sklearn 숲데이터를 끄적여보기

2022-11-11 2 분 소요

SCK 숲 데이터 셋 다루기 포트폴리오

=============================

SCK 숲 데이터

https://scikit-learn.org/stable/datasets/real_world.html#forest-covertypes

이 데이터 세트의 샘플은 각 패치의 덮개 유형, 즉 우세한 수종을 예측하는 작업을 위해 수집된 미국의 30×30m 숲 패치에 해당합니다. 7가지 커버타입이 있어 이를 다중 클래스 분류 문제로 만듭니다. 각 샘플에는 데이터세트의 홈페이지 에 설명된 54개의 기능이 있습니다 . 일부 기능은 부울 표시기이고 다른 기능은 이산 또는 연속 측정입니다.

이러한 설명으로 업로드 되어있는 미국에 있는 숲관련 데이터를 받아

이것저것 분석해본 포트폴리오임을 밝힌다.

by E Creator

import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import pandas as pd

from sklearn.datasets import fetch_covtype

covtype = fetch_covtype()
print(covtype.DESCR)

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

    =================   ============
    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int
    =================   ============

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`` as pandas
data frame, and there will be an additional member ``frame`` as well.
The dataset will be downloaded from the web if necessary.

df = pd.DataFrame(covtype.data, 
                  columns=["x{:02d}".format(i + 1) for i in range(covtype.data.shape[1])],
                  dtype=int)
sy = pd.Series(covtype.target, dtype="category")
df['covtype'] = sy
df.tail()

	x01	x02	x03	x04	x05	x06	x07	x08	x09	x10	...	covtype
581007	2396	153	20	85	17	108	240	237	118	837	...	3
581008	2391	152	19	67	12	95	240	237	119	845	...	3
581009	2386	159	17	60	7	90	236	241	130	854	...	3
581010	2384	170	15	60	5	90	230	245	143	864	...	3
581011	2383	165	13	60	4	67	231	244	141	875	...	3

5 rows × 55 columns

pd.DataFrame(df.nunique()).T

	x01	x02	x03	x04	x05	x06	x07	x08	x09	x10	...	x46	x47	x48	x49	x50	x51	x52	x53	x54	covtype
0	1978	361	67	551	700	5785	207	185	255	5827	...	2	2	2	2	2	2	2	2	2	7

1 rows × 55 columns

df.iloc[:, 10:54] = df.iloc[:, 10:54].astype('category')

import seaborn as sns

df_count = df.pivot_table(index="covtype", columns="x14", aggfunc="size")
sns.heatmap(df_count, cmap=sns.light_palette("gray", as_cmap=True), annot=True, fmt="0")
plt.show()

마무리

아직은 여러사정상 숲데이터

Twitter Facebook LinkedIn

sklearn 숲데이터를 끄적여보기

SCK 숲 데이터

마무리

공유하기

댓글남기기

참고

파이썬 정리 노트01

Intel AI 캡스톤2 판매예측을 통한 이익실현

Intel AI 캡스톤1 비만 예방을 위한 행동분류예측

네이버 지식in 웹크롤링 만져보기 겸 감성분석시도