머신러닝 03 - sklearn 알아 보기 (분류

빅데이터/Machine-Learning 2022. 2. 8. 16:33

패스트캠퍼스 '직장인을 위한 파이썬 데이터분석 올인원 패키치 Online' 참조

머신러닝 분류는 sklearn에서 제공해주는 데이터 세트로 진행 예정
시작 전 경고 출력 방지 하나를 걸고 하자

import warnings # 불필요한 경고 출력 방지
warnings.filterwarnings('ignore')

01 꽃 종류 분류 하기

sklearn에서 제공하는 데이터 세트중 하나인 'iris'를 사용
붓꽃 데이터셋을 학습한 뒤 품종을 판별하는 모델

import pandas as pd

from sklearn.datasets import load_iris
iris = load_iris()

data - feautre data
feature_name - feature data의 컬럼 이름
target - label data(수치형)
target_name - label의 이름(문자형

sklearn(사이킷런) 패키지는 붓꽃(iris) 데이터셋을 가지고 있고 위와 같이 load_iris() 함수로 붓꽃 데이터를 올수 있고 이 데이터 키값은 다음과 같다. data, target, target_names, DESCR, feature_names, filename

iris 데이터를 로드 한 후 변수화

data = iris['data']
data[:5]
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])
       

feature_names = iris['feature_names']
feature_name
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
 
 
target = iris['target']
target[-5:] # 셔플이 안되어있음
array([2, 2, 2, 2, 2])

print("데이터셋 내용: ", iris['target'])
데이터셋 내용:  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
 
 
 * import collections
 print('target 정보: ', collections.Counter(iris['target']))
 target 정보:  Counter({0: 50, 1: 50, 2: 50})

 # 0,1,2 로 목표 변수 데이터가 정의 되어 져 있다. 
 # 150개 샘플 데이터가 1차원 배열에 들어있고 3개의 클래스(0,1,2)가 50개씩 있다


target_names = iris['target_names']
target_name
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

▶ 'target' 은 곧 3가지 종류의 붓꽃을 의미 (0,1,2 로 구분)

데이터 프레임 만들기 (위의 각 내용을 데이터 프레임으로 정리해서 보자)

# 현재 'data'가 2차원이다. 이를 데이터 프레임으로 바꿔줘야한다.
df_iris = pd.DataFrame(data, columns = feature_names) #.dataFrame(인자1, 컬럼설정)

df_iris.head()

타겟 컬럼을 추가

df_iris['target'] = target

df_iris
# df_iris.head()

▶ 'Seaborn'으로 시각화 해보기

import matplotlib.pyplot as plt
import seaborn as sns

# .scatterplot(x축이름, y축이름, 데이터=값)
# hue = 점들 컬러링 및 범주 타이틀 설정
# palette = 색상 커스터마이징 
sns.scatterplot('sepal width (cm)', 'sepal length (cm)', hue='target', palette='muted', data=df_iris)
plt.title('Sepal')
plt.show()

sns.scatterplot('petal width (cm)', 'petal length (cm)', hue='target', palette='muted', data=df_iris)
plt.title('Petal')
plt.show()

▶ 3D 그래프로 그리기

PCA를 통해 차원축소를 할 수 있다. 4가지 정보를 3차원으로 적용시킬때 차원이 바뀌는 개념인데 이때 차원을 줄여주는 역할을 한다.

from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA

#그래프 크기
fig = plt.figure(figsize=(8, 6)) #.figure(figsize=(너비,높이,인치))

# 3차원 변수화
ax = Axes3D(fig, elev=-150, azim=110)

# PCA = 차원변환시 사용하는 함수,
X_reduced = PCA(n_components=3).fit_transform(df_iris.drop('target', 1))


# ax.scatter(1,2,3 차원에 대한 값, 스타일링)
# c = 마커색상
# cmap = 컬러맵에서 색상을 뽑아옴
# s = 마커크기
# edgecolor = 마커 테두리선 색상
ax.scatter(X_reduced[:,0], X_reduced[:,1],X_reduced[:,2],c=df_iris['target'],
          cmap=plt.cm.Set1, edgecolor='k', s=40)

ax.set_title('Iris 3D')
ax.set_xlabel("x")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("y")
ax.w_yaxis.set_ticklabels([])
ax.set_ylabel("z")
ax.w_zaxis.set_ticklabels([])

plt.show()

02 데이터셋 분류

from sklearn.model_selection import train_test_split

#순서 중요 x-x, y-y, train_test_split(학습값, 예측값)
x_train, x_valid, y_train, y_valid = train_test_split(df_iris.drop('target',1), df_iris['target'])
# 'target'컬럼은 예측값이기 때문에 feature(학습값)에 반영이 되면 절대 안된다

x_train.shape, y_train.shape
((112, 4), (112,))

x_valid.shape, y_valid.shape
((38, 4), (38,))

※ 헷갈리지 말자! 학습값에는 절대로 예측값이 들어가서는 안되며 데이터가 어떻게 진행될지 모를때는 아래와 같이 하나씩 다 찍어보는게 좋다

여기서 그래프를 찍어보면 아래와 같이 나오는데 훈련/예측 데이터를 나누는데 분포되는 클래스가 랜덤하게 설정된다는 말이다. 즉 훈련데이터에는 2의 클래스가 많이들어가고 예측에는 1이 많이 들어가는 불균형이 일어나는 것

sns.countplot(y_train)

계속해서 학습데이터와 예측데이터를 반복해서 찍어보면 그래프가 바뀐다. 데이터를 균등하게 주는 옵션을 이용 하자
(링크 반드시 참조: https://teddylee777.github.io/scikit-learn/train-test-split)

(링크 : https://hyjykelly.tistory.com/44)

# stratify='' - 데이터 세트 클래스 비율을 유지 한다(즉 훈련용이든 테스트 용이든 고르기 0,1,2 클래스를 다 가지도록 하는것, 쏠림방지)
x_train, x_valid, y_train, y_valid = train_test_split(df_iris.drop('target',1), df_iris['target'],
                                                     stratify=df_iris['target'])

저작자표시

'빅데이터 > Machine-Learning' 카테고리의 다른 글

머신러닝 05 - sklearn 알아 보기 (회귀 - 2) (0)	2022.02.14
머신러닝 04 - sklearn 알아 보기 (분류 - 3, 회귀 - 1) (0)	2022.02.11
머신러닝 03 - sklearn 알아 보기 (분류 - 2 (몇가지 알고리즘)) (0)	2022.02.08
머신러닝 02 - sklearn 알아 보기 (전처리) (0)	2022.02.07
머신러닝 01 - 정의와 용어 그리고 sklearn 기본 (0)	2022.02.04

ABOUT ME

Treasure Treasure

'빅데이터 > Machine-Learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'빅데이터 > Machine-Learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바