유형별 데이터 분석 맛보기 03 - 분류분석 - 로지스틱스 회귀분석 (3)

빅데이터/Machine-Learning 2022. 2. 17. 15:15

패스트캠퍼스 '직장인을 위한 파이썬 데이터분석 올인원 패키치 Online' 참조

로지스틱스 회귀분석 실습을 해보자

01 데이터 전처리

EDA가 완료된 포켓몬 데이터셋을 이용하여 레전더리 여부를 분류하는 실습!
현재의 레전더리 데이터 상태는 다음과 같다
숫자형을 제외하고 쓰고자하는 데이터는 타입을 알맞게 바꿔줘야 한다
```
# 데이터 타입 변경 
df['Legendary'] = df['Legendary'].astype(int)
df['Generation'] = df['Generation'].astype(str)
preprocessed_df = df[['Type 1', 'Type 2','Total','HP','Attack','Defense','Sp. Atk','Sp. Def', 'Speed', 'Generation', 'Legendary']]
```
분류예측 목표 Feature인 'Legendary' 값은 현재 불린 'True/False'이기에 인트값으로 변경
세대는 현제 인트로 1,2,3,4,5이렇게 되어있는데 Feature의 의미상 해당 숫자들은 분류 역할을 할 수 있다. 그러므로 STR로 변경해줘야 한다.

원핫인코딩 예제

# 타입1에 적용
# .get_dummies() -> 원핫인코딩 적용 함수
encoded_de = pd.get_dummies(preprocessed_df['Type 1'])
encoded_de.head()

분석에서는 멀티레이블을 사용 할 예정. 그전에 타입을 합치는 작업이 필요하다.

# 타입을 합친 후 멀티 레이블 적용
def make_list(x1, x2):
    type_list = []
    type_list.append(x1) # x1으로 들어오는 type1은 널값이 없어서 그냥 추가
    if x2 is not np.nan:
        type_list.append(x2) # x2값으로 들어오는 type2는 널값 체크
    return type_list



preprocessed_df['Type'] = preprocessed_df.apply(lambda x: make_list(x['Type 1'], x['Type 2']), axis=1)
preprocessed_df.head()

람다 함수 참고 링크 (https://blockdmask.tistory.com/520#:~:text=%EB%9E%8C%EB%8B%A4%ED%95%A8%EC%88%98%20%EC%84%A0%EC%96%B8%20%EB%B0%A9%EB%B2%95&text=lambda%20%EB%9D%BC%EB%8A%94%20%ED%82%A4%EC%9B%8C%EB%93%9C%EB%A5%BC%20%EC%9E%85%EB%A0%A5,2%20%EC%9D%B4%EB%9F%B0%EC%8B%9D%EC%9C%BC%EB%A1%9C%20%EB%90%A9%EB%8B%88%EB%8B%A4.)

- Type1,2 는 지워도 된다!

del preprocessed_df['Type 1']
del preprocessed_df['Type 2']

preprocessed_df.head()

이제 타입이 2개 이상인 포켓몬이 있기때문에 원핫인코딩보다는 멀티레이블이 좋다. 이를 적용하여 데이터를 전처리 해야 한다

# 멀티레이블로 타입 전처리
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
# .mlb로 멀티 레이블이 들어감, 그리고 컬럼명은 mlb멀티 레이블로 쓰였던 컬럼들이 들어감
preprocessed_df = preprocessed_df.join(pd.DataFrame(mlb.fit_transform(preprocessed_df.pop('Type')), columns=mlb.classes_))

다시 이제 제네레이션을 원핫인코등을 이용하여 스트링을 구분짓도록 하면 된다.

# 이제 다시 제네레이션을 원핫인코딩을 하여 스트링을 구분짓도록 처리
preprocessed_df = pd.get_dummies(preprocessed_df) # 알아서 'Generation만 찾아서 변환 함'

▶ 데이터 전처리가 끝이났으면 이제 데이터의 표준화 작업이 필요하다.

# 데이터 표준화
from sklearn.preprocessing import StandardScaler

# 1. 스케일 함수 선언
scaler = StandardScaler()

# 2. 스케일 할 컬럼명 선언
scale_columns = ['Total','HP','Attack','Defense','Sp. Atk','Sp. Def', 'Speed']

# 3. 스케일 할 데이터 - 컬럼명을 올바르게 적으면 그 컬럼명에있는 데이터와 자동 매치된다.
preprocessed_df[scale_columns] = scaler.fit_transform(preprocessed_df[scale_columns])

preprocessed_df.head()

현재는 z값을 내는 일반 Standardizing이지만 MinMax를 사용하면 0과 1사이로 좋은 성과를 낼 수 있다.

02 데이터셋 분리

from sklearn.model_selection import train_test_split

x = preprocessed_df.loc[:, preprocessed_df.columns != 'Legendary']
y = preprocessed_df['Legendary']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33)

x_train.shape
x_test.shape

(600, 31)
(200, 31)

03 모델 학습

# 로지스틱 함수 호출 및 모델 평가를 위한 Confusion_metrix 호출
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# LR 학습, random_state의 이유는 일반 회귀와 다르게 그래디언트 디센트를 쓰기 때문에 같은값 유지를 위해서
lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)

# Result
y_pred = lr.predict(x_test)

# Evaluation
print(accuracy_score(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))

0.955
0.6153846153846154
0.6666666666666666
0.64

앞선 강의에서 보았듯이 정확도만 높은 모델이 되었다. 그 이유를 알아야 한다. Confusion_Metrix로 확인 해 보자

# Confusion_metrix로 상황 파악
from sklearn.metrics import confusion_matrix

confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)

[[183   5]
 [  4   8]]

오답이 총 9개가 나왔다. 이 원인은 클래스의 불균형 때문에 무조건 맞다라고 해도 정확도가 올라가는 것이다.

04 클래스 불균형 조정

조정 전에 얼마나 불균형인지 봐야 한다.
```
preprocessed_df['Legendary'].value_counts()
0    735
1     65
```
거의 10:1 정도의 차이를 보이고 있다!

이제 1:1 샘플링을 해야 한다.

# 정답이 1인것만 들고옴, index번호로 추출
positive_random_idx = preprocessed_df[preprocessed_df['Legendary']==1].sample(65, random_state=33).index.tolist()

print(positive_random_idx)
[796, 537, 704, 164, 262, 429, 542, 707, 705, 264, 551, 430, 418, 163, 424, 706, 157, 545, 431, 710, 708, 702, 156, 699, 428, 703, 538, 420, 795, 540, 793, 270, 798, 544, 794, 426, 711, 797, 799, 552, 712, 709, 419, 425, 414, 415, 550, 700, 539, 416, 541, 543, 701, 553, 417, 422, 549, 162, 792, 269, 421, 158, 427, 263, 423]


# 정답이 0인것만 들고옴, index번호로 추출
negative_random_idx = preprocessed_df[preprocessed_df['Legendary']==0].sample(65, random_state=33).index.tolist()

print(negative_random_idx)
[321, 101, 86, 291, 122, 64, 661, 198, 575, 298, 614, 605, 448, 501, 745, 464, 138, 479, 187, 660, 31, 441, 151, 22, 674, 599, 302, 121, 375, 670, 285, 8, 331, 770, 450, 327, 301, 217, 338, 286, 374, 399, 4, 297, 165, 96, 89, 451, 507, 721, 513, 698, 486, 623, 320, 160, 21, 366, 442, 607, 311, 645, 336, 358, 514]

65개씩 0,1 의 값들을 나눔

데이터 셋으로 다시 분리

# 이 두개의 데이터셋을 분리
from sklearn.model_selection import train_test_split

random_idx = positive_random_idx + negative_random_idx

x = preprocessed_df.loc[random_idx, preprocessed_df.columns != 'Legendary']
y = preprocessed_df['Legendary'][random_idx]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=33)

x_train.shape
(97, 31)

x_test.shape
(33, 31)

총 130개의 데이터를 1:1 비율로 7.5:2.5 의 양으로 훈련/예측 데이터로 나뉨

다시 훈련 후 테스트

# 다시 훈련 후 테스트
# 로지스틱 함수 호출 및 모델 평가를 위한 Confusion_metrix 호출
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# LR 학습, random_state의 이유는 일반 회귀와 다르게 그래디언트 디센트를 쓰기 때문에 같은값 유지를 위해서
lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)

# Result
y_pred = lr.predict(x_test)

# Evaluation
print(accuracy_score(y_test, y_pred))
print(precision_score(y_test, y_pred))
print(recall_score(y_test, y_pred))
print(f1_score(y_test, y_pred))
0.9696969696969697
0.9230769230769231
1.0
0.9600000000000001

모든 수치가 아주 정확하게 나온것을 볼 수 있다!

Confusion_metrix의 결과도 좋다
```
# Confusion_metrix로 상황 파악
from sklearn.metrics import confusion_matrix

confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)

[[20  1]
 [ 0 12]]
```
여기서 데이터 수가 적어졌는데도 불구하고 결과가 더 좋네 라고 생각 할 수도 있는데 이는 클래스 불균형을 조정했기에 가능한 일이다.

저작자표시

'빅데이터 > Machine-Learning' 카테고리의 다른 글

유형별 데이터 분석 맛보기 06 - 감성분석 (0)	2022.02.18
유형별 데이터 분석 맛보기 05 - 텍스트 마이닝 (0)	2022.02.18
유형별 데이터 분석 맛보기 03 - 분류분석 - 로지스틱스 회귀분석 (2) (0)	2022.02.16
유형별 데이터 분석 맛보기 03 - 분류분석 (1) (0)	2022.02.16
유형별 데이터 분석 맛보기 02 - EDA & 회귀분석(3) (0)	2022.02.16

ABOUT ME

Treasure Treasure

'빅데이터 > Machine-Learning' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'빅데이터 > Machine-Learning' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바