유형별 데이터 분석 맛보기 06

빅데이터/Machine-Learning

유형별 데이터 분석 맛보기 06 - 감성분석

H-V 2022. 2. 18. 18:12

패스트캠퍼스 '직장인을 위한 파이썬 데이터분석 올인원 패키치 Online' 참조

01 감성분류란?

문장들로부터 어떠한 특정 성향을 가지는지 알아보는 모델

감성분류는 3가지의 스텝으로 이루어 진다.
1) 텍스트 데이터 전처리 - 기계가 이해하도록 만듬
2) 이진 분류 - 특정 문장 혹은 단어를 분류형태로 만듬
3) 긍/부정 키워드 분석 - 분류된 이진 데이터에서 개수를 세어 긍정과 부정으로 나뉘는것들을 분석

02 EDA

* 필수 라이브러리

%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv("https://raw.githubusercontent.com/yoonkt200/FastCampusDataset/master/tripadviser_review.csv")

df.head()

df.shape
(1001, 2)

df.isnull().sum()
rating    0
text      0
dtype: int64

df.info()
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  1001 non-null   int64 
 1   text    1001 non-null   object
 
 df['text'][100]
 '올 봄에 벚꽃기간에 방문, 협재를 바라보는 바다뷰가 좋고 대로변이라 렌트해서 가기도 좋음.
 조식은 이용안했는데 근처 옹포밥집까지 아침 산책겸 걸어가서 하고옴. 
 루프탑 수영장과 바가 있었는데 내가 갔을때는 밤에 비바람이 너무 불어서 이용못하고옴 ㅠㅠ
 단점으로는 모 유명 여행블로거 리뷰처럼 화장실 물떄가... 그거빼곤 다 만족'
 
 len(df['text'].values.sum())
 223576

총 1001개의 리뷰와 널값이 존재하지 않고 한국어로 되어있으며 간간히 특수 문자나 'ㅠㅠ'이런 표현들이 있는것으로 보인다.
또한 총 문자열의 개수는 '223576'개로 잡힌다

03 한국어 전처리

전처리를 위하여 아래 문구로 라이브러리 설치를 해주자
```
pip install konlpy==0.5.1 jpype1 Jpype1-py3
```
설치가 안된다면 다음 링크랑 똑같이 하면 설치가 된다 (https://0ver-grow.tistory.com/69)

정규표현식

import re

def apply_regular_expression(text):
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]')
    result = hangul.sub('',text)
    return result
    
apply_regular_expression(df['text'][0])

df['text'][0]

'여행에 집중할수 있게 편안한 휴식을 제공하는 호텔이었습니다 위치선정 또한 적당한 편이었고 청소나 청결상태도 좋았습니다'
'여행에 집중할수 있게 편안한 휴식을 제공하는 호텔이었습니다. 위치선정 또한 적당한 편이었고 청소나 청결상태도 좋았습니다.'

명사 형태소 추출

# 명사 형태소 추출
from konlpy.tag import Okt
from collections import Counter

nouns_tagger = Okt()
nouns = nouns_tagger.nouns(apply_regular_expression(df['text'][0]))
nouns

['여행', '집중', '휴식', '제공', '호텔', '위치', '선정', '또한', '청소', '청결', '상태']


# corpus 로 모두 처리
# konlpy 같은 경우는 리스트로 못 받는다. 텍스트를 만들어서 조인을 시켜 텍스트 형태로 만들어야 한다.
nouns = nouns_tagger.nouns(apply_regular_expression("".join(df['text'].tolist())))
nouns
['여행',
 '집중',
 '휴식',
 '제공',
 '호텔',
 '위치',
 '선정',
 ...]
 
 
counter = Counter(nouns)
counter.most_common(10)
[('호텔', 803),
 ('수', 498),
 ('것', 436),
 ('방', 330),
 ('위치', 328),
 ('우리', 327),
 ('곳', 320),
 ('공항', 307),
 ('직원', 267),
 ('매우', 264)]

여기서 문제는 두글자는 이해가 되는데 한글자 중 수, 것, 곳 등은 이해하기도 분석도 힘들다. 제거하자.

한글자 제거

# 한글자 명사 제거
available_counter = Counter({x : counter[x] for x in counter if len(x) > 1 })
available_counter.most_common(10)

[('호텔', 803),
 ('위치', 328),
 ('우리', 327),
 ('공항', 307),
 ('직원', 267),
 ('매우', 264),
 ('가격', 245),
 ('객실', 244),
 ('시설', 215),
 ('제주', 192)]

불용어 제거 - 한국어 불용어는 불용어 사전 데이터를 읽어와서 거기서 맞춰서 제거해야한다. 또한 몇개의 의미가 없는 단어들도 불용어 사전에 추가하여 진행.

# 불용어 사전 로드
# source - https://www.ranks.nl/stopwords/korean
stopwords = pd.read_csv("https://raw.githubusercontent.com/yoonkt200/FastCampusDataset/master/korean_stopwords.txt").values.tolist()
print(stopwords[:10])

[['휴'], ['아이구'], ['아이쿠'], ['아이고'], ['어'], ['나'], ['우리'], ['저희'], ['따라'], ['의해']]

# 의미 없는 단어들을 추가하여 진행
jeju_list = ['제주','제주도','호텔','리뷰','숙소','여행','트립']
for word in jeju_list:
    stopwords.append(word)
    
# 불용어 사전을 가지고 BoW 만들기
from sklearn.feature_extraction.text import CountVectorizer

def text_cleaning(text):
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]')
    result = hangul.sub('',text) # 정규식으로 전처리 처리
    tagger = Okt()
    nouns = nouns_tagger.nouns(result) # 한국어 형태소 만듬
    nouns = [x for x in nouns if len(x) > 1] # 만들면서 한글자 모두 제거
    nouns = [x for x in nouns if x not in stopwords] # 추가로 불용어 제거
    return nouns
    

# 벡터화시에 이제 tokenizer에서 커스텀 함수 사용 가능
vect = CountVectorizer(tokenizer= lambda x: text_cleaning(x))
bow_vect = vect.fit_transform(df['text'].tolist())
word_list = vect.get_feature_names()
count_list = bow_vect.toarray().sum(axis=0)

word_list
['가가',
 '가게',
 '가격',
 ..]
 
 count_list
 array([  4,   8, 245, ...,   1,   7,  14], dtype=int64)

1) apply_regular_expression() - 한국어 전처리 함수 세팅
2) Okt() 를 사용하여 한국어 형태소 추출
3) Counter에 모든 형태소 넣은 뒤 한글자 제거
4) 한국어 불용어 사전 다운 + 다른 단어들 추가로 삽입
-------> 여기까지가 전처리 프로세스
5) 불용어 사전까지 세팅이 됬으면 이거를 가지고 BoW 만드는 함수 세팅 text_cleaning()
6) text_cleaning() 함수를 통해 위의 전처리 프로세스 한번에 진행
7) 전처리가 된 형태소 형태의 데이터들을 'CountVectorizer'를 통해 벡터화 세팅
8) 벡터화 시킬 데이터를 넘기면서 학습
9) 워드 리스트, 카운터 리스트로 확인해보면 형태소와 각 단어가 몇번 등장했는지 알 수 있음

딕셔너리로 묶어서 시각화 또는 TF-IDF로 처리

# 딕셔너리로 처리하여 시각화 및 TF-IDF 처리
word_count_dict = dict(zip(word_list, count_list))
print(str(word_count_dict)[:100])

{'가가': 4, '가게': 8, '가격': 245, '가격표': 1, '가구': 8, '가급': 1, '가기': 20, '가까이': 20, '가끔': 5, '가능': 10, '가

from sklearn.feature_extraction.text import TfidfTransformer

tfidf_vectorizer = TfidfTransformer()
tf_idf_vect = tfidf_vectorizer.fit_transform(bow_vect)

print(tf_idf_vect[0])
  (0, 3588)	0.35673213299026796
  (0, 2927)	0.2582351368959594
  (0, 2925)	0.320251680858207
  (0, 2866)	0.48843555212083145
  (0, 2696)	0.23004450213863206
  (0, 2311)	0.15421663035331626
  (0, 1584)	0.48843555212083145
  (0, 1527)	0.2928089229786031
  (0, 790)	0.2528176728459411

단어 매핑

# 단어 매핑
invert_index_vectorizer = {v: k for k, v in vect.vocabulary_.items()}
print(str(invert_index_vectorizer)[:100]+'..')

{2866: '집중', 3588: '휴식', 2696: '제공', 2311: '위치', 1584: '선정', 790: '또한', 2927: '청소', 2925: '청결', 1527..

04 이진 분류로 감성 분석 하기

현재 텍스트들은 x값으로 변환이 잘 되어 있다. 이제 Y값을 어떻게 하느냐의 문제이다
rating이 Y값에 들어가면 적절
레이팅을 히스토로 찍어보면 아래와 같다. 여기서 추측해볼수 있는것은 보통 대부분의 레이팅이 1~2 혹은 4~5로 나눠져있고 1~2는 부정 4~5는 긍정 처럼 추측 할 수 있다

이제 1~5까지의 점수를 분류할수있도록 바이너리 형태로 바꿔줘야한다. 즉 이진 형태로 나누는데 1~3점은 부정, 4~5점은 긍정으로 나눌 예정

# 바이너리화 1~3은 부정 4~5는 긍정
def rating_to_label(rating):
    if rating > 3:
        return 1
    else:
        return 0
    
df['y'] = df['rating'].apply(lambda x:rating_to_label(x))

df.head()

df.y.value_counts()

1    726 #긍정
0    275 #부정

- 데이터셋 분리

# 데이터셋 분리

from sklearn.model_selection import train_test_split

y = df['y']
x_train, x_test, y_train, y_text = train_test_split(tf_idf_vect, y, test_size=0.3)

print(x_test.shape)
(301, 3599) # 1000개중 30퍼가 테스트로 갔다!

- 모델 학습과 평가

# 모델 학습
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Train LR model
lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)

# classifiacation predict
y_pred = lr.predict(x_test)

# 평가
accuracy_score(y_test, y_pred)

precision_score(y_test, y_pred)

recall_score(y_test, y_pred)

f1_score(y_test, y_pred)

from sklearn.metrics import confusion_matrix

confmat = confusion_matrix(y_test, y_pred)
print(confmat)

[[  5  82]
 [  0 214]]

오답의 크기가 너무 크다. 82와 5의 위치가 바껴야 더 정확하고 믿을 수 있는 모델이다. 샘플링을 재 조정 하자.

(처음 y값을 1,0 으로 나눴을 때 1이 압도적으로 많기 때문에 이런 결과가 나온다. 다시 샘플링을 1:1로 맞추자)

# 샘플링 재조정
positive_random_idx = df[df['y']==1].sample(275, random_state=33).index.tolist()
negative_random_idx = df[df['y']==0].sample(275, random_state=33).index.tolist()

# 데이터셋 다시 세팅
random_idx = positive_random_idx + negative_random_idx
X = tf_idf_vect[random_idx] # x값은 tf_idf_vect의 'random_idx' 인덱스 값
y = df['y'][random_idx] # y값은 df['y']의 'random_idx' 인덱스 값
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=33)

print(x_train.shape)
print(x_test.shape)
(412, 3599)
(138, 3599)

# 모델 재학습
lr = LogisticRegression(random_state=0)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)


# 모델 재평가
print("accuracy: %.2f" % accuracy_score(y_test, y_pred))
print("Precision : %.3f" % precision_score(y_test, y_pred))
print("Recall : %.3f" % recall_score(y_test, y_pred))
print("F1 : %.3f" % f1_score(y_test, y_pred))
accuracy: 0.72
Precision : 0.644
Recall : 0.797
F1 : 0.712

confmat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print(confmat)
[[53 26]
 [12 47]]

수치들은 조금 줄었지만 Confusion_Matrix를 보면 반반씩 정답이 아주 잘 묶여있는게 보인다.

05 긍정/부정 키워드 분석

# 긍정/부정 키워드 분석
plt.rcParams['figure.figsize'] = [10,8]
plt.bar(range(len(lr.coef_[0])), lr.coef_[0])

반반으로 잘 나뉘었고 변동폭이 작은것들은 긍정/부정에 공존하는 키워드라고 볼 수 있다.

계수들을 큰순~작은순으로 정렬하면 긍정/부정으로 나누는 지표가 될 수 있다.

print(sorted(((value, index) for index, value in enumerate(lr.coef_[0])), reverse=True)[:5])
print(sorted(((value, index) for index, value in enumerate(lr.coef_[0])), reverse=True)[-5:])

[(1.332130808711117, 2400), (1.1098677278465363, 2977), (1.029120247844704, 1247), (0.9474432432978868, 2957), (0.9049132254229898, 26)]
[(-0.6491883332225629, 363), (-0.6683241824194205, 3538), (-0.6811855513119685, 1909), (-0.9632209931825515, 1293), (-1.1245008869879292, 515)]

- 이제 이것들을 단어로 바꿔서 확인만 하면 된다.

coef_pos_index = sorted(((value, index) for index, value in enumerate(lr.coef_[0])), reverse=True)
coef_neg_index = sorted(((value, index) for index, value in enumerate(lr.coef_[0])), reverse=False)

for coef in coef_pos_index[:15]:
    print(invert_index_vectorizer[coef[1]], coef[0])
    
for coef in coef_neg_index[:15]:
    print(invert_index_vectorizer[coef[1]], coef[0])

저작자표시 (새창열림)