실전 예제 - 유투브 채널 및 영상 분석 07

빅데이터/Data-Analysis 2022. 2. 26. 12:58

패스트캠퍼스 '직장인을 위한 파이썬 데이터분석 올인원 패키치 Online' 참조

유투브 인기 채널 순위 시각화, 인기 영상 타이틀을 분석해 볼 예정

* 필수 라이브러리 및 경고창 제거

# 필수 라이브러리
import pandas as pd
import seaborn as sns

# 한글 폰트 설치
from matplotlib import rc
import matplotlib.font_manager as fm

# # 설치된 폰트 출력, 여기서 나눔이나 맑은 고딕을 찾아야함 없으면 설치
# font_list = [font.name for font in fm.fontManager.ttflist]
# font_list

# 한글 폰트 테스트
import matplotlib as mpl
import matplotlib.pyplot as plt

plt.rcParams['font.family'] = 'Malgun Gothic'
plt.figure(figsize=(5,5))
plt.plot([0,1],[0,1],label='한글테스트')
plt.legend();

# 경고창 무시
import warnings
warnings.filterwarnings('ignore')

01 EDA

https://www.kaggle.com/datasnaek/youtube-new

데이터 다운을 받은 후 사용하는 파이썬 디렉토리에 넣고 불러야 한다.

데이터를 다운받고 바로 사용하면 한글들이 깨지는 현상이 있는 분들이 있을텐데 해결 방안은 3가지 이다

1. 메모장으로 열어서 다른이름으로의 저장을 할때 인코딩을 'utf-8'로 변경하여 사용

2. vscode로 똑같이 열고 csv파일로 다시 저장

3. 엑셀 파일로 따로 불러오면 한글이 제대로 들어오는데 이를 똑같이 다른이름으로 저장할때 세팅을 바꿔주면 된다.

파일을 불러온 뒤 구분 기호로 분리 → 구분 기호 쉼표로 선택 후 마치면 된다

- EDA 시작

df = pd.read_csv('KRvideos.csv')
df.head()

# 데이터 수
df.shape
(34567, 16)

# 결측치 확인
df.isnull().sum()
video_id                     0
trending_date                0
title                        0
channel_title                0
category_id                  0
publish_time                 0
tags                         0
views                        0
likes                        0
dislikes                     0
comment_count                0
thumbnail_link               0
comments_disabled            0
ratings_disabled             0
video_error_or_removed       0
description               3163

df.info()
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                34567 non-null  object
 1   trending_date           34567 non-null  object
 2   title                   34567 non-null  object
 3   channel_title           34567 non-null  object
 4   category_id             34567 non-null  int64 
 5   publish_time            34567 non-null  object
 6   tags                    34567 non-null  object
 7   views                   34567 non-null  int64 
 8   likes                   34567 non-null  int64 
 9   dislikes                34567 non-null  int64 
 10  comment_count           34567 non-null  int64 
 11  thumbnail_link          34567 non-null  object
 12  comments_disabled       34567 non-null  bool  
 13  ratings_disabled        34567 non-null  bool  
 14  video_error_or_removed  34567 non-null  bool  
 15  description             31404 non-null  object

- 필요한 컬럼만 추리고 중복 및 결측치 처리

# 필요한 컬럼만 추리고 중복 및 결측치 처리
df = df[['title','channel_title', 'views']]
df.head()

df_sorted = df.sort_values(by='views', ascending=False)
df_sorted

중복 데이터가 보이는데 이는 매일매일 데이터를 쌓은거기 때문에 오늘 내일 같은 트렌딩 비디오가 나오면 그대로 가져와 쌓는 형식의 데이터이기 때문이다. 이를 제거를 해줘야 한다.

# 중복값 제거
# keep='first'를 통해 높은거만 남길 수 있음
df_sorted.drop_duplicates(['title','channel_title'], keep='first')

02 시각화

조회수 합계 기준 TOP 100 채널을 뽑은 후 이를 시각화 해 보자

# 그룹바이로 채널 타이틀로 각각 묶고 그 합계를 자동적으로 .sum()으로 수치를 계산
df_channel_view_sum = df_sorted_max.groupby(df_sorted_max['channel_title']).sum()
df_channel_view_sum

# 정렬
df_channel_view = df_channel_view_sum.sort_values(by='views', ascending=False)
df_channel_view

# 100개만 뽑음
df_channnel_view_top = df_channel_view[:100]
df_channnel_view_top

# 인덱싱이 없으니 .reset_index() 함수로 추가
df_channnel_view_top = df_channnel_view_top.reset_index()
df_channnel_view_top

# 뽑은 데이터로 시각화
sns.barplot(x='channel_title', y='views', data=df_channnel_view_top);

# 사이즈 조정
plt.figure(figsize=(20, 100))
sns.barplot(x='views', y='channel_title', data=df_channnel_view_top);

03 인기 영상 타이틀 분석

똑같이 데이터를 정제하고 정규표현식을 써 제목을 추출 해 보자

df = pd.read_csv('KRvideos.csv')
df.head()

# 필요 컬럼만 뽑기
df = df[['title', 'views']]
df

# 중복 제거 및 정렬
df_sorted = df.sort_values(by='views', ascending=False).drop_duplicates(['title'], keep='first')
df_sorted

인기 영상 타이틀 분석을 하기위해서는 제목들을 일관성있게 맞춰줘야 한다. 한글 제목들만 남기자. 정규 표현식을 써서 처리 하자.

df_sorted['title'].values

array(['YouTube Rewind: The Shape of 2017 | #YouTubeRewind',
       "Marvel Studios' Avengers: Infinity War Official Trailer",
       "BTS (방탄소년단) 'FAKE LOVE' Official MV", ...,
       '[홍익인간 인성교육] 7128강 산에 들어가고 싶다',
       '만취 브이로그ㅣ서프라이즈 생일 파티ㅣ실시간 현실 술판ㅣ여자셋 일상ㅣ일상브이로그', '소셜 잠금화면 앱 (달고나)'],

import re
# .apply() -> 각각의 열을 돌면서 적용 시킴
# .sub('적용할 정규 표현식', '바꿀것', 데이터)
# 한글데이터 및 띄어쓰기가 아니면 ''으로 바꿈
df_sorted['title_refined'] = df_sorted['title'].apply(lambda x: re.sub('[^가-힣\s]', '', x))
df_sorted

# 공백 열 삭제
df_sorted['another_refined'] = df_sorted['title_refined'].apply(lambda x: re.sub('[^가-힣]', '', x))
df_sorted

# 공백 열 삭제
df_sorted = df_sorted[df_sorted['title_refiend'].apply(lambda x: re.sub('[^가-힣]', '', x)) !='']
df_sorted

04 한글 단어 추출

KoNLPy로 주로 한글 데이터를 가공하는데 이 라이브러리는 새로운 단어를 인식 하지 못한다는 단점이 있다. 이를 해결 하는 라이브러리가 'soynlp'
soynlp는 새로운 단어도 학습이 가능하며 자주 쓰는 2개의 파라미터를 이용해 추출 한다
1. WordExtractor - 통계 기반 단어 추출기 (문서집합에서 자주 등장하는 단어열 추출)
2. Tokenizer - WordExtractor의 결과를 이용하여 단어의 경계를 따라 문장을 단어열로 분해
(LTokenizer - 띄어쓰기 o, MaxScoreTokenizer - 띄어쓰기 x)
한글 추출시에는 보통 의미를 지니는 단어 (명사/동사/형용사/부사)는 어절 왼쪽에 위치함 (L-R)

의미를 지는 L만 추출하여 진행하면 됨

soynlp를 설치를 해주자

pip install soynlp

학습!

# 구글링을 통해 사용법 참조

from soynlp.word import WordExtractor

word_extractor = WordExtractor(min_frequency=100,
    min_cohesion_forward=0.05, 
    min_right_branching_entropy=0.0
)
word_extractor.train(df_sorted['title_refined'].values) # list of str or like
words = word_extractor.extract()
words

# LTokenizer로 L-R로 분리
from soynlp.tokenizer import LTokenizer
from soynlp.word import WordExtractor
from soynlp.utils import DoublespaceLineCorpus

cohesion_score = {word:score.cohesion_forward for word, score in words.items()}
tokenizer = LTokenizer(scores=cohesion_score)

# remove_r=True 를 통해 R부분을 바로 제거 가능 하다
df_sorted['tokenized'] = df_sorted['title_refined'].apply(lambda x: tokenizer.tokenize(x, remove_r=True))
df_sorted

학습을 통해 어떻게 토큰이 나눠지는지를 볼 수 있다. 워드익스트랙트의 점수를 기반으로 정해진것

05 워드 클라우드

토크나이저로 남은 단어들의 빈도수를 구하고 그것들을 가지고 워드클라우드를 해 보자

# 50만 이상 걸러진 데이터의 토크나이저 단어들을 한곳에 모아서 빈도수 파악
words = []
for i in df_top['tokenized'].values:
    #print(i) i는 현재 각각의 배열 형식으로 넘어옴
    for k in i:
        #print(k) 배열 형식에서 하나씩 빼서 다시 하나의 리스트 형식으로 바꿈
        words.append(k)
        
words

빈도수 체크 후 워드 클라우드!

# 빈도수 체크
from collections import Counter
count = Counter(words)

count

# Counter 형태를 딕셔너리로 바꾼 후 진행
word_dict = dict(count)
word_dict

# 워드클라우드 세팅
from wordcloud import WordCloud
font_path='NanumSquareB.otf'
wordcloud = WordCloud(font_path=font_path, width=500,height=500, background_color='white').generate_from_frequencies(word_dict)

plt.figure(figsize=(10,10))
plt.imshow(wordcloud)
plt.axis('off'); #x,y축 필요없으므로 off

마지막으로 필요 없는 단어들 (탄, 했다, 먹고 등) 을 없애자

stopwords={'탄','일','대','이','분','회','온','했다','화'}
for word in stopwords:
    word_dict.pop(word)
    
--> 다시 워드클라우드를 돌리면 됨

저작자표시 (새창열림)

'빅데이터 > Data-Analysis' 카테고리의 다른 글

실전 예제 - 온-오프라인 비지니스 분석 09 (0)	2022.03.01
실전 예제 - 코로나 데이터 분석 08 (0)	2022.02.28
실전 예제 - 마케팅 데이터 분석 06 (Referral) (0)	2022.02.25
실전 예제 - 마케팅 데이터 분석 05 (Revenue - 03) (0)	2022.02.24
실전 예제 - 마케팅 데이터 분석 04 (Revenue - 02) (0)	2022.02.24

ABOUT ME

Treasure Treasure

'빅데이터 > Data-Analysis' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'빅데이터 > Data-Analysis' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바