BeautifulSoup 03 - Basics of data science tasks (2) - 위키피디아 영화 관련 스크래핑

빅데이터/BeautifulSoup 2021. 12. 8. 09:06

유투버 'Keith Galli' 강의 참조

아래 위키피디아 링크로 이동 하여 진행 (https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films)

리스트의 링크들을 타고 들어가서 그 링크들이 가지고 있는 인포 박스 따오기
참고로 위키피디아에서 웹스크래핑을 허용 하지만 너무많이 하거나 부적절하게 사용하면 막히게 해놓았다. 그래서 천천히 조금씩 해야 한다!

기본 세팅

from bs4 import BeautifulSoup as bs
import reques

# Load the webpage
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")

# Convert tp a beautifulsoup object
soup = bs(r.content, "lxml")

# Print
contents = soup.prettify()
print(contents)

검사를 해보면 테이블들이 비슷한 패턴으로 반복되는것을 볼 수 있다.

movies = soup.select(".wikitable.sortable")
movies

[<table class="wikitable sortable" style="width:100%;">
 <tbody><tr>
 <th style="width:1em;">
 </th>
 <th style="width:35%;">Title
 ....
 
 
 1. soup.find_all 로는 찾아오지 못한다.
 2. select를 써서 class 전체를 긁으니 뒷부분을 찾지 못한다.
	(wikitable sortable "jquery-tablesorter")
 3. select의 .class1.class2 를 써서 뒷부분을 제외하고 찾아 본다.

이제 각 타이틀들의 링크들을 긁어 와야 한다
테이블들이 똑같은 형식으로 반복이 되고 그안에 똑같은 태그명으로도 반복이 되는것을 볼 수 있다.

- 1차로 <i> 태그만 들고 와 보자 (큰 <table> 태그 안에 똑같은 패턴으로 반복이 되고 <i> 태그가 <a>를 포함 정보를 들고 있다.)

movies = soup.select(".wikitable.sortable i")
movies[0:5]

[<i><a href="/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons" title="Academy Award Review of Walt Disney Cartoons">Academy Award Review of Walt Disney Cartoons</a></i>,
 <i><a href="/wiki/Snow_White_and_the_Seven_Dwarfs_(1937_film)" title="Snow White and the Seven Dwarfs (1937 film)">Snow White and the Seven Dwarfs</a></i>,
 <i><a href="/wiki/Pinocchio_(1940_film)" title="Pinocchio (1940 film)">Pinocchio</a></i>,
 <i><a href="/wiki/Fantasia_(1940_film)" title="Fantasia (1940 film)">Fantasia</a></i>,
 <i><a href="/wiki/The_Reluctant_Dragon_(1941_film)" title="The Reluctant Dragon (1941 film)">The Reluctant Dragon</a></i>]

- 하나의 태그 링크 들고 오기

movies = soup.select(".wikitable.sortable i")
movies[0:5]
movies[0].a['href']

'/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons'

*태그안의 속성을 넣으면 원하는 정보를 들고 올 수 있다.
movies[0].a['title']
'Academy Award Review of Walt Disney Cartoons'

→ 인덱스 형식으로 나누어지는고 그 인덱스 정보를 들 고 올 수 있다.

- 인덱스를 이용하여 링크 + 타이틀 추출

# Load the webpage
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")

# Convert tp a beautifulsoup object
soup = bs(r.content, "lxml")
movies = soup.select(".wikitable.sortable i")

for index, movie in enumerate(movies):
    relative_path = movie.a['href']
    title = movie.a['title']
    
    print(relative_path)
    print(title)
    break

/wiki/Academy_Award_Review_of_Walt_Disney_Cartoons
Academy Award Review of Walt Disney Cartoons

→ 페이지의 모든 타이틀 및 링크를 들고 오게되면 'NoneType' 오류가 뜬다. 해결 해 보자

for index, movie in enumerate(movies):
    
    try:
        relative_path = movie.a['href']
        title = movie.a['title']

    except Exception as e:
        print(movie.get_text())
        print(e)
        
Escape from the Dark
'NoneType' object is not subscriptable
The Omega Connection
'NoneType' object is not subscriptable
Trail of the Panda
'NoneType' object is not subscriptable
Growing Up Wild
....

* try/except 를 써서 어디에서 오류가 걸리는지 볼 수 있다. 꼭 기억 하자!

→ 'NoneType' 형태들을 찾아보면 타이틀이 없거나, 링크가 없거나 등 공통적인 오류가 나온다.

movies = soup.select(".wikitable.sortable i a")

* 간단하게'wikitable.sortable'에 <i> 태그를 찾고 그안에 <a> 태그가 있는 것만 들고 오면 된다.

- 2차 시도

def get_content_value(row_data):
    if row_data.find("li"):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all("li")]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

def get_info_box(url):
    
    r = requests.get(url)
    soup = bs(r.content, "lxml")
    info_box = soup.find(class_="infobox vevent")
    info_rows = info_box.find_all("tr")
    
    movie_info = {}
    for index, row in enumerate(info_rows):
        if index == 0:
            movie_info['title'] = row.find("th").get_text()

        #사진은 생략
        elif index == 1:
            continue

        #계속해서 필요한 정보 긁기
        else:
            content_key = row.find("th").get_text()
            content_value = get_content_value(row.find("td"))
            movie_info[content_key] = content_value
    
    return movie_info

# Load the webpage
r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")

# Convert tp a beautifulsoup object
soup = bs(r.content, "lxml")
movies = soup.select(".wikitable.sortable i a")

bas_path = "https://en.wikipedia.org/"
movie_info_list = []

for index, movie in enumerate(movies):
    if index == 10:
        break
        
    try:
        relative_path = movie['href']
        full_path = bas_path+relative_path
        title = movie['title']
        
        movie_info_list.append(get_info_box(full_path))
        
    except Exception as e:
        print(movie.get_text())
        print(e)

인포박스의 구성은 동일하므로 첫 수업때 썻던 함수를 그대로 활용
각 링크를 타고들어가서 인포박스를 들고와야하므로 인포박스를 들고 오는 함수 추가 ('def get_info_box(url)')
인포박스를 긁고 거기서 넘어오는 데이터는 'def get_content_value(row_data)' 함수를 탐
진행 순서를 보면 2개의 함수를 타기전에 full_path를 만들기 위해서 첫 웹페이지에서 포문을 돌면서 링크들을 따게 됨
링크를 따고 'full_path'를 만든 후 이 'full_path' 주소가 인포박스(url)로 들어가고 이를 통해 인포박스를 긁어 옴
긁어 와진 인포박스 내용을 추가로 가공해서 들고 오게 되면 아래와 같은 결과가 나옴

movie_info_list[0]

{'title': 'Academy Award Review of Walt Disney Cartoons',
 'Productioncompany ': 'Walt Disney Productions',
 'Distributed by': 'RKO Radio Pictures',
 'Release date': ['May 19, 1937 ( 1937-05-19 )'],
 'Running time': '41 minutes (74 minutes 1966 release)',
 'Country': 'United States',
 'Language': 'English',
 'Box office': '$45.472'}

현재까지 진행상태를 확인해보면 여전히 몇몇의 무비가 'NoneType'오류를 일으킨다

from bs4 import BeautifulSoup as bs
import requests

def get_content_value(row_data):
#     print(row_data)

    if row_data.find('li'):
        return [li.get_text(" ", strip=True).replace("\xa0", " ") for li in row_data.find_all('li')]
    else:
        return row_data.get_text(" ", strip=True).replace("\xa0", " ")

def get_info_box(url):

    r = requests.get(url)

    #Convert to a beautifulsoup object
    soup = bs(r.content, 'html.parser')
    info_box = soup.find(class_="infobox vevent")
    info_rows = info_box.find_all('tr')

    movie_info = {}

    for idx, row in enumerate(info_rows):
        if idx == 0: # 타이틀의 인덱스는 = 0
            movie = row.find('th').get_text(" ", strip=True)
        elif idx == 1:
            continue
        else:
            content_key = row.find('th').get_text(" ", strip=True)
#             print(content_key)
            content_value = get_content_value(row.find('td')) #이 부분이 get_content_value를 탐
            movie_info[content_key] = content_value

    return movie_info

r = requests.get("https://en.wikipedia.org/wiki/List_of_Walt_Disney_Pictures_films")

#Convert to a beautifulsoup object
soup = bs(r.content, 'html.parser')

# movies = soup.select('.wikitable.sortable i')으로는 아래의 Nonetype이 있으므로 조금 더 구체적으로 가져오자
movies = soup.select('.wikitable.sortable i a') #즉 <i>태그 중 <a>를 가진 것만 들고 옴
# print(len(movies)) #NoneType 제외하고 개수를 확인 510개로 나옴

base_path = "https://en.wikipedia.org/" #기본 url을 세팅 후

movie_info_list = [] #들고 오는 내용을 담을 리스트
for index, movie in enumerate(movies):
    #디버깅용
#     if index == 10:
#         break
    try:
        relative_path = movie['href']
        full_path = base_path + relative_path #긁어오는 'href'를 더해서 full_path 만든 후 함수에 삽입
        title = movie['title']
#         relative_path = movie.a['href'] # <i>태그 중 <a>만 들고 오니 .a는 생략
#         title = movie.a['title']
        #위의 결과들을 print 해보면 relative_path가 가져오는 'href'가 없는 부분이 있다. 여기서 try/catch가 필요
#         print(relative_path)
#         print(title)
#         print()

        movie_info_list.append(get_info_box(full_path)) #여기에 들어갈 url 세팅 (base_path + full_path)

    except Exception as e: #즉 오류부분이면 여기를타고 아래를 실행
        print(movie.get_text()) #이 부분에서 어떤 movie에서 오류가 나오는지 보인다
        #몇몇의 movie가 테이블의 Title에서 'href'를 들고오는게아닌 'Note'부분에서 들고 온다.
        #몇몇의 movie가 테이블의 Title에서 'href'를 들고 오지만, 링크가 아는 형태이다.

        #최종코드에서 돌려보면 여전히 몇몇의 movie가 find/find_all/get_text가 안된다
        print(e)

완벽하진않지만 500개가 넘어가는것 중 10개 내외의 오류는 나쁘지 않다고 볼 수 있다.
이제 이 딕셔너리형태의 데이터들을 꺼내서 JSON형태로 저장 한 후 계속해서 진행해볼 예정

import json

#JSON 형태로 저장
def save_data(title, data):
    with open(title, 'w', encoding='utf-8') as f:
        #json.dumps() = Python 객체를 JSON으로 변환
        #ensure_ascii - true면 모든 비 ASCII 문자가 출력 안됨, False면 문자 그대로 출력
        #indent - 인덱스폭 (문자 수) 지정
        json.dump(data, f, ensure_ascii=False, indent=2)
        

#JSON 형태로 로드
def load_data(title):
    with open(title, encoding='utf-8') as f:
        return json.load(f)
        
save_data("Disney_data.json", movie_info_list)

저작자표시

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

웹 크롤링 - BeautifulSoup+Pandas를 이용한 데이터 분석 (0)	2021.12.11
웹 크롤링 - BeautifulSoup 기초 개념 (0)	2021.12.09
BeautifulSoup 03 - Basics of data science tasks (1) - 위키피디아 영화 관련 스크래핑 (0)	2021.12.01
BeautifulSoup 02 - Code Navigation/Exercise - 2 (0)	2021.11.28
BeautifulSoup 02 - Code Navigation/Exercise - 1 (0)	2021.11.24

ABOUT ME

Treasure Treasure

유투버 'Keith Galli' 강의 참조

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

티스토리툴바

ABOUT ME

유투버 'Keith Galli' 강의 참조

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바