BeautifulSoup 02 - Code Navigation/Exercise

BeautifulSoup 02 - Code Navigation/Exercise - 2

빅데이터/BeautifulSoup 2021. 11. 28. 10:12

유투버 'Keith Galli' 강의 참조

'Photos'의 src 들고 와 보기

# t1 = webpage.select("div.row")
# t2 = webpage.select("div.column")
t3 = webpage.select("div.column img")
t4 = [tt['src'] for tt in t3]
print(t4)

['images/italy/lake_como.jpg', 'images/italy/pontevecchio.jpg', 'images/italy/riomaggiore.jpg']


*리스트로 담기니 꺼내는거는 인덱스를 활용하거나 for문을 써서 꺼내면 된다.
for i in t4:
    print(i)
    
images/italy/lake_como.jpg
images/italy/pontevecchio.jpg
images/italy/riomaggiore.jpg

판다스를 활용하여 테이블 내용 들고 오기

import pandas as pd

# 1. 전체 구성 보기
t1 = webpage.select("table.hockey-stats")[0]

# 2. <thead> -> <tr> -> <th> 로 구성되어있고 그 중 'th'만 뽑기
t2 = t1.find("thead").find_all("th")

# 3. 'th'만 거른 후 'th' 문자열만 들고 오기
t3 = [t.string for t in t2]

# 4. 테이블 내용 뽑기
t4 = t1.find("tbody").find_all("tr")
l = []
for tr in t4:
    td = tr.find_all('td')
    #스페이스,빈칸등을 strip()
    row = [str(tr.get_text()).strip() for tr in td]
    l.append(row)

# 5. 판다스를 써서 테이블 형식으로 뽑기 (pd.DataFrmae(행, 열))
df = pd.DataFrame(l, columns=t3)

print(df.head())


*판다스를 활용하면 테이블을 컨트롤해서 뽑아 올 수 있다.
print(df.loc[df['Team'] != 'Did not play'])

'Fun Facts'의 'is'가 들어있는 문장만 들고 오기

import re

# 1. '<ul> 태그중 'fun-facts' 클래스 이름을 가진 모든 li 들고 오기
t1 = webpage.select("ul.fun-facts li")

# 2. 그 중 're' 라이브러리로 스트링으로 컴파일 후 't1'을 하나씩 꺼내기
t2 = [t.find(string=re.compile("is")) for t in t1]

# 3. t가 'is'를 포함하고 있는 문장만 들고 오는데 여기서 '... is ' 로 짤리니 
#    .find_parent()를 이용하여 문장의 부모(<li> 태그)를 들고 오면 전체를 들고 오게 됨
t3 = [t.find_parent().get_text() for t in t2 if t]
print(t3)

['Middle name is Ronald', 'Dunkin Donuts coffee is better than Starbucks',
"A favorite book series of mine is Ender's Game", 
'Current video game of choice is Rocket League', 
"The band that I've seen the most times live is the Zac Brown Band"]

이미지 다운로드

import requests
from bs4 import BeautifulSoup as bs
import re

url = "https://keithgalli.github.io/web-scraping/"
r = requests.get(url + "webpage.html")
r.raise_for_status()

webpage = bs(r.content, "lxml")

# 1. 위치 긁어 오기
t1 = webpage.select("div.row div.column img")

# 2. 위치 인덱스 
t2 = t1[0]['src']
print(t2)

# 3. 다운로드 url
t3 = url + t2 

img_data = requests.get(t3).content
with open('lage_como.jpg', 'wb') as handler:
    handler.write(img_data)

* 여러장을 다운로드

# 2. 위치 인덱스 
for idx, i in enumerate(t1):
    print(i["src"])
    image_url = i["src"]
    if image_url.startswith(""):
        image_url = url + image_url

    image_res = requests.get(image_url)
    image_res.raise_for_status()
    
    with open("pic{}.jpg".format(idx+1), "wb") as f:
        f.write(image_res.content)

'Mystery Message Challenge' 링크 내용 들고 오기

import requests
from bs4 import BeautifulSoup as bs
import re

url = "https://keithgalli.github.io/web-scraping/"
r = requests.get(url + "webpage.html")
r.raise_for_status()

webpage = bs(r.content, "lxml")

# 1. 링크 전체 모두 들고 오기
t1 = webpage.select("div.block a")

# 2. 링크 부분 뽑기
t2 = [t['href'] for t in t1]

# 3. url과 연결하여 클릭 링크 만들기 
for t in t2:
    full_url = url + t
    # print(full_url)
    page = requests.get(full_url)
    bs_page = bs(page.content, features="lxml")
    secret_word_element = bs_page.find("p", attrs={"id":"secret-word"})
    secret_word = secret_word_element.string
    print(secret_word)
    
    
Make
sureto
smash
thatlike
button
andsubscribe
!!!

저작자표시 (새창열림)

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

웹 크롤링 - BeautifulSoup 기초 개념 (0)	2021.12.09
BeautifulSoup 03 - Basics of data science tasks (2) - 위키피디아 영화 관련 스크래핑 (0)	2021.12.08
BeautifulSoup 03 - Basics of data science tasks (1) - 위키피디아 영화 관련 스크래핑 (0)	2021.12.01
BeautifulSoup 02 - Code Navigation/Exercise - 1 (0)	2021.11.24
BeautifulSoup 01 - find-find_all/select (0)	2021.11.19

ABOUT ME

Treasure Treasure

유투버 'Keith Galli' 강의 참조

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

티스토리툴바

ABOUT ME

유투버 'Keith Galli' 강의 참조

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바