BeautifulSoup 02 - Code Navigation/Exercise

BeautifulSoup 02 - Code Navigation/Exercise - 1

빅데이터/BeautifulSoup 2021. 11. 24. 04:21

유투버 'Keith Galli' 강의 참조

BeautifulSoup 기본 세팅

import requests
from bs4 import BeautifulSoup as bs
import re

r = requests.get("https://keithgalli.github.io/web-scraping/example.html")
r.raise_for_status()

soup = bs(r.content, "lxml")

Code Navigation의 역할은 앞서 했던 코딩을 일일히 하기 보다는 보다 간단히 찾기 위한 용도

a = soup.body.div.h1.string
print(a)

HTML Webpage

태그명과 내용 중복없이 명확하다면 일일이 find/select 할 필요 없이 이렇게 찾을 수 있다

Code Navigation은 쉽게 말해 부모, 형제, 자식의 형태를 잘 구분 할 줄 알아야 한다.
<body> 가 부모(parents)가 되며 그 밑에 형제(siblings)개 념으로 <div>와 같은 라인에 있는 태그들이 있고 그 밑에 자식(child) 개념으로 형제 태그보다 더 안에 들어가 있는 형태를 볼 수 있다.

즉 태그의 형태를 잘 구분 하면 아래와 같은 코드로 스크래핑이 수월해 진다.
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/) → 모든 기능이 나와 있음.

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object (unicode() in Python 2), or on a Tag within it: str(soup) # ' I linked to example.com ' str(soup.a) # ' I linked to example.com ' The str(

www.crummy.com

find_parents()
 find_next_siblings()
 find_previous()
 find_all
 ....
 
 a = soup.body.find("div").find_next_siblings()
 print(a)
 
 [<h2>A Header</h2>, <p><i>Some italicized text</i></p>, 
 <h2>Another header</h2>, <p id="paragraph-id"><b>Some bold text</b></p>]

Excercise!

기본 세팅

import requests
from bs4 import BeautifulSoup as bs
import re

r = requests.get("https://keithgalli.github.io/web-scraping/webpage.html")
r.raise_for_status()

soup = bs(r.content, "lxml")

모든 링크 가져 오기

#SELECT

*select() 시에 '#'은 특정 아이디, '.'은 특정 클래스를 긁어 올 수 있다.

1. 
links = webpage.select("a")

[<a href="https://www.youtube.com/kgmit">youtube.com/kgmit</a>, 
<a href="#footer"><sup>1</sup></a>, <a href="https://www.in
...
→ 이 페이지의 모든 '<a>' 태그를 다 들고 옴


2. 
links = webpage.select("ul.socials")
print(links)

[<ul class="socials"><li class="social instagram"><b>Instagram: </b>
<a href="https://www.instagram.com/keithgalli/">https://www.instagram.com/keithgalli/</a>
</li><li class="social twitter"><b>Twitter:
....

→ <ul> 태그에 class 이름이 'socials'을 가진 곳에 링크들이 들어가 있음.
  하지만 <ul>,<li> 태그까지 다 들고 와버림
  
  
3.
links = webpage.select("ul.socials a")
print(links)

→ 특정 클래스 이름을 가진 태그의 '<a>'를 들고 옴. 하지만 모든 '<a>'태그를 들고 옴



4.
links = webpage.select("ul.socials a")
what_we_need = [link['href'] for link in links]
print(what_we_need)

['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 
'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/@keithgalli']

→ 특정 태그를 찾고 그 태그의 특정 이름을 찾은 후 이 태그가 여러개를 들고 있으면 
  포문을 돌리는데 이때 포문을 돌릴 때 찾고자 하는 특정 부분('href')를 찾도록 하고
  'link'가 가진 결과에서 '특정 부분'만 포문을 돌면서 들고 오면 됨

#FIND

1.
links = webpage.find("a")
print(links)

<a href="https://www.youtube.com/kgmit">youtube.com/kgmit</a>

→ find를 사용해보면 select()와 다르게 웹페이지에서 가장 첫 '<a>'태그를 들고 온다


2.
links = webpage.find("ul", attrs={"class":"socials"})
print(links)

<ul class="socials">
<li class="social instagram">
...

→ 구체화를 시켜 필요한 부분을 먼저 추출해 보면 쉬워진다


3.
ulist = webpage.find("ul", attrs={"class":"socials"})
links = ulist.find_all("a")
what_we_need = [link['href'] for link in links]
print(what_we_need)

['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 
'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/@keithgalli']

→ 똑같이 find()를 써서 특정 구간을 찾고 거기서 '<a>'태그만 추출, 여러개의 태그가 오니 
  포문을 돌려 특정 부분('href')만 찾아서 꺼내면 된다.

#SELECT2
- 링크들이 'social...'로 시작하는 패턴을 볼 수 있다. 이를 이용해 보자

1. 
links = webpage.select("li.social a") 
print(links)
[<a href="https://www.instagram.com/keithgalli/"> https://www.instagram.com/keithgalli/</a>,
<a hr ... 

→ '<li>' 태그들중 클래스 이름이 'social' 로 시작하는 태그들을 들고 옴

2. 
links = webpage.select("li.social a")
what_we_need = [link['href'] for link in links] 
print(what_we_need) 

['https://www.instagram.com/keithgalli/', 'https://twitter.com/keithgalli', 
'https://www.linkedin.com/in/keithgalli/', 'https://www.tiktok.com/@keithgalli'] 

→ 똑같은 포문을 돌려주면 결과가 나온다. 즉 특정 태그의 패턴이 같다면 이런식으로도 찾을 수 있다.

'Fun Facts' 들고 오기

try_1 = webpage.find("ul", attrs={"class":"fun-facts"})
try_2 = try_1.find_all("li")
for i in try_2:
    print(i.get_text())
    
Owned my dream car in high school 1
Middle name is Ronald
Never had been on a plane until college
Dunkin Donuts coffee is better than Starbucks
A favorite book series of mine is Ender's Game
Current video game of choice is Rocket League
The band that I've seen the most times live is the Zac Brown Band

*get_text()를 쓰지 않으면 '<li>' 태그 까지 들고 온다.

저작자표시

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

웹 크롤링 - BeautifulSoup 기초 개념 (0)	2021.12.09
BeautifulSoup 03 - Basics of data science tasks (2) - 위키피디아 영화 관련 스크래핑 (0)	2021.12.08
BeautifulSoup 03 - Basics of data science tasks (1) - 위키피디아 영화 관련 스크래핑 (0)	2021.12.01
BeautifulSoup 02 - Code Navigation/Exercise - 2 (0)	2021.11.28
BeautifulSoup 01 - find-find_all/select (0)	2021.11.19

ABOUT ME

Treasure Treasure

유투버 'Keith Galli' 강의 참조

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

티스토리툴바

ABOUT ME

유투버 'Keith Galli' 강의 참조

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바