BeautifulSoup 01 - find-find_all/select

빅데이터/BeautifulSoup 2021. 11. 19. 01:40

유투버 'Keith Galli' 강의 참조

01 웹 사이트 간단히 훑어 보기

어떠한 종류의 웹 사이트를 가더라도 보통 기본적으로 구성되는 요소 3가지가 있다 → HTML/CSS/JS
구글 크롬을 이용하면 검사(inspect)를 활용해 쉽게 페이지가 어떤식으로 구성되어있는지 볼 수 있다.

직접 라이브러리를 이용해서 웹사이트 구성요소를 알아 보자

1. 파이썬과 Visual StudioCode를 깔아야 한다

2. VS에서 requests/beautifulsoup4/lxml을 'pip install..'을 통해 설치

3. VS에서 설치가 되지 않는다면 cmd창(윈도우 키 + r)을 눌러 파이썬 설치 경로로 이동 후 여기서 직접 설치

4. 파이썬 경로에 script 폴더가 있을 경우 cmd에서 script까지 이동 후 설치

import requests
from bs4 import BeautifulSoup as bs

#웹 사이트 정보 가져오기
r = requests.get("https://keithgalli.github.io/web-scraping/example.html")

#BeautifulSoup으로 objectify
soup = bs(r.content, "lxml")

print(soup.prettify())

.get() → requests의 기능 중 하나로 웹사이트 내용을 들고 온다
bs() → BeautifulSoup을 이용하여 사이트 content를 들고 옴
.prettify() → 프린터시에 보기 쉽게 내용을 바꿔 줌

02 find/find_all

BeautifulSoup 기초 세팅

import requests
from bs4 import BeautifulSoup as bs

#웹 사이트 정보 가져오기
r = requests.get("https://keithgalli.github.io/web-scraping/example.html")

#Convert to beautifulsoup object
soup = bs(r.content, "lxml")

찾고자 하는 첫번째 태그 리턴(find)

find_headers = soup.find("h2")
print(find_headers)

<h2>A Header</h2>

모든 태그를 리스트 형식으로 리턴(find_all)

headers = soup.find_all("p")
print(headers)

[<p>Link to more interesting example: 
<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>,
<p><i>Some italicized text</i></p>, 
<p id="paragraph-id"><b>Some bold text</b></p>]

각 태그의 특정 속성 내용 찾기

attrs_p = soup.find_all("p", attrs={"id":"paragraph-id"})
print(attrs_p)

[<p id="paragraph-id"><b>Some bold text</b></p>]

attrs_div = soup.find_all("div", attrs={"align":"middle"})
print(attrs_div)

[<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>]

특정 영역을 나눈 후 내용 찾아 오기
```
body = soup.find("body")
div = body.find("div")
header = div.find("h1")
print(header)

<h1>HTML Webpage</h1>
```
정말 유용한 기능. 페이지가 클수록 공간을 쪼개서 찾아오는게 좋다.

특정 태그가 가진 특정 문자열 찾아 오기

import re
string = soup.find_all("p", string=re.compile("Some"))
print(string)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]

headers = soup.find_all("h2", string=re.compile("(H|h)eader"))
print(headers)

[<h2>A Header</h2>, <h2>Another header</h2>]

03 Select - CSS를 찾는데 특화 되어 있음

CSS 셀렉터 관련 모든 참조는 아래 링크로(https://www.w3schools.com/cssref/css_selectors.asp)

find/find_all과 select의 차이
정확히 "p" 태그 부분을 find/select를 써서 찾아 보면 왜 select를 쓰는지 알 수 있다

#div -> p 까지 find로
body = soup.find("body")
div = body.find("div")
p = div.find("p")
print(p)


#div -> p까지 select로
p_select = soup.select("div p")
print(p_select)

*결과는 동일
[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]

→ find/find_all을 통하면 구체화하기에 여러 코드가 필요 하다. 하지만 select를 통해 바로 CSS 인자를 들고 올 수 있다.

※ SELECT 사용시 어느 태그의 특정 아이디는 '#'으로, 어느 태그의 특정 클래스는 '.'으로 들고 올 수 있다.

특정 구간 뒤의 CSS 들고 오기

after_p = soup.select("h2 ~ p")
print(after_p)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]

bold_text = soup.select("p#paragraph-id b")
print(bold_text)

[<b>Some bold text</b>]

특정 태그안의 또 다른 태그 들고 오기
<body> 태그 안에 <p> 태그로 시작하는 것들 들고 오기.

# tag_p = soup.select("body > p")
# # print(tag_p)

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]


tag_p = soup.select("body > p")
for i in tag_p:
    print(i.select("i"))
    
[<i>Some italicized text</i>]
[]

여러 태그 안의 특정 태그 찾기

try_1 = soup.select("body > div > p > a")
print(try_1)

[<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>]

* <div> 태그가 가지고 있는 특정 클래스로도 찾을 수 있다.

try_2 = soup.select("[align=middle]")
print(try_2)

[<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>      
</div>]

특정 태그안의 문자열 들고 오기

find_1 = soup.find("h2")
find_1.string
print(find_1)

<h2>A Header</h2>

* .string 사용시 주의 사항

만약 <body> → <div> 태그안의 스트링 출력을 시도하면 'None'이 뜨는데 그 이유는 <div>안에 여러 태그들이 있고 그 여러 태그들이 각각의 스트링을 들고 있기 때문에 출력을 하지 못하는 것. 즉 스트링 출력시에는 어느 태그의 어느 스트링인지 명확히 해줘야 한다

위의 상황 즉 여러 자식태그들이 많이 있으면 .string() 보다는 .get_text()를 쓰는것이 유용 하다.
```
try_3 = soup.find("div")
print(try_3.get_text())

HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html
```
태그명만 명확히 해주면 그 태그 안의 모든 string을 다 들고 온다.

특정 <a> 태그 및 내용 들고 오기
```
link = soup.find("a")
print(link['href'])

https://keithgalli.github.io/web-scraping/webpage.html
```
<a>태그를 집어주고 그 태그가 가지고 있는 링크 헤더를 달아주면 가지고 온다.
특정 태그의 아이디 값 들고 오기

paragraphs = soup.select("p#paragraph-id")
print(paragraphs[0]['id'])

paragraph-id

.select()를 하면 아래와 같이 리스트 형식으로 들고 온다.
'[<p id="paragraph-id"><b>Some bold text</b></p>]'
하나의 리스트에 내용이 1개밖에 없고 그에 맞는 index = 0 이므로 0이 가진 키 = 'id' 를 넣으면 됨
이 <p>태그의 <b>를 들고 오고 싶다면 아래와 같다
'try_4 = soup.select("p#paragraph-id > b")'

저작자표시 (새창열림)

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

웹 크롤링 - BeautifulSoup 기초 개념 (0)	2021.12.09
BeautifulSoup 03 - Basics of data science tasks (2) - 위키피디아 영화 관련 스크래핑 (0)	2021.12.08
BeautifulSoup 03 - Basics of data science tasks (1) - 위키피디아 영화 관련 스크래핑 (0)	2021.12.01
BeautifulSoup 02 - Code Navigation/Exercise - 2 (0)	2021.11.28
BeautifulSoup 02 - Code Navigation/Exercise - 1 (0)	2021.11.24

ABOUT ME

Treasure Treasure

유투버 'Keith Galli' 강의 참조

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

티스토리툴바

ABOUT ME

유투버 'Keith Galli' 강의 참조

'빅데이터 > BeautifulSoup' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바