[Python]온라인에 있는 라이브러리 사용해서 url 가져오기

Python

[Python]온라인에 있는 라이브러리 사용해서 url 가져오기

sagesse2021 2021. 11. 20. 14:46

indeed 웹사이트의 html정보 가져오기

github에 있는 request코드 복사하기
repl.it에서 패키지에 requests라이브러리 검색해서 설치

import requests

indeed_result = requests.get("https://www.indeed.com/jobs?q=python&limit=50&radius=25&start=950")

print(indeed_result.text) #html코드 가져오기

3. 코드 복붙해서 html정보를 전부 불러옴

페이지 정보(페이지 숫자들)를 불러오기 위해서 screen scrapping라이브버리인 beautifulsoup을 사용
beautifulsoup : html에서 정보를 추출하기에 유용한 라이브러리

4. repl.it 패키지에서 beautifulsoup4를 설치한다

가져온 html에서 정보 추출하기(페이지 숫자 추출)

beaufifulsoup 사이트에서 documentation으로 들어가서 코드를 복사 한다

import requests
from bs4 import BeautifulSoup # BeautifulSoup 모듈 가져오기
indeed_result = requests.get("https://www.indeed.com/jobs?q=python&limit=50&radius=25&start=950")

indeed_soup = BeautifulSoup(indeed_result.text, "html.parser")

print(indeed_soup)

soup.p (p는 paragraph을 의미)
soup.p['class'] 클래스 지정
soup.find_all('a') anchor요소를 모두 찾으라는 의미, 모든 링크의 리스트를 반환한다

indeed 사이트에서 마우스 왼쪽버튼-검사로 들어가서 pagination class 안 a의 링크들을 확인한다

class명이 pagination인 div를 찾는다

<main.py>
from indeed import get_jobs as get_indeed_jobs

indeed_jobs = get_indeed_jobs()
print(indeed_jobs)

<indeed.py>
import requests
from bs4 import BeautifulSoup  # BeautifulSoup 모듈 가져오기

LIMIT = 50
URL = f"https://www.indeed.com/jobs?q=python&limit={LIMIT}"


def get_last_page():
    result = requests.get(URL)
    soup = BeautifulSoup(result.text, "html.parser")
    pagination = soup.find("div", {"class": "pagination"})

    links = pagination.find_all('a')
    pages = []

    for link in links[:-1]:  # Next는 출력하지 않도록 마지막 요소는 읽지 않도록 리스트를 넣음
        pages.append(int(link.string))

    max_page = pages[-1]
    return max_page


def extract_job(html):
    title = html.find("h2", {"class": "title"}).find("a")["title"]  #일자리 정보
    company = html.find("span", {"class": "company"})  #회사 이름
    company_anchor = company.find("a")
    if company_anchor is not None:
        company = str(company_anchor.string)
    else:
        company = str(company.string)
    company = company.strip()
    location = html.find("div", {"class": "recJobLoc"})["data-rc-loc"]
    job_id = html["data-jk"]
    print(job_id)
    return {
        'title': title,
        'company': company,
        'location': location,
        "link": f"https://www.indeed.com/viewjob?jk={job_id}"
    }


def extract_jobs(last_page):
  jobs = []
  for page in range(last_page):
    print(f"Scrapping page {page}")
    result = requests.get(f"{URL}&start={page*LIMIT}")
    soup = BeautifulSoup(result.text, "html.parser")
    results = soup.find_all("div", {"class": "jobsearch-SerpJobCard"})
    for result in results:
        job = extract_job(result)
        jobs.append(job)
    return jobs

    #find_all = 리스트 전부를 가져옴
    #find = 첫번째 찾은 결과를 보여줌

def get_jobs():
  last_page = get_last_page()
  jobs = extract_jobs(last_page)
  return jobs

'Python' 카테고리의 다른 글

[Python]모듈(Module) (0)	2021.11.20
[Python]for in 반복문 (0)	2021.11.19
[Python]if...else, elif..or/and (0)	2021.11.19
[Python]Code Challenge (0)	2021.11.19
[Python]Keyworded Arguments (0)	2021.11.19

현재글[Python]온라인에 있는 라이브러리 사용해서 url 가져오기

나의 코딩 공부 기록하기

프로그래밍, 내장객체, oracle, java, 웹프로그래밍, HTML, JDBC, subquery, db, 자바스크립트, Python, Function, 개발, 파이썬, 자바프로그래밍, 자바, 웹개발, 코딩, SQL, javascript,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

지혜로운 코딩생활

[Python]온라인에 있는 라이브러리 사용해서 url 가져오기

indeed 웹사이트의 html정보 가져오기

가져온 html에서 정보 추출하기(페이지 숫자 추출)

'Python' 카테고리의 다른 글

'Python'의 다른글

티스토리툴바

[Python]온라인에 있는 라이브러리 사용해서 url 가져오기

indeed 웹사이트의 html정보 가져오기

가져온 html에서 정보 추출하기(페이지 숫자 추출)

'Python' 카테고리의 다른 글

'Python'의 다른글

관련글

티스토리툴바