Notion으로 네이버 뉴스 모아보기 (Notion-py 라이브러리 이용)

Productivity/Notion

Notion으로 네이버 뉴스 모아보기 (Notion-py 라이브러리 이용)

Genie.Choi 2021. 8. 6. 07:46

다들 인터넷 뉴스 기사 많이 보시나요?

요새 주식이나 비트코인 등 여러 분야에서 투자하시는 분들이 많아지면서 투자정보 때문에 특정 기사를 찾아보시는 분들 꽤 계실겁니다.

(저도 푼돈이지만 이곳저곳 투자를 해볼까 생각하고 있습니다 🙂)

일일이 키워드를 찾아서 관련 기사를 스크랩 해놓는 일은 여간 번거로운 일이 아닙니다.

이런 경우에 생산성 관리 툴인 Notion과 비공식 파이썬 라이브러리 Notion-py를 이용하시면 손쉽게 기사 스크랩이 가능합니다.

notion-py 라이브러리를 이용한 news crawling 완성본

위와 같이 news list에서 뉴스 스크랩을 자동화하는 작업을 어떻게 진행하는 지 알려드립니다.

이 튜토리얼에서는 네이버 뉴스 화면을 크롤링하는 작업을 진행하겠습니다 (참고 : 크롤러는 추후 html 속성이 바뀌면 수정이 필요함)

먼저 노션에서 + add a page를 하여 새로운 페이지를 만들어줍니다. 저는 news crawling으로 하였습니다.

이후 table-inline 블록을 생성하여 각 칼럼의 이름을 지정합니다. (title / crawlingdate / keyword / url)

github.com/jamalex/notion-py

jamalex/notion-py

Unofficial Python API client for Notion.so. Contribute to jamalex/notion-py development by creating an account on GitHub.

github.com

notion-py라는 라이브러리를 이용할 예정입니다. 이 라이브러리를 이용하기 위해서는 url과 token 값이 필요합니다.

url 값은 웹으로 노션에 접속하여 자동화 설정을 할 페이지의 상단 링크를 가져오면 됩니다.

token 값의 경우 chrome 개발자 도구 접속 (F12) -> application -> cookies -> token_v2 값을 가져옵니다.

라이브러리에서 제공하는 코드를 이용하여 하는 작업은 다음과 같습니다.

- token 값을 이용하여 notion client를 불러옴

- url을 이용하여 특정 page 설정

- news()라는 사용자 지정 함수를 이용하여 크롤링한 자료를 news에 list로 입력

- 불러온 page 값에 각 column에 맞춰 한 줄씩 입력하는 코드 진행

from notion.client import *
from notion.block import *

token = "토큰 값"
url = "url"

# client 만들고 페이지 정보 가져오기
client = NotionClient(token_v2=token)
page = client.get_collection_view(url)

news = news()

for onenews in news:
    row = page.collection.add_row()
    row.title = onenews['기사 제목']
    row.crawlingdate = onenews['크롤링 날짜']
    row.keyword = onenews['키워드']
    row.url = onenews['url']

이제 네이버 기사를 크롤링하는 news() 함수를 만들어보겠습니다.

기본적으로 한 가지 키워드를 입력하여 검색을 진행하고, 날짜는 당일 하루로 지정합니다.

네이버 뉴스에서 기사를 검색하면 다음과 같은 url 형식으로 나옵니다.

https://search.naver.com/search.naver?where=news&query=%EC%A3%BC%EC%8B%9D&sm=tab_opt&sort=0&photo=0&field=0&reporter_article=&pd=3&ds=2021.01.01&de=2021.01.17&docid=&nso=so%3Ar%2Cp%3Afrom20210101to20210117%2Ca%3Aall&mynews=0&refresh_start=0&related=0

&query = "검색어 값"
&ds = "시작 일자"
&de = "종료 일자"
from "시작 일자" to "종료 일자"
start = "페이지 번호"

위 설정을 formating하여 원하는 일자에 원하는 검색어 값을 페이지 번호에 맞춰 검색하는 코드를 작성합니다.

from bs4 import BeautifulSoup
import requests
import datetime

keyword = "주식" # 검색할 키워드 입력
now = datetime.datetime.now()
now_date = now.strftime('%Y.%m.%d') # 날짜 형식에 맞춰 오늘 날짜 변수 변경
now_date2 = now.strftime('%Y%m%d')

news_list = []

for page_number in range(1, 11, 10): # 한 페이지 당 기사 10개
    url_format = "https://search.naver.com/search.naver?&where=news&query={}&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=0&ds={}&de={}&docid=&nso=so:r,p:from{}to{},a:all&mynews=0&cluster_rank=27&start={}"
    req = requests.get(url_format.format(keyword, now_date, now_date, now_date2, now_date2, str(page_number)),
                       headers={'User-Agent': 'Mozilla/5.0'}) # header의 경우, 크롤링 가동 시 우회 경로로 사용
    sp = BeautifulSoup(req.text, 'html.parser')  # 파싱하여 원하는 데이터만 가져오게 html 저장

네이버 뉴스 페이지 파싱에 성공하였습니다.

페이지 번호(page_number)의 경우, 1, 11, 21 순으로 진행하기 때문에 range(1, ***, 10)을 사용하여 1에서 10씩 더하는 수로 진행합니다.

이제 source를 살펴보고 기사 제목과 url 링크 정보를 가져옵니다.

네이버 뉴스의 경우 div.group_news > ul.list_news > li div.news_area > a 에 각 기사 정보가 있습니다.

여기서 title과 url에 정보를 넣고 crawling_news 정보를 위에서 만들어 높은 news_list에 입력합니다.

sources = sp.select('div.group_news > ul.list_news > li div.news_area > a')

for source in sources:
    title = source.attrs['title']
    url = source.attrs['href']

    crawling_news = {
        '기사 제목' : title,
        '키워드' : keyword,
        'url' : url,
        '크롤링 날짜' : str(now_date)
    }

	news_list.append(crawling_news)

다 조합하여 만든 사용자 지정 함수 news()는 다음과 같습니다.

# 키워드 별로 크롤링 하기
def news():
    keyword = '추천+종목'
    now = datetime.datetime.now()
    now_date = now.strftime('%Y.%m.%d')
    now_date2 = now.strftime('%Y%m%d')

    news_list = []

    for page_number in range(1, 11, 10): # 한 페이지 당 기사 10개
        url_format = "https://search.naver.com/search.naver?&where=news&query={}&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=0&ds={}&de={}&docid=&nso=so:r,p:from{}to{},a:all&mynews=0&cluster_rank=27&start={}"
        req = requests.get(url_format.format(keyword, now_date, now_date, now_date2, now_date2, str(page_number)),
                           headers={'User-Agent': 'Mozilla/5.0'})
        sp = BeautifulSoup(req.text, 'html.parser')  # 파싱하여 원하는 데이터만 가져오게 html 저장

        sources = sp.select('div.group_news > ul.list_news > li div.news_area > a')

        for source in sources:
            title = source.attrs['title']
            url = source.attrs['href']

            crawling_news = {
                '기사 제목' : title,
                '키워드' : keyword,
                'url' : url,
                '크롤링 날짜' : str(now_date)
            }

            news_list.append(crawling_news)

    return news_list

모든 과정을 조합한 완성 코드는 다음과 같습니다.

from notion.client import *
from notion.block import *

from bs4 import BeautifulSoup
import requests
import datetime

# 키워드 별로 크롤링 하기
def news():
    keyword = '추천+종목'
    now = datetime.datetime.now()
    now_date = now.strftime('%Y.%m.%d')
    now_date2 = now.strftime('%Y%m%d')

    news_list = []

    for page_number in range(1, 11, 10): # 한 페이지 당 기사 10개
        url_format = "https://search.naver.com/search.naver?&where=news&query={}&sm=tab_pge&sort=0&photo=0&field=0&reporter_article=&pd=0&ds={}&de={}&docid=&nso=so:r,p:from{}to{},a:all&mynews=0&cluster_rank=27&start={}"
        req = requests.get(url_format.format(keyword, now_date, now_date, now_date2, now_date2, str(page_number)),
                           headers={'User-Agent': 'Mozilla/5.0'})
        sp = BeautifulSoup(req.text, 'html.parser')  # 파싱하여 원하는 데이터만 가져오게 html 저장

        sources = sp.select('div.group_news > ul.list_news > li div.news_area > a')

        for source in sources:
            title = source.attrs['title']
            url = source.attrs['href']

            crawling_news = {
                '기사 제목' : title,
                '키워드' : keyword,
                'url' : url,
                '크롤링 날짜' : str(now_date)
            }

            news_list.append(crawling_news)

    return news_list


token = 'token 정보'
url = 'url 정보'

# client 만들고 페이지 정보 가져오기
client = NotionClient(token_v2=token)
page = client.get_collection_view(url)

news = news()

for onenews in news:
    row = page.collection.add_row()
    row.title = onenews['기사 제목']
    row.crawlingdate = onenews['크롤링 날짜']
    row.keyword = onenews['키워드']
    row.url = onenews['url']

news_list에 들어 있는 title, keyword, url, crawlingdate 정보를 이용하여

row = page.collection.add_row()의 명령어로 해당 페이지 table의 각 column 정보에 맞춰 한 줄씩 기입이 됩니다.

지금까지 notion-py를 이용한 notion 네이버 뉴스 웹 크롤링을 알아보았습니다.

더 많은 노션 꿀팁은 유튜브를 참고해주세요 :-)

(구독과 좋아요는 사랑입니다... 🥰)

www.youtube.com/channel/UCpMsx_Ac9qVr9bFrBSOI-WQ/featured

아멜리 Amélie

www.youtube.com

노션 사용법, 노션, 노션 다운로드, 노션 포트폴리오, 노션 템플릿, 노션 가계부, 노션 독서노트, 노션 한글판, 노션 한글, 노션 캘린더, 아이패드 노션, notion, notion 사용법, 노션 한국어, 노션 가이드, notion 다운로드, 노션 에버노트, 노션 홈페이지, 노션 무료, 노션 유료