I'm working on an AI project, and one of the steps is to get ~5,000 articles from an online outlet.
I'm a beginner programmer, so please be kind. I've found a site that is very easy to scrape from, in terms of URL structure - I just need a scraper that can take an entire article from a site (we will be analyzing the articles in bulk, with AI).
The div containing the article text for each piece, is the same across the entire site - "col-md-12 description-content-wrap".
Does anyone know a simple Python script that would simply go thru a .CSV of URLs, pull the text from the above listed ^ div of each article, and output it as plain text? I've found a few solutions, but none are 100% what I need.
Ideally all of the 5,000 articles would be outputted in one file, but if they need to each be separate, that's fine too. Thanks in advance!
I did something a little bit similar to this about a week ago. Here is the code that I came up with.
from bs4 import BeautifulSoup
import urllib.request
from pandas import DataFrame
resp = urllib.request.urlopen("https://www.cnbc.com/finance/")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
substring = 'https://www.cnbc.com/'
df = ['review']
for link in soup.find_all('a', href=True):
#print(link['href'])
if (link['href'].find(substring) == 0):
# append
df.append(link['href'])
#print(link['href'])
#list(df)
# convert list to data frame
df = DataFrame(df)
#type(df)
#list(df)
# add column name
df.columns = ['review']
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review'].apply(lambda x: sid.polarity_scores(x))
def convert(x):
if x < 0:
return "negative"
elif x > .2:
return "positive"
else:
return "neutral"
df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
df['result']
df_final = pd.merge(df['review'], df['result'], left_index=True, right_index=True)
df_final
df_final.to_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\out.csv')
Result:
Related
I recently asked a very similar question to this, however, I have run into some additional issues regarding the problem. My goal was to extract the links from the column column players to new column with rows corresponding to the given player. My current method works for some tables where these links are the first to appear in the table. However, with tables like the one in the code below, there are links prior to the player ones and those are what come through to the new column. I have attempted to exclude certain links from the ones extracted using sub strings. However I am unclear on what format the "list"(not as in a list object) of links are coming is as. Does anyone know of a way to either extract soley the player links? I cannot find any obvious differences between the columns' links within the html so I am unclear if this is even possible. However, if anyone more knowledegable of BeautifulSoup could take a look, that would be amazing.
Below I have provided the code, the type of links coming in, and the desired links. Thank you in advance.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import unicodedata
def MVPWINNERS():
url="https://www.basketball-reference.com/awards/mvp.html"
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, "html.parser")
tabs = soup.select('table[id*="mvp_NBA"]')
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
player_links=(j.find('a')['href'])
player.append(player_links)
players.append(player)
max_length = len(max(players, key=len))
players_plus = [player + [""]*(max_length - len(player)) for player in players]
df=pd.DataFrame(players_plus,columns=cols+["player_links"])
max_length = len(max(players, key=len))
players_plus = [player + [""]*(max_length - len(player)) for player in players]
df=pd.DataFrame(players_plus,columns=cols+["player_links"])
print(df)
MVPWINNERS()
Current output:
/leagues/NBA_2022.html
Desired Output:
/players/j/jokicni01.html
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from pprint import pp
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
players = set(urljoin(url, x['href'])
for x in soup.select('td[data-stat=player] a'))
pp(players)
main('https://www.basketball-reference.com/awards/mvp.html')
Output:
{'https://www.basketball-reference.com/players/a/abdulka01.html',
'https://www.basketball-reference.com/players/a/antetgi01.html',
'https://www.basketball-reference.com/players/b/barklch01.html',
'https://www.basketball-reference.com/players/b/birdla01.html',
'https://www.basketball-reference.com/players/b/bryanko01.html',
'https://www.basketball-reference.com/players/c/chambwi01.html',
'https://www.basketball-reference.com/players/c/cousybo01.html',
'https://www.basketball-reference.com/players/c/cowenda01.html',
'https://www.basketball-reference.com/players/c/cunnibi01.html',
'https://www.basketball-reference.com/players/c/curryst01.html',
'https://www.basketball-reference.com/players/d/danieme01.html',
'https://www.basketball-reference.com/players/d/duncati01.html',
'https://www.basketball-reference.com/players/d/duranke01.html',
'https://www.basketball-reference.com/players/e/ervinju01.html',
'https://www.basketball-reference.com/players/g/garneke01.html',
'https://www.basketball-reference.com/players/g/gilmoar01.html',
'https://www.basketball-reference.com/players/h/hardeja01.html',
'https://www.basketball-reference.com/players/h/hawkico01.html',
'https://www.basketball-reference.com/players/h/haywosp01.html',
'https://www.basketball-reference.com/players/i/iversal01.html',
'https://www.basketball-reference.com/players/j/jamesle01.html',
'https://www.basketball-reference.com/players/j/johnsma02.html',
'https://www.basketball-reference.com/players/j/jokicni01.html',
'https://www.basketball-reference.com/players/j/jordami01.html',
'https://www.basketball-reference.com/players/m/malonka01.html',
'https://www.basketball-reference.com/players/m/malonmo01.html',
'https://www.basketball-reference.com/players/m/mcadobo01.html',
'https://www.basketball-reference.com/players/m/mcginge01.html',
'https://www.basketball-reference.com/players/n/nashst01.html',
'https://www.basketball-reference.com/players/n/nowitdi01.html',
'https://www.basketball-reference.com/players/o/olajuha01.html',
'https://www.basketball-reference.com/players/o/onealsh01.html',
'https://www.basketball-reference.com/players/p/pettibo01.html',
'https://www.basketball-reference.com/players/r/reedwi01.html',
'https://www.basketball-reference.com/players/r/roberos01.html',
'https://www.basketball-reference.com/players/r/robinda01.html',
'https://www.basketball-reference.com/players/r/rosede01.html',
'https://www.basketball-reference.com/players/r/russebi01.html',
'https://www.basketball-reference.com/players/u/unselwe01.html',
'https://www.basketball-reference.com/players/w/waltobi01.html',
'https://www.basketball-reference.com/players/w/westbru01.html'}
I am trying to retrieve all journals that exist within the a subject area of Scopus, say 'Medicine', using the python package pybliometrics.
According to the Scopus search (online), there are 13,477 Journals in this category.
Accessing the SerialTitle API of Scopus via pybliometrics.scopus.SerialSearch() for category Medicine, the subjArea='MEDI' and subjCode='2700'. The list of all codes associated with the Scopus subject categories are listed here
I am not able to get more than 5000 journals. But with parameter subjArea='MEDI' I am able to retrieve 5000+ documents but not more than 10,000.
I do not understand why searching with subjArea and subjCode fetches different results for me. Can anyone help me understand why this could be happening?
I am adding my code for both these search queries for better understanding:
import pandas as pd
from pybliometrics.scopus import SerialSearch
def search_by_subject_area(subject_area):
print("Searching journals by subject area....")
df = pd.DataFrame()
i = 0
# limitation of i<10000 is added otherwise raises error of scopus500
while (i > -1 and i < 10000):
s = SerialSearch(query={"subj": f"{str(subject_area)}"}, start=f'{i}', refresh=True)
if s.get_results_size() == 0:
break
else:
i += s.get_results_size()
df_new = pd.DataFrame(s.results)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
print(i, " journals obtained!")
def search_by_subject_code(code):
print("------------------------------------------------\n Searching journals by subject codes....")
df = pd.DataFrame()
i = 0
while (i > -1):
s = SerialSearch(query={"subjCode": f"{code}"}, start=f'{i}', refresh=True)
if s.get_results_size() == 0:
break
else:
i += s.get_results_size()
df_new = pd.DataFrame(s.results)
df = pd.concat([df, df_new], axis=0, ignore_index=True)
print(i, " journals obtained!")
if __name__ == '__main__':
search_by_subject_area(subject_area = 'MEDI')
search_by_subject_code('2700')
Certain Scopus APIs, including the Serial Search API, are restricted: They do not allow more than 5,000 results.
There are some Search APIs that have pagination active, where they allow you to cycle through a potentially unlimited number of results.
I've managed to expose the right data (some of it is calculated on the fly in the page so was a bit more complex than I thought) but I now need to get it in a JSON string and despite many attempts I'm stuck!
This Python script is as follows (using Selenium & BeautifulSoup):
from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
from dateutil import parser
import requests
import json
url = 'https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1'
browser = webdriver.Chrome(executable_path = r'C:/Users/user/Downloads/chromedriver.exe')
browser.get(url)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, "html.parser")
data=soup.find_all("div", {"class":"date_display"})
#print(data)
#out = {}
for data in data:
bin_colour = data.find('h3').text
bin_date = parser.parse(data.find('p').text).strftime('%Y-%m-%d')
print(bin_colour)
print(bin_date)
print()
browser.quit()
This results in:
Grey Bin
2021-06-30
Green Bin
2021-06-23
Clear Sack
2021-06-23
Food Bin
2021-06-23
It might (probably) not be the best code/approach so am open to your suggestions. The main goal is to end up with:
{"Grey Bin": "2021-06-30", "Green Bin": "2021-06-23", "Clear Sack": "2021-06-23", "Food Bin": "2021-06-23"}
Hope this makes sense, I've tried various ways of getting the data into the right format but just seem to lose it all so after many hours of trying I'm hoping you guys can help.
Update:
Both of MendelG's solutions worked perfectly. Vitalis's solution gave four outputs, the last being the required output - so thank you both for very quick and working solutions - I was close, but couldn't see the wood for the trees!
To get the data in a dictionary format, you can try:
out = {}
for data in tag:
out[data.find("h3").text] = parser.parse(data.find("p").text).strftime("%Y-%m-%d")
print(out)
Or, use a dictionary comprehension:
print(
{
data.find("h3").text: parser.parse(data.find("p").text).strftime("%Y-%m-%d")
for data in tag
}
)
Output:
{'Grey Bin': '2021-06-30', 'Green Bin': '2021-06-23', 'Clear Sack': '2021-06-23', 'Food Bin': '2021-06-23'}
You can create an empty dictionary, add values there and print it.
Solution
from bs4 import BeautifulSoup
from selenium import webdriver
import datetime
from dateutil import parser
import requests
import json
url = 'https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1'
browser = webdriver.Chrome(executable_path='/snap/bin/chromium.chromedriver')
browser.get(url)
html = browser.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
soup = BeautifulSoup(html, "html.parser")
data = soup.find_all("div", {"class":"date_display"})
result = {}
for item in data:
bin_colour = item.find('h3').text
bin_date = parser.parse(item.find('p').text).strftime('%Y-%m-%d')
result[bin_colour]=bin_date
print(result)
OUTPUT
{'Grey Bin': '2021-06-30', 'Green Bin': '2021-06-23', 'Clear Sack': '2021-06-23', 'Food Bin': '2021-06-23'}
You can make it in a similar way if you'll need and an output in list, but you'll need to .append values, similarly as I did here Trouble retrieving elements and looping pages using next page button
If you need a double quotes use this print:
print(json.dumps(result))
It will print:
{"Grey Bin": "2021-06-30", "Green Bin": "2021-06-23", "Clear Sack": "2021-06-23", "Food Bin": "2021-06-23"}
You could collect all the listed dates using requests and re. You regex out the various JavaScript objects containing the dates for each collection type. You then need to add 1 to each month value to get month in the range 1-12; which can be done with regex named groups. These can be converted to actual dates for later filtering.
Initially storing all dates in a dictionary with key as collection type and values as a list of collection dates, you can use zip_longest to create a DataFrame. You can then use filtering to find the next collection date for a given collection.
I use a couple of helper functions to achieve this.
import requests
from dateutil import parser
from datetime import datetime
from pandas import to_datetime, DataFrame
from itertools import zip_longest
def get_dates(dates):
dates = [re.sub(r'(?P<g1>\d+),(?P<g2>\d+),(?P<g3>\d+)$', lambda d: parser.parse('-'.join([d.group('g1'), str(int(d.group('g2')) + 1), d.group('g3')])).strftime('%Y-%m-%d'), i)
for i in re.findall(r'Date\((\d{4},\d{1,2},\d{1,2}),', dates)]
dates = [datetime.strptime(i, '%Y-%m-%d').date() for i in dates]
return dates
def get_next_collection(collection, df):
return df[df[collection] >= to_datetime('today')][collection].iloc[0]
collection_types = ['grey', 'green', 'clear', 'food']
r = requests.get('https://www.braintree.gov.uk/bins-waste-recycling/route-3-collection-dates/1')
collections = {}
for collection in collection_types:
dates = re.search(r'var {0}(?:(?:bin)|(?:sack)) = (\[.*?\])'.format(collection), r.text, re.S).group(1)
collections[collection] = get_dates(dates)
df = DataFrame(zip_longest(collections['grey'], collections['green'],
collections['clear'], collections['food']),
columns = collection_types)
get_next_collection('grey', df)
You could also use a generator and islice, as detailed by #Martijn Pieters
, to work direct of the dictionary entries (holding the collection dates) and limit how many future dates you are interested in e.g.
filtered = (i for i in collections['grey'] if i >= date.today())
list(islice(filtered, 3))
Altered import lines are:
from itertools import zip_longest, islice
from datetime import datetime, date
You then don't need the pandas imports or creation of a DataFrame.
I am relatively new to Python and completely new to webscraping, but I am trying to gather data from this website:
https://www.usclimatedata.com/climate/cumming/georgia/united-states/usga1415
I wanna grab the info from the tables from Jan-Dec and put it into a Pandas data frame and print it back to the user. I plan on doing some more stuff with the data like computing my own averages and mean/medians etc., but I am struggling with getting the data initially. Any help would be appreciated!!
if you getting data from files ,you can use (x=pd.read_csv or put the file extension that u use instead of csv )and print(x)
First check terms of website services in robots.txt to check whether it is legal to scrape the web page.
If it is, then you can use bs4's BeautifulSoup package to scrape the web page.
def get_state_holiday_data(self, year: int, state_name: str) -> pd.DataFrame:
try:
pagecontent = self.get_page_content(year, state_name)
holiday_table_list = []
for table in pagecontent.findAll("table"):
for tbody in table.findAll("tbody"):
for row in tbody.findAll("tr"):
holiday_row_list = []
if len(row.findAll("td")) == 3:
for cell_data in row.findAll("td"):
holiday_row_list.append(cell_data.find(text=True).replace('\n', '').strip(' '))
holiday_table_list.append(holiday_row_list)
break
state_holiday_df = pd.DataFrame.from_records(holiday_table_list, columns=['Date', 'Day', 'Holiday'])
state_holiday_df['Date'] = state_holiday_df['Date'].apply(
lambda date: str(year) + '-' + datetime.strptime(date, '%d %b').strftime('%m-%d'))
del state_holiday_df['Day']
return state_holiday_df
except Exception as e:
raise e
Above is the sample code to scrape a table and convert it to dataframe, where table and tbody are the html table element id/name.
How do i get the resulting url: https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-index.htm
...from this page ...
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001633917&owner=exclude&count=40
... by specifing date = '2018-04-25 and I want 8-k for Filing? Do I loop though or is there a one liner code that will get me the result?
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
date='2018-04-25'
CIK='1633917'
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=' + CIK + '&owner=exclude&count=100'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
a=soup.find('table', class_='tableFile2').findAll('tr')
for i in a:
print i
There is no one liner code to get what you want. You'll have to loop through the rows and then check if the values match.
But, there is a slightly better approach which narrows down the rows. You can directly select the rows which match one of the values. For example, you can select all the rows which have date = '2018-04-25' and then check if the Filing matches.
Code:
for date in soup.find_all('td', text='2018-04-25'):
row = date.find_parent('tr')
if row.td.text == '8-K':
link = row.a['href']
print(link)
Output:
/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-index.htm
So, here, instead of looping over all the rows, you simply loop over the rows having the date you want. In this case, there is only one such row, and hence we loop only once.