beautiful soup find link in a table by specifying two things - python

How do i get the resulting url: https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-index.htm
...from this page ...
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001633917&owner=exclude&count=40
... by specifing date = '2018-04-25 and I want 8-k for Filing? Do I loop though or is there a one liner code that will get me the result?
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
date='2018-04-25'
CIK='1633917'
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=' + CIK + '&owner=exclude&count=100'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
a=soup.find('table', class_='tableFile2').findAll('tr')
for i in a:
print i

There is no one liner code to get what you want. You'll have to loop through the rows and then check if the values match.
But, there is a slightly better approach which narrows down the rows. You can directly select the rows which match one of the values. For example, you can select all the rows which have date = '2018-04-25' and then check if the Filing matches.
Code:
for date in soup.find_all('td', text='2018-04-25'):
row = date.find_parent('tr')
if row.td.text == '8-K':
link = row.a['href']
print(link)
Output:
/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-index.htm
So, here, instead of looping over all the rows, you simply loop over the rows having the date you want. In this case, there is only one such row, and hence we loop only once.

Related

How to scrape website with mouseover events and no class on most elements

I am trying to scrape a table with data off a website with mouse-over color changing events and on each row only the first two columns have a class:
Additionally, when I do try to scrape those first two rows I get the output as such:
This is the code I am running.
lists = soup.find_all('table', class_= "report")
for list in lists:
date = list.find_all('td', class_= "time")
flag = list.find_all('td', class_= "flag")
info = [date, flag]
print(info)
I was expecting to receive only the numerical values so that I can export them and work with them.
I tried to use the .replace() function but it didn't remove anything.
I was unable to use .text even after converting date and flag to strings.
Notes
It might be better to not use list as a variable name since it already means something in python....
Also, I can't test any of the suggestions below without having access to your HTML. It would be helpful if you include how you fetched the HTML to parse with BeautifulSoup. (Did you use some form of requests? Or something like selenium? Or do you just have the HTML file?) With requests or even selenium, sometimes the HTML fetched is not what you expected, so that's another issue...
The Reason behind the Issue
You can't apply .text/.get_text to the ResultSets (lists), which are what .find_all and .select return. You can apply .text/.get_text to Tags like the ones returned by .find or .select_one (but you should check first None was not returned - which happens when nothing is found).
So date[0].get_text() might have returned something, but since you probably want all the dates and flags, that's not the solution.
Solution Part 1 Option A: Getting the Rows with .find... chain
Instead of iterating through tables directly, you need to get the rows first (tr tags) before trying to get at the cells (td tags); if you have only one table with class= "report", you could just do something like:
rows = soup.find('table', class_= "report").find_all('tr', {'onmouseover': True})
But it's risky to chain multiple .find...s like that, because an error will be raised if any of them return None before reaching the last one.
Solution Part 1 Option B: Getting the Rows More Safely with .find...
It would be safer to do something like
table = soup.find('table', class_= "report")
rows = table.find_all('tr', {'onmouseover': True}) if table else []
or, if there might be more than one table with class= "report",
rows = []
for table in soup.find_all('table', class_= "report"):
rows += table.find_all('tr', {'onmouseover': True})
Solution Part 1 Option C: Getting the Rows with .select
However, I think the most convenient way is to use .select with CSS selectors
# if there might be other tables with report AND other classes that you don't want:
# rows = soup.select('table[class="report"] tr[onmouseover]')
rows = soup.select('table.report tr[onmouseover]')
this method is only unsuitable if there might be more than one table with class= "report", but you only want rows from the first one; in that case, you might prefer the table.find_....if table else [] approach.
Solution Part 2 Option A: Iterating Over rows to Print Cell Contents
Once you have rows, you can iterate over them to print the date and flag cell contents:
for r in rows:
date, flag = r.find('td', class_= "time"), r.find('td', class_= "flag")
info = [i.get_text() if i else i for i in [date, flag]]
if any([i is not None for i in info]): print(info)
# [ only prints if there are any non-null values ]
Solution Part 2 Option B: Iterating Over rows with a Fucntion
Btw, if you're going to be extracting multiple tag attributes/texts repeatedly, you might find my selectForList function useful - it could have been used like
for r in rows:
info = selectForList(r, ['td.time', 'td.flag'], printList=True)
if any([i is not None for i in info]): print(info)
or, to get a list of dictionaries like [{'time': time_1, 'flag': flag_1}, {'time': time_2, 'flag': flag_2}, ...],
infList = [selectForList(r, {
'time': 'td.time', 'flag': 'td.flag', ## add selectors for any info you want
}) for r in soup.select('table.report tr[onmouseover]')]
infList = [d for d in infoList if [v for v in d.values() if v is not None]]
Added EDIT:
To get all displayed cell contents:
for r in rows:
info = [(td.get_text(),''.join(td.get('style','').split())) for td in r.select('td')]
info = [txt.strip() for txt, styl in r.select('td') if 'display:none' not in styl]
# info = [i if i else 0 for i in info] # fill blanks with zeroes intead of ''
if any(info): print(info) ## or write to CSV

Extracting Player Hyperlinks from Basketball Reference Column(From Tables With Multiple Columns of Hyperlinks) to new Column in Pandas DataFrame

I recently asked a very similar question to this, however, I have run into some additional issues regarding the problem. My goal was to extract the links from the column column players to new column with rows corresponding to the given player. My current method works for some tables where these links are the first to appear in the table. However, with tables like the one in the code below, there are links prior to the player ones and those are what come through to the new column. I have attempted to exclude certain links from the ones extracted using sub strings. However I am unclear on what format the "list"(not as in a list object) of links are coming is as. Does anyone know of a way to either extract soley the player links? I cannot find any obvious differences between the columns' links within the html so I am unclear if this is even possible. However, if anyone more knowledegable of BeautifulSoup could take a look, that would be amazing.
Below I have provided the code, the type of links coming in, and the desired links. Thank you in advance.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import unicodedata
def MVPWINNERS():
url="https://www.basketball-reference.com/awards/mvp.html"
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, "html.parser")
tabs = soup.select('table[id*="mvp_NBA"]')
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
player_links=(j.find('a')['href'])
player.append(player_links)
players.append(player)
max_length = len(max(players, key=len))
players_plus = [player + [""]*(max_length - len(player)) for player in players]
df=pd.DataFrame(players_plus,columns=cols+["player_links"])
max_length = len(max(players, key=len))
players_plus = [player + [""]*(max_length - len(player)) for player in players]
df=pd.DataFrame(players_plus,columns=cols+["player_links"])
print(df)
MVPWINNERS()
Current output:
/leagues/NBA_2022.html
Desired Output:
/players/j/jokicni01.html
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from pprint import pp
def main(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
players = set(urljoin(url, x['href'])
for x in soup.select('td[data-stat=player] a'))
pp(players)
main('https://www.basketball-reference.com/awards/mvp.html')
Output:
{'https://www.basketball-reference.com/players/a/abdulka01.html',
'https://www.basketball-reference.com/players/a/antetgi01.html',
'https://www.basketball-reference.com/players/b/barklch01.html',
'https://www.basketball-reference.com/players/b/birdla01.html',
'https://www.basketball-reference.com/players/b/bryanko01.html',
'https://www.basketball-reference.com/players/c/chambwi01.html',
'https://www.basketball-reference.com/players/c/cousybo01.html',
'https://www.basketball-reference.com/players/c/cowenda01.html',
'https://www.basketball-reference.com/players/c/cunnibi01.html',
'https://www.basketball-reference.com/players/c/curryst01.html',
'https://www.basketball-reference.com/players/d/danieme01.html',
'https://www.basketball-reference.com/players/d/duncati01.html',
'https://www.basketball-reference.com/players/d/duranke01.html',
'https://www.basketball-reference.com/players/e/ervinju01.html',
'https://www.basketball-reference.com/players/g/garneke01.html',
'https://www.basketball-reference.com/players/g/gilmoar01.html',
'https://www.basketball-reference.com/players/h/hardeja01.html',
'https://www.basketball-reference.com/players/h/hawkico01.html',
'https://www.basketball-reference.com/players/h/haywosp01.html',
'https://www.basketball-reference.com/players/i/iversal01.html',
'https://www.basketball-reference.com/players/j/jamesle01.html',
'https://www.basketball-reference.com/players/j/johnsma02.html',
'https://www.basketball-reference.com/players/j/jokicni01.html',
'https://www.basketball-reference.com/players/j/jordami01.html',
'https://www.basketball-reference.com/players/m/malonka01.html',
'https://www.basketball-reference.com/players/m/malonmo01.html',
'https://www.basketball-reference.com/players/m/mcadobo01.html',
'https://www.basketball-reference.com/players/m/mcginge01.html',
'https://www.basketball-reference.com/players/n/nashst01.html',
'https://www.basketball-reference.com/players/n/nowitdi01.html',
'https://www.basketball-reference.com/players/o/olajuha01.html',
'https://www.basketball-reference.com/players/o/onealsh01.html',
'https://www.basketball-reference.com/players/p/pettibo01.html',
'https://www.basketball-reference.com/players/r/reedwi01.html',
'https://www.basketball-reference.com/players/r/roberos01.html',
'https://www.basketball-reference.com/players/r/robinda01.html',
'https://www.basketball-reference.com/players/r/rosede01.html',
'https://www.basketball-reference.com/players/r/russebi01.html',
'https://www.basketball-reference.com/players/u/unselwe01.html',
'https://www.basketball-reference.com/players/w/waltobi01.html',
'https://www.basketball-reference.com/players/w/westbru01.html'}

Scraper To Copy Articles In Bulk

I'm working on an AI project, and one of the steps is to get ~5,000 articles from an online outlet.
I'm a beginner programmer, so please be kind. I've found a site that is very easy to scrape from, in terms of URL structure - I just need a scraper that can take an entire article from a site (we will be analyzing the articles in bulk, with AI).
The div containing the article text for each piece, is the same across the entire site - "col-md-12 description-content-wrap".
Does anyone know a simple Python script that would simply go thru a .CSV of URLs, pull the text from the above listed ^ div of each article, and output it as plain text? I've found a few solutions, but none are 100% what I need.
Ideally all of the 5,000 articles would be outputted in one file, but if they need to each be separate, that's fine too. Thanks in advance!
I did something a little bit similar to this about a week ago. Here is the code that I came up with.
from bs4 import BeautifulSoup
import urllib.request
from pandas import DataFrame
resp = urllib.request.urlopen("https://www.cnbc.com/finance/")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
substring = 'https://www.cnbc.com/'
df = ['review']
for link in soup.find_all('a', href=True):
#print(link['href'])
if (link['href'].find(substring) == 0):
# append
df.append(link['href'])
#print(link['href'])
#list(df)
# convert list to data frame
df = DataFrame(df)
#type(df)
#list(df)
# add column name
df.columns = ['review']
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review'].apply(lambda x: sid.polarity_scores(x))
def convert(x):
if x < 0:
return "negative"
elif x > .2:
return "positive"
else:
return "neutral"
df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
df['result']
df_final = pd.merge(df['review'], df['result'], left_index=True, right_index=True)
df_final
df_final.to_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\out.csv')
Result:

Parsing output from scraped webpage using pandas and bs4: way to make output more readable?

I wanted to scrape this page.
I wrote this code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("http://yadamp.unisa.it/showItem.aspx?yadampid=18")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))
But the output isn't ideal. The output is:
[{"0":"ID","1":"18","2":"NAME","3":"Colutellin-A Blast NCBI-PROT","4":null,"5":null},{"0":"LENGTH","1":"7","2":"DISULFIDE BRIDGE","3":null,"4":"View PDB \/\/ Small molecules can be embedded in the page var glmol02 = new GLmol('glmol02');","5":null},{"0":"SEQUENCE","1":"VISIIPV","2":null,"3":null,"4":null,"5":null},{"0":"HELICITY","1":"85.70","2":"INSTAB. INDEX","3":"31.97","4":"FLEXIBILITY","5":"5.43"},{"0":"a HYD. MOM.","1":"16.35","2":"b HYD. MOM.","3":"9.04","4":"c HYD. MOM","5":"1.37"},{"0":"a MEAN HYD. MOM.","1":"2.34","2":"b MEAN HYD. MOM.","3":"1.29","4":"c MEAN HYD. MOM.","5":"0.20"},{"0":"CHARGE pH5","1":"0.00","2":"CHARGE pH7","3":"0.00","4":"CHARGE pH9","5":"-0.17"},{"0":"\u0394 CHARGE pH5-pH9","1":"0.17","2":"ISOELECTRIC POINT","3":"5.49","4":"BOMAN INDEX","5":"-2.78"},{"0":"\u0394G","1":"-368","2":"CPP","3":"-027","4":"MLP","5":"-006"},{"0":"MOLECULAR VOLUME","1":null,"2":"POLARITY","3":null,"4":null,"5":null},{"0":"MIC E. coli","1":null,"2":"MIC P. aeruginosa","3":null,"4":"MIC S. typhimurium","5":null},{"0":"MIC S. aureus","1":null,"2":"MIC M. luteus","3":null,"4":"MIC B. subtilis","5":null},{"0":"MIC C. albicans","1":null,"2":"OTHER","3":"S.sclerotiorum = 30.86; B.cinerea = 10.29","4":null,"5":null},{"0":"MIC OTHER gram+","1":null,"2":null,"3":null,"4":null,"5":null},{"0":"MIC OTHERgram-","1":null,"2":null,"3":null,"4":null,"5":null},{"0":"PHYLUM","1":"Ascomycota","2":"CLASS","3":"Sordariomycetes","4":"ORDER","5":"Glomerellales"},{"0":"FAMILY","1":"Glomerellaceae","2":"GENUS","3":"Colletotrichum","4":"SPECIES","5":"Colletotrichum dematium"},{"0":"DATE","1":"2008","2":null,"3":null,"4":null,"5":null},{"0":"TITLE PAPER","1":"Colutellin A, an immunosuppressive peptide from Colletotrichum dematium","2":null,"3":null,"4":null,"5":null}]
You can see it's difficult for me to make sense of this list, because I have to loop through a list of multiple dictionaries, and then join pairs of keys together. I was hoping the output would be more like:
ID 18
Name Colutellin-A
Helicity 85.7
etc....just something more readable. Can anyone pinpoint a section of the code that I should change to improve this?
Thanks
You can use pandas read_html() to get the table and then navigate the table using pandas DataFrame(), see the code below!
url = 'http://yadamp.unisa.it/showItem.aspx?yadampid=18'
table = pd.read_html(url, attrs={
'class': 'table table-responsive'}, header=0)
print(pd.DataFrame(table[0]))

Pandas DataFrame Speed

So I have the preceding dataframe to which I want to add a new column called "dload" which I achieve by coding df["dload"] = np.nan
I then want to fill in the nan value with the returns of this function:
def func_ret_value(soup,tables):
for td in tables[40].findAll("td"):
if td.text == "Short Percent of Float":
value = list(td.next_siblings)[1].text.strip("%")
#print(value)
return value
To do this I write the following code:
for index in df.index:
# print(index,row)
# print(index,df.iloc[index]["Symbol"])
r = requests.get(url_pre+df.iloc[index]["Symbol"]+url_suf)
soup = BeautifulSoup(r.text,"html.parser")
tables = soup.findAll("table")
#print(row["dload"])
df.loc[index,"dload"] = func_ret_value(soup,tables)
Is there some iterrows or apply that is a faster way of doing this?
Thank you.
You could use apply(), but I would guess that the most computationally intensive part of your code are your HTTP requests (as mentioned by #Peter Leimbigler in his comment). Here is an example with your function:
def func_ret_value(x):
r = requests.get(url_pre + x['Symbol'] + url_suf)
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.findAll('table')
for td in tables[40].findAll("td"):
if td.text == "Short Percent of Float":
return list(td.next_siblings)[1].text.strip("%")
df['dload'] = df.apply(func_ret_value, axis=1)
Note that axis=1 specifies that you will apply this function row-wise.
You may also consider implementing some error-handling here in the case that your if statement inside your func_ret_value() function is never triggered for a given row.

Categories

Resources