So I have the preceding dataframe to which I want to add a new column called "dload" which I achieve by coding df["dload"] = np.nan
I then want to fill in the nan value with the returns of this function:
def func_ret_value(soup,tables):
for td in tables[40].findAll("td"):
if td.text == "Short Percent of Float":
value = list(td.next_siblings)[1].text.strip("%")
#print(value)
return value
To do this I write the following code:
for index in df.index:
# print(index,row)
# print(index,df.iloc[index]["Symbol"])
r = requests.get(url_pre+df.iloc[index]["Symbol"]+url_suf)
soup = BeautifulSoup(r.text,"html.parser")
tables = soup.findAll("table")
#print(row["dload"])
df.loc[index,"dload"] = func_ret_value(soup,tables)
Is there some iterrows or apply that is a faster way of doing this?
Thank you.
You could use apply(), but I would guess that the most computationally intensive part of your code are your HTTP requests (as mentioned by #Peter Leimbigler in his comment). Here is an example with your function:
def func_ret_value(x):
r = requests.get(url_pre + x['Symbol'] + url_suf)
soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.findAll('table')
for td in tables[40].findAll("td"):
if td.text == "Short Percent of Float":
return list(td.next_siblings)[1].text.strip("%")
df['dload'] = df.apply(func_ret_value, axis=1)
Note that axis=1 specifies that you will apply this function row-wise.
You may also consider implementing some error-handling here in the case that your if statement inside your func_ret_value() function is never triggered for a given row.
Related
I am trying to scrape a table with data off a website with mouse-over color changing events and on each row only the first two columns have a class:
Additionally, when I do try to scrape those first two rows I get the output as such:
This is the code I am running.
lists = soup.find_all('table', class_= "report")
for list in lists:
date = list.find_all('td', class_= "time")
flag = list.find_all('td', class_= "flag")
info = [date, flag]
print(info)
I was expecting to receive only the numerical values so that I can export them and work with them.
I tried to use the .replace() function but it didn't remove anything.
I was unable to use .text even after converting date and flag to strings.
Notes
It might be better to not use list as a variable name since it already means something in python....
Also, I can't test any of the suggestions below without having access to your HTML. It would be helpful if you include how you fetched the HTML to parse with BeautifulSoup. (Did you use some form of requests? Or something like selenium? Or do you just have the HTML file?) With requests or even selenium, sometimes the HTML fetched is not what you expected, so that's another issue...
The Reason behind the Issue
You can't apply .text/.get_text to the ResultSets (lists), which are what .find_all and .select return. You can apply .text/.get_text to Tags like the ones returned by .find or .select_one (but you should check first None was not returned - which happens when nothing is found).
So date[0].get_text() might have returned something, but since you probably want all the dates and flags, that's not the solution.
Solution Part 1 Option A: Getting the Rows with .find... chain
Instead of iterating through tables directly, you need to get the rows first (tr tags) before trying to get at the cells (td tags); if you have only one table with class= "report", you could just do something like:
rows = soup.find('table', class_= "report").find_all('tr', {'onmouseover': True})
But it's risky to chain multiple .find...s like that, because an error will be raised if any of them return None before reaching the last one.
Solution Part 1 Option B: Getting the Rows More Safely with .find...
It would be safer to do something like
table = soup.find('table', class_= "report")
rows = table.find_all('tr', {'onmouseover': True}) if table else []
or, if there might be more than one table with class= "report",
rows = []
for table in soup.find_all('table', class_= "report"):
rows += table.find_all('tr', {'onmouseover': True})
Solution Part 1 Option C: Getting the Rows with .select
However, I think the most convenient way is to use .select with CSS selectors
# if there might be other tables with report AND other classes that you don't want:
# rows = soup.select('table[class="report"] tr[onmouseover]')
rows = soup.select('table.report tr[onmouseover]')
this method is only unsuitable if there might be more than one table with class= "report", but you only want rows from the first one; in that case, you might prefer the table.find_....if table else [] approach.
Solution Part 2 Option A: Iterating Over rows to Print Cell Contents
Once you have rows, you can iterate over them to print the date and flag cell contents:
for r in rows:
date, flag = r.find('td', class_= "time"), r.find('td', class_= "flag")
info = [i.get_text() if i else i for i in [date, flag]]
if any([i is not None for i in info]): print(info)
# [ only prints if there are any non-null values ]
Solution Part 2 Option B: Iterating Over rows with a Fucntion
Btw, if you're going to be extracting multiple tag attributes/texts repeatedly, you might find my selectForList function useful - it could have been used like
for r in rows:
info = selectForList(r, ['td.time', 'td.flag'], printList=True)
if any([i is not None for i in info]): print(info)
or, to get a list of dictionaries like [{'time': time_1, 'flag': flag_1}, {'time': time_2, 'flag': flag_2}, ...],
infList = [selectForList(r, {
'time': 'td.time', 'flag': 'td.flag', ## add selectors for any info you want
}) for r in soup.select('table.report tr[onmouseover]')]
infList = [d for d in infoList if [v for v in d.values() if v is not None]]
Added EDIT:
To get all displayed cell contents:
for r in rows:
info = [(td.get_text(),''.join(td.get('style','').split())) for td in r.select('td')]
info = [txt.strip() for txt, styl in r.select('td') if 'display:none' not in styl]
# info = [i if i else 0 for i in info] # fill blanks with zeroes intead of ''
if any(info): print(info) ## or write to CSV
I'm trying to filter data with Pandas using a list of values which are a couple of str book_tittle and int book_price :
import pandas as pd
import requests
from bs4 import BeautifulSoup
# settings_#############################################################################
isbn = {'9782756002484', '9782756025117', '9782756072449'}
url = 'https://www.abebooks.fr/servlet/SearchResults?sts=t&cm_sp=SearchF-_-NullResults-_-Results&isbn={}'
book_title = ["Mondes", "X-Wing"]
book_price = [100, 10]
#######################################################################################
### creation de lien à partir des codes ISBN#
def url_isbn(isbn):
merged = []
for link in isbn:
link_isbn = url.format(link)
merged.append(link_isbn)
return merged
### scraping each url from url_isbn
def get_data():
data = []
for i in url_isbn(isbn):
r = requests.get(i)
soup = BeautifulSoup(r.text, 'html.parser')
item = soup.find_all('div', {'class': 'result-data col-xs-9 cf'})
for x in item:
title = x.find('h2', {'class': 'title'}).text.replace('\n', '')
price = x.find('p', {'class': 'item-price'}).text.replace('EUR ', '').replace(',', '.')
url = 'https://www.abebooks.fr'+x.find('a', {'itemprop': 'url'})['href']
products = title, int(float(price)), url
data.append(products)
return data
###creating the dataframe
df = pd.DataFrame(get_data(), columns=["Titre", "Prix", "URL"])
###Filter data into the dataframe
for filtered in df:
df_final_to_email = filtered[(df['Titre'].str.contains(book_title) & (df.Prix < book_price))]
print(df_final_to_email)
I'm getting an error : TypeError : unhashable type : 'list'
I assume I cannot use a list for filtering in cause of the mix of data type, I tested with Tuple and dict, I get the same kind of error
I also try with df.query but it gives empty data frame
The filter will allow me to filter all the books which has "Mondes" in the title for a price < 100 but also all the books which contains "X-Wing" below a price of < 10, I'll also add more item to find with a price related.
Titre
Prix
Mondes Infernaux
95,10
Star Wars, Mondes Infernaux
75,50
X-Wing Rogue
9,50
X-Wing Rogue Squadron
7,50
Nothing about the filtering, but do you know how I could figure the following :products = title, int(float(price)), url ? I had to use float as I'm not able to convert like int(price) as int, I'm a bit annoyed of having rounded down numbers in the dataframe. (if any moderator can tell me if I have to do another post for this specific need ? thank you)
Thank you for your kind help
The error resides in your filtering code:
df_final_to_email = filtered[(df['Titre'].str.contains(book_title) & (df.Prix < book_price))]
book_title is a list. .str.contains does not work with list. It works with a single string or a regex pattern.
If your intention is to find books with "Mondes" in the title and price under 100 or "X-Wing" in the title and price under 10, you can use the following filtering code:
###Filter data into the dataframe
cond = pd.Series([False] * len(df), index=df.index)
for title, price in zip(book_title, book_price):
cond |= df["Titre"].str.contains(title) & df["Prix"].lt(price)
print(df[cond])
How it works:
We start by selecting no rows cond = <all False>
For each title and price, evaluate each row to see if they meet the criteria. A row needs to only match one (title, price) condition from the list so we use the "in-place or" operator (|=) to update our selection list.
The |= operator is equivalent to:
cond = cond | (df["Titre"].str.contains(title) & df["Prix"].lt(price))
If you need all matching rows in the dataframe there's no need to use a for loop.
Maybe try something like this:
def find_book(str, price):
return df[ (df['Titre'].str.contains(str)) & (df['Prix']<price) ]
# find all books containing the substring 'Wing' in the title with price <7
find_book('Wing', 7)
I believe this piece of code filters dataframe as you want (not tested):
df_final_to_email = pd.concat([df.loc[df["Titre"].str.contains(t) & df["Prix"].lt(p)]
for t,p in zip(book_title, book_price)])
I'm working on an AI project, and one of the steps is to get ~5,000 articles from an online outlet.
I'm a beginner programmer, so please be kind. I've found a site that is very easy to scrape from, in terms of URL structure - I just need a scraper that can take an entire article from a site (we will be analyzing the articles in bulk, with AI).
The div containing the article text for each piece, is the same across the entire site - "col-md-12 description-content-wrap".
Does anyone know a simple Python script that would simply go thru a .CSV of URLs, pull the text from the above listed ^ div of each article, and output it as plain text? I've found a few solutions, but none are 100% what I need.
Ideally all of the 5,000 articles would be outputted in one file, but if they need to each be separate, that's fine too. Thanks in advance!
I did something a little bit similar to this about a week ago. Here is the code that I came up with.
from bs4 import BeautifulSoup
import urllib.request
from pandas import DataFrame
resp = urllib.request.urlopen("https://www.cnbc.com/finance/")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
substring = 'https://www.cnbc.com/'
df = ['review']
for link in soup.find_all('a', href=True):
#print(link['href'])
if (link['href'].find(substring) == 0):
# append
df.append(link['href'])
#print(link['href'])
#list(df)
# convert list to data frame
df = DataFrame(df)
#type(df)
#list(df)
# add column name
df.columns = ['review']
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review'].apply(lambda x: sid.polarity_scores(x))
def convert(x):
if x < 0:
return "negative"
elif x > .2:
return "positive"
else:
return "neutral"
df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
df['result']
df_final = pd.merge(df['review'], df['result'], left_index=True, right_index=True)
df_final
df_final.to_csv('C:\\Users\\ryans\\OneDrive\\Desktop\\out.csv')
Result:
How do i get the resulting url: https://www.sec.gov/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-index.htm
...from this page ...
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001633917&owner=exclude&count=40
... by specifing date = '2018-04-25 and I want 8-k for Filing? Do I loop though or is there a one liner code that will get me the result?
from bs4 import BeautifulSoup
from bs4.element import Comment
import requests
date='2018-04-25'
CIK='1633917'
url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=' + CIK + '&owner=exclude&count=100'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
a=soup.find('table', class_='tableFile2').findAll('tr')
for i in a:
print i
There is no one liner code to get what you want. You'll have to loop through the rows and then check if the values match.
But, there is a slightly better approach which narrows down the rows. You can directly select the rows which match one of the values. For example, you can select all the rows which have date = '2018-04-25' and then check if the Filing matches.
Code:
for date in soup.find_all('td', text='2018-04-25'):
row = date.find_parent('tr')
if row.td.text == '8-K':
link = row.a['href']
print(link)
Output:
/Archives/edgar/data/1633917/000163391718000094/0001633917-18-000094-index.htm
So, here, instead of looping over all the rows, you simply loop over the rows having the date you want. In this case, there is only one such row, and hence we loop only once.
The set up
I want to add a new column that contains a URL that has a base/template form and should have certain values interpolated into it based on the information contained in the row.
Table
What I would LOVE to be able to do
base_link = "https://www.vectorbase.org/Glossina_fuscipes/Location/View?r=%(scaffold)s:%(start)s-%(end)s"
# simplify getting column data from data_frame
start = operator.attrgetter('start')
end = operator.attrgetter('end')
scaffold = operator.attrgetter('seqname')
def get_links_to_genome_browser(data_frame):
base_links = pd.Series([base_link]*len(data_frame.index))
links = base_links % {"scaffold":scaffold(data_frame),"start":start(data_frame),"end":end(data_frame)}
return links
So I am answering my own question but I finally figured it out so I want to close this out and record the solution.
The solution is to use data_frame.apply() but to change my indexing syntax in the get_links_to_genome_browser function to Series syntax rather than DataFrame indexing syntax.
def get_links_to_genome_browser(series):
link = base_link % {"scaffold":series.ix['seqname'],"start":series.ix['start'],"end":series.ix['end']}
return link
Then call it like:
df.apply(get_links_to_genome_browser, axis=1)
I think I get what you're asking. Let me know
base_link = "https://www.vectorbase.org/Glossina_fuscipes/Location/View?r=%(scaffold)s:%(start)s-%(end)s"
then you can do something like this
data_frame['url'] = base_link + data_frame['start'] + data_frame['end'] + etc...