Python BeautifulSoup Extracting Data From Header - python

This is a follow-up from another question. Thanks for the help so far.
I've got some code to loop through a page and create a dataframe. I'm trying to add a third piece of information but it is contained within the header so it's just returning blank. The level information contained in the td and h3 part of the code. It returns the error "AttributeError: 'NoneType' object has no attribute 'text'" If I change level.h3.text to level.h3 it will run but then it will have the full tags in the data frame, instead of just the number.
import urllib
import bs4 as bs
import pandas as pd
#import csv as csv
sauce = urllib.request.urlopen('https://us.diablo3.com/en/item/helm/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
item_details = soup.find('tbody')
names = item_details.find_all('div', class_='item-details')
types = item_details.find_all('ul', class_='item-type')
#levels = item_details.find_all('h3', class_='subheader-3')
levels = item_details.find_all('td', class_='column-level align-center')
print(levels)
mytable = []
for name, type, level in zip(names, types, levels):
mytable.append((name.h3.a.text, type.span.text, level.h3.text))
export = pd.DataFrame(mytable, columns=('Item', 'Type','Level'))

Try to modify your code as below:
for name, type, level in zip(names, types, levels):
mytable.append((name.h3.a.text, type.span.text, level.h3.text if level.h3 else "No level"))
Now "No level" (you can use "N/A", None or whatever you like the most) will be added as third value in case there is no level (no header)

Related

Creating a list using BeautifulSoup

I want to scrape the IRS past forms site to gather the data for studying data mining. This web data contains a big table with 101 pages.
Here's the link:
https://apps.irs.gov/app/picklist/list/priorFormPublication.html
picture of site
My task:
Taking a list of tax form names (ex: "Form W-2", "Form 1095-C"), search the website
and return some informational results. Specifically, you must return the "Product
Number", the "Title", and the maximum and minimum years the form is available for
download. The forms returned should be an exact match for the input (ex: "Form W-2"
should not return "Form W-2 P", etc.) The results should be returned as json.
MY CODE SO FAR:
import requests
import lxml.html as lh
import pandas as pd
from bs4 import BeautifulSoup
import requests
url="https://apps.irs.gov/app/picklist/list/priorFormPublication.html"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify())
forms_table = soup.find("table", class_= "picklist-dataTable")
forms_table_data = forms_table.find_all("tr") # contains 2 rows
headings = []
for tr in forms_table_data[0].find_all("th"):
headings.append(tr.b.text.replace('\n', ' ').strip())
print(headings)
THIS IS WHERE I AM GETTING HORRIBLY STUCK:
data = {}
for table, heading in zip(forms_table_data, headings):
t_headers = []
for th in table.find_all("th"):
t_headers.append(th.text.replace('\n', ' ').strip())
table_data = []
for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
t_row = {}
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
data[heading] = table_data
print(data)
I also seem to be missing how to incorporate the rest of the pages on the site.
Thanks for your patience!
Easiest way as mentioned to get table in data frame is read_html() - Be aware that pandas read all the table from the site and put them in a list of data frames. In your case you have to slice it by [3]
Cause your question is not that clear and hard to read with all that images, you should improve it.
Example (Form W-2)
import pandas as pd
pd.read_html('pd.read_html('https://apps.irs.gov/app/picklist/list/priorFormPublication.html?resultsPerPage=200&sortColumn=sortOrder&indexOfFirstRow=0&criteria=formNumber&value=Form+W-2&isDescending=false')[3]
Than you can filter and sort the data frame and export as json.

Getting headers from html (parsing)

The source is https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States. I am looking to use the table called "COVID-19 pandemic in the United States by state and territory" which is the third diagram on the page.
Here is my code so far
from bs4 import BeautifulSoup
import pandas as pd
with open("COVID-19 pandemic in the United States - Wikipedia.htm", "r", encoding="utf-8") as fd:
soup=BeautifulSoup(fd)
print(soup.prettify())
all_tables = soup.find_all("table")
print("The total number of tables are {} ".format(len(all_tables)))
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
print(type(data_table))
sources = data_table.tbody.findAll('tr', recursive=False)[0]
sources_list = [td for td in sources.findAll('td')]
print(len(sources_list))
data = data_table.tbody.findAll('tr', recursive=False)[1].findAll('td', recursive=False)
data_tables = []
for td in data:
data_tables.append(td.findAll('table'))
header1 = [th.getText().strip() for th in data_tables[0][0].findAll('thead')[0].findAll('th')]
header1
This last line with header1 i giving me the error "list index out of range". What it is supposed to print is "U.S State or territory....."
I don't know anything about html, and everything gets me stuck and confused. The soup.find could also be referencing the wrong part of the webpage.
Can you just use
headers = [element.text.strip() for element in data_table.find_all("th")]
To get the text in the headers?
To get the entire table as a pandas dataframe, you can do:
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_file)
data_table = soup.find("div", {"class": 'mw-stack stack-container stack-clear-right mobile-float-reset'})
rows = data_table.find_all("tr")
# Delete first row as it's not part of the table and confuses pandas
# this removes it from both soup and data_table
rows[0].decompose()
# Same for third row
rows[2].decompose()
# Same for last two rows
rows[-1].decompose()
rows[-2].decompose()
# Read html with pandas
df = pd.read_html(str(data_table))[0]
# Keep only the useful columns
df = df[['U.S. state or territory[i].1', 'Cases[ii]', 'Deaths', 'Recov.[iii]', 'Hosp.[iv]']]
# Rename columns
df.columns = ["State", "Cases", "Deaths", "Recov.", "Hosp."]
It's probably easier in these cases to try to read tables with pandas, and go from there:
import pandas as pd
table = soup.select_one("div#covid19-container table")
df = pd.read_html(str(table))[0]
df
The output is the target table.
by looking at your code, I think you should call the html tag by find, not by find_all in the title tag

Combine multiple lists into one organized csv using bs4

I am new to this, I am using this as a learning opportunity, and have only gotten this far due to this communities help. But, I am trying to grab multiple sections pages like this
https://m.the-numbers.com/movie/Black-Panther
specifically the summary, starring cast, and supporting cast
I have been successful writing 1 list to csv, but cannot seem to find a way to write multiple. I am looking for a solution that is scalable, where I can keep adding more lists to the export.
things I have tried:
putting them in separate lists such as details, actors, using the same list with details.extended, etc. and nothing seems to work.
Expected result is producing a table such as:
HEADERS:
title, amount,starName,StarCharacter
with the data listed underneath.
ERRORS:
Exception has occurred: AttributeError'str' object has no attribute 'keys'
from bs4 import BeautifulSoup
import csv
import re
# Making get request
r = requests.get('https://m.the-numbers.com/movie/Black-Panther')
# Creating BeautifulSoup object
soup = BeautifulSoup(r.text, 'lxml')
# Localizing table from the BS object
table_soup = soup.find('div', class_='row').find('div', class_='table-responsive').find('table', id='movie_finances')
website = 'https://m.the-numbers.com/'
details = []
# Iterating through all trs in the table except the first(header) and the last two(summary) rows
for tr in table_soup.find_all('tr')[2:4]:
tds = tr.find_all('td')
# Creating dict for each row and appending it to the details list
details.extend({
'title': tds[0].text.strip(),
'amount': tds[1].text.strip(),
})
cast_soup = soup.find('div', id='accordion').find('div', class_='cast_new').find('table', class_='table table-sm')
for tr in cast_soup.find_all('tr')[2:15]:
tdc = tr.find_all('td')
# Creating dict for each row and appending it to the details list
details.append({
'starName': tdc[0].text.strip(),
'starCharacter': tdc[1].text.strip(),
})
# Writing details list of dicts to file using csv.DictWriter
with open('moviesPage2018.csv', 'w', encoding='utf-8', newline='\n') as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=details[0].keys())
writer.writeheader()
writer.writerows(details)```

Loop through a python dataframe with 10 urls and extract contents from them (BeautifulSoup)

I have a csv called 'df' with 1 column. I have a header and 10 urls.
Col
"http://www.cnn.com"
"http://www.fark.com"
etc
etc
This is my ERROR code
import bs4 as bs
df_link = pd.read_csv('df.csv')
for link in df_link:
x = urllib2.urlopen(link[0])
new = x.read()
# Code does not even get past here as far as I checked
soup = bs.BeautifulSoup(new,"lxml")
for text in soup.find_all('a',href = True):
text.append((text.get('href')))
I am getting an error which says
ValueError: unknown url type: C
I also get other variations of this error like
The issue is, it is not even getting past
x = urllib2.urlopen(link[0])
On the other hand; This is the WORKING CODE...
url = "http://www.cnn.com"
x = urllib2.urlopen(url)
new = x.read()
soup = bs.BeautifulSoup(new,"lxml")
for link in soup.find_all('a',href = True):
links.append((link.get('href')))
Fixed answer
I didn't realize you were using pandas, so what I said wasn't very helpful.
The way you want to do this using pandas is to iterate over the rows and extract the info from them. The following should work without having to get rid of the header:
import bs4 as bs
import pandas as pd
import urllib2
df_link = pd.read_csv('df.csv')
for link in df_link.iterrows():
url = link[1]['Col']
x = urllib2.urlopen(url)
new = x.read()
# Code does not even get past here as far as I checked
soup = bs.BeautifulSoup(new,"lxml")
for text in soup.find_all('a',href = True):
text.append((text.get('href')))
Original misleading answer below
It looks like the header of your CSV file is not being treated separately, and so in the first iteration through df_link, link[0] is "Col", which isn't a valid URL.

Parsing html data into python list for manipulation

I am trying to read in html websites and extract their data. For example, I would like to read in the EPS (earnings per share) for the past 5 years of companies. Basically, I can read it in and can use either BeautifulSoup or html2text to create a huge text block. I then want to search the file -- I have been using re.search -- but can't seem to get it to work properly. Here is the line I am trying to access:
EPS (Basic)\n13.4620.6226.6930.1732.81\n\n
So I would like to create a list called EPS = [13.46, 20.62, 26.69, 30.17, 32.81].
Thanks for any help.
from stripogram import html2text
from urllib import urlopen
import re
from BeautifulSoup import BeautifulSoup
ticker_symbol = 'goog'
url = 'http://www.marketwatch.com/investing/stock/'
full_url = url + ticker_symbol + '/financials' #build url
text_soup = BeautifulSoup(urlopen(full_url).read()) #read in
text_parts = text_soup.findAll(text=True)
text = ''.join(text_parts)
eps = re.search("EPS\s+(\d+)", text)
if eps is not None:
print eps.group(1)
It's not a good practice to use regex for parsing html. Use BeautifulSoup parser: find the cell with rowTitle class and EPS (Basic) text in it, then iterate over next siblings with valueCell class:
from urllib import urlopen
from BeautifulSoup import BeautifulSoup
url = 'http://www.marketwatch.com/investing/stock/goog/financials'
text_soup = BeautifulSoup(urlopen(url).read()) #read in
titles = text_soup.findAll('td', {'class': 'rowTitle'})
for title in titles:
if 'EPS (Basic)' in title.text:
print [td.text for td in title.findNextSiblings(attrs={'class': 'valueCell'}) if td.text]
prints:
['13.46', '20.62', '26.69', '30.17', '32.81']
Hope that helps.
I would take a very different approach. We use LXML for scraping html pages
One of the reasons we switched was because BS was not being maintained for a while - or I should say updated.
In my test I ran the following
import requests
from lxml import html
from collections import OrderedDict
page_as_string = requests.get('http://www.marketwatch.com/investing/stock/goog/financials').content
tree = html.fromstring(page_as_string)
Now I looked at the page and I see the data is divided into two tables. Since you want EPS, I noted that it is in the second table. We could write some code to sort this out programmatically but I will leave that for you.
tables = [ e for e in tree.iter() if e.tag == 'table']
eps_table = tables[-1]
now I noticed that the first row has the column headings, so I want to separate all of the rows
table_rows = [ e for e in eps_table.iter() if e.tag == 'tr']
now lets get the column headings:
column_headings =[ e.text_content() for e in table_rows[0].iter() if e.tag == 'th']
Finally we can map the column headings to the row labels and cell values
my_results = []
for row in table_rows[1:]:
cell_content = [ e.text_content() for e in row.iter() if e.tag == 'td']
temp_dict = OrderedDict()
for numb, cell in enumerate(cell_content):
if numb == 0:
temp_dict['row_label'] = cell.strip()
else:
dict_key = column_headings[numb]
temp_dict[dict_key] = cell
my_results.append(temp_dict)
now to access the results
for row_dict in my_results:
if row_dict['row_label'] == 'EPS (Basic)':
for key in row_dict:
print key, ':', row_dict[key]
row_label : EPS (Basic)
2008 : 13.46
2009 : 20.62
2010 : 26.69
2011 : 30.17
2012 : 32.81
5-year trend :
Now there is still more to do, for example I did not test for squareness (number of cells in each row is equal).
Finally I am a novice and I suspect others will advise more direct methods of getting at these elements (xPath or cssselect) but this does work and it gets you everything from the table in a nice structured manner.
I should add that every row from the table is available, they are in the original row order. The first item (which is a dictionary) in the my_results list has the data from the first row, the second item has the data from the second row etc.
When I need a new build of lxml I visit a page maintained by a really nice guy at UC-IRVINE
I hope this helps
from bs4 import BeautifulSoup
import urllib2
import lxml
import pandas as pd
url = 'http://markets.ft.com/research/Markets/Tearsheets/Financials?s=CLLN:LSE&subview=BalanceSheet'
soup = BeautifulSoup(urllib2.urlopen(url).read())
table = soup.find('table', {'data-ajax-content' : 'true'})
data = []
for row in table.findAll('tr'):
cells = row.findAll('td')
cols = [ele.text.strip() for ele in cells]
data.append([ele for ele in cols if ele])
df = pd.DataFrame(data)
print df
dictframe = df.to_dict()
print dictframe
The above code will give you a DataFrame from the webpage and then uses that to create a python dictionary.

Categories

Resources