Parsing HTML a convoluted Table w/ BeautifulSoup

Parsing HTML a convoluted Table w/ BeautifulSoup - python

I am trying to create a csv file from NOAA data from their http://www.srh.noaa.gov/data/obhistory/PAFA.html.
I tried working with the table tag,but it failed. So I am trying to do it by identifying <tr> on each line.
So this is my code:
#This script should take table context from URL and save new data into a CSV file.
noaa = urllib2.urlopen("http://www.srh.noaa.gov/data/obhistory/PAFA.html").read()
soup = BeautifulSoup(noaa)
#Iterate from lines 7 to 78 and extract the text in each line. I probably would like
#space delimited between each text
#for i in range(7, 78, 1):
rows = soup.findAll('tr')[i]
for tr in rows:
for n in range(0, 15, 1):
cols = rows.findAll('td')[n]
for td in cols[n]:
print td.find(text=true)....(match.group(0), match.group(2), match.group(3), ...
match.group(15)
At the moment some stuff is working as expected some is not, and the last part I am not sure how to stitch the way I would like it.
Ok so I took what "That1guy " suggested, and tried to extend it to the CSV component.
So:
import urllib2 as urllib
from bs4 import BeautifulSoup
from time import localtime, strftime
import csv
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)
table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue
date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]
print date, time, wind
with open("/home/eyalak/Documents/weather/weather.csv", "wb") as f:
writer = csv.writer(f)
print date, time, wind
writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
writer.writerow(str(time)+str(wind)+str(date)+'\n')
if row_count == 74:
print "74"
break
The printed result is fine, it is the file that is not.
I get:
Title 1,Title 2,Title 3
0,5,:,5,3,C,a,l,m,0,8,"
The problems in the CSV file created are:
The title is broken into the wrong columns;column 2, has "1,Title" versus "title 2"
The data is comma delineated in the wrong places
As The script writes new lines it over writes on the previous one, instead of appending
from the bottom.
Any thoughts?

This worked for me:
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)
table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue
date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]
print date, time, wind
if row_count == 74:
break
This code obviously only returns the first 3 cells of each row, but you get the idea. Also, note some empty cells. In these cases, to make sure they're populated (or else probably receive an IndexError), I would check the length of each row before grabbing .contents. ie:
if len(table_row('td')[offset]) > 0:
variable = table_row('td')[offset].contents[0]
This will ensure the cell is populated and you will avoid IndexErrors

Related

CSV -(excel)- Python. Seems like wrong writing on csv from python

I´m trying to export some data from a website and I first tried on one single page. I´ve to import text delimited by titles:
['Drug name','General Information','Clinical Results','Side Effects','Mechanism of Action','Literature
References','Additional Information','Approval Date','Date Created','Company Name']
The url is https://www.centerwatch.com/directories/1067-fda-approved-drugs/listing/3092-afinitor-everolimus
The code currently works, it gives me all the data. But when I insert it on the CSV , the information is not delimited as I wish.
As it is one single page, the excel should have ONE row... but it doesn´t
The code:
from bs4 import BeautifulSoup
import requests
import csv
csv_file = open('Drugs.csv','w')
csv_writer = csv.writer(csv_file, delimiter ='+')
csv_writer.writerow(['Drug name','General Information','Clinical Results','Side Effects','Mechanism of Action','Literature References','Additional Information','Approval Date','Date Created','Company Name'])
link = requests.get('https://www.centerwatch.com/directories/1067-fda-approved-drugs/listing/3092-afinitor-everolimus')
aux =[]
soup = BeautifulSoup(link.content, 'lxml')
drugName = soup.find('div', class_='company-navigation').find('h1').text
gralInfo = soup.find('div', class_='body directory-listing-profile__description')
y = 0
for h2 in gralInfo.find_all('h2'):
print (y)
text =''
for sibling in h2.find_next_siblings():
if (sibling.name == 'h2'):
break
else:
text = text + sibling.get_text(separator ='\n') + '\n'
print(text)
aux.append(text)
print()
print()
y = y + 1
auxi = []
for info in soup.find_all('div', class_='contact directory-listing-profile__master-detail'):
print(info.text)
auxi.append(info.text)
csv_writer.writerow([drugName, aux[0], aux[1], aux[2], aux[3], aux[4], aux[5], auxi[0], auxi[1], auxi[2]])

How to Export the Multiple pages data to Excel/CSV by python?

currently i have Webscraped the data and printed it but now i want it to export to excel/csvi am new to python need help there are multiple pages that i have scraped now i need to export them to csv/excel.need help
my code below
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def scrap_bid_data():
page_no = 1 #initial page number
while True:
print('Hold on creating URL to fetch data...')
URL = 'https://bidplus.gem.gov.in/bidlists?bidlists&page_no=' + str(page_no) #create dynamic URL
print('URL cerated: ' + URL)
scraped_data = requests.get(URL,verify=False) # request to get the data
soup_data = bs(scraped_data.text, 'lxml') #parse the scraped data using lxml
extracted_data = soup_data.find('div',{'id':'pagi_content'}) #find divs which contains required data
if len(extracted_data) == 0: # **if block** which will check the length of extracted_data if it is 0 then quit and stop the further execution of script.
break
else:
for idx in range(len(extracted_data)): # loops through all the divs and extract and print data
if(idx % 2 == 1): #get data from odd indexes only because we have required data on odd indexes
bid_data = extracted_data.contents[idx].text.strip().split('\n')
print('-' * 100)
print(bid_data[0]) #BID number
print(bid_data[5]) #Items
print(bid_data[6]) #Quantitiy Required
print(bid_data[10] + bid_data[12].strip()) #Department name and address
print(bid_data[16]) #Start date
print(bid_data[17]) #End date
print('-' * 100)
page_no +=1 #increments the page number by 1
scrap_bid_data()

Since you have the data elements already, use can write them to a csv in a couple steps.
Create a list of lists, with each list being a single row of data elements
Save the full list to csv using csv.writer.writerows passing in the full list
Here are the code updates:
def scrap_bid_data():
csvlst = [['BID number','Items','Quantity Required','Department name and address','Start date','End date']] # header row # ADD THIS LINE
page_no = 1 #initial page number
while True:
...................
if len(extracted_data) == 0: # **if block** which will check the length of extracted_data if it is 0 then quit and stop the further execution of script.
break
else:
for idx in range(len(extracted_data)): # loops through all the divs and extract and print data
if(idx % 2 == 1): #get data from odd indexes only because we have required data on odd indexes
bid_data = extracted_data.contents[idx].text.strip().split('\n')
.................
csvlst.append([bid_data[0],bid_data[5],bid_data[6],bid_data[10],bid_data[16],bid_data[17]]) # CSV row # ADD THIS LINE
page_no +=1 #increments the page number by 1
import csv # Write CSV # ADD THIS SECTION
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(csvlst)
scrap_bid_data()

Webscraping issue where dataframe to csv put output into one cell

I am trying to help out our soccer coach who is doing some work on helping underprivileged kids get recruited. I am trying to scrape a "topdrawer" website page so we can track where players get placed. I am not a python expert at all and am banging my head against the wall. I got some help yesterday and tried to implement - see two sets of code below. Neither puts the data into a nice table we can sort and analyze etc. Thanks in advance for any help.
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w", newline='') as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
df = pd.read_html(source)
df = pd.DataFrame(df)
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
The second method jumbles the output up into one line with /n separators etc
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w", newline='') as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
print(page_num)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
print(source)
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url,'lxml')
table = soup.find('table')
#table = soup.table
table_rows = table.find_all('tr')
with open('result.csv', 'a', newline='') as f:
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
f.write(str(row))
in the first version the data is all place on one line and not separated.
The second version puts each page into one cell and splits the pages in half.

Page may have many <table> in HTML (sometimes used to create menu or to organize elements on page) and pandas.read_html() creates DataFrame for every <table> on page and it always returns list with all created DataFrames (even if there was only one <table>) and you have to check which one has your data. You can display every DataFrame from list to see which one you need. This way I know that first DataFrame has your data and you have to use [0] to get it.
import pandas as pd
max_page_num = 15 # it has to be 15 instead of 14 because `range(15)` will give `0-14`
with open('result.csv', 'w', newline='') as f:
f.write('Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment\n')
for i in range(max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
all_tables = pd.read_html(source)
df = all_tables[0]
print('items:', len(df))
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
EDIT:
In second version you should use strip() to remove \n which csv would tread as beginning of new row.
You shouldn't use str(row) because it creates string with [ ] which is not correct in csv file. You should rather use ",".join(row) to create string. And you have to add \n at the end of every row because write() doesn't add it.
But it could be better to use csv module and its writerow() for this. It will convert list to string with , as separtor and add \n automatically. If some item will have , or \n then it will put it in " " to create correct row.
import bs4 as bs
import urllib.request
import csv
max_page_num = 15
fh = open('result.csv', "w", newline='')
csv_writer = csv.writer(fh)
csv_writer.writerow( ["Name", "Gender", "State", "Position", "Grad", "Club/HS", "Rating", "Commitment"] )
for i in range(max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url, 'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
#row = [i.text.strip() for i in td] # strip to remove spaces and '\n'
row = [i.get_text(strip=True) for i in td] # strip to remove spaces and '\n'
if row: # check if row is not empty
#print(row)
csv_writer.writerow(row)
fh.close()

How to bypass IndexError

Here is my situation: My code parses out data from HTML tables that are within emails. The roadblock I'm running into is that some of these tables have blank empty rows right in the middle of the table, as seen in the photo below. This blank space causes my code to fail (IndexError: list index out of range) since it attempts to extract text from the cells.
Is it possible to say to Python: "ok, if you run into this error that comes from these blank rows, just stop there and take the rows you have acquired text from so far and execute the rest of the code on those"...?
That might sound like a dumb solution to this problem but my project involves me taking data from only the most recent date in the table anyway, which is always amongst the first few rows, and always before these blank empty rows.
So if it is possible to say "if you hit this error, just ignore it and proceed" then I would like to learn how to do that. If it's not then I'll have to figure out another way around this. Thanks for any and all help.
The table with the gap:
My code:
from bs4 import BeautifulSoup, NavigableString, Tag
import pandas as pd
import numpy as np
import os
import re
import email
import cx_Oracle
dsnStr = cx_Oracle.makedsn("sole.nefsc.noaa.gov", "1526", "sole")
con = cx_Oracle.connect(user="user", password="password", dsn=dsnStr)
def celltext(cell):
'''
textlist=[]
for br in cell.findAll('br'):
next = br.nextSibling
if not (next and isinstance(next,NavigableString)):
continue
next2 = next.nextSibling
if next2 and isinstance(next2,Tag) and next2.name == 'br':
text = str(next).strip()
if text:
textlist.append(next)
return (textlist)
'''
textlist=[]
y = cell.find('span')
for a in y.childGenerator():
if isinstance(a, NavigableString):
textlist.append(str(a))
return (textlist)
path = 'Z:\\blub_2'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
html=open(file_path,'r').read()
soup = BeautifulSoup(html, 'lxml') # Parse the HTML as a string
table = soup.find_all('table')[1] # Grab the second table
df_Quota = pd.DataFrame()
for row in table.find_all('tr'):
columns = row.find_all('td')
if columns[0].get_text().strip()!='ID': # skip header
Quota = celltext(columns[1])
Weight = celltext(columns[2])
price = celltext(columns[3])
print(Quota)
Nrows= max([len(Quota),len(Weight),len(price)]) #get the max number of rows
IDList = [columns[0].get_text()] * Nrows
DateList = [columns[4].get_text()] * Nrows
if price[0].strip()=='Package':
price = [columns[3].get_text()] * Nrows
if len(Quota)<len(Weight):#if Quota has less itmes extend with NaN
lstnans= [np.nan]*(len(Weight)-len(Quota))
Quota.extend(lstnans)
if len(price) < len(Quota): #if price column has less items than quota column,
val = [columns[3].get_text()] * (len(Quota)-len(price)) #extend with
price.extend(val) #whatever is in
#price column
#if len(DateList) > len(Quota): #if DateList is longer than Quota,
#print("it's longer than")
#value = [columns[4].get_text()] * (len(DateList)-len(Quota))
#DateList = value * Nrows
if len(Quota) < len(DateList): #if Quota is less than DateList (due to gap),
stu = [np.nan]*(len(DateList)-len(Quota)) #extend with NaN
Quota.extend(stu)
if len(Weight) < len(DateList):
dru = [np.nan]*(len(DateList)-len(Weight))
Weight.extend(dru)
FinalDataframe = pd.DataFrame(
{
'ID':IDList,
'AvailableQuota': Quota,
'LiveWeightPounds': Weight,
'price':price,
'DatePosted':DateList
})
df_Quota = df_Quota.append(FinalDataframe, ignore_index=True)
#df_Quota = df_Quota.loc[df_Quota['DatePosted']=='5/20']
df_Q = df_Quota['DatePosted'].iloc[0]
df_Quota = df_Quota[df_Quota['DatePosted'] == df_Q]
print (df_Quota)
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r') as f:
pattern = re.compile(r'Sent:.*?\b(\d{4})\b')
email = f.read()
dates = pattern.findall(email)
if dates:
print("Date:", ''.join(dates))
#cursor = con.cursor()
#exported_data = [tuple(x) for x in df_Quota.values]
#sql_query = ("INSERT INTO ROUGHTABLE(species, date_posted, stock_id, pounds, money, sector_name, ask)" "VALUES (:1, :2, :3, :4, :5, 'NEFS 2', '1')")
#cursor.executemany(sql_query, exported_data)
#con.commit()
#cursor.close()
#con.close()

continue is the keyword to use for skipping empty/problem rows. IndexError is thanks to the attempt to access columns[0] on an empty columns list. So just skip to next row when there is an exception.
for row in table.find_all('tr'):
columns = row.find_all('td')
try:
if columns[0].get_text().strip()!='ID':
# Rest as above in original code.
except IndexError:
continue

Use try: ... except: ...:
try:
#extract data from table
except IndexError:
#execute rest of program

BeautifulSoup and CSV files

I'm looking to pull the table from http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1# and put all the information in a csv file.
I've done this but am having a few issues. The first column of the table contains both the ranking of the player and their name. I want to split these up so that one column just contains the ranking and the other column contains the player name.
Here's the code:
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
csvwriter.writerow(cells)
Here's the output of a few players:
"1
Novak Djokovic",SRB,5-0,0-0,9,1.8,7,1.4,62%,74%,58%,88%,42%,68%,39%-57%,46%
"2
Roger Federer",SUI,1-1,0-1,9,4.5,2,1.0,59%,68%,54%,84%,46%,67%,37%-49%,33%
"3
Andy Murray",GBR,0-0,0-0,0,0.0,0,0.0,0%,0%,0%,0%,0%,0%,0%-0%,0%
"4
Rafael Nadal",ESP,11-3,2-1,25,1.8,18,1.3,68%,69%,57%,82%,43%,57%,36%-58%,38%
"5
Kei Nishikori",JPN,5-0,0-0,14,2.8,9,1.8,57%,75%,62%,92%,49%,80%,39%-62%,42%
As you can see the first column isn't displayed properly with the number being on a higher line than the rest of the data as well as the extremely large gap.
The HTML code for the problem column is slightly more complex than the rest of the columns:
<td class="col1" rel="1">1
Novak Djokovic</td>
I tried separating it from that but I couldn't get it to work and thought it might be easier to fix the current CSV file.

Separating the field after pulling it out is pretty easy. You've got a number, a bunch of whitespace, and a name. So just use split, with the default delimiter, and a max split of 1:
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0:1] = cells[0].split(None, 1)
csvwriter.writerow(cells)
But you can also separate it from within the soup, and that's probably more robust:
cells = row.find_all('td')
cell0 = cells.pop(0)
rank = next(cell0.children).strip().encode('utf-8')
name = cell0.find('a').text.encode('utf-8')
cells = [rank, name] + [c.text.encode('utf-8') for c in cells]

Since the value you're concerned with contains multiple tabs and the player's name is directly after the final tab, I'd suggest split by tab and collect the last item from the resulting tuple.
The line I added is cells[0] = cells[0].split('\t')[-1]
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0] = cells[0].split('\t')[-1]
csvwriter.writerow(cells)
f.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML a convoluted Table w/ BeautifulSoup - python

Related

CSV -(excel)- Python. Seems like wrong writing on csv from python

How to Export the Multiple pages data to Excel/CSV by python?

Webscraping issue where dataframe to csv put output into one cell

How to bypass IndexError

BeautifulSoup and CSV files

Categories

Resources