BeautifulSoup and CSV files - python

I'm looking to pull the table from http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1# and put all the information in a csv file.
I've done this but am having a few issues. The first column of the table contains both the ranking of the player and their name. I want to split these up so that one column just contains the ranking and the other column contains the player name.
Here's the code:
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
csvwriter.writerow(cells)
Here's the output of a few players:
"1
Novak Djokovic",SRB,5-0,0-0,9,1.8,7,1.4,62%,74%,58%,88%,42%,68%,39%-57%,46%
"2
Roger Federer",SUI,1-1,0-1,9,4.5,2,1.0,59%,68%,54%,84%,46%,67%,37%-49%,33%
"3
Andy Murray",GBR,0-0,0-0,0,0.0,0,0.0,0%,0%,0%,0%,0%,0%,0%-0%,0%
"4
Rafael Nadal",ESP,11-3,2-1,25,1.8,18,1.3,68%,69%,57%,82%,43%,57%,36%-58%,38%
"5
Kei Nishikori",JPN,5-0,0-0,14,2.8,9,1.8,57%,75%,62%,92%,49%,80%,39%-62%,42%
As you can see the first column isn't displayed properly with the number being on a higher line than the rest of the data as well as the extremely large gap.
The HTML code for the problem column is slightly more complex than the rest of the columns:
<td class="col1" rel="1">1
Novak Djokovic</td>
I tried separating it from that but I couldn't get it to work and thought it might be easier to fix the current CSV file.

Separating the field after pulling it out is pretty easy. You've got a number, a bunch of whitespace, and a name. So just use split, with the default delimiter, and a max split of 1:
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0:1] = cells[0].split(None, 1)
csvwriter.writerow(cells)
But you can also separate it from within the soup, and that's probably more robust:
cells = row.find_all('td')
cell0 = cells.pop(0)
rank = next(cell0.children).strip().encode('utf-8')
name = cell0.find('a').text.encode('utf-8')
cells = [rank, name] + [c.text.encode('utf-8') for c in cells]

Since the value you're concerned with contains multiple tabs and the player's name is directly after the final tab, I'd suggest split by tab and collect the last item from the resulting tuple.
The line I added is cells[0] = cells[0].split('\t')[-1]
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0] = cells[0].split('\t')[-1]
csvwriter.writerow(cells)
f.close()

Related

CSV -(excel)- Python. Seems like wrong writing on csv from python

I´m trying to export some data from a website and I first tried on one single page. I´ve to import text delimited by titles:
['Drug name','General Information','Clinical Results','Side Effects','Mechanism of Action','Literature
References','Additional Information','Approval Date','Date Created','Company Name']
The url is https://www.centerwatch.com/directories/1067-fda-approved-drugs/listing/3092-afinitor-everolimus
The code currently works, it gives me all the data. But when I insert it on the CSV , the information is not delimited as I wish.
As it is one single page, the excel should have ONE row... but it doesn´t
The code:
from bs4 import BeautifulSoup
import requests
import csv
csv_file = open('Drugs.csv','w')
csv_writer = csv.writer(csv_file, delimiter ='+')
csv_writer.writerow(['Drug name','General Information','Clinical Results','Side Effects','Mechanism of Action','Literature References','Additional Information','Approval Date','Date Created','Company Name'])
link = requests.get('https://www.centerwatch.com/directories/1067-fda-approved-drugs/listing/3092-afinitor-everolimus')
aux =[]
soup = BeautifulSoup(link.content, 'lxml')
drugName = soup.find('div', class_='company-navigation').find('h1').text
gralInfo = soup.find('div', class_='body directory-listing-profile__description')
y = 0
for h2 in gralInfo.find_all('h2'):
print (y)
text =''
for sibling in h2.find_next_siblings():
if (sibling.name == 'h2'):
break
else:
text = text + sibling.get_text(separator ='\n') + '\n'
print(text)
aux.append(text)
print()
print()
y = y + 1
auxi = []
for info in soup.find_all('div', class_='contact directory-listing-profile__master-detail'):
print(info.text)
auxi.append(info.text)
csv_writer.writerow([drugName, aux[0], aux[1], aux[2], aux[3], aux[4], aux[5], auxi[0], auxi[1], auxi[2]])

Webscraping issue where dataframe to csv put output into one cell

I am trying to help out our soccer coach who is doing some work on helping underprivileged kids get recruited. I am trying to scrape a "topdrawer" website page so we can track where players get placed. I am not a python expert at all and am banging my head against the wall. I got some help yesterday and tried to implement - see two sets of code below. Neither puts the data into a nice table we can sort and analyze etc. Thanks in advance for any help.
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w", newline='') as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
df = pd.read_html(source)
df = pd.DataFrame(df)
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
The second method jumbles the output up into one line with /n separators etc
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w", newline='') as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
print(page_num)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
print(source)
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url,'lxml')
table = soup.find('table')
#table = soup.table
table_rows = table.find_all('tr')
with open('result.csv', 'a', newline='') as f:
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
f.write(str(row))
in the first version the data is all place on one line and not separated.
The second version puts each page into one cell and splits the pages in half.
Page may have many <table> in HTML (sometimes used to create menu or to organize elements on page) and pandas.read_html() creates DataFrame for every <table> on page and it always returns list with all created DataFrames (even if there was only one <table>) and you have to check which one has your data. You can display every DataFrame from list to see which one you need. This way I know that first DataFrame has your data and you have to use [0] to get it.
import pandas as pd
max_page_num = 15 # it has to be 15 instead of 14 because `range(15)` will give `0-14`
with open('result.csv', 'w', newline='') as f:
f.write('Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment\n')
for i in range(max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
all_tables = pd.read_html(source)
df = all_tables[0]
print('items:', len(df))
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
EDIT:
In second version you should use strip() to remove \n which csv would tread as beginning of new row.
You shouldn't use str(row) because it creates string with [ ] which is not correct in csv file. You should rather use ",".join(row) to create string. And you have to add \n at the end of every row because write() doesn't add it.
But it could be better to use csv module and its writerow() for this. It will convert list to string with , as separtor and add \n automatically. If some item will have , or \n then it will put it in " " to create correct row.
import bs4 as bs
import urllib.request
import csv
max_page_num = 15
fh = open('result.csv', "w", newline='')
csv_writer = csv.writer(fh)
csv_writer.writerow( ["Name", "Gender", "State", "Position", "Grad", "Club/HS", "Rating", "Commitment"] )
for i in range(max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url, 'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
#row = [i.text.strip() for i in td] # strip to remove spaces and '\n'
row = [i.get_text(strip=True) for i in td] # strip to remove spaces and '\n'
if row: # check if row is not empty
#print(row)
csv_writer.writerow(row)
fh.close()

Loop in Python script, Only get last results

Why do I only get the stats from the last player in PLAYER_NAME?
I would like to get the stats from all the players in PLAYER_NAME.
import csv
import requests
from bs4 import BeautifulSoup
import urllib
PLAYER_NAME = ["andy-murray/mc10", "rafael-nadal/n409"]
URL_PATTERN = 'http://www.atpworldtour.com/en/players/{}/player-stats?year=0&surfaceType=clay'
for item in zip (PLAYER_NAME):
url = URL_PATTERN.format(item)
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'class': 'mega-table-wrapper'})
list_of_rows = []
for row in table.findAll('tr'):
list_of_cells = []
for cell in row.findAll('td'):
text = (cell.text.encode("utf-8").strip())
list_of_cells.append(text)
list_of_rows.append(list_of_cells)
outfile = open("./tennis.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Name", "Stat"])
writer.writerows(list_of_rows)
As mentioned in the comments, you're recreating list_of_rows every time. To fix that, you have to move it outside the for loop, and instead of appending to it, and turning it into a list of lists, extend it.
On a side note, you have a few other issues with your code:
zip is redundant, and it actually ends up converting your names into tuples, which will cause incorrect formatting, you just want to iterate over PLAYER_NAME, and while you're at it, maybe rename that to PLAYER_NAMES (since it's a list of names)
When trying to format the string you just have empty braces, you need a number in there to specify the position of the argument in format - in this case {0}.
PLAYER_NAMES = ["andy-murray/mc10", "rafael-nadal/n409"]
URL_PATTERN = 'http://www.atpworldtour.com/en/players/{0}/player-stats?year=0&surfaceType=clay'
list_of_rows = []
for item in PLAYER_NAMES:
url = URL_PATTERN.format(item)
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
table = soup.find('div', attrs={'class': 'mega-table-wrapper'})
# for row in table.findAll('tr'):
# list_of_cells = []
# for cell in row.findAll('td'):
# text = (cell.text.encode("utf-8").strip())
# list_of_cells.append(text)
# list_of_rows.extend(list_of_cells) # Change to extend here
# Incidentally, the for loop above could also be written as:
list_of_rows += [
[cell.text.encode("utf-8").strip() for cell in row.findAll('td')]
for row in table.findAll('tr')
]

Parse HTML table data to JSON and save to text file in Python 2.7

I'm trying to extract the data on the crime rate across states from
this webpage, link to web page
http://www.disastercenter.com/crime/uscrime.htm
I am able to get this into text file. But I would like to get the
response in Json format. How can I do this in python.
Here is my code:
import urllib
import re
from bs4 import BeautifulSoup
link = "http://www.disastercenter.com/crime/uscrime.htm"
f = urllib.urlopen(link)
myfile = f.read()
soup = BeautifulSoup(myfile)
soup1=soup.find('table', width="100%")
soup3=str(soup1)
result = re.sub("<.*?>", "", soup3)
print(result)
output=open("output.txt","w")
output.write(result)
output.close()
The following code will get the data from the two tables and output all of it as a json formatted string.
Working Example (Python 2.7.9):
from lxml import html
import requests
import re as regular_expression
import json
page = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = html.fromstring(page.text)
tables = [tree.xpath('//table/tbody/tr[2]/td/center/center/font/table/tbody'),
tree.xpath('//table/tbody/tr[5]/td/center/center/font/table/tbody')]
tabs = []
for table in tables:
tab = []
for row in table:
for col in row:
var = col.text_content()
var = var.strip().replace(" ", "")
var = var.split('\n')
if regular_expression.match('^\d{4}$', var[0].strip()):
tab_row = {}
tab_row["Year"] = var[0].strip()
tab_row["Population"] = var[1].strip()
tab_row["Total"] = var[2].strip()
tab_row["Violent"] = var[3].strip()
tab_row["Property"] = var[4].strip()
tab_row["Murder"] = var[5].strip()
tab_row["Forcible_Rape"] = var[6].strip()
tab_row["Robbery"] = var[7].strip()
tab_row["Aggravated_Assault"] = var[8].strip()
tab_row["Burglary"] = var[9].strip()
tab_row["Larceny_Theft"] = var[10].strip()
tab_row["Vehicle_Theft"] = var[11].strip()
tab.append(tab_row)
tabs.append(tab)
json_data = json.dumps(tabs)
output = open("output.txt", "w")
output.write(json_data)
output.close()
This might be what you want, if you can use the requests and lxml modules. The data structure presented here is very simple, adjust this to your needs.
First, get a response from your requested URL and parse the result into an HTML tree:
import requests
from lxml import etree
import json
response = requests.get("http://www.disastercenter.com/crime/uscrime.htm")
tree = etree.HTML(response.text)
Assuming you want to extract both tables, create this XPath and unpack the results. totals is "Number of Crimes" and rates is "Rate of Crime per 100,000 People":
xpath = './/table[#width="100%"][#style="background-color: rgb(255, 255, 255);"]//tbody'
totals, rates = tree.findall(xpath)
Extract the raw data (td.find('./') means first child item, whatever tag it has) and clean the strings (r'' raw strings are needed for Python 2.x):
raw_data = []
for tbody in totals, rates:
rows = []
for tr in tbody.getchildren():
row = []
for td in tr.getchildren():
child = td.find('./')
if child is not None and child.tag != 'br':
row.append(child.text.strip(r'\xa0').strip(r'\n').strip())
else:
row.append('')
rows.append(row)
raw_data.append(rows)
Zip together the table headers in the first two rows, then delete the redundant rows, seen as the 11th & 12th steps in slice notation:
data = {}
data['tags'] = [tag0 + tag1 for tag0, tag1 in zip(raw_data[0][0], raw_data[0][1])]
for raw in raw_data:
del raw[::12]
del raw[::11]
Store the rest of the raw data and create a JSON file (optional: eliminate whitespace with separators=(',', ':')):
data['totals'], data['rates'] = raw_data[0], raw_data[1]
with open('data.json', 'w') as f:
json.dump(data, f, separators=(',', ':'))

Parsing HTML a convoluted Table w/ BeautifulSoup

I am trying to create a csv file from NOAA data from their http://www.srh.noaa.gov/data/obhistory/PAFA.html.
I tried working with the table tag,but it failed. So I am trying to do it by identifying <tr> on each line.
So this is my code:
#This script should take table context from URL and save new data into a CSV file.
noaa = urllib2.urlopen("http://www.srh.noaa.gov/data/obhistory/PAFA.html").read()
soup = BeautifulSoup(noaa)
#Iterate from lines 7 to 78 and extract the text in each line. I probably would like
#space delimited between each text
#for i in range(7, 78, 1):
rows = soup.findAll('tr')[i]
for tr in rows:
for n in range(0, 15, 1):
cols = rows.findAll('td')[n]
for td in cols[n]:
print td.find(text=true)....(match.group(0), match.group(2), match.group(3), ...
match.group(15)
At the moment some stuff is working as expected some is not, and the last part I am not sure how to stitch the way I would like it.
Ok so I took what "That1guy " suggested, and tried to extend it to the CSV component.
So:
import urllib2 as urllib
from bs4 import BeautifulSoup
from time import localtime, strftime
import csv
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)
table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue
date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]
print date, time, wind
with open("/home/eyalak/Documents/weather/weather.csv", "wb") as f:
writer = csv.writer(f)
print date, time, wind
writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
writer.writerow(str(time)+str(wind)+str(date)+'\n')
if row_count == 74:
print "74"
break
The printed result is fine, it is the file that is not.
I get:
Title 1,Title 2,Title 3
0,5,:,5,3,C,a,l,m,0,8,"
The problems in the CSV file created are:
The title is broken into the wrong columns;column 2, has "1,Title" versus "title 2"
The data is comma delineated in the wrong places
As The script writes new lines it over writes on the previous one, instead of appending
from the bottom.
Any thoughts?
This worked for me:
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)
table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue
date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]
print date, time, wind
if row_count == 74:
break
This code obviously only returns the first 3 cells of each row, but you get the idea. Also, note some empty cells. In these cases, to make sure they're populated (or else probably receive an IndexError), I would check the length of each row before grabbing .contents. ie:
if len(table_row('td')[offset]) > 0:
variable = table_row('td')[offset].contents[0]
This will ensure the cell is populated and you will avoid IndexErrors

Categories

Resources