Webscraping issue where dataframe to csv put output into one cell - python

I am trying to help out our soccer coach who is doing some work on helping underprivileged kids get recruited. I am trying to scrape a "topdrawer" website page so we can track where players get placed. I am not a python expert at all and am banging my head against the wall. I got some help yesterday and tried to implement - see two sets of code below. Neither puts the data into a nice table we can sort and analyze etc. Thanks in advance for any help.
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w", newline='') as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
df = pd.read_html(source)
df = pd.DataFrame(df)
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
The second method jumbles the output up into one line with /n separators etc
import bs4 as bs
import urllib.request
import pandas as pd
import csv
max_page_num = 14
max_page_dig = 1 # number of digits in the page number
with open('result.csv',"w", newline='') as f:
f.write("Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment \n")
for i in range(0, max_page_num):
page_num = (max_page_dig - len(str(i))) * "0" +str(i) #gives a string in the format of 1, 01 or 001, 005 etc
print(page_num)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
print(source)
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url,'lxml')
table = soup.find('table')
#table = soup.table
table_rows = table.find_all('tr')
with open('result.csv', 'a', newline='') as f:
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
f.write(str(row))
in the first version the data is all place on one line and not separated.
The second version puts each page into one cell and splits the pages in half.

Page may have many <table> in HTML (sometimes used to create menu or to organize elements on page) and pandas.read_html() creates DataFrame for every <table> on page and it always returns list with all created DataFrames (even if there was only one <table>) and you have to check which one has your data. You can display every DataFrame from list to see which one you need. This way I know that first DataFrame has your data and you have to use [0] to get it.
import pandas as pd
max_page_num = 15 # it has to be 15 instead of 14 because `range(15)` will give `0-14`
with open('result.csv', 'w', newline='') as f:
f.write('Name, Gender, State, Position, Grad, Club/HS, Rating, Commitment\n')
for i in range(max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
all_tables = pd.read_html(source)
df = all_tables[0]
print('items:', len(df))
df.to_csv('results.csv', header=False, index=False, mode='a') #'a' should append each table to the csv file, instead of overwriting it.
EDIT:
In second version you should use strip() to remove \n which csv would tread as beginning of new row.
You shouldn't use str(row) because it creates string with [ ] which is not correct in csv file. You should rather use ",".join(row) to create string. And you have to add \n at the end of every row because write() doesn't add it.
But it could be better to use csv module and its writerow() for this. It will convert list to string with , as separtor and add \n automatically. If some item will have , or \n then it will put it in " " to create correct row.
import bs4 as bs
import urllib.request
import csv
max_page_num = 15
fh = open('result.csv', "w", newline='')
csv_writer = csv.writer(fh)
csv_writer.writerow( ["Name", "Gender", "State", "Position", "Grad", "Club/HS", "Rating", "Commitment"] )
for i in range(max_page_num):
print('page:', i)
page_num = str(i)
source = "https://www.topdrawersoccer.com/search/?query=&divisionId=&genderId=m&graduationYear=2020&positionId=0&playerRating=&stateId=All&pageNo=" + page_num + "&area=commitments"
url = urllib.request.urlopen(source).read()
soup = bs.BeautifulSoup(url, 'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
#row = [i.text.strip() for i in td] # strip to remove spaces and '\n'
row = [i.get_text(strip=True) for i in td] # strip to remove spaces and '\n'
if row: # check if row is not empty
#print(row)
csv_writer.writerow(row)
fh.close()

Related

CSV -(excel)- Python. Seems like wrong writing on csv from python

I´m trying to export some data from a website and I first tried on one single page. I´ve to import text delimited by titles:
['Drug name','General Information','Clinical Results','Side Effects','Mechanism of Action','Literature
References','Additional Information','Approval Date','Date Created','Company Name']
The url is https://www.centerwatch.com/directories/1067-fda-approved-drugs/listing/3092-afinitor-everolimus
The code currently works, it gives me all the data. But when I insert it on the CSV , the information is not delimited as I wish.
As it is one single page, the excel should have ONE row... but it doesn´t
The code:
from bs4 import BeautifulSoup
import requests
import csv
csv_file = open('Drugs.csv','w')
csv_writer = csv.writer(csv_file, delimiter ='+')
csv_writer.writerow(['Drug name','General Information','Clinical Results','Side Effects','Mechanism of Action','Literature References','Additional Information','Approval Date','Date Created','Company Name'])
link = requests.get('https://www.centerwatch.com/directories/1067-fda-approved-drugs/listing/3092-afinitor-everolimus')
aux =[]
soup = BeautifulSoup(link.content, 'lxml')
drugName = soup.find('div', class_='company-navigation').find('h1').text
gralInfo = soup.find('div', class_='body directory-listing-profile__description')
y = 0
for h2 in gralInfo.find_all('h2'):
print (y)
text =''
for sibling in h2.find_next_siblings():
if (sibling.name == 'h2'):
break
else:
text = text + sibling.get_text(separator ='\n') + '\n'
print(text)
aux.append(text)
print()
print()
y = y + 1
auxi = []
for info in soup.find_all('div', class_='contact directory-listing-profile__master-detail'):
print(info.text)
auxi.append(info.text)
csv_writer.writerow([drugName, aux[0], aux[1], aux[2], aux[3], aux[4], aux[5], auxi[0], auxi[1], auxi[2]])

How to write a string into one cell in csv using Python?

I am trying to extract a review from one page in Zomato using request and Beautiful Soup 4 in Python. I want to store the link of the requested page and the review extracted into one csv file.
My problem is that the review I extracted does not store into one cell but instead it splits into multiple cells. How do I store my extracted review into one cell?
Here is my code:
import time
from bs4 import BeautifulSoup
import requests
URL = "https://www.zomato.com/review/eQEygl"
time.sleep(2)
reviewPage = requests.get(URL, headers = {'user-agent': 'my-app/0.0.1'})
reviewSoup = BeautifulSoup(reviewPage.content,"html.parser")
reviewText = reviewSoup.find("div",{"class":"rev-text"})
textSoup = BeautifulSoup(str(reviewText), "html.parser")
reviewElem = [URL, ""]
for string in textSoup.stripped_strings:
reviewElem[1] += string
csv = open("out.csv", "w", encoding="utf-8")
csv.write("Link, Review\n")
row = reviewElem[0] + "," + reviewElem[1] + "\n"
csv.write(row)
csv.close()
Output
Expected Output
I think the problem is the commas embedded in the reviewElem[1] string, because they are the default delimiter in most CSV software. The following avoids the problem by wrapping the contents of the string in " characters to indicate it's all one cell:
import time
from bs4 import BeautifulSoup
import requests
URL = "https://www.zomato.com/review/eQEygl"
time.sleep(2)
reviewPage = requests.get(URL, headers = {'user-agent': 'my-app/0.0.1'})
reviewSoup = BeautifulSoup(reviewPage.content,"html.parser")
reviewText = reviewSoup.find("div",{"class":"rev-text"})
textSoup = BeautifulSoup(str(reviewText), "html.parser")
reviewElem = [URL, ""]
for string in textSoup.stripped_strings:
reviewElem[1] += string
csv = open("out.csv", "w", encoding="utf-8")
csv.write("Link, Review\n")
#row = reviewElem[0] + "," + reviewElem[1] + "\n"
row = reviewElem[0] + ',"{}"\n'.format(reviewElem[1]) # quote string 2
csv.write(row)
csv.close()
There is no need to manually construct a CSV string. When you do it manually, if there are column delimiters (, by default) inside the column values, they are interpreted as delimiters and not literal strings leading to a column value being scattered around multiple columns.
Use the csv module and the .writerow() method:
import csv
# ...
with open("out.csv", "w", encoding="utf-8") as csv_file:
writer = csv.writer(csv_file)
writer.writerow(["Link", "Review"])
writer.writerow(reviewElem)

Write data into csv

I am crawling data from Wikipedia and it works so far. I can display it on the terminal, but I can't write it the way I need it into a csv file :-/
The code is pretty long, but I paste it here anyway and hope that somebody can help me.
import csv
import requests
from bs4 import BeautifulSoup
def spider():
url = 'https://de.wikipedia.org/wiki/Liste_der_Gro%C3%9F-_und_Mittelst%C3%A4dte_in_Deutschland'
code = requests.get(url).text # Read source code and make unicode
soup = BeautifulSoup(code, "lxml") # create BS object
table = soup.find(text="Rang").find_parent("table")
for row in table.find_all("tr")[1:]:
partial_url = row.find_all('a')[0].attrs['href']
full_url = "https://de.wikipedia.org" + partial_url
get_single_item_data(full_url) # goes into the individual sites
def get_single_item_data(item_url):
page = requests.get(item_url).text # Read source code & format with .text to unicode
soup = BeautifulSoup(page, "lxml") # create BS object
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
basisdaten_list = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:',
'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:',
'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin',
'Oberbürgermeister', 'Oberbürgermeisterin']
with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile:
fieldnames = ['Bundesland', 'Regierungsbezirk:', 'Höhe:', 'Fläche:', 'Einwohner:', 'Bevölkerungsdichte:',
'Postleitzahl', 'Vorwahl:', 'Kfz-Kennzeichen:', 'Gemeindeschlüssel:', 'Stadtgliederung:',
'Adresse', 'Anschrift', 'Webpräsenz:', 'Website:', 'Bürgermeister', 'Bürgermeisterin',
'Oberbürgermeister', 'Oberbürgermeisterin']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, delimiter=';', quotechar='|', quoting=csv.QUOTE_MINIMAL, extrasaction='ignore')
writer.writeheader()
for i in basisdaten_list:
wanted = i
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print(BeautifulSoup.get_text(items[0]))
print(BeautifulSoup.get_text(items[1]))
writer.writerow({i: BeautifulSoup.get_text(items[1])})
if '<th ' in str(current): break
current = current.nextSibling
print(spider())
The output is incorrect in 2 ways. The cells are their right places and only one city is written, all others are missing. It looks like this:
But it should look like this + all other cities in it:
'... only one city is written ...': You call get_single_item_data for each city. Then inside this function you open the output file with the same name, in the statement with open('staedte.csv', 'w', newline='', encoding='utf-8') as csvfile: which will overwrite the output file each time you call the function.
Each variable is written to a new row: In the statement writer.writerow({i: BeautifulSoup.get_text(items[1])}) you write the value for one variable to a row. What you need to do instead is to make a dictionary for values before you start looking for page values. As you accumulate the values from the page you shove them into the dictionary by field name. Then after you have found all of the values available you call writer.writerow.

BeautifulSoup and CSV files

I'm looking to pull the table from http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1# and put all the information in a csv file.
I've done this but am having a few issues. The first column of the table contains both the ranking of the player and their name. I want to split these up so that one column just contains the ranking and the other column contains the player name.
Here's the code:
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
csvwriter.writerow(cells)
Here's the output of a few players:
"1
Novak Djokovic",SRB,5-0,0-0,9,1.8,7,1.4,62%,74%,58%,88%,42%,68%,39%-57%,46%
"2
Roger Federer",SUI,1-1,0-1,9,4.5,2,1.0,59%,68%,54%,84%,46%,67%,37%-49%,33%
"3
Andy Murray",GBR,0-0,0-0,0,0.0,0,0.0,0%,0%,0%,0%,0%,0%,0%-0%,0%
"4
Rafael Nadal",ESP,11-3,2-1,25,1.8,18,1.3,68%,69%,57%,82%,43%,57%,36%-58%,38%
"5
Kei Nishikori",JPN,5-0,0-0,14,2.8,9,1.8,57%,75%,62%,92%,49%,80%,39%-62%,42%
As you can see the first column isn't displayed properly with the number being on a higher line than the rest of the data as well as the extremely large gap.
The HTML code for the problem column is slightly more complex than the rest of the columns:
<td class="col1" rel="1">1
Novak Djokovic</td>
I tried separating it from that but I couldn't get it to work and thought it might be easier to fix the current CSV file.
Separating the field after pulling it out is pretty easy. You've got a number, a bunch of whitespace, and a name. So just use split, with the default delimiter, and a max split of 1:
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0:1] = cells[0].split(None, 1)
csvwriter.writerow(cells)
But you can also separate it from within the soup, and that's probably more robust:
cells = row.find_all('td')
cell0 = cells.pop(0)
rank = next(cell0.children).strip().encode('utf-8')
name = cell0.find('a').text.encode('utf-8')
cells = [rank, name] + [c.text.encode('utf-8') for c in cells]
Since the value you're concerned with contains multiple tabs and the player's name is directly after the final tab, I'd suggest split by tab and collect the last item from the resulting tuple.
The line I added is cells[0] = cells[0].split('\t')[-1]
import urllib2
from bs4 import BeautifulSoup
import csv
URL = 'http://www.atpworldtour.com/Rankings/Top-Matchfacts.aspx?y=2015&s=1#'
req = urllib2.Request(URL)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
tables = soup.findAll('table')
my_table = tables[0]
with open('out2.csv', 'w') as f:
csvwriter = csv.writer(f)
for row in my_table.findAll('tr'):
cells = [c.text.encode('utf-8') for c in row.findAll('td')]
if len(cells) == 16:
cells[0] = cells[0].split('\t')[-1]
csvwriter.writerow(cells)
f.close()

Parsing HTML a convoluted Table w/ BeautifulSoup

I am trying to create a csv file from NOAA data from their http://www.srh.noaa.gov/data/obhistory/PAFA.html.
I tried working with the table tag,but it failed. So I am trying to do it by identifying <tr> on each line.
So this is my code:
#This script should take table context from URL and save new data into a CSV file.
noaa = urllib2.urlopen("http://www.srh.noaa.gov/data/obhistory/PAFA.html").read()
soup = BeautifulSoup(noaa)
#Iterate from lines 7 to 78 and extract the text in each line. I probably would like
#space delimited between each text
#for i in range(7, 78, 1):
rows = soup.findAll('tr')[i]
for tr in rows:
for n in range(0, 15, 1):
cols = rows.findAll('td')[n]
for td in cols[n]:
print td.find(text=true)....(match.group(0), match.group(2), match.group(3), ...
match.group(15)
At the moment some stuff is working as expected some is not, and the last part I am not sure how to stitch the way I would like it.
Ok so I took what "That1guy " suggested, and tried to extend it to the CSV component.
So:
import urllib2 as urllib
from bs4 import BeautifulSoup
from time import localtime, strftime
import csv
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)
table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue
date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]
print date, time, wind
with open("/home/eyalak/Documents/weather/weather.csv", "wb") as f:
writer = csv.writer(f)
print date, time, wind
writer.writerow( ('Title 1', 'Title 2', 'Title 3') )
writer.writerow(str(time)+str(wind)+str(date)+'\n')
if row_count == 74:
print "74"
break
The printed result is fine, it is the file that is not.
I get:
Title 1,Title 2,Title 3
0,5,:,5,3,C,a,l,m,0,8,"
The problems in the CSV file created are:
The title is broken into the wrong columns;column 2, has "1,Title" versus "title 2"
The data is comma delineated in the wrong places
As The script writes new lines it over writes on the previous one, instead of appending
from the bottom.
Any thoughts?
This worked for me:
url = 'http://www.srh.noaa.gov/data/obhistory/PAFA.html'
file_pointer = urllib.urlopen(url)
soup = BeautifulSoup(file_pointer)
table = soup('table')[3]
table_rows = table.findAll('tr')
row_count = 0
for table_row in table_rows:
row_count += 1
if row_count < 4:
continue
date = table_row('td')[0].contents[0]
time = table_row('td')[1].contents[0]
wind = table_row('td')[2].contents[0]
print date, time, wind
if row_count == 74:
break
This code obviously only returns the first 3 cells of each row, but you get the idea. Also, note some empty cells. In these cases, to make sure they're populated (or else probably receive an IndexError), I would check the length of each row before grabbing .contents. ie:
if len(table_row('td')[offset]) > 0:
variable = table_row('td')[offset].contents[0]
This will ensure the cell is populated and you will avoid IndexErrors

Categories

Resources