I have some Python code which scrapes the game logs of NBA players for a given season (for instance: the data here) into a csv file. I'm using Beautiful Soup. I am aware that there is an option to just get csv version by clicking on a link on the website, but I am adding something to each line, so I feel like scraping line by line is the easiest option. The goal is to eventually write code that does this for every season of every player.
The code looks like this:
import urllib
from bs4 import BeautifulSoup
def getData(url):
html = urllib.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
type(soup)
file = open('/Users/Mika/Desktop/a_players.csv', 'a')
for table in soup.find_all("pre", class_ = ""):
dataline = table.getText
player_id = player_season_url[47:-14]
file.write(player_id + ',' + dataline + '\n')
file.close()
player_season_url = "https://www.basketball-reference.com/players/a/abdelal01/gamelog/1991/"
getData(player_season_url)
The problem is this: as you can see from inspecting the element of the URL, some cells in the table have empty values.
<td class="right " data-stat="fg3_pct"></td>
(this is an example of a good cell with a value ("1") in in that is properly scraped):
<th scope="row" class="right " data-stat="ranker" csk="1">1</th>
When scraping, the rows come out uneven, skipping over the empty values to create a csv file with the values out of place. Is there a way to ensure that those empty values get replaces with " " in the csv file?
For writing csv files Python has builtin support (module csv). For grabbing whole table from the page you could use script like this:
import requests
from bs4 import BeautifulSoup
import csv
import re
def getData(url):
soup = BeautifulSoup(requests.get(url).text, 'lxml')
player_id = re.findall(r'(?:/[^/]/)(.*?)(?:/gamelog)', url)[0]
with open('%s.csv' % player_id, 'w') as f:
csvwriter = csv.writer(f, delimiter=',', quotechar='"')
d = [[td.text for td in tr.find_all('td')] for tr in soup.find('div', id='all_pgl_basic').find_all('tr') if tr.find_all('td')]
for row in d:
csvwriter.writerow([player_id] + row)
player_season_url = "https://www.basketball-reference.com/players/a/abdelal01/gamelog/1991/"
getData(player_season_url)
Output is in CSV file (I added from LibreOffice):
Edit:
extracted player_id from URL
file is saved in {player_id}.csv
Related
I am trying to scrape the prices from a website and it's working but... I can't write the result to a text.file.
this is my python code.
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.futbin.com/stc/cheapest"
r = requests.get(url)
soup = bs(r.content, "html.parser")
price = soup.find("div", {"class":"d-flex row col-md-9 px-0"})
name =("example")
f =open(name + '.txt', "a")
f.write(price.text)
This is not working but if I print it instead of try to write it to a textfile it's working. I have searched for a long time but don't understand it. I think it must be a string to write to a text file but don't know how to change the ouput to a string.
You're getting error due to unicode character.
Try to add encoding='utf-8' property while opening a file.
Also your code gives a bit messy output. Try this instead:
import requests
from bs4 import BeautifulSoup as bs
url = "https://www.futbin.com/stc/cheapest"
r = requests.get(url)
soup = bs(r.content, "html.parser")
rows = soup.find("div", {"class":"d-flex row col-md-9 px-0"})
prices = rows.findAll("span",{"class":"price-holder-row"})
names = rows.findAll("div",{"class":"name-holder"})
price_list = []
name_list = []
for price in prices:
price_list.append(price.text.strip("\n "))
for name in names:
name_list.append(name.text.split()[0])
name =("example")
with open(f"{name}.txt",mode='w', encoding='utf-8') as f:
for name, price in zip(name_list,price_list):
f.write(f"{name}:{price}\n")
I'm very new to Python - WebScraping, and I want to extract text from website and export to csv files,
but i got a problem when check the csv file,
When i run this code (with print) :
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://intanseafood.com/demersal-fish"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
quotes=[]
table = soup.find('div', attrs = {'id':'archive-product'})
for row in table.findAll('div',
attrs = {'class':'product-h2'}):
quote = {}
quote['product'] = print(row.get_text())
quotes.append(quote)
Results:
Fish Goldband Snapper Natural Cut
Fish Grouper Portion
Fish Ruby Snaper Natural Cut
Fish Croaker
Fish Grouper WGGS
Fish Pinjalo Snapper Natural Cut
Fish Parrotfish WGGS
Fish Snapper One Cut
But when i change it to this code (export to csv) :
import requests
from bs4 import BeautifulSoup
import csv
URL = "https://intanseafood.com/demersal-fish"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
quotes=[]
table = soup.find('div', attrs = {'id':'archive-product'})
for row in table.findAll('div',
attrs = {'class':'product-h2'}):
quote = {}
quote['product'] = row.get_text()
quotes.append(quote)
filename = 'demersal.csv'
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f,['product'])
w.writeheader()
for quote in quotes:
w.writerow(quote)
File csv created, but nothing inside except the header. Kindly anybody help me to resolve this, Thanks in advance
There is a lot of whitespace in your first output which means there are tabs/spaces/new lines in the string. Doing a little digging showed it was a newline and tabs. Remove them, for example:
text = row.get_text()
quote['product'] = text.replace("\t", "").replace("\n","")
I have a 45k+ rows CSV file, each one containing a different path of the same domain - which are structurally identical to each other - and every single one is clickable. I managed to use BeautifulSoup to scrape the title and content of each one and through the print function, I was able to validate the scraper. However, when I try to export the information gathered to a new CSV file, I only get the last URL's street name and description, and not all of them as I expected.
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
site = requests.get(row['addresses']).text
soup = BeautifulSoup(site, 'lxml')
StreetName = soup.find('div', class_='hist-title').text
Description = soup.find('div', class_='hist-content').text
with open('OutputList.csv','w', newline='') as output:
Header = ['StreetName', 'Description']
writer = csv.DictWriter(output, fieldnames=Header)
writer.writeheader()
writer.writerow({'StreetName' : StreetName, 'Description' : Description})
How can the output CSV have on each row the street name and description for the respective URL row in the input CSV file?
You need to open both files on the same level and then read and write on each iteration. Something like this:
from bs4 import BeautifulSoup
import requests
import csv
with open('URLs.csv') as a, open('OutputList.csv', 'w') as b:
reader = csv.reader(a)
writer = csv.writer(b, quoting=csv.QUOTE_ALL)
writer.writerow(['StreetName', 'Description'])
# Assuming url is the first field in the CSV
for url, *_ in reader:
r = requests.get(url)
if r.ok:
soup = BeautifulSoup(r.text, 'lxml')
street_name = soup.find('div', class_='hist-title').text.strip()
description = soup.find('div', class_='hist-content').text.strip()
writer.writerow([street_name, description])
I hope it helps.
I have a list of names and trying to parse whole table content in a row of a with Xpath. In some name if there is less content my webdriver crushed and programs stops, So I decided parse table with pandas. I did my research to parse table with pandas into csv file. But don't know how to implement it.
here is the link of table I am trying to parse in a row in csv
DLLC , ACT , OREGON , 11-25-2015 , 11-25-2017 , PPB , PRINCIPAL PLACE OF BUSINESS , 22325 SW MURPHY ST,BEAVERTON , OR and so on.
see every data field from that table will be look like this in excel in each cell. I don't want any header. I just table data in row.
Now I have list of names in csv something like this:
HALF MOON BEND FARM, LLC
NICELY GROWN LLC
COPR INCORPORATED
so on......
Here is the code:
from selenium import webdriver
from bs4 import BeautifulSoup
import lxml
import time
import csv
driver = webdriver.Chrome()
driver.get("url")
#time.sleep(5)
username = driver.find_element_by_name("p_name")
#time.sleep(1)
username.send_keys("xxxxxxx")
#username.clear()
driver.find_element_by_xpath("html/body/form/table[6]/tbody/tr/td[2]/input").click()
entity= driver.find_element_by_partial_link_text("xxxxxxx")
entity.click()
html = driver.page_source
Registry_nbr = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[1]").text
Entity_type = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[2]").text
Entity_status = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[3]").text
Registry_date = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[6]").text
#Next_renewal_date = driver.find_element_by_xpath("html/body/form/table[2]/tbody/tr[2]/td[6]").text
entity_name = driver.find_element_by_xpath("html/body/form/table[3]/tbody/tr/td[2]").text
Ttest=driver.find_element_by_xpath("html/body/form/table[32]/tbody/tr/td[2]").text
with open("sos.csv", "w") as scoreFile:
scoreFileWriter = csv.writer(scoreFile)
scoreFileWriter.writerow([Registry_nbr,Entity_type,Entity_status,Registry_date,entity_name],)
scoreFile.close()
soup =BeautifulSoup(html)
for tag in soup.find_all('table'):
print tag.text
Use this after entity.click()
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
words = soup.find_all("td")
table_data = soup.get_text().encode('utf-8')
word = list()
for cell in words:
a.append((cell.text).encode('utf-8'))
with open('name.csv', 'w') as csvfile:
spamwriter = csv.writer(csvfile,delimiter=',')
spamwriter.writerow(word)
hope this will help
Once you have the html you can parse it using BeautifulSoup and find the table you want. Looking at the HTML page you reference, I do not see any classid's or identifying keys to search for so just indexing in to table[2] will have to do.
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
NBSP = u'\xa0'
tables = [ [ map(lambda d: d.text.replace(NBSP, u''), r.findAll('td'))
for r in t.findAll('tr') ]
for t in soup.findAll('table') ]
business_entity_data = tables[2]
keys = business_entity_data[0]
with open('page.csv', 'wb') as csvfile:
csvwriter = csv.DictWriter(csvfile, keys)
csvwriter.writeheader()
csvwriter.writerow(dict(zip(keys, business_entity_data[1])))
You should end up with a file containing:
Registry Nbr,Entity Type,Entity Status,Jurisdiction,Registry Date,Next Renewal Date,Renewal Due?
1164570-94,DLLC,ACT,OREGON,11-25-2015,11-25-2017,
I am trying to extract a list of all the golf courses in the USA through this link. I need to extract the name of the golf course, address, and the phone number. My script is suppose to extract all the data from the website but it looks like it only prints one row in my csv file. I noticed that when I print the "name" field it only prints once despite the find_all function. All I need is the data and not just one field from multiple links on the website.
How do I go about fixing my script so that it prints all the needed data into a CSV file.
Here is my script:
import csv
import requests
from bs4 import BeautifulSoup
courses_list = []
for i in range(1):
url="http://www.thegolfcourses.net/page/1?ls&location=California&orderby=title&radius=6750#038;location=California&orderby=title&radius=6750" #.format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content)
g_data2=soup.find_all("div",{"class":"list"})
for item in g_data2:
try:
name= item.contents[7].find_all("a",{"class":"entry-title"})[0].text
print name
except:
name=''
try:
phone= item.contents[7].find_all("p",{"class":"listing-phone"})[0].text
except:
phone=''
try:
address= item.contents[7].find_all("p",{"class":"listing-address"})[0].text
except:
address=''
course=[name,phone,address]
courses_list.append(course)
with open ('PGN_Final.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])
Here is a neat implementation for your code. You can use the library urllib2 instead of requests. And bs4 works the same though.
import csv
import urllib2
from BeautifulSoup import *
url="http://www.thegolfcourses.net/page/1?ls&location=California&orderby=title&radius=6750#038;location=California&orderby=title&radius=6750" #.format(i)
r = urllib2.urlopen(url).read()
soup = BeautifulSoup(r)
courses_list = []
courses_list.append(("Course name","Phone Number","Address"))
names = soup.findAll('h2', attrs={'class':'entry-title'})
phones = soup.findAll('p', attrs={'class':'listing-phone'})
address = soup.findAll('p', attrs={'class':'listing-address'})
for na, ph, add in zip(names,phones, address):
courses_list.append((na.text,ph.text,add.text))
with open ('PGN_Final.csv','a') as file:
writer=csv.writer(file)
for row in courses_list:
writer.writerow([s.encode("utf-8") for s in row])