I am trying to read the text data from the Url mentioned in the code. But it throws an error:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 2
url="https://cdn.upgrad.com/UpGrad/temp/d934844e-5182-4b58-b896-4ba2a499aa57/companies.txt"
c=pd.read_csv(url, encoding='utf-8')
Seems like there was some encoding issues with df.read_csv() it never splitted the code:
#!/usr/bin/env python3
import requests
import pandas as pd
url = "https://cdn.upgrad.com/UpGrad/temp/d934844e-5182-4b58-b896-4ba2a499aa57/companies.txt"
r = requests.get(url)
df = None
if r.status_code == 200:
rows = r.text.split('\r\n')
header = rows[0].split('\t')
data = []
for n in range(1, len(rows)):
cols = rows[n].split('\t')
data.append(cols)
df = pd.DataFrame(columns=header, data=data)
else:
print("error: unable to load {}".format(url))
sys.exit(-1)
print(df.shape)
print(df.head(2))
$ ./test.py
(66369, 10)
permalink name homepage_url category_list status country_code state_code region city founded_at
0 /Organization/-Fame #fame http://livfame.com Media operating IND 16 Mumbai Mumbai
1 /Organization/-Qounter :Qounter http://www.qounter.com Application Platforms|Real Time|Social Network... operating USA DE DE - Other Delaware City 04-09-2014
Related
site that will be webscraped: https://api.github.com/repos/angular/angular-cli/issues?state=all&per_page=1&page=1
import requests
import bs4
page_number=1
number=[]
state=[]
created_at=[]
closed_at=[]
while page_number<100:
base_url=f'https://api.github.com/repos/angular/angular-cli/issues?state=all&per_page=1&page={page_number}'
result=requests.get(base_url)
soup=bs4.BeautifulSoup(result.text,'lxml')
test_string=soup.select('p')[0].getText()[2:]
my_list=test_string.split(',')
number.append(my_list[8])
state.append(my_list[29])
created_at.append(my_list[35])
closed_at.append(my_list[37])
page_number=page_number+1
my data of interest are number, state, created_at,closed_at
when I print their lists, the results are incorrect probably because their index changes in other pages and I based the code for indexing just in the first page.
No need to parse with beautifulsoup. The data comes in a nice json format. However, there is a rate limit of 60 request per hour for the api.
import requests
import pandas as pd
page_number=1
rows = []
while page_number<100:
base_url=f'https://api.github.com/repos/angular/angular-cli/issues?state=all&per_page=100&page={page_number}'
response = requests.get(base_url)
if response.status_code != 200:
print('Response: %s' %response.status_code)
break
jsonData = response.json()
rows += jsonData
print(page_number)
page_number=page_number+1
df = pd.DataFrame(rows)
output = df[['number', 'state', 'created_at', 'closed_at']]
Output sample:
print(output)
number state created_at closed_at
0 23719 open 2022-08-10T09:38:18Z None
1 23718 open 2022-08-10T09:20:48Z None
2 23717 open 2022-08-10T08:56:29Z None
3 23716 open 2022-08-10T08:13:01Z None
4 23715 open 2022-08-10T07:08:05Z None
5 23714 open 2022-08-09T23:36:26Z None
6 23713 open 2022-08-09T22:44:22Z None
7 23712 open 2022-08-09T22:11:42Z None
8 23711 open 2022-08-09T20:22:43Z None
I doing crawl data from Yahoo Financail. I have searched link:
https://query1.finance.yahoo.com/v7/finance/download/BVH?period1=923729900&period2=1618039708&interval=1d&events=history&includeAdjustedClose=true.
def createLink(symbol,table):
s = "https://query1.finance.yahoo.com/v7/finance/download/BVH?period1=923729900&period2=1618039708&interval=1d&events=history&includeAdjustedClose=true"
return s.replace("BVH",symbol).replace("history",table)
def getData(symbol,table):
URL = createLink(symbol,table)
web = requests.get(URL)
if web.status_code == 200:
reader = pd.read_csv(URL)
else:
reader = pd.DataFrame({"Data":[],"Dividends":[],"Stock Splits":[]})
return reader
def history(symbol):
history_close = getData(symbol,'history')
if history_close.empty:
return history_close
divend = getData(symbol,'div')
stock = getData(symbol,'split')
x = pd.merge(divend,stock, how="outer", on="Date")
data = pd.merge(history_close,x, how="outer", on="Date")
return data
df = pd.read_excel("/content/drive/MyDrive/Colab Notebooks/symbolNYSE.xlsx")
count = 0
count_fail = 0
for i in range(0,len(df["Symbol"])):
try:
count += 1
print(df["Symbol"][i],count)
a = history(df["Symbol"][i])
if not a.empty:
a.to_excel("/content/drive/MyDrive/ColabNotebooks/GetCloseYahoo/"+df["Symbol"][i]+".xlsx")
except:
count_fail+=1
pass
print("success:", count)
print("fail:", count_fail)
I am using python, request, pandas on Jupiter to crawl it.
The errors:
Error tokenizing data. C error: Expected 2 fields in line 3, saw 12
Key error
Start, i can crawl about 100 - 200 company. Then the program will error by any Symbol company. Finally, i wait a minute i can run repeat it, the program is not error.
What is the reason? Thank you so much.
I am working on a screen scraper to pull football statistics down from www.pro-football-reference.com. I'm currently scraping from the main player's stat page and then diving into their individual page with their statistics by year.
I was able to implement this process successfully with my first set of players (quarterbacks, using the Passing Table). However, when I attempted to re-create the process to get running back data, I am reciving an additional column in my data frame with the values "Unnamed: x_level_0". This is my first experience with HTML data so I'm not sure what piece I missed, I just assumed it would be the same code as the quarterbacks.
Below is the QB Code sample and the correct dataframe:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
import lxml
import re
import csv
p = 1
url = 'https://www.pro-football-reference.com'
year = 2020
maxp = 300
#Passing Data
r = requests.get(url+ '/years/' + str(year) + '/passing.htm')
soup = BeautifulSoup(r.content, 'html.parser')
parsed_table = soup.find_all('table')[0]
results = soup.find(id='div_passing')
job_elems = results.find_all('tr')
df = []
LastNameList = []
FirstNameList = []
for i,row in enumerate(parsed_table.find_all('tr')[2:]):
dat = row.find('td', attrs={'data-stat': 'player'})
if dat != None:
name = dat.a.get_text()
print(name)
stub = dat.a.get('href')
#pos = row.find('td', attrs={'data-stat': 'fantasy_pos'}).get_text()
#print(pos)
# grab this players stats
tdf = pd.read_html(url + stub)[1]
for k,v in tdf.iterrows():
#Scrape 2020 stats, if no 2020 stats move on
try:
FindYear=re.search(".*2020.*",v['Year'])
if FindYear:
#If Year for stats is current year append data to dataframe
#get Name data
fullName = row.find('td', {'class':'left'})['csk']
findComma = fullName.find(',',0,len(fullName))
lName = fullName[0:findComma]
fName = fullName[findComma + 1:len(fullName)]
LastNameList.append(lName)
FirstNameList.append(fName)
#get basic stats
df.append(v)
except:
pass
This output looks like the following:
Philip Rivers
Year 2020
Age 39
Tm IND
Pos qb
No. 17
G 1
GS 1
Below is the RB Code sample and the incorrect dataframe:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import pandas as pd
from pandas import DataFrame
import lxml
import re
import csv
p = 1
url = 'https://www.pro-football-reference.com'
year = 2020
maxp = 300
#Rushing Data
r = requests.get(url+ '/years/' + str(year) + '/rushing.htm')
soup = BeautifulSoup(r.content, 'html.parser')
parsed_table = soup.find_all('table')[0]
results = soup.find(id='div_rushing')
job_elems = results.find_all('tr')
df = []
LastNameList = []
FirstNameList = []
for i,row in enumerate(parsed_table.find_all('tr')[2:]):
dat = row.find('td', attrs={'data-stat': 'player'})
if dat != None:
name = dat.a.get_text()
print(name)
stub = dat.a.get('href')
print(stub)
#pos = row.find('td', attrs={'data-stat': 'fantasy_pos'}).get_text()
#print(pos)
# grab this players stats
tdf = pd.read_html(url + stub)[1]
for k,v in tdf.iterrows():
print(v)
#Scrape 2020 stats, if no 2020 stats move on
try:
FindYear=re.search(".*2020.*",v['Year'])
print('found 2020')
if FindYear:
#If Year for stats is current year append data to dataframe
#get Name data
fullName = row.find('td', {'class':'left'})['csk']
findComma = fullName.find(',',0,len(fullName))
lName = fullName[0:findComma]
fName = fullName[findComma + 1:len(fullName)]
LastNameList.append(lName)
FirstNameList.append(fName)
#get basic stats
df.append(v)
except:
pass
This output looks like the following:
Unnamed: 0_level_0 Year 2020
Unnamed: 1_level_0 Age 26
Unnamed: 2_level_0 Tm TEN
Unnamed: 3_level_0 Pos rb
Unnamed: 4_level_0 No. 22
Games G 1
GS 1
Rushing Rush 31
Yds 116
TD 0
An example URL where this data is pulled from is: https://www.pro-football-reference.com/players/J/JacoJo01.htm
And it is pulling the Rushing & Receiving. Is there something additional I need to be on the lookout for when it comes to parsing HTML?
I attempted to add index_col = 1 into my tdf = pd.read_html(url + stub)[1]. However, that just kind of grouped the two values into one column.
Any input on this would be greatly appreciated. If I can provide any further information, please let me know.
Thank you
You can try this code to parse the table passing for each player (Now I get the players from https://www.pro-football-reference.com/years/2020/passing.htm but you can pass any player URL to it:
import requests
from bs4 import BeautifulSoup
def scrape_player(player_name, player_url, year="2020"):
out = []
soup = BeautifulSoup(requests.get(player_url).content, 'html.parser')
row = soup.select_one('table#passing tr:has(th:contains("{}"))'.format(year))
if row:
tds = [player_name] + [t.text for t in row.select('th, td')]
headers = ['Name'] + [th.text for th in row.find_previous('thead').select('th')]
out.append(dict(zip(headers, tds)))
return out
url = 'https://www.pro-football-reference.com/years/2020/passing.htm'
all_data = []
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for player in soup.select('table#passing [data-stat="player"] a'):
print(player.text)
for data in scrape_player(player.text, 'https://www.pro-football-reference.com' + player['href']):
all_data.append(data)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Creates this csv:
EDIT: To parse Rushing&Receiving, you can use this script:
import requests
from bs4 import BeautifulSoup, Comment
def scrape_player(player_name, player_url, year="2020"):
out = []
soup = BeautifulSoup(requests.get(player_url).content, 'html.parser')
soup = BeautifulSoup(soup.select_one('#rushing_and_receiving_link').find_next(text=lambda t: isinstance(t, Comment)), 'html.parser')
row = soup.select_one('table#rushing_and_receiving tr:has(th:contains("{}"))'.format(year))
if row:
tds = [player_name] + [t.text for t in row.select('th, td')]
headers = ['Name'] + [th.text for th in row.find_previous('thead').select('tr')[-1].select('th')]
out.append(dict(zip(headers, tds)))
return out
url = 'https://www.pro-football-reference.com/years/2020/passing.htm'
all_data = []
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for player in soup.select('table#passing [data-stat="player"] a'):
print(player.text)
for data in scrape_player(player.text, 'https://www.pro-football-reference.com' + player['href']):
all_data.append(data)
df = pd.DataFrame(all_data)
df.to_csv('data.csv')
print(df)
Creates this CSV:
I am following this tutorial to retrieve data from news sites.
The main function is getDailyNews. It will loop on each news source, request the api, extract the data and dump it to a pandas DataFrame and then export the result into csv file.
But when I ran the code, I am getting an error.
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
from tqdm import tqdm, tqdm_notebook
from functools import reduce
def getSources():
source_url = 'https://newsapi.org/v1/sources?language=en'
response = requests.get(source_url).json()
sources = []
for source in response['sources']:
sources.append(source['id'])
return sources
def mapping():
d = {}
response = requests.get('https://newsapi.org/v1/sources?language=en')
response = response.json()
for s in response['sources']:
d[s['id']] = s['category']
return d
def category(source, m):
try:
return m[source]
except:
return 'NC'
def getDailyNews():
sources = getSources()
key = '96f279e1b7f845669089abc016e915cc'
url = 'https://newsapi.org/v1/articles?source={0}&sortBy={1}&apiKey={2}'
responses = []
for i, source in tqdm_notebook(enumerate(sources), total=len(sources)):
try:
u = url.format(source, 'top', key)
except:
u = url.format(source, 'latest', key)
response = requests.get(u)
r = response.json()
try:
for article in r['articles']:
article['source'] = source
responses.append(r)
except:
print('Rate limit exceeded ... please wait and retry in 6 hours')
return None
articles = list(map(lambda r: r['articles'], responses))
articles = list(reduce(lambda x,y: x+y, articles))
news = pd.DataFrame(articles)
news = news.dropna()
news = news.drop_duplicates()
news.reset_index(inplace=True, drop=True)
d = mapping()
news['category'] = news['source'].map(lambda s: category(s, d))
news['scraping_date'] = datetime.now()
try:
aux = pd.read_csv('./data/news.csv')
aux = aux.append(news)
aux = aux.drop_duplicates('url')
aux.reset_index(inplace=True, drop=True)
aux.to_csv('./data/news.csv', encoding='utf-8', index=False)
except:
news.to_csv('./data/news.csv', index=False, encoding='utf-8')
print('Done')
if __name__=='__main__':
getDailyNews()
Error:
FileNotFoundError: [Errno 2] No such file or directory: './data/news.csv'
I know that I have to give the path name in pd.read_csv but I don't know which path I have to give here.
This error would make sense if there wasn't already a data folder in the directory you are executing this program from. There is a similar problem in the post here.
I am using Python 2.76 to submit queries to an .aspx webpage and pick up the results by BeautifulSoup, and want to store them into an Excel spreadsheet.
import mechanize
import re
import xlwt
from bs4 import BeautifulSoup
import urllib2
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('Legi', cell_overwrite_ok = True)
for items in ['university student', 'high school student']:
url = r'http://legistar.council.nyc.gov/Legislation.aspx'
request = mechanize.Request(url)
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
form = forms[0]
response.close()
form['ctl00$ContentPlaceHolder1$txtSearch'] = items
submit_page = mechanize.urlopen(form.click())
soup = BeautifulSoup(submit_page.read())
aa = soup.find_all(href=re.compile('LegislationDetail'))
for bb in aa:
cc = bb.text
#print cc
results = []
results.append(cc)
for row, legi_no in enumerate(results):
sheet.write (row, 0, legi_no)
book.save("C:\\legi results.xls")
It finds and pick up the results, if I print the variable ‘cc’. however the writing into Excel spreadsheet is not successful because it only writes the first cell.
You create the results variable inside the for bb in aa loop.
This means results will get initialized to [] for each value in aa and in the end results will contain only one element (the last one), which is ofcourse not intended.
Put the results outside and it should work fine, as shown below.
import mechanize
import re
import xlwt
from bs4 import BeautifulSoup
import urllib2
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('Legi', cell_overwrite_ok = True)
for items in ['university student', 'high school student']:
url = r'http://legistar.council.nyc.gov/Legislation.aspx'
request = mechanize.Request(url)
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
form = forms[0]
response.close()
form['ctl00$ContentPlaceHolder1$txtSearch'] = items
submit_page = mechanize.urlopen(form.click())
soup = BeautifulSoup(submit_page.read())
aa = soup.find_all(href=re.compile('LegislationDetail'))
results = [] # Initialize results here !!!
for bb in aa:
cc = bb.text
#print cc
results.append(cc)
for row, legi_no in enumerate(results):
sheet.write (row, 0, legi_no)
book.save("C:\\legi results.xls")