Python: Using XPath to get data from a table

Python: Using XPath to get data from a table - python

I'm trying to get data from the table at the bottom of http://projects.fivethirtyeight.com/election-2016/delegate-targets/.
import requests
from lxml import html
url = "http://projects.fivethirtyeight.com/election-2016/delegate-targets/"
response = requests.get(url)
doc = html.fromstring(response.text)
tables = doc.findall('.//table[#class="delegates desktop"]')
election = tables[0]
election_rows = election.findall('.//tr')
def extractCells(row, isHeader=False):
if isHeader:
cells = row.findall('.//th')
else:
cells = row.findall('.//td')
return [val.text_content() for val in cells]
import pandas
def parse_options_data(table):
rows = table.findall(".//tr")
header = extractCells(rows[1], isHeader=True)
data = [extractCells(row, isHeader=False) for row in rows[2:]]
return pandas.DataFrame(data, columns=header)
election_data = parse_options_data(election)
election_data
I'm having trouble with the topmost row with the candidates' names ('Trump', 'Cruz', 'Kasich'). It is under tr class="top" and right now I only have tr class="bottom" (starting with the row that says "won/target").
Any help is much appreciated!

The candidate names are in the 0-th row:
candidates = [val.text_content() for val in rows[0].findall('.//th')[1:]]
Or, if reusing the same extractCells() function:
candidates = extractCells(rows[0], isHeader=True)[1:]
[1:] slice here is to skip the first empty th cell.

Not good ( hard-coded ), but run as u want to.
def parse_options_data(table):
rows = table.findall(".//tr")
candidate = extractCells(rows[0], isHeader=True)[1:]
header = extractCells(rows[1], isHeader=True)[:3] + candidate
data = [extractCells(row, isHeader=False) for row in rows[2:]]
return pandas.DataFrame(data, columns=header)

Related

ValueError: setting an array element with a sequence. For pandas.concat

I have tried many ways to concatenate a list of DataFrames together but am continuously getting the error message "ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part."
At the moment the list only contains two elements, both of them being DataFrames. They do have different columns in places but i didn't think this would be an issue. At the moment I have:
df_year_stats = pd.concat(yearStats, axis = 0, ignore_index = True).reset_index(drop=True)
I don't think the dataframes have any lists in them but that is the only plausible solution I have thought of so far, if so how would I go about checking for these.
Any help would be greatly appreciated, thank you.
edit code:
import pandas as pd
from pandas.api.types import is_string_dtype
import requests
from bs4 import BeautifulSoup as bs
course_df = pd.read_csv("dg_course_table.csv")
soup = bs(requests.get('https://www.pgatour.com/stats/categories.ROTT_INQ.html').text, 'html.parser')
tabs = soup.find('div',attrs={'class','tabbable-head clearfix hidden-small'})
subStats = tabs.find_all('a')
# creating lists of tab and link, and removing the first and last
tab_links = []
tab_names = []
for subStat in subStats:
tab_names.append(subStat.text)
tab_links.append(subStat.get('href'))
tab_names = tab_names[1:-2] #potentially remove other areas here- points/rankings and streaks
tab_links = tab_links[1:-2]
# creating empty lists
stat_links = []
all_stat_names = []
# looping through each tab and extracting all of the stats URL's, along with the corresponding stat name.
for link in tab_links:
page2 = 'https://www.pgatour.com' + str(link)
req2 = requests.get(page2)
soup2 = bs(req2.text, 'html.parser')
# find correct part of html code
stat = soup2.find('section',attrs={'class','module-statistics-off-the-tee clearfix'})
specificStats = stat.find_all('a')
for stat in specificStats:
stat_links.append(stat.get('href'))
all_stat_names.append(stat.text)
s_asl = pd.Series(stat_links, index = all_stat_names )
s_asl = s_asl.drop(labels='show more')
s_asl = s_asl.str[:-4]
tourn_links = pd.Series([],dtype=('str'))
df_all_stats = []
req4 = requests.get('https://www.pgatour.com/content/pgatour/stats/stat.120.y2014.html')
soup4 = bs(req4.text, 'html.parser')
stat = soup4.find('select',attrs={'aria-label':'Available Tournaments'})
htm = stat.find_all('option')
for h in htm: #finding all tournament codes for the given year
z = pd.Series([h.get('value')],index=[h.text])
tourn_links = tourn_links.append(z)
yearStats = []
count = 0
for tournament in tourn_links[0:2]: # create stat tables for two different golf tournaments
print(tournament)
df1 = []
df_labels = []
for r in range(0,len(s_asl)): #loop through all stat links adding the corresponding stat to that tounaments df
try:
link = 'https://www.pgatour.com'+s_asl[r]+'y2014.eon.'+tournament+'.html'
web = pd.read_html(requests.get(link).text)
table = web[1].set_index('PLAYER NAME')
df1.append(table)
df_labels.append(s_asl.index[r])
except:
print("empty table")
try:
df_tourn_stats = pd.concat(df1,keys=df_labels,axis=1)
df_tourn_stats.reset_index(level=0, inplace=True)
df_tourn_stats.insert(1,'Tournament Name',tourn_links.index[count])
df_tourn_stats.to_csv(str(count) + ".csv")
df_tourn_stats = df_tourn_stats.loc[:,~df_tourn_stats.columns.duplicated()].copy()
yearStats.append(df_tourn_stats)
except:
print("NO DATA")
count= count + 1
#combine the stats of the two different tournaments into one dataframe
df_year_stats = pd.concat(yearStats, axis = 0, ignore_index = True).reset_index(drop=True)

Python KeyError in web scraper

I am getting a KeyError: 'title' error in my web scraping program and not sure what the issue is. When I use inspect element on the webpage I can see the element that I am trying to find;
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
url = 'https://www.ncaagamesim.com/college-basketball-predictions.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')
# Get column names
headers = table.find_all('th')
cols = [x.text for x in headers]
# Get all rows in table body
table_rows = table.find_all('tr')
rows = []
# Grab the text of each td, and put into a rows list
for each in table_rows[1:]:
odd_avail = True
data = each.find_all('td')
time = data[0].text.strip()
try:
matchup, odds = data[1].text.strip().split('\xa0')
odd_margin = float(odds.split('by')[-1].strip())
except:
matchup = data[1].text.strip()
odd_margin = '-'
odd_avail = False
odd_team_win = data[1].find_all('img')[-1]['title']
sim_team_win = data[2].find('img')['title']
sim_margin = float(re.findall("\d+\.\d+", data[2].text)[-1])
if odd_avail == True:
if odd_team_win == sim_team_win:
diff = sim_margin - odd_margin
else:
diff = -1 * odd_margin - sim_margin
else:
diff = '-'
row = {cols[0]: time, 'Matchup': matchup, 'Odds Winner': odd_team_win, 'Odds': odd_margin,
'Simulation Winner': sim_team_win, 'Simulation Margin': sim_margin, 'Diff': diff}
rows.append(row)
df = pd.DataFrame(rows)
print (df.to_string())
# df.to_csv('odds.csv', index=False)
I am getting the error on setting the sim_team_win line. It is getting data[2] which is the 3rd column on the website and finding the img title to get the team name. Is it because the img title is within another div? Also, when running this code it also does not print out the "Odds" column, which is being stored in the odd_margin variable. Is there something that is wrong when setting that variable? Thanks in advance for the help!

As far as the not finding the img title, if you look at the row with New Mexico # Dixie State, there is no image in the third column - no img title in the source either.
For the Odds column, after try/excepting the sim_team_win assignment, I get all the Odds values in the table.

python pandas remove character

I am working on a project, and I need to remove the left and right most character of a data result. The data forms a scrape of craigslist, and the neighborhood results return as '(####)', but what I need it to be is ####. I am using pandas, and trying to use lstrip & rstrip. When I attempt it inside the python shell, it works, but when I use it on my data it does not work.
post_results['neighborhood'] = post_results['neighborhood'].str.lstrip('(')
post_results['neighborhood'] = post_results['neighborhood'].str.rstrip(')')
For some reason, the rstrip, does work and removes the ')' but the lstrip does not.
The full code is:
from bs4 import BeautifulSoup
import json
from requests import get
import numpy as np
import pandas as pd
import csv
print('hello world')
#get the initial page for the listings, to get the total count
response = get('https://washingtondc.craigslist.org/search/hhh?query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
results = html_result.find('div', class_='search-legend')
total = int(results.find('span',class_='totalcount').text)
pages = np.arange(0,total+1,120)
neighborhood = []
bedroom_count =[]
sqft = []
price = []
link = []
for page in pages:
#print(page)
response = get('https://washingtondc.craigslist.org/search/hhh?s='+str(page)+'query=rent&availabilityMode=0&sale_date=all+dates')
html_result = BeautifulSoup(response.text, 'html.parser')
posts = html_result.find_all('li', class_='result-row')
for post in posts:
if post.find('span',class_='result-hood') is not None:
post_url = post.find('a',class_='result-title hdrlnk')
post_link = post_url['href']
link.append(post_link)
post_neighborhood = post.find('span',class_='result-hood').text
post_price = int(post.find('span',class_='result-price').text.strip().replace('$',''))
neighborhood.append(post_neighborhood)
price.append(post_price)
if post.find('span',class_='housing') is not None:
if 'ft2' in post.find('span',class_='housing').text.split()[0]:
post_bedroom = np.nan
post_footage = post.find('span',class_='housing').text.split()[0][:-3]
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
elif len(post.find('span',class_='housing').text.split())>2:
post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
post_footage = post.find('span',class_='housing').text.split()[2][:-3]
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
elif len(post.find('span',class_='housing').text.split())==2:
post_bedroom = post.find('span',class_='housing').text.replace("br","").split()[0]
post_footage = np.nan
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
else:
post_bedroom = np.nan
post_footage = np.nan
bedroom_count.append(post_bedroom)
sqft.append(post_footage)
#create results data frame
post_results = pd.DataFrame({'neighborhood':neighborhood,'footage':sqft,'bedroom':bedroom_count,'price':price,'link':link})
#clean up results
post_results.drop_duplicates(subset='link')
post_results['footage'] = post_results['footage'].replace(0,np.nan)
post_results['bedroom'] = post_results['bedroom'].replace(0,np.nan)
post_results['neighborhood'] = post_results['neighborhood'].str.lstrip('(')
post_results['neighborhood'] = post_results['neighborhood'].str.rstrip(')')
post_results = post_results.dropna(subset=['footage','bedroom'],how='all')
post_results.to_csv("rent_clean.csv",index=False)
print(len(post_results.index))

This problem will happened when you have whitespace in the front
For example :
s=pd.Series([' (xxxx)','(yyyy) '])
s.str.strip('(|)')
0 (xxxx
1 yyyy)
dtype: object
What we can do is strip twice
s.str.strip().str.strip('(|)')
0 xxxx
1 yyyy
dtype: object

From my understanding of your question, you are removing characters from a string. You don't need pandas for this. Strings have a length and you can remove the first and last character like this;
new_word = old_word[1:-1]
This should work for you. Good luck.

AttributeError when webscraping

Received AttributeError when web-scraping but i am unsure what i a doing wrong? what does AttributeError mean?
response_obj = requests.get('https://en.wikipedia.org/wiki/Demographics_of_New_York_City').text
soup = BeautifulSoup(response_obj,'lxml')
Population_Census_Table = soup.find('table', {'class':'wikitable sortable'})
preparation of the table
rows = Population_Census_Table.select("tbody > tr")[3:8]
jurisdiction = []
for row in rows:
jurisdiction = {}
tds = row.select('td')
jurisdiction["jurisdiction"] = tds[0].text.strip()
jurisdiction["population_census"] = tds[1].text.strip()
jurisdiction["%_white"] = float(tds[2].text.strip().replace(",",""))
jurisdiction["%_black_or_african_amercian"] = float(tds[3].text.strip().replace(",",""))
jurisdiction["%_Asian"] = float(tds[4].text.strip().replace(",",""))
jurisdiction["%_other"] = float(tds[5].text.strip().replace(",",""))
jurisdiction["%_mixed_race"] = float(tds[6].text.strip().replace(",",""))
jurisdiction["%_hispanic_latino_of_other_race"] = float(tds[7].text.strip().replace(",",""))
jurisdiction["%_catholic"] = float(tds[7].text.strip().replace(",",""))
jurisdiction["%_jewish"] = float(tds[8].text.strip().replace(",",""))
jurisdiction.append(jurisdiction)
` `print(jurisdiction)
AttributeError
---> 18 jurisdiction.append(jurisdiction)
AttributeError: 'dict' object has no attribute 'append'

You start with jurisdiction as a list and immediately make it as a dict. You then treat as a dict until the error line where you try to treat it again as a list. I think you need another name for the list at the start. Possibly you meant jurisdictions (plural) as list. However, IMO there are two other areas that also definitely need fixing:
find returns a single table. The labels/keys in your dict indicate you want to a later table (not the first match)
Your indexing is incorrect for the target table
You want something like:
import requests, re
from bs4 import BeautifulSoup
response_obj = requests.get('https://en.wikipedia.org/wiki/Demographics_of_New_York_City').text
soup = BeautifulSoup(response_obj,'lxml')
Population_Census_Table = soup.select_one('.wikitable:nth-of-type(5)') #use css selector to target correct table.
jurisdictions = []
rows = Population_Census_Table.select("tbody > tr")[3:8]
for row in rows:
jurisdiction = {}
tds = row.select('td')
jurisdiction["jurisdiction"] = tds[0].text.strip()
jurisdiction["population_census"] = tds[1].text.strip()
jurisdiction["%_white"] = float(tds[2].text.strip().replace(",",""))
jurisdiction["%_black_or_african_amercian"] = float(tds[3].text.strip().replace(",",""))
jurisdiction["%_Asian"] = float(tds[4].text.strip().replace(",",""))
jurisdiction["%_other"] = float(tds[5].text.strip().replace(",",""))
jurisdiction["%_mixed_race"] = float(tds[6].text.strip().replace(",",""))
jurisdiction["%_hispanic_latino_of_other_race"] = float(tds[7].text.strip().replace(",",""))
jurisdiction["%_catholic"] = float(tds[10].text.strip().replace(",",""))
jurisdiction["%_jewish"] = float(tds[12].text.strip().replace(",",""))
jurisdictions.append(jurisdiction)

LOOP When scraping data

i am trying to scrape data using loop and this is the code
import requests
import json
import pandas as pd
parameters = ['a:1','a:2','a:3','a:4','a:3','a:4','a:5','a:6','a:7','a:8','a:9','a:10']
results = pd.DataFrame()
for item in parameters:
key, value = item.split(':')
url = "https://xxxx.000webhostapp.com/getNamesEnc02Motasel2.php?keyword=%s&type=2&limit=%s" %(key, value)
r = requests.get(url)
cont = json.loads(r.content)
temp_df = pd.DataFrame(cont)
results = results.append(temp_df)
results.to_csv('ScrapeData.csv', index=False)
this method is working great but the problem is that there i need the parameters = until 'a:1000' and i think there is a better solution to loop from 'a:1' to 'a:1000' instead of duplicating parameters like in my code .
i really need your help

Use can use a for i in range(start, end) loop. Like this
results = pd.DataFrame()
key = 'a'
# Goes from 1 to 1000 (including both)
for value in range(1, 1001):
url = f'https://xxxx.000webhostapp.com/getNamesEnc02Motasel2.php?keyword={key}&type=2&limit={value}'
r = requests.get(url)
cont = json.loads(r.content)
temp_df = pd.DataFrame(cont)
results = results.append(temp_df)
results.to_csv('ScrapeData.csv', index=False)

value = 1
key = 'a'
while value <= 1000:
url = .....%(key, str(value))
....
....
value += 1
......
Use a counter

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Using XPath to get data from a table - python

The candidate names are in the 0-th row: candidates = [val.text_content() for val in rows[0].findall('.//th')[1:]] Or, if reusing the same extractCells() function: candidates = extractCells(rows[0], isHeader=True)[1:] [1:] slice here is to skip the first empty th cell.

Related

ValueError: setting an array element with a sequence. For pandas.concat

Python KeyError in web scraper

python pandas remove character

AttributeError when webscraping

LOOP When scraping data

Categories

Resources