AttributeError when webscraping

AttributeError when webscraping - python

Received AttributeError when web-scraping but i am unsure what i a doing wrong? what does AttributeError mean?
response_obj = requests.get('https://en.wikipedia.org/wiki/Demographics_of_New_York_City').text
soup = BeautifulSoup(response_obj,'lxml')
Population_Census_Table = soup.find('table', {'class':'wikitable sortable'})
preparation of the table
rows = Population_Census_Table.select("tbody > tr")[3:8]
jurisdiction = []
for row in rows:
jurisdiction = {}
tds = row.select('td')
jurisdiction["jurisdiction"] = tds[0].text.strip()
jurisdiction["population_census"] = tds[1].text.strip()
jurisdiction["%_white"] = float(tds[2].text.strip().replace(",",""))
jurisdiction["%_black_or_african_amercian"] = float(tds[3].text.strip().replace(",",""))
jurisdiction["%_Asian"] = float(tds[4].text.strip().replace(",",""))
jurisdiction["%_other"] = float(tds[5].text.strip().replace(",",""))
jurisdiction["%_mixed_race"] = float(tds[6].text.strip().replace(",",""))
jurisdiction["%_hispanic_latino_of_other_race"] = float(tds[7].text.strip().replace(",",""))
jurisdiction["%_catholic"] = float(tds[7].text.strip().replace(",",""))
jurisdiction["%_jewish"] = float(tds[8].text.strip().replace(",",""))
jurisdiction.append(jurisdiction)
` `print(jurisdiction)
AttributeError
---> 18 jurisdiction.append(jurisdiction)
AttributeError: 'dict' object has no attribute 'append'

You start with jurisdiction as a list and immediately make it as a dict. You then treat as a dict until the error line where you try to treat it again as a list. I think you need another name for the list at the start. Possibly you meant jurisdictions (plural) as list. However, IMO there are two other areas that also definitely need fixing:
find returns a single table. The labels/keys in your dict indicate you want to a later table (not the first match)
Your indexing is incorrect for the target table
You want something like:
import requests, re
from bs4 import BeautifulSoup
response_obj = requests.get('https://en.wikipedia.org/wiki/Demographics_of_New_York_City').text
soup = BeautifulSoup(response_obj,'lxml')
Population_Census_Table = soup.select_one('.wikitable:nth-of-type(5)') #use css selector to target correct table.
jurisdictions = []
rows = Population_Census_Table.select("tbody > tr")[3:8]
for row in rows:
jurisdiction = {}
tds = row.select('td')
jurisdiction["jurisdiction"] = tds[0].text.strip()
jurisdiction["population_census"] = tds[1].text.strip()
jurisdiction["%_white"] = float(tds[2].text.strip().replace(",",""))
jurisdiction["%_black_or_african_amercian"] = float(tds[3].text.strip().replace(",",""))
jurisdiction["%_Asian"] = float(tds[4].text.strip().replace(",",""))
jurisdiction["%_other"] = float(tds[5].text.strip().replace(",",""))
jurisdiction["%_mixed_race"] = float(tds[6].text.strip().replace(",",""))
jurisdiction["%_hispanic_latino_of_other_race"] = float(tds[7].text.strip().replace(",",""))
jurisdiction["%_catholic"] = float(tds[10].text.strip().replace(",",""))
jurisdiction["%_jewish"] = float(tds[12].text.strip().replace(",",""))
jurisdictions.append(jurisdiction)

Related

ValueError: setting an array element with a sequence. For pandas.concat

I have tried many ways to concatenate a list of DataFrames together but am continuously getting the error message "ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part."
At the moment the list only contains two elements, both of them being DataFrames. They do have different columns in places but i didn't think this would be an issue. At the moment I have:
df_year_stats = pd.concat(yearStats, axis = 0, ignore_index = True).reset_index(drop=True)
I don't think the dataframes have any lists in them but that is the only plausible solution I have thought of so far, if so how would I go about checking for these.
Any help would be greatly appreciated, thank you.
edit code:
import pandas as pd
from pandas.api.types import is_string_dtype
import requests
from bs4 import BeautifulSoup as bs
course_df = pd.read_csv("dg_course_table.csv")
soup = bs(requests.get('https://www.pgatour.com/stats/categories.ROTT_INQ.html').text, 'html.parser')
tabs = soup.find('div',attrs={'class','tabbable-head clearfix hidden-small'})
subStats = tabs.find_all('a')
# creating lists of tab and link, and removing the first and last
tab_links = []
tab_names = []
for subStat in subStats:
tab_names.append(subStat.text)
tab_links.append(subStat.get('href'))
tab_names = tab_names[1:-2] #potentially remove other areas here- points/rankings and streaks
tab_links = tab_links[1:-2]
# creating empty lists
stat_links = []
all_stat_names = []
# looping through each tab and extracting all of the stats URL's, along with the corresponding stat name.
for link in tab_links:
page2 = 'https://www.pgatour.com' + str(link)
req2 = requests.get(page2)
soup2 = bs(req2.text, 'html.parser')
# find correct part of html code
stat = soup2.find('section',attrs={'class','module-statistics-off-the-tee clearfix'})
specificStats = stat.find_all('a')
for stat in specificStats:
stat_links.append(stat.get('href'))
all_stat_names.append(stat.text)
s_asl = pd.Series(stat_links, index = all_stat_names )
s_asl = s_asl.drop(labels='show more')
s_asl = s_asl.str[:-4]
tourn_links = pd.Series([],dtype=('str'))
df_all_stats = []
req4 = requests.get('https://www.pgatour.com/content/pgatour/stats/stat.120.y2014.html')
soup4 = bs(req4.text, 'html.parser')
stat = soup4.find('select',attrs={'aria-label':'Available Tournaments'})
htm = stat.find_all('option')
for h in htm: #finding all tournament codes for the given year
z = pd.Series([h.get('value')],index=[h.text])
tourn_links = tourn_links.append(z)
yearStats = []
count = 0
for tournament in tourn_links[0:2]: # create stat tables for two different golf tournaments
print(tournament)
df1 = []
df_labels = []
for r in range(0,len(s_asl)): #loop through all stat links adding the corresponding stat to that tounaments df
try:
link = 'https://www.pgatour.com'+s_asl[r]+'y2014.eon.'+tournament+'.html'
web = pd.read_html(requests.get(link).text)
table = web[1].set_index('PLAYER NAME')
df1.append(table)
df_labels.append(s_asl.index[r])
except:
print("empty table")
try:
df_tourn_stats = pd.concat(df1,keys=df_labels,axis=1)
df_tourn_stats.reset_index(level=0, inplace=True)
df_tourn_stats.insert(1,'Tournament Name',tourn_links.index[count])
df_tourn_stats.to_csv(str(count) + ".csv")
df_tourn_stats = df_tourn_stats.loc[:,~df_tourn_stats.columns.duplicated()].copy()
yearStats.append(df_tourn_stats)
except:
print("NO DATA")
count= count + 1
#combine the stats of the two different tournaments into one dataframe
df_year_stats = pd.concat(yearStats, axis = 0, ignore_index = True).reset_index(drop=True)

Only items from first Beautiful Soup object are being added to my lists

I suspect this isn't very complicated, but I can't see to figure it out. I'm using Selenium and Beautiful Soup to parse Petango.com. Data will be used to help a local shelter understand how they compare in different metrics to other area shelters. so next will be taking these data frames and doing some analysis.
I grab detail urls from a different module and import the list here.
My issue is, my lists are only showing the value from the HTML from the first dog. I was stepping through and noticed my len are different for the soup iterations, so I realize my error is after that somewhere but I can't figure out how to fix.
Here is my code so far (running the whole process vs using a cached page)
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
from Petango import pet_links
headings = []
values = []
ShelterInfo = []
ShelterInfoWebsite = []
ShelterInfoEmail = []
ShelterInfoPhone = []
ShelterInfoAddress = []
Breed = []
Age = []
Color = []
SpayedNeutered = []
Size = []
Declawed = []
AdoptionDate = []
# to access sites, change url list to pet_links (break out as needed) and change if false to true. false looks to the html file
url_list = (pet_links[4], pet_links[6], pet_links[8])
#url_list = ("Petango.html", "Petango.html", "Petango.html")
for link in url_list:
page_source = None
if True:
#pet page = link should populate links from above, hard code link was for 1 detail page, =to hemtl was for cached site
PetPage = link
#PetPage = 'https://www.petango.com/Adopt/Dog-Terrier-American-Pit-Bull-45569732'
#PetPage = Petango.html
PetDriver = webdriver.Chrome(executable_path='/Users/paulcarson/Downloads/chromedriver')
PetDriver.implicitly_wait(30)
PetDriver.get(link)
page_source = PetDriver.page_source
PetDriver.close()
else:
with open("Petango.html",'r') as f:
page_source = f.read()
PetSoup = BeautifulSoup(page_source, 'html.parser')
print(len(PetSoup.text))
#get the details about the shelter and add to lists
ShelterInfo.append(PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find('h4').text)
ShelterInfoParagraphs = PetSoup.find('div', class_ = "DNNModuleContent ModPethealthPetangoDnnModulesShelterShortInfoC").find_all('p')
First_Paragraph = ShelterInfoParagraphs[0]
if "Website" not in First_Paragraph.text:
raise AssertionError("first paragraph is not about site")
ShelterInfoWebsite.append(First_Paragraph.find('a').text)
Second_Paragraph = ShelterInfoParagraphs[1]
ShelterInfoEmail.append(Second_Paragraph.find('a')['href'])
Third_Paragraph = ShelterInfoParagraphs[2]
ShelterInfoPhone.append(Third_Paragraph.find('span').text)
Fourth_Paragraph = ShelterInfoParagraphs[3]
ShelterInfoAddress.append(Fourth_Paragraph.find('span').text)
#get the details about the pet
ul = PetSoup.find('div', class_='group details-list').ul # Gets the ul tag
li_items = ul.find_all('li') # Finds all the li tags within the ul tag
for li in li_items:
heading = li.strong.text
headings.append(heading)
value = li.span.text
if value:
values.append(value)
else:
values.append(None)
Breed.append(values[0])
Age.append(values[1])
print(Age)
Color.append(values[2])
SpayedNeutered.append(values[3])
Size.append(values[4])
Declawed.append(values[5])
AdoptionDate.append(values[6])
ShelterDF = pd.DataFrame(
{
'Shelter': ShelterInfo,
'Shelter Website': ShelterInfoWebsite,
'Shelter Email': ShelterInfoEmail,
'Shelter Phone Number': ShelterInfoPhone,
'Shelter Address': ShelterInfoAddress
})
PetDF = pd.DataFrame(
{'Breed': Breed,
'Age': Age,
'Color': Color,
'Spayed/Neutered': SpayedNeutered,
'Size': Size,
'Declawed': Declawed,
'Adoption Date': AdoptionDate
})
print(PetDF)
print(ShelterDF)
output from print len and print the age value as the loop progresses
12783
['6y 7m']
10687
['6y 7m', '6y 7m']
10705
['6y 7m', '6y 7m', '6y 7m']
Could someone please point me in the right direction?
Thank you for your help!
Paul

You need to change the find method into find_all() in BeautifulSoup so it locates all the elements.

Values is global and you only ever append the first value in this list to Age
Age.append(values[1])
Same problem for your other global lists (static index whether 1 or 2 etc...).
You need a way to track the appropriate index to use perhaps through a counter, or determine other logic to ensure current value is added e.g. with current Age, is it is the second li in the loop? Or just append PetSoup.select_one("[data-bind='text: age']").text
It looks like each item of interest e.g. colour, spayed contains the data-bind attribute so you can use those with the appropriate attribute value to select each value and avoid a loop over li elements.
e.g. current_colour = PetSoup.select_one("[data-bind='text: color']").text
Best to set in a variable and test is not None before accessing with .text

How to build DataFrame from two dicts Python

I am trying to build a dataframe, in which this attempt grabs data and column from dicts. (I tried doing this with pd.Series but I kept running into issues there, as well.)
import requests
import pandas as pd
from bs4 import BeautifulSoup
# get link and parse
page = requests.get('https://www.finviz.com/screener.ashx?v=111&ft=4')
soup = BeautifulSoup(page.text, 'html.parser')
# return 'Title's for each filter
# to be used as columns in dataframe
titles = soup.find_all('span', attrs={'class': 'screener-combo-title'})
title_list = []
for t in titles:
t = t.stripped_strings
t = ' '.join(t)
title_list.append(t)
title_list = {k: v for k, v in enumerate(title_list)}
# finding filters-cells tag id's
# to be used to build url
filters = soup.find_all('select', attrs={'data-filter': True})
filter_list = []
for f in filters:
filter_list.append(f.get('data-filter'))
# finding selectable values per cell
# to be used as data in dataframe
final_list = []
for f in filters:
options = f.find_all('option', attrs={'value': True})
option_list = [] # list needs to stay inside
for option in options:
if option['value'] != "":
option_list.append(option['value'])
final_list.append(option_list)
final_list = {k: v for k, v in enumerate(final_list)}
df = pd.DataFrame([final_list], columns=[title_list])
print(df)
This results in TypeError: unhashable type: 'dict' An example would look like (the first column is NOT the index):
Exchange Index ...
amex s&p500 ...
nasd djia
nyse

Here is an attempt to build a dict where key corresponds to filter values, and value corresponds to a list of possible choices. Does it suit your needs?
import requests
import pandas as pd
from bs4 import BeautifulSoup
# get link and parse
page = requests.get('https://www.finviz.com/screener.ashx?v=111&ft=4')
soup = BeautifulSoup(page.text, 'html.parser')
all_dict = {}
filters = soup.find_all('td', attrs={'class': 'filters-cells'})
for i in range(len(filters) // 2):
i_title = 2 * i
i_value = 2 * i + 1
sct = filters[i_title].find_all('span', attrs={'class': 'screener-combo-title'})
if len(sct)== 1:
title = ' '.join(sct[0].stripped_strings)
values = [v.text for v in filters[i_value].find_all('option', attrs={'value': True}) if v.text]
all_dict[title] = values
max_element = max([len(v) for v in all_dict.values()])
for k in all_dict:
all_dict[k] = all_dict[k] + [''] * (max_element - len(all_dict[k]))
df = pd.DataFrame.from_dict(all_dict)

How to iterate through each sub link to gather data

How do you iterate through each sub text (fighter) to get the data i need and then leave the sub link to go back to the page with all the fighters names on it and then iterate to the next fighter(link) and get all the data on that fighter and keep doing that until it gets to the end of the list on that specific page.
records=[]
r = requests.get('http://www.espn.com/mma/fighters')
soup = BeautifulSoup(r.text,'html.parser')
data = soup.find_all('tr',attrs={'class':['oddrow','evenrow']})
for d in data:
try:
name = d.find('a').text
except AttributeError: name = ""
try:
country = d.find('td').findNext('td').text
except AttributeError: county = ""
records.append([name,country])
The above code is where all the fighters names are located. I am able to iterate over each one to collect the (fighters name and country)
links = [f"http://www.espn.com{i['href']}" for i in data.find_all('a') if re.findall('^/mma/', i['href'])][1]
r1 = requests.get(links)
data1 = BeautifulSoup(test.text,'html.parser')
bio = data1.find('div', attrs={'class':'mod-content'})
weightClass = data1.find('li',attrs={'class':'first'}).text
trainingCenter = data1.find('li',attrs={'class':'last'}).text
wins = data1.find('table',attrs={'class':'header-stats'})('td')[0].text
loses = data1.find('table',attrs={'class':'header-stats'})('td')[1].text
draws = data1.find('table',attrs={'class':'header-stats'})('td')[2].text
tkos = data1.find_all('table',attrs={'class':'header-stats'})[1]('td')[0].text
subs = data1.find_all('table',attrs={'class':'header-stats'})[1]('td')[1].text
The above code is currently entering into the second fighter and collecting all the data for that specific fighter(link).
records=[]
r = requests.get('http://www.espn.com/mma/fighters')
soup = BeautifulSoup(r.text,'html.parser')
data = soup.find_all('tr',attrs={'class':['oddrow','evenrow']})
links = [f"http://www.espn.com{i['href']}" for i in data.find_all('a') if re.findall('^/mma/', i['href'])]
for d in data:
try:
name = d.find('a').text
except AttributeError: name = ""
try:
country = d.find('td').findNext('td').text
except AttributeError: county = ""
for l in links:
r1 = requests.get(links)
data1 = BeautifulSoup(test.text,'html.parser')
bio = data1.find('div', attrs={'class':'mod-content'})
for b in bio:
try:
weightClass = data1.find('li',attrs={'class':'first'}).text
except AttributeError: name = ""
try:
trainingCenter = data1.find('li',attrs={'class':'last'}).text
except AttributeError: name = ""
try:
wins = data1.find('table',attrs={'class':'header-stats'})('td')[0].text
except AttributeError: name = ""
try:
loses = data1.find('table',attrs={'class':'header-stats'})('td')[1].text
except AttributeError: name = ""
try:
draws = data1.find('table',attrs={'class':'header-stats'})('td')[2].text
except AttributeError: name = ""
try:
tkos = data1.find_all('table',attrs={'class':'header-stats'})[1]('td')[0].text
except AttributeError: name = ""
try:
subs = data1.find_all('table',attrs={'class':'header-stats'})[1]('td')[1].text
except AttributeError: name = ""
records.append([name,country,weightClass])
The above code is what i am trying, but i am getting an error message:
"ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
How do i add this to the initial code i have so, i can collect the fighters name and country on the original page and then iterate into the fighters(link) and gather the data you see above and then have it do it for all fighters on that page?

Check out this solution. I don't have much time at this moment but I will check around as soon as I'm free. You can do the main operation using the following code. The only thing you ned to do is get the data from the target page. The below script can fetch you all the links from each page going through the pagination (a to z) and then from the target page it will collect you the names.
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "http://www.espn.com/mma/fighters?search={}"
for linknum in [chr(i) for i in range(ord('a'),ord('z')+1)]:
r = requests.get(url.format(linknum))
soup = BeautifulSoup(r.text,'html.parser')
for links in soup.select(".tablehead a[href*='id']"):
res = requests.get(urljoin(url,links.get("href")))
sauce = BeautifulSoup(res.text,"lxml")
title = sauce.select_one(".player-bio h1").text
print(title)

import requests, re
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import pandas as pd
url = "http://www.espn.com/mma/fighters?search={}"
titleList = []
countryList = []
stanceList = []
reachList = []
ageList = []
weightClassList = []
trainingCenterList = []
winsList = []
losesList =[]
drawsList = []
tkosList = []
subsList = []
#i believe this is what takes us from one page to another, but not 100% sure yet
for linknum in [chr(i) for i in range(ord('a'),ord('z')+1)]:
r = requests.get(url.format(linknum))
soup = BeautifulSoup(r.text,'html.parser')
#a[href*=] gets all anchors a that contain whatever the href*=''
for links in soup.select(".tablehead a[href*='id']"):
#urljoin just takes a url and another string and combines them to create a new url
res = requests.get(urljoin(url,links.get("href")))
sauce = BeautifulSoup(res.text,"lxml")
try:
title = sauce.select_one(".player-bio h1").text
except AttributeError: title = ""
try:
country = sauce.find('span',text='Country').next_sibling
except AttributeError: country = ""
try:
stance = sauce.find('span',text='Stance').next_sibling
except AttributeError: stance = ""
try:
reach = sauce.find('span',text='Reach').next_sibling
except AttributeError: reach = ""
try:
age = sauce.find('span',text='Birth Date').next_sibling[-3:-1]
except AttributeError: age = ""
try:
weightClass = sauce.find('li',attrs={'class':'first'}).text
except AttributeError: weightClass = ""
try:
trainingCenter = sauce.find('li',attrs={'class':'last'}).text
except AttributeError: trainingCenter = ""
try:
wins = sauce.find('table',attrs={'class':'header-stats'})('td')[0].text
except AttributeError: wins = ""
try:
loses = sauce.find('table',attrs={'class':'header-stats'})('td')[1].text
except AttributeError: loses = ""
try:
draws = sauce.find('table',attrs={'class':'header-stats'})('td')[2].text
except AttributeError: draws = ""
try:
tkos = sauce.find_all('table',attrs={'class':'header-stats'})[1]('td')[0].text
except AttributeError: tkos = ""
try:
subs = sauce.find_all('table',attrs={'class':'header-stats'})[1]('td')[1].text
except AttributeError: subs = ""
titleList.append(title)
countryList.append(country)
stanceList.append(stance)
reachList.append(reach)
ageList.append(age)
weightClassList.append(weightClass)
trainingCenterList.append(trainingCenter)
winsList.append(wins)
losesList.append(loses)
drawsList.append(draws)
tkosList.append(tkos)
subsList.append(subs)
df = pd.DataFrame()
df['title'] = titleList
df['country'] = countryList
df['stance'] = stanceList
df['reach'] = reachList
df['age'] = ageList
df['weightClass'] = weightClassList
df['trainingCenter']= trainingCenterList
df['wins'] = winsList
df['loses'] = losesList
df['draws'] = drawsList
df['tkos'] = tkosList
df['subs'] = subsList
df.to_csv('MMA Fighters', encoding='utf-8')

Python: Using XPath to get data from a table

I'm trying to get data from the table at the bottom of http://projects.fivethirtyeight.com/election-2016/delegate-targets/.
import requests
from lxml import html
url = "http://projects.fivethirtyeight.com/election-2016/delegate-targets/"
response = requests.get(url)
doc = html.fromstring(response.text)
tables = doc.findall('.//table[#class="delegates desktop"]')
election = tables[0]
election_rows = election.findall('.//tr')
def extractCells(row, isHeader=False):
if isHeader:
cells = row.findall('.//th')
else:
cells = row.findall('.//td')
return [val.text_content() for val in cells]
import pandas
def parse_options_data(table):
rows = table.findall(".//tr")
header = extractCells(rows[1], isHeader=True)
data = [extractCells(row, isHeader=False) for row in rows[2:]]
return pandas.DataFrame(data, columns=header)
election_data = parse_options_data(election)
election_data
I'm having trouble with the topmost row with the candidates' names ('Trump', 'Cruz', 'Kasich'). It is under tr class="top" and right now I only have tr class="bottom" (starting with the row that says "won/target").
Any help is much appreciated!

The candidate names are in the 0-th row:
candidates = [val.text_content() for val in rows[0].findall('.//th')[1:]]
Or, if reusing the same extractCells() function:
candidates = extractCells(rows[0], isHeader=True)[1:]
[1:] slice here is to skip the first empty th cell.

Not good ( hard-coded ), but run as u want to.
def parse_options_data(table):
rows = table.findall(".//tr")
candidate = extractCells(rows[0], isHeader=True)[1:]
header = extractCells(rows[1], isHeader=True)[:3] + candidate
data = [extractCells(row, isHeader=False) for row in rows[2:]]
return pandas.DataFrame(data, columns=header)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

AttributeError when webscraping - python

Related

ValueError: setting an array element with a sequence. For pandas.concat

Only items from first Beautiful Soup object are being added to my lists

How to build DataFrame from two dicts Python

How to iterate through each sub link to gather data

Python: Using XPath to get data from a table

Categories

Resources