Python Limit time to run pandas read_html - python

I am trying to limit the time for running dfs = pd.read_html(str(response.text)). Once it runs for more than 5 seconds, it will stop running for this url and move to running the next url. I did not find out timeout attribute in pd.readhtml. So how can I do that?
from bs4 import BeautifulSoup
import re
import requests
import os
import time
from pandas import DataFrame
import pandas as pd
from urllib.request import urlopen
headers = {'User-Agent': 'regsre#jh.edu'}
urls={'https://www.sec.gov/Archives/edgar/data/1058307/0001493152-21-003451.txt', 'https://www.sec.gov/Archives/edgar/data/1064722/0001760319-21-000006.txt'}
for url in urls:
response = requests.get(url, headers = headers)
response.raise_for_status()
time.sleep(0.1)
dfs = pd.read_html(str(response.text))
print(url)
for item in dfs:
try:
Operation=(item[0].apply(str).str.contains('Revenue') | item[0].apply(str).str.contains('profit'))
if Operation.empty:
pass
if Operation.any():
Operation_sheet=item
if not Operation.any():
CashFlows=(item[0].apply(str).str.contains('income') | item[0].apply(str).str.contains('loss'))
if CashFlows.any():
Operation_sheet=item
if not CashFlows.any():
pass

I'm not certain what the issue is, but pandas seems to get overwhelmed by this file. If we utilize BeautifulSoup to instead search for tables, prettify them, and pass those to pd.read_html(), then it seems to be able to handle things just fine.
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers = {'User-Agent': 'regsre#jh.edu'}
url = 'https://www.sec.gov/Archives/edgar/data/1064722/0001760319-21-000006.txt'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text)
dfs = []
for table in soup.find_all('table'):
dfs.extend(pd.read_html(table.prettify()))
# Printing the first few:
for df in dfs[0:3]:
print(df, '\n')
0 1 2 3 4
0 Nevada NaN 4813 NaN 65-0783722
1 (State or other jurisdiction of NaN (Primary Standard Industrial NaN (I.R.S. Employer
2 incorporation or organization) NaN Classification Code Number) NaN Identification Number)
0
0 Ralph V. De Martino, Esq.
1 Alec Orudjev, Esq.
2 Schiff Hardin LLP
3 901 K Street, NW, Suite 700
4 Washington, DC 20001
5 Phone (202) 778-6400
6 Fax: (202) 778-6460
0 1
0 Large accelerated filer [ ] Accelerated filer [ ]
1 NaN NaN
2 Non-accelerated filer [X] Smaller reporting company [X]
3 NaN NaN
4 NaN Emerging growth company [ ]

Related

bs4 only prints out 1/4 of the contents in the spreadsheet

here's the code, I'm trying to parse a spreadsheet whit over 2000 items but when ever I run this script I only get the last one hundred or so what could I do to fix this I have tried different parsers, and I haven't found any solution's
from bs4 import BeautifulSoup
import requests
url = "https://backpack.tf/spreadsheet"
sourse = requests.get("https://backpack.tf/spreadsheet").text
soup = BeautifulSoup(sourse, "html.parser")
try:
for name in soup.find_all("tr"):
header = name.text
print(header)
except:
pass
coulden't get the html to work sorry so pls go to https://backpack.tf/spreadsheet
Simplest way to read the table from this page is with pd.read_html:
import requests
import pandas as pd
url = "https://backpack.tf/spreadsheet"
r = requests.get(
url,
headers={
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:103.0) Gecko/20100101 Firefox/103.0"
},
)
df = pd.read_html(r.text)[0]
print(df)
Prints:
Name Type Genuine Vintage Unique Strange Haunted Collector's
0 A Brush with Death Cosmetic NaN NaN 5.11–5.22 ref 18.4 keys NaN 200 keys
1 A Color Similar to Slate Tool NaN NaN 33.11–33.22 ref NaN NaN NaN
2 A Color Similar to Slate (Non-Craftable) Tool NaN NaN 30 ref NaN NaN NaN
...
2851 Zepheniah's Greed (Non-Craftable) Tool NaN NaN 12 ref NaN NaN NaN
2852 Zipperface (Non-Craftable) Cosmetic NaN NaN 10.55 ref NaN NaN NaN
2853 Zipperface Cosmetic NaN NaN 1.65 keys NaN 13–14.66 ref NaN

Scraping tbody data -- trouble

I'm new with this, but struggling to understand how this makes sense/why nothing seems to be working.
Basically, all I want to do is scrape the table data (play by play) from a list of ncaa.com links (sample below)
https://stats.ncaa.org/game/play_by_play/12465
https://stats.ncaa.org/game/play_by_play/12755
https://stats.ncaa.org/game/play_by_play/12640
https://stats.ncaa.org/game/play_by_play/12290
For context, I got these links by scraping HREF tags from a different list of links (which contained every NCAA team's game schedule).
I've struggled through a lot of this, but there's been an answer somewhere...
Inspector makes it seem like the Play by Play (table) data is a tbody tag, or at least I think?
I've tried a script as simple as this (which works for other websites)
import pandas as pd
df = pd.read_html(
'https://stats.ncaa.org/game/play_by_play/13592')[0]
print(df)
But it still didn't work for this site... I read a bit about using lxml.parser instead of html.parser (like in the code below).. but also not working -- I thought this was my best chance at getting the tables from multiple links at once:
import requests
from bs4 import BeautifulSoup
profiles = []
urls = [
'https://stats.ncaa.org/game/play_by_play/12564',
'https://stats.ncaa.org/game/play_by_play/13592'
]
for url in urls:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'lxml.parser')
for profile in soup.find_all('a'):
profile = profile.get('tbody')
profiles.append(profile)
# print(profiles)
for p in profiles:
print(p)
Any thoughts as to what is unique about this site/what could be the issue would be greatly appreciated.
That website will check if the request comes from a bot or a browser, so you need to update requests' header with a real user-agent.
Each page has 8 tables. The code below will go through each url you mentioned above and print out all tables. You can review them, see which one you need, etc:
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
urls = [
'https://stats.ncaa.org/game/play_by_play/12465',
'https://stats.ncaa.org/game/play_by_play/12755',
'https://stats.ncaa.org/game/play_by_play/12640',
'https://stats.ncaa.org/game/play_by_play/12290',
]
s = requests.Session()
s.headers.update(headers)
for url in urls:
r = s.get(url)
dfs = pd.read_html(r.text)
len(dfs)
for df in dfs:
print(df)
print('___________')
Response:
0 1 2 3
0 NaN 1st Half 2nd Half Total
1 UTRGV 23 19 42
2 UTEP 26 34 60
___________
0 1
0 Game Date: 01/02/2009
1 Location: El Paso, Texas (Don Haskins Center)
2 Attendance: 8413
___________
0 1
0 Officials: John Higgins, Duke Edsall, Quinton, Reece
___________
0 1
0 1st Half 1 2
___________
0 1 2 \
0 Time UTRGV Score
1 19:45 NaN 0-0
2 19:45 NaN 0-0
3 19:33 NaN 0-0
4 19:33 Emmanuel Jones Defensive Rebound 0-0
.. ... ... ...
163 00:11 NaN 23-25
164 00:11 NaN 23-26
165 00:00 Emmanuel Jones missed Two Point Jumper 23-26
166 00:00 NaN 23-26
167 End of 1st Half End of 1st Half End of 1st Half
3
0 UTEP
1 Arnett Moultrie missed Two Point Jumper
2 Julyan Stone Offensive Rebound
3 Stefon Jackson missed Three Point Jumper
4 NaN
.. ...
163 Stefon Jackson made Free Throw
164 Stefon Jackson made Free Throw
165 NaN
166 Julyan Stone Defensive Rebound
167 End of 1st Half
[168 rows x 4 columns]
___________
0 1
0 2nd Half 1 2
[..]

I have to create a dataframe from webscraping in a specific way

I have to create a dataframe in python by creating a bunch of lists from a table in a wikipedia article.
code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv
import pandas as pd
import numpy as np
url = "https://en.wikipedia.org/wiki/Texas_Killing_Fields"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
all_tables = soup.find_all('table')
all_sortable_tables = soup.find_all('table', class_='wikitable sortable')
right_table = all_sortable_tables
A = []
B = []
C = []
D = []
E = []
for row in right_table.find_all('tr'):
cells = row.find_all('td')
if len(cells) == 5:
row.strip('\n')
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
C.append(cells[2].find(text=True))
D.append(cells[3].find(text=True))
E.append(cells[4].find(text=True))
df = pd.DataFrame(A, columns=['Victim'])
df['Victim'] = A
df['Age'] = B
df['Residence'] = C
df['Last Seen'] = D
df['Discovered'] = E
I keep getting an attribute error "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?"
I have tried a bunch of methods and nothing has helped me. I'm also following a tutorial the teacher gave us and its not helpful either.
tutorial: https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/#heading-10.-loop-through-the-rows
first time here btw as a questioner.
Note: As mentioned by #ggorlen using an existig api would be the best approache. I also would recommend to use a more structured approache to store your data to avoid these bunch of lists.
data = []
for row in soup.select('table.wikitable.sortable tr:has(td)'):
data.append(
dict(
zip([h.text.strip() for h in soup.select('table.wikitable.sortable tr th')[:5]],
[c.text.strip() for c in row.select('td')][:5])
)
)
pd.DataFrame(data)
Just an alternative approach to scrape tables using pandas.read_html() cause you already imported pandas. It also uses BeautifulSoup and is doing the job for you:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/Texas_Killing_Fields')[1]
df.iloc[:,:5] ### displays only the first 5 columns as in your example
Output:
Victim
Age
Residence
Last seen
Discovered
Brenda Jones
14
Galveston, Texas
July 1, 1971
July 2, 1971
Colette Wilson
13
Alvin, Texas
June 17, 1971
November 26, 1971
Rhonda Johnson
14
Webster, Texas
August 4, 1971
January 3, 1972
Sharon Shaw
13
Webster, Texas
August 4, 1971
January 3, 1972
Gloria Gonzales
19
Houston, Texas
October 28, 1971
November 23, 1971
...

beautiful soup find_all() not returning all elements

I am trying to scrape this website using bs4. Using inspect on particular car ad tile, I figured what I need to scrape in order to get the title & the link to the car's page.
I am making use of the find_all() function of the bs4 library but the issue is that it's not scraping the required info of all the cars. It returns only info of about 21, whereas it's clearly visible on the website that there are about 2410 cars.
The relevant code:
from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen
import re
import requests
url = 'https://www.cardekho.com/used-cars+in+bangalore'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = bs(webpage,"html.parser")
tags = page_soup.find_all("div","title")
print(len(tags))
How to get info on all of the cars present on the page.
P.S - Want to point out just one thing, all the cars aren't displayed at once. More car info gets loaded as you scroll down. Could it because of that? Not sure.
Ok, I've written up a sample code to show you how it can be done. Although the site has a convenient api that we can leverage, the first page is not available through the api, but is embedded in a script tag in the html code. This requires additional processing to extract. After that it is simply a matte of getting the json data from the api, parsing it to python dictionaries and appending the car entries to a list. The link to the api can be found when inspecting network activity in Chrome or Firefox while scrolling the site.
from bs4 import BeautifulSoup
import re
import json
from subprocess import check_output
import requests
import time
from tqdm import tqdm #tqdm is just to implement a progress bar, https://pypi.org/project/tqdm/
cars = [] #create empty list to which we will append the car dicts from the json data
url = 'https://www.cardekho.com/used-cars+in+bangalore'
r = requests.get(url , headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser")
s = soup.find('script', {"type":"application/ld+json"}).next_sibling #find the section with the json data. It looks for a script tage with application/ld+json type, and takes the next tag, which is the one with the data we need, see page source code
js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));' #strip the text from unnecessary strings and load the json as python dict, taken from: https://stackoverflow.com/questions/54991571/extract-json-from-html-script-tag-with-beautifulsoup-in-python/54992015#54992015
with open('temp.js','w') as f: # save the sting to a javascript file
f.write(js)
data_site = json.loads(check_output(['node','temp.js'])) #execute the file with node, which will return the json data that will be loaded with json.loads.
for i in data_site['items']: #iterate over the dict and append all cars to the empty list 'cars'
cars.append(i)
for page in tqdm(range(20, data_site['total_count'], 20)): #'pagefrom' in the api call is 20, 40, 60, etc. so create a range and loop it
r = requests.get(f"https://www.cardekho.com/api/v1/usedcar/search?&cityId=105&connectoid=&lang_code=en&regionId=0&searchstring=used-cars%2Bin%2Bbangalore&pagefrom={page}&sortby=updated_date&sortorder=asc&mink=0&maxk=200000&dealer_id=&regCityNames=&regStateNames=", headers={'User-Agent': 'Mozilla/5.0'})
data = r.json()
for i in data['data']['cars']: #iterate over the dict and append all cars to the empty list 'cars'
cars.append(i)
time.sleep(5) #wait a few seconds to avoid overloading the site
This will result in cars being a list of dictionaries. The car names can be found in the vid key, and the urls are present in the vlink key.
You can load it into a pandas dataframe to explore the data:
import pandas as pd
df = pd.DataFrame(cars)
df.head() will output (I omitted the images column for readability):
loc
myear
bt
ft
km
it
pi
pn
pu
dvn
ic
ucid
sid
ip
oem
model
vid
city
vlink
p_numeric
webp_image
position
pageNo
centralVariantId
isExpiredModel
modelId
isGenuine
is_ftc
seller_location
utype
views
tmGaadiStore
cls
0
Koramangala
2014
SUV
Diesel
30,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2206305_1614944913.jpg
9.9 Lakh
Mahindra XUV500 W6 2WD
13
3019084
9509A09F1673FE2566DF59EC54AAC05B
1
Mahindra
Mahindra XUV500
Mahindra XUV500 2011-2015 W6 2WD
Bangalore
/used-car-details/used-Mahindra-XUV500-2011-2015-W6-2WD-cars-Bangalore_9509A09F1673FE2566DF59EC54AAC05B.htm
990000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2206305_1614944913.webp
1
1
3822
True
570
0
0
{'address': 'BDA Complex, 100 Feet Rd, 3rd Block, Koramangala 3 Block, Koramangala, Bengaluru, Karnataka 560034, Bangalore', 'lat': 12.931, 'lng': 77.6228}
Dealer
235
False
1
Marathahalli Colony
2017
SUV
Petrol
30,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2203506_1614754307.jpeg
7.85 Lakh
Ford Ecosport 1.5 Petrol Trend BSIV
14
3015331
2C0E4C4E543D4792C1C3186B361F718B
1
Ford
Ford Ecosport
Ford Ecosport 2015-2021 1.5 Petrol Trend BSIV
Bangalore
/used-car-details/used-Ford-Ecosport-2015-2021-1.5-Petrol-Trend-BSIV-cars-Bangalore_2C0E4C4E543D4792C1C3186B361F718B.htm
785000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2203506_1614754307.webp
2
1
6086
True
175
0
0
{'address': '2, Varthur Rd, Ayyappa Layout, Chandra Layout, Marathahalli, Bengaluru, Karnataka 560037, Marathahalli Colony, Bangalore', 'lat': 12.956727624875453, 'lng': 77.70174980163576}
Dealer
495
False
2
Yelahanka
2020
SUV
Diesel
13,969
0
https://images10.gaadicdn.com/usedcar_image/320x240/usedcar_11_276591614316705_1614316747.jpg
41 Lakh
Toyota Fortuner 2.8 4WD AT
12
3007934
BBC13FB62DF6840097AA62DDEA05BB04
1
Toyota
Toyota Fortuner
Toyota Fortuner 2016-2021 2.8 4WD AT
Bangalore
/used-car-details/used-Toyota-Fortuner-2016-2021-2.8-4WD-AT-cars-Bangalore_BBC13FB62DF6840097AA62DDEA05BB04.htm
4100000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/usedcar_11_276591614316705_1614316747.webp
3
1
7618
True
364
0
0
{'address': 'Sonnappanahalli Kempegowda Intl Airport Road Jala Uttarahalli Hobli, Yelahanka, Bangalore, Karnataka 560064', 'lat': 13.1518821, 'lng': 77.6220694}
Dealer
516
False
3
Byatarayanapura
2017
Sedans
Diesel
18,000
0
https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2202297_1615013237.jpg
35 Lakh
Mercedes-Benz E-Class E250 CDI Avantgarde
15
3013606
4553943A967049D873712AFFA5F65A56
1
Mercedes-Benz
Mercedes-Benz E-Class
Mercedes-Benz E-Class 2009-2012 E250 CDI Avantgarde
Bangalore
/used-car-details/used-Mercedes-Benz-E-Class-2009-2012-E250-CDI-Avantgarde-cars-Bangalore_4553943A967049D873712AFFA5F65A56.htm
3500000
https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2202297_1615013237.webp
4
1
4611
True
674
0
0
{'address': 'NO 19, Near Traffic Signal, Byatanarayanapura, International Airport Road, Byatarayanapura, Bangalore, Karnataka 560085', 'lat': 13.0669588, 'lng': 77.5928756}
Dealer
414
False
4
nan
2015
Sedans
Diesel
80,000
0
https://stimg.cardekho.com/pwa/img/noimage.svg
12.5 Lakh
Skoda Octavia Elegance 2.0 TDI AT
1
3002709
156E5F2317C0A3A3BF8C03FFC35D404C
1
Skoda
Skoda Octavia
Skoda Octavia 2013-2017 Elegance 2.0 TDI AT
Bangalore
/used-car-details/used-Skoda-Octavia-2013-2017-Elegance-2.0-TDI-AT-cars-Bangalore_156E5F2317C0A3A3BF8C03FFC35D404C.htm
1250000
5
1
3092
True
947
0
0
{'lat': 0, 'lng': 0}
Individual
332
False
Or if you wish to explode the dict in seller_location to columns, you can load it with df = pd.json_normalize(cars).
You can save all data to a csv file: df.to_csv('output.csv')

Reading tables with Pandas and Selenium

I am trying to figure out an elegant way to scrape tables from a website. However, when running below script I am getting a ValueError: No tables found error.
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(executable_path=r'C:...\chromedriver.exe')
driver.implicitly_wait(30)
driver.get("https://www.gallop.co.za/#meeting#20201125#3")
df_list=pd.read_html(driver.find_element_by_id("eventTab_4").get_attribute('outerHTML'))
When I look at the site elements, I notice that the code below works if the < table > tag lies neatly within the < div id="...">. However, in this case, I think the code is not working because of the following reasons:
There is a < div > within a < div > and then there is the < table > tag.
The site uses Javascript with the tables.
Grateful for advice on how to pull the tables for all races. That is, there are several tables which are made visible as the user clicks on each tab (race). I need to extract all of them into separate dataframes.
from selenium import webdriver
import time
import pandas as pd
pd.set_option('display.max_column',None)
driver = webdriver.Chrome(executable_path='C:/bin/chromedriver.exe')
driver.get("https://www.gallop.co.za/#meeting#20201125#3")
time.sleep(5)
tab = driver.find_element_by_id('tabs') #All tabs are here
li_list = tab.find_elements_by_tag_name('li') #They are in a "li"
a_list = []
for li in li_list[1:]: #First tab has nothing..We skip it
a = li.find_element_by_tag_name('a') #extract the "a" element from the "li"
a_list.append(a)
df = []
for a in a_list:
a.click() #Next Tab
time.sleep(8) #Tables take some time to load fully
page = driver.page_source #Get the HTML of the new Tab page
source = pd.read_html(page)
table = source[1] #Get 2nd table
df.append(table)
print(df)
Output
[ Silk No Horse Unnamed: 3 ACS SH CFR Trainer \
0 NaN 1 JEM ROCK NaN 4bC A NaN Eric Sands
1 NaN 2 MAISON MERCI NaN 3chC A NaN Brett Crawford
2 NaN 3 WORDSWORTH NaN 3bC AB NaN Greg Ennion
3 NaN 4 FOUND THE DREAM NaN 3bG A NaN Adam Marcus
4 NaN 5 IZAPHA NaN 3bG A NaN Andre Nel
5 NaN 6 JACKBEQUICK NaN 3grG A NaN Glen Kotzen
6 NaN 7 MHLABENI NaN 3chG A NaN Eric Sands
7 NaN 8 ORLOV NaN 3bG A NaN Adam Marcus
8 NaN 9 T'CHALLA NaN 3bC A NaN Justin Snaith
9 NaN 10 WEST COAST WONDER NaN 3bG A NaN Piet Steyn
Jockey Wgt MR Dr Odds Last 3 Runs
0 D Dillon 60 0 7 NaN NaN
continued

Categories

Resources