Python BeautifulSoup html parser not working

Python BeautifulSoup html parser not working - python

Here i'm trying to read the page and create a csv with columns respectively. But i'm unable to read the parsed data to use find function. The soup data doesn't have the data present in webpage
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.fancraze.com/marketplace/sales/mornemorkel1?tab=latest-sales"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

Site use API to get data, so you can handle it
import pandas as pd
import requests
url = 'https://api.faze.app/v1/latestSalesInAGroup/mornemorkel1'
result = []
response = requests.get(url=url)
for data in response.json()['data']:
data = {
'id': data['momentId']['id'],
'seller': data['sellerAddress']['userName'],
'buyer': data['buyerAddress']['userName'],
'price': data['price'],
'created': data['createdAt']
}
result.append(data)
df = pd.DataFrame(result)
print(df)
OUTPUT:
id seller ... price created
0 1882 singal22 ... 8 2022-06-22T14:34:39.403Z
1 1737 olive_creepy2343 ... 7 2022-06-22T14:09:32.070Z
2 1256 tomato_wicked3294 ... 10 2022-06-22T13:49:20.895Z
3 1931 aquamarine_productive9244 ... 6 2022-06-22T13:41:49.153Z
4 1603 aquamarine_productive9244 ... 9 2022-06-22T13:28:01.624Z
.. ... ... ... ... ...
95 1026 olive_creepy2343 ... 7 2022-04-16T18:00:00.662Z
96 1719 Hhassan136 ... 5 2022-04-14T23:14:12.037Z
97 2054 Cricket101 ... 5 2022-04-14T21:30:13.185Z
98 1961 emzeden_9 ... 6 2022-04-14T18:02:05.194Z
99 1194 amaranth_curious1871 ... 5 2022-04-14T17:45:25.266Z

Related

How to web scrape Rotowire iframe table

I am try to scrape tables from Rotowire. pd.read is only returning the Headers.
import pandas as pd
url = pd.read_html("http://www.rotowire.com/daily/mlb/optimizer.htm?site=DraftKings&sport=MLB")
# for idx, table in enumerate(url):
# print("***************************")
# print(idx)
# print(table)
url[5]
Output:
Player Team Position Salary Fpts. Val Min. % Max. % Exposure
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN

No idea what table you want, but you're not going to get anything from the static html response as the page is rendered through javascript. They do have some data you can access though. You'd have to work out the parameters:
import pandas as pd
import requests
url = 'https://www.rotowire.com/daily/tables/optimizer-mlb.php'
payload = {
'siteID': '1',
'slateID': '6441',
'projSource': 'RotoWire',
'rst': 'RotoWire'}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData)
Output:
print(df)
id playerID rotoPlayerID ... ie_green_lights ie_matchup_notes ie_volatility
0 12739 11095 12739 ... 0 0
1 10510 4081 10510 ... 0 0
2 16036 5163 16036 ... 0 0
3 14194 10827 14194 ... 0 0
4 14865 15463 14865 ... 0 0
.. ... ... ... ... ... ... ...
687 14444 11330 14444 ... 0 0
688 14440 18894 14440 ... 0 0
689 14439 18905 14439 ... 0 0
690 14435 5058 14435 ... 0 0
691 17921 18828 17921 ... 0 0
[692 rows x 99 columns]

Scrape web with info from several years and create a csv file for each year

I have scraped information with the results of the 2016 Chess Olympiad, using the following code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table, but just for the first eleven columns in the webpage
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
I want to do the same thing for the results of 2014 and 2012 (the Olympics are played every two years normally), authomatically. I have advanced the code half the way, but I really don't know how to continue. This is what I've done so far.
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Imports the HTML into python
url = 'https://www.olimpbase.org/2016/2016te14.html'
requests.get(url)
page = requests.get(url)
print(page)
soup = BeautifulSoup(page.text, 'lxml')
#Subsets the HTML to only get the HTML of our table needed
table = soup.find('table', attrs = {'border': '1'})
print(table)
#Gets all the column headers of our table
table.find_all('td', class_= 'bog')[1:12]
headers = []
for i in table.find_all('td', class_= 'bog')[1:12]:
title = i.text.strip()
headers.append(title)
#Creates a dataframe using the column headers from our table
df = pd.DataFrame(columns = headers)
table.find_all('tr')[3:] #We grab data since the fourth row; the previous ones belong to the headers.
start_year=2012
i=2
end_year=2016
def download_chess(start_year):
url = f'https://www.olimpbase.org/{start_year}/{start_year}te14.html'
response = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
for j in table.find_all('tr')[3:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data][0:11]
length = len(df)
df.loc[length] = row
while start_year<end_year:
download_chess(start_year)
start_year+=i
download_chess(start_year)
I don't have much experience so I don't quite understand the logic of writing filenames. I hope you can help me.

The following will retrieve information for a range of years - in this case, 2000 -- 2018, and save each table to csv as well:
import requests
import pandas as pd
years = range(2000, 2019, 2)
for y in years:
try:
df = pd.read_html(f'https://www.olimpbase.org/{y}/{y}te14.html')[1]
new_header = df.iloc[2]
df = df[3:]
df.columns = new_header
print(df)
df.to_csv(f'chess_olympics_{y}.csv')
except Exception as e:
print(y, 'error', e)
This will print out the results table for each year:
no.
team
Elo
flag
code
pos.
pts
Buch
MP
gms
nan
+
=
-
nan
+
=
-
nan
%
Eloav
Elop
ind.medals
3
1
Russia
2685
nan
RUS
1
38
457.5
20
56
nan
8
4
2
nan
23
30
3
nan
67.9
2561
2694
1 - 0 - 2
4
2
Germany
2604
nan
GER
2
37
455.5
22
56
nan
10
2
2
nan
21
32
3
nan
66.1
2568
2685
0 - 0 - 2
5
3
Ukraine
2638
nan
UKR
3
35½
457.5
21
56
nan
8
5
1
nan
18
35
3
nan
63.4
2558
2653
1 - 0 - 0
6
4
Hungary
2661
nan
HUN
4
35½
455.5
21
56
nan
8
5
1
nan
22
27
7
nan
63.4
2570
2665
0 - 0 - 0
7
5
Israel
2652
nan
ISR
5
34½
463.5
20
56
nan
7
6
1
nan
17
35
4
nan
61.6
2562
2649
0 - 0 - 0
[...]
Relevant documentation for pandas: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

Scrape for Table with Limits

There is a website that has data on it that it pulls from an API. The max number of rows you can have per page is 100. If you were to check the API URL from page 1, 2, 3, etc they change each time. So far I have taken the same script each time and just switched out the URL but I then also have to save it in a different excel file every time or else it removes data.
I'd like to have a script to be able to pull all information from this table and then place them all into an excel on the same sheet without the values being overwritten.
The main page I'm using is http://www.nhl.com/stats/teams?aggregate=0&report=daysbetweengames&reportType=game&dateFrom=2021-10-12&dateTo=2021-11-30&gameType=2&filter=gamesPlayed,gte,1&sort=a_teamFullName,daysRest&page=0&pageSize=50 but please keep in mind that all the information on that page is being pulled from an API.
Here is the code I'm using:
import requests
import json
import pandas as pd
url = ('https://api.nhle.com/stats/rest/en/team/daysbetweengames? `isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22teamFullName%22,%22direction%22:%22ASC%22%7D,%7B%22property%22:%22daysRest%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=500&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameDate%3C=%222021-11-30%2023%3A59%3A59%22%20and%20gameDate%3E=%222021-10-12%22%20and%20gameTypeId=2')`
resp = requests.get(url).text
resp = json.loads(resp)
df = pd.DataFrame(resp['data'])
df.to_excel('Master File.xlsx', sheet_name = 'Info')
Any help would be greatly appreciated.
Thanks!

url has start=... - so you can use for-loop and replace this value 0, 100, 200, etc. and run code for different urls, and append() to DataFrame
It can be simpler if you put all arguments from url (after char ?) to dictionary and later run as get(url, params=...)
requests has response.json so it doesn't need json.loads(response.text)
import requests
import pandas as pd
# --- before loop ---
url = 'https://api.nhle.com/stats/rest/en/team/daysbetweengames'
payload = {
'isAggregate': 'false',
'isGame': 'true',
'start': 0,
'limit': 100,
'sort': '[{"property":"teamFullName","direction":"ASC"},{"property":"daysRest","direction":"DESC"},{"property":"teamId","direction":"ASC"}]',
'factCayenneExp': 'gamesPlayed>=1',
'cayenneExp': 'gameDate<="2021-11-30 23:59:59" and gameDate>="2021-10-12" and gameTypeId=2',
}
df = pd.DataFrame()
# --- loop ---
for start in range(0, 1000, 100):
print('start:', start)
payload['start'] = start
response = requests.get(url, params=payload)
data = response.json()
df = df.append(data['data'], ignore_index=True)
# --- after loop ---
print(df)
df.to_excel('Master File.xlsx', sheet_name='Info')
Result:
daysRest faceoffWinPct gameDate ... ties timesShorthandedPerGame wins
0 4 0.47169 2021-10-13 ... None 5.0 1
1 3 0.50847 2021-11-22 ... None 4.0 0
2 2 0.45762 2021-10-26 ... None 1.0 0
3 2 0.56666 2021-11-05 ... None 2.0 1
4 2 0.54716 2021-11-14 ... None 1.0 1
.. ... ... ... ... ... ... ...
675 1 0.37209 2021-10-28 ... None 2.0 1
676 1 0.48000 2021-10-21 ... None 3.0 1
677 0 0.57692 2021-11-06 ... None 1.0 0
678 0 0.32727 2021-11-19 ... None 3.0 0
679 0 0.47169 2021-11-27 ... None 4.0 1

Cannot export to ".csv" file - pandas.DataFrame

I would like to seek help with regards to my Google Colaboratory Notebook. The error is located in fourth cell.
Context:
We're performing Web scraping BTC's Historical Data.
Here's my codes:
First cell (Executed successfully)
#importing libaries
from bs4 import BeautifulSoup
import requests
import pandas as pd
Second cell (Executed successfully)
#sample url
url = "https://www.bitrates.com/coin/BTC/historical-data/USD?period=allData&limit=500"
#request the page
page = requests.get(url)
#creating a soup object and the parser
soup = BeautifulSoup(page.text, 'lxml')
#creating a table body to pass on the soup to find the table
table_body = soup.find('table')
#creating an empty list to store information
row_data = []
#creating a table
for row in table_body.find_all('tr'):
col = row.find_all('td')
col = [ele.text.strip() for ele in col ] # stripping the whitespaces
row_data.append(col) #append the column
# extracting all data on table entries
df = pd.DataFrame(row_data)
df
Third cell (Executed successfully)
headers = []
for i in soup.find_all('th'):
col_name = i.text.strip().lower().replace(" ", "_")
headers.append(col_name)
headers
Fourth cell (Execution failed)
df = pd.DataFrame(row_data, columns=headers)
df
#into a file
df.to_csv('/content/file.csv')
The error! :(
AssertionError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in _list_to_arrays(data, columns, coerce_float, dtype)
563 try:
--> 564 columns = _validate_or_indexify_columns(content, columns)
565 result = _convert_object_array(content, dtype=dtype, coerce_float=coerce_float)
AssertionError: 13 columns passed, passed data had 7 columns
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in _list_to_arrays(data, columns, coerce_float, dtype)
565 result = _convert_object_array(content, dtype=dtype, coerce_float=coerce_float)
566 except AssertionError as e:
--> 567 raise ValueError(e) from e
568 return result, columns
569
ValueError: 13 columns passed, passed data had 7 columns

To load the table you can use simple pd.read_html(). For example:
import pandas as pd
url = "https://www.bitrates.com/coin/BTC/historical-data/USD?period=allData&limit=500"
df = pd.read_html(url)[0]
print(df)
df.to_csv("data.csv")
Creates data.csv (screenshot from LibreOffice):
To correct your example:
# importing libaries
from bs4 import BeautifulSoup
import requests
import pandas as pd
# sample url
url = "https://www.bitrates.com/coin/BTC/historical-data/USD?period=allData&limit=500"
# request the page
page = requests.get(url)
# creating a soup object and the parser
soup = BeautifulSoup(page.text, "lxml")
# creating a table body to pass on the soup to find the table
table_body = soup.find("table")
# creating an empty list to store information
row_data = []
# creating a table
for row in table_body.select("tr:has(td)"):
col = row.find_all("td")
col = [ele.text.strip() for ele in col] # stripping the whitespaces
row_data.append(col) # append the column
# extracting all data on table entries
df = pd.DataFrame(row_data)
headers = []
for i in table_body.select("th"):
col_name = i.text.strip().lower().replace(" ", "_")
headers.append(col_name)
df = pd.DataFrame(row_data, columns=headers)
print(df)
df.to_csv("/content/file.csv")

import pandas as pd
df = pd.read_json(
'https://www.bitrates.com/api/node/v1/symbols/USDTUSD/bitrates/series?aggregate=3&period=lastMonth').T['series'].to_dict()['data']
print(pd.DataFrame(df))
Output:
date open close ... supply market_volume24 btc_ratio
0 2021-04-11T06:00:00.000Z 0.999212 0.999114 ... 4.584629e+10 3.146109e+08 0.000016
1 2021-04-12T00:00:00.000Z 0.999114 0.999317 ... 4.584629e+10 2.100706e+09 0.000016
2 2021-06-04T18:00:00.000Z 0.999317 1.000613 ... 6.447629e+10 7.298208e+08 0.000025
3 2021-06-05T12:00:00.000Z 1.000613 1.000328 ... 0.000000e+00 6.502947e+09 0.000025
4 2021-06-06T06:00:00.000Z 1.000328 1.000499 ... 6.447629e+10 6.649574e+08 0.000025
5 2021-06-07T00:00:00.000Z 1.000499 1.000408 ... 6.447629e+10 8.272473e+09 0.000025
6 2021-06-07T18:00:00.000Z 1.000408 1.000338 ... 6.447629e+10 1.090599e+09 0.000025
7 2021-06-08T12:00:00.000Z 1.000338 1.000840 ... 6.447177e+10 2.196249e+09 0.000028
8 2021-06-09T06:00:00.000Z 1.000840 1.001088 ... 0.000000e+00 1.080053e+10 0.000028
9 2021-06-10T00:00:00.000Z 1.001088 1.000618 ... 6.447177e+10 4.158914e+09 0.000026
10 2021-06-10T18:00:00.000Z 1.000618 1.000436 ... 6.447177e+10 6.713012e+08 0.000026
11 2021-06-11T12:00:00.000Z 1.000436 1.000234 ... 6.447177e+10 4.093096e+09 0.000025
12 2021-06-12T06:00:00.000Z 1.000234 1.000385 ... 6.447177e+10 5.042653e+09 0.000026
13 2021-06-13T00:00:00.000Z 1.000385 1.000302 ... 0.000000e+00 5.502808e+09 0.000026
14 2021-06-13T18:00:00.000Z 1.000302 1.000110 ... 6.447177e+10 1.008952e+10 0.000024
15 2021-06-14T12:00:00.000Z 1.000110 1.000309 ... 6.447177e+10 7.405940e+09 0.000024
16 2021-06-15T06:00:00.000Z 1.000309 1.000205 ... 6.447177e+10 4.256491e+09 0.000023
17 2021-06-16T00:00:00.000Z 1.000205 1.000104 ... 0.000000e+00 1.495518e+09 0.000023
18 2021-06-16T18:00:00.000Z 1.000104 0.999833 ... 0.000000e+00 3.033091e+09 0.000024
19 2021-06-17T12:00:00.000Z 0.999833 1.000016 ... 6.447177e+10 1.449031e+08 0.000024
20 2021-07-10T00:00:00.000Z 1.000016 1.000100 ... 6.446977e+10 7.586923e+08 0.000025
21 2021-07-10T18:00:00.000Z 1.000100 1.000199 ... 6.446977e+10 2.312489e+09 0.000025
22 2021-07-11T12:00:00.000Z 1.000199 1.000134 ... 6.446977e+10 2.236517e+09 0.000024
23 2021-07-12T06:00:00.000Z 1.000134 1.000192 ... 6.446977e+10 8.140557e+09 0.000024
24 2021-07-13T00:00:00.000Z 1.000192 1.000290 ... 6.446977e+10 3.846952e+09 0.000026
25 2021-07-13T18:00:00.000Z 1.000290 1.000411 ... 6.446977e+10 1.278604e+09 0.000026
26 2021-07-14T12:00:00.000Z 1.000411 1.000315 ... 6.446977e+10 3.279535e+09 0.000026
27 2021-07-15T06:00:00.000Z 1.000315 1.000142 ... 6.446977e+10 8.086642e+08 0.000026
28 2021-07-16T00:00:00.000Z 1.000142 1.000295 ... 6.446977e+10 1.187211e+09 0.000027
29 2021-07-16T18:00:00.000Z 1.000295 1.000610 ... 6.446977e+10 7.721854e+08 0.000027
30 2021-07-17T12:00:00.000Z 1.000610 1.000535 ... 6.446977e+10 4.535049e+09 0.000027
31 2021-07-18T06:00:00.000Z 1.000535 1.000610 ... 6.446977e+10 2.345491e+09 0.000026
32 2021-07-19T00:00:00.000Z 1.000610 1.000386 ... 6.446977e+10 4.725531e+09 0.000027
33 2021-07-19T18:00:00.000Z 1.000386 1.000215 ... 6.446977e+10 3.314499e+09 0.000028
34 2021-07-20T12:00:00.000Z 1.000215 1.000324 ... 6.446977e+10 5.315525e+09 0.000030
35 2021-07-21T06:00:00.000Z 1.000324 1.000277 ... 6.446977e+10 7.141479e+09 0.000028
36 2021-07-22T00:00:00.000Z 1.000277 1.000255 ... 6.446977e+10 2.533840e+09 0.000028
37 2021-07-22T18:00:00.000Z 1.000255 1.000325 ... 6.446977e+10 2.699050e+09 0.000027
38 2021-07-23T12:00:00.000Z 1.000325 1.000363 ... 6.446977e+10 2.681340e+09 0.000026
39 2021-07-24T06:00:00.000Z 1.000363 1.000644 ... 6.446974e+10 6.241232e+08 0.000026
[40 rows x 10 columns]

When using findAll with BeautifulSoup it returns an empty list

I'm practicing some web scraping and for this project I'm scraping this website: https://assetdash.com/?all=true
I'm getting and parsing the HTML code as following.
my_url = 'https://assetdash.com/?all=true'
client = urlopen(my_url)
page_html = client.read()
client.close()
soup = BeautifulSoup(page_html, 'html.parser')
rows = soup.findAll("tr", {"class":"Table__Tr-sc-1pfmqa-5 gNrtPb"})
print(len(rows))
This returns a length of 0 whereas it should be returning a much higher value. Have I done something wrong with the parsing or am I retrieving the rows incorrectly?

It dynamic and javascript rendered. Go straight to the source of the data.
Code:
import requests
my_url = 'https://assetdash.herokuapp.com/assets?currentPage=1&perPage=200&typesOfAssets[]=Stock&typesOfAssets[]=ETF&typesOfAssets[]=Cryptocurrency'
data = requests.get(my_url).json()
df = pd.DataFrame(data['data'])
Output:
print (df)
id ticker ... peRatio rank
0 60 AAPL ... 35.17 1
1 2287 MSFT ... 34.18 2
2 251 AMZN ... 91.52 3
3 1527 GOOGL ... 33.79 4
4 1276 FB ... 31.09 5
.. ... ... ... ... ...
195 537 BMWYY ... 15.06 196
196 3756 WBK ... 35.57 197
197 1010 DG ... 23.40 198
198 1711 HUM ... 12.77 199
199 1194 EQNR ... -15.82 200
[200 rows x 13 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup html parser not working - python

Related

How to web scrape Rotowire iframe table

Scrape web with info from several years and create a csv file for each year

Scrape for Table with Limits

Cannot export to ".csv" file - pandas.DataFrame

When using findAll with BeautifulSoup it returns an empty list

Categories

Resources