Geting Rating data from Vivino page with python - python

I am scraping the ranking data from Vivino.
The website is: link
but I got an empty data list can anyone help me. Image of the result i got Image
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszVXy02ssDU2UEuutPXzUUu2dQ0NUiuwNVRLT7MtSyzKTC1JzFHLT7ItSizJzEsvjk8sSy1KTE9Vy7dNSS1OVisviY4FKgZTRhDKGMozgdDmEMoEAJ7xJhY%3D" #the url of the website we're scraping
page = requests.get(url)
html = BeautifulSoup(page.content, "html.parser")
Name_html = html.find_all(class_="item_description")
Rating_html = html.find_all(class_="vivinoRating__averageValue--3Navj")
Name_list = []
Rating_list = []
#loop over HTML elements, get text and add to list
for i in Name_html:
Name_list.append(i)
for i in Rating_html:
Rating_list.append(i)
#make a dictionary with column names for data to put in DataFrame
data = {"description": Name_list,
"airdate": Rating_list}
df = pd.DataFrame(data) #make a dataframe

The wines labels are loaded through Javascript, you can't scrape them with bs4 as if they're part of a static html.
You need to either manually send the POST requests to get the response content, or scrape the page dinamically with selenium through any of the available browser simulation drivers (i.e. chromedriver).

Related

Extract data by looping though dates using pandas

I want to scrape exchange rate data from July 1 2021 to June 30 2022 by enumerating exchangeDate variable and save it to excel.
Here is my code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Set the URL for the website you want to scrape
url = "https://www.bot.go.tz/ExchangeRate/previous_rates?__RequestVerificationToken=P0qGKEy8P6ISFMLlu7mKvMi4YrMyeHc1aCz4ZuGQVyJ6mK9w6StV6QPyinF7ym_mAZG6yO6ShU1DuFm6teqBAxCcCrEQSjz7KtXzi2kbJH41&exchangeDate=04%2F05%2F2022"
# Send an HTTP request to the website and retrieve the HTML content
response = requests.get(url)
html = response.content
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Find the table containing the data you want to scrape
table = soup.find("table", attrs={"class": "table"})
# Extract the data from the table and save it to a Pandas DataFrame
df = pd.read_html(str(table))[0]
# Save the DataFrame to an Excel file
df.to_excel("exchange_Rate_data.xlsx", index=False)
How do I loop through all dates?
You can use something like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
start='2021-07-01'
end='2022-06-30'
dates=[i.replace('-','%2F') for i in pd.date_range(start,end,freq='d').strftime('%m-%d-%Y').tolist()]
final_df=pd.DataFrame()
for i in dates:
# Set the URL for the website you want to scrape
url = "https://www.bot.go.tz/ExchangeRate/previous_rates?__RequestVerificationToken=P0qGKEy8P6ISFMLlu7mKvMi4YrMyeHc1aCz4ZuGQVyJ6mK9w6StV6QPyinF7ym_mAZG6yO6ShU1DuFm6teqBAxCcCrEQSjz7KtXzi2kbJH41&exchangeDate={}".format(i)
# Send an HTTP request to the website and retrieve the HTML content
response = requests.get(url)
html = response.content
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Find the table containing the data you want to scrape
table = soup.find("table", attrs={"class": "table"})
# Extract the data from the table and save it to a Pandas DataFrame
df = pd.read_html(str(table))[0]
final_df=pd.concat([final_df,df])
final_df.to_excel("exchange_Rate_data.xlsx", index=False)

BeautifulSoup not working after the first page

I'm trying to use Python's BeautifulSoup to scrape data from the following website. The data on the website is split over four different pages. Each page has a unique link (i.e. http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/0 for the first page, http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/1 for the second page, etc.). I am able to successfully scrape the data on the first page, but when I try to scrape data for the second page onward it comes up empty. Here is the code I'm using:
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
# Define url and request webpage
season = 2019
page = 1
url = "http://insider.espn.com/nbadraft/results/top100/_/year/{}/set/{}".format(season, page)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
# Scrape all of the data in the table
rows = page_soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
# Get the column headers
headers = player_stats[0]
# Remove the first row
player_stats.pop(0)
# Convert to pandas dataframe
df = pd.DataFrame(player_stats, columns = headers)
# Remove all rows where Name = None
df = df[~df['NAME'].isnull()]
# Remove PLAYER column because it's empty
df = df.drop(columns='PLAYER')
df
Any advice would be much appreciated! I'm a bit new to using BeautifulSoup, so I apologize in advance if the code isn't particularly nice or efficient.
Update: The links only work if opened in Chrome, which is likely what is causing the problem. Is there any way around it?

Python web scraping a table with filters

I am having difficulty extracting text when web-scraping a table, and I think the filters on the page are to blame. I have tried isolating all the "tr" and "th" elements, but cannot seem to get the underlying text into Python. What am I doing wrong?
My code:
from bs4 import BeautifulSoup
import requests
page_link ='https://www.ersteliga.hu/stats#/players/1945/regular/points'
page = requests.get(page_link)
html = page.content
soup = BeautifulSoup(html, 'html.parser')
my_table = soup.find_all('table', class_= "IHD-TABLE")
columns = my_table.find('th')
I expect the output to contain the text values seen online, but I seem to get output that looks different than the HTML on the web page.
Data comes dynamically from another xhr request the page makes you can find in the network tab. It returns json.
import requests
import pandas as pd
headers = {
'referer': 'https://www.ersteliga.hu/stats',
'user-agent': 'Mozilla/5.0'
}
data = {'championshipId': '1945', 'division': 'Alapszakasz','type': 'playerStatsChampionShipPeriod'}
r = requests.post('https://www.ersteliga.hu/ajax/CallWS', headers = headers, data=data).json()
df = pd.DataFrame([i for i in r['d']], columns = list(r['d'][0].keys()))
print(df)
Sort on point column desc to get same order as on page
print(df.sort_values(['point'], ascending=[False]))

Web Scraping Python For Two Different Buttons

I am trying to scrape data from https://www.wsj.com/market-data/bonds/treasuries.
There are two tables on this website which get switched when we select the options:
1. Treasury Notes and Bond
2. Treasury Bills
I want to scrape the data for Treasury bills. But there is no change in the link and attributes or anything when i click that option. I have tried a lot of things but every time, i am able to scrape the data for Treasury Notes and Bond.
Can someone help me with that?
Following the my code:
import re
import csv
import requests
import pandas as pd
from bs4 import BeautifulSoup
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
df = pd.DataFrame(list_rows)
df1 = df[0].str.split(',', expand=True)
All the data in the site is loaded once and then js is used to update the values in the table
Here is a working quickly written code:
import requests
from bs4 import BeautifulSoup
import json
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('script') # we get all the script tags
importantJson = ''
for r in rows:
text = r.text
if 'NOTES_AND_BONDS' in text: # the scirpt tags containing the date, probably you can do this better
importantJson = text
break
# remove the non json stuff
importantJson = importantJson\
.replace('window.__STATE__ =', '')\
.replace(';', '')\
.strip()
#parse the json
jsn = json.loads(importantJson)
print(jsn) #json object containing all the data you need
How did I got to this conclusion?
First I noticed that switching between the two tables makes no http requests to the server, meaning the data is already there.
Then I inspected the table html and noticed that there is only one table and its contents are dynamically changing, which lead me to the conclusion that this data is already on the page.
Then with simple search in the source I found the script tag containing the json.

using beautiful soup for simulating a page-click to access all HTML on a page?

I'm trying to scrape the following website:
https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs
I'm able to successfully scrape the events listed on the page using beautifulsoup, using the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
dates = soup.find_all('div', {'class': 'event-b58f7990'})
month=[]
day=[]
for i in dates:
md = i.find_all('div')
month.append(md[0].text)
day.append(md[1].text)
However, the issue I'm having is that I'm only able to scrape the first 18 events - the rest of the page is only available if the 'view all' button is clicked at the bottom. Is there a way in beautifulsoup, or otherwise, to simulate this button being clicked, so that I can scrape ALL of the data? I'd prefer to keep this in python as I'm doing most scraping with beautifulsoup. Thanks so much!
If you can work out the end point or set an end point for range in the following (with error handling for going too far) you can get a json response and parse out the info you require as follows. Depending on how many requests making you may choose to re-use connection with session.
import requests
import pandas as pd
url = 'https://www.bandsintown.com/upcomingEvents?came_from=257&sort_by_filter=Number+of+RSVPs&page={}&latitude=51.5167&longitude=0.0667'
results = []
for page in range(1,20):
data = requests.get(url.format(page)).json()
for item in data['events']:
results.append([item['artistName'], item['eventDate']['day'],item['eventDate']['month']])
df = pd.DataFrame(results)
print(df)

Categories

Resources