Web Scraping Python For Two Different Buttons - python

I am trying to scrape data from https://www.wsj.com/market-data/bonds/treasuries.
There are two tables on this website which get switched when we select the options:
1. Treasury Notes and Bond
2. Treasury Bills
I want to scrape the data for Treasury bills. But there is no change in the link and attributes or anything when i click that option. I have tried a lot of things but every time, i am able to scrape the data for Treasury Notes and Bond.
Can someone help me with that?
Following the my code:
import re
import csv
import requests
import pandas as pd
from bs4 import BeautifulSoup
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
df = pd.DataFrame(list_rows)
df1 = df[0].str.split(',', expand=True)

All the data in the site is loaded once and then js is used to update the values in the table
Here is a working quickly written code:
import requests
from bs4 import BeautifulSoup
import json
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('script') # we get all the script tags
importantJson = ''
for r in rows:
text = r.text
if 'NOTES_AND_BONDS' in text: # the scirpt tags containing the date, probably you can do this better
importantJson = text
break
# remove the non json stuff
importantJson = importantJson\
.replace('window.__STATE__ =', '')\
.replace(';', '')\
.strip()
#parse the json
jsn = json.loads(importantJson)
print(jsn) #json object containing all the data you need
How did I got to this conclusion?
First I noticed that switching between the two tables makes no http requests to the server, meaning the data is already there.
Then I inspected the table html and noticed that there is only one table and its contents are dynamically changing, which lead me to the conclusion that this data is already on the page.
Then with simple search in the source I found the script tag containing the json.

Related

Scraping a table from html using python and beautifulsoup

I am trying to scrape data from a table in a government website, I have tried to watch some tutorials but so far to no avail (coding dummy over here!!!)
I would like to get a .csv file out of the tables they have containing the date, the type of event, and the adopted measures for a project. I leave here the website if any of you wants to crack it!
https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps
!pip install beautifulsoup4
!pip install requests
from bs4 import BeautifulSoup
import requests
url= "https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
FIRST_table = soup.find('table', class_ = 'tableau-timeline')
print(FIRST_table)
for timeline in FIRST_table.find_all('tbody'):
rows= timeline.find_all('tr')
for row in rows:
pl_timeline = row.find('td', class_ = 'date').text
print(pl_timeline)
p
I was expecting to get in order the dates and to use the same for loop to get the also the other two columns by tweaking it for "Type d'événement" and "Mesures adoptées"
What am I doing wrong? How can I tweak it? (I am using colab if it makes any difference) Thanks in advance
Make your life easier and just use pandas to parse the tables and dump the data to a .csv file.
For example, this gets you all the tables, merges them, and spits out a .csv file:
import pandas as pd
import requests
url = "https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps"
tables = pd.read_html(requests.get(url).text, flavor="lxml")
df = pd.concat(tables).to_csv("data.csv", index=False)
Output:

Geting Rating data from Vivino page with python

I am scraping the ranking data from Vivino.
The website is: link
but I got an empty data list can anyone help me. Image of the result i got Image
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.vivino.com/explore?e=eJzLLbI11jNVy83MszVXy02ssDU2UEuutPXzUUu2dQ0NUiuwNVRLT7MtSyzKTC1JzFHLT7ItSizJzEsvjk8sSy1KTE9Vy7dNSS1OVisviY4FKgZTRhDKGMozgdDmEMoEAJ7xJhY%3D" #the url of the website we're scraping
page = requests.get(url)
html = BeautifulSoup(page.content, "html.parser")
Name_html = html.find_all(class_="item_description")
Rating_html = html.find_all(class_="vivinoRating__averageValue--3Navj")
Name_list = []
Rating_list = []
#loop over HTML elements, get text and add to list
for i in Name_html:
Name_list.append(i)
for i in Rating_html:
Rating_list.append(i)
#make a dictionary with column names for data to put in DataFrame
data = {"description": Name_list,
"airdate": Rating_list}
df = pd.DataFrame(data) #make a dataframe
The wines labels are loaded through Javascript, you can't scrape them with bs4 as if they're part of a static html.
You need to either manually send the POST requests to get the response content, or scrape the page dinamically with selenium through any of the available browser simulation drivers (i.e. chromedriver).

scrape book body text from project gutenberg de

I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one
I need to use them for further analysis and text mining.
I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_title, page_head)
I suppose I could use that as a second step to extract it then? I am not sure how, though.
Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?
What happens?
First of all you will get a list of pages, cause you not requesting the right url it to:
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
Recommend that if your looping all the urls store the content in a list of dicts and push it to csv or pandas or ...
Example
import requests
from bs4 import BeautifulSoup
data = []
# Make a request
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
soup = BeautifulSoup(page.content, 'html.parser')
data.append({
'title': soup.title,
'chapter': soup.h2.get_text(),
'text': ' '.join([p.get_text(strip=True) for p in soup.select('body p')[2:]])
}
)
data

BeautifulSoup not working after the first page

I'm trying to use Python's BeautifulSoup to scrape data from the following website. The data on the website is split over four different pages. Each page has a unique link (i.e. http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/0 for the first page, http://insider.espn.com/nbadraft/results/top100/_/year/2019/set/1 for the second page, etc.). I am able to successfully scrape the data on the first page, but when I try to scrape data for the second page onward it comes up empty. Here is the code I'm using:
# Import libraries
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
# Define url and request webpage
season = 2019
page = 1
url = "http://insider.espn.com/nbadraft/results/top100/_/year/{}/set/{}".format(season, page)
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
page_soup = soup(webpage, "html.parser")
# Scrape all of the data in the table
rows = page_soup.findAll('tr')[1:]
player_stats = [[td.getText() for td in rows[i].findAll('td')]
for i in range(len(rows))]
# Get the column headers
headers = player_stats[0]
# Remove the first row
player_stats.pop(0)
# Convert to pandas dataframe
df = pd.DataFrame(player_stats, columns = headers)
# Remove all rows where Name = None
df = df[~df['NAME'].isnull()]
# Remove PLAYER column because it's empty
df = df.drop(columns='PLAYER')
df
Any advice would be much appreciated! I'm a bit new to using BeautifulSoup, so I apologize in advance if the code isn't particularly nice or efficient.
Update: The links only work if opened in Chrome, which is likely what is causing the problem. Is there any way around it?

Is there a way to choose the column for .writerow?

I want to scrape a site and save the data in a csv file that could be opened in Excel. I've managed to retrieve the information, but I have trouble transferring it to a csv document. When I open the document, the headers are there and in different columns, but the actual contents are in the same one, name first and price second.
I have tried putting file.writerow([Name, Price]) at the end of the code, but, probably because I've used span.find for name, only the last name value is displayed. I figured file.writerow has to be in the loop to work, but I can't move the data to another column.
Here's the code:
import requests
from bs4 import BeautifulSoup
import csv
file = csv.writer(open('GPU.csv', 'w'))
file.writerow(['Name','Price'])
url = 'link'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
for span in soup.findAll('span', attrs={'class':'details'}):
name = span.find('a').string
file.writerow([name])
for span in soup.findAll('span', attrs={'class':'price'}):
price = span.findAll(text=True)
file.writerow([price])
If there is nothing I can do with file.writerow, looping could be the issue. I have no experience with coding and would appreciate any advice.
The csv module always only writes sequentially. However, you can gather the names and prices up into separate lists up front, then use the zip() function to iterate over them in pairs, like so:
import requests
from bs4 import BeautifulSoup
import csv
url = "link"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
names = []
prices = []
for span in soup.findAll("span", attrs={"class": "details"}):
names.append(span.find("a").string)
for span in soup.findAll("span", attrs={"class": "price"}):
prices.append(span.findAll(text=True))
file = csv.writer(open("GPU.csv", "w"))
file.writerow(["Name", "Price"])
for name, price in zip(names, prices):
file.writerow([name, price])

Categories

Resources