scraping data from Chicago Mercantile Exchange website - python

I am trying to scrape data from table of CME website. Specifically I want to pull the data of Open interest for every future currency. but when I try parse the table it gives me nothing.
Link from which I am trying to scrape the data given below is the code through which I am trying to do it.
from bs4 import BeautifulSoup
import requests
url="https://www.cmegroup.com/market-data/volume-open-interest/fx-volume.html"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content)
table = soup.find("table", attrs={"class": "cmeData voiDataset"})
print(table)

Table data comes from another HTML doc that you can get with
from bs4 import BeautifulSoup
import requests
url = 'https://www.cmegroup.com/CmeWS/mvc/xsltTransformer.do?xlstDoc=/XSLT/md/voi/voi_asset_class_final.xsl&url=/da/VOI/V2/Totals/TradeDate/20201116/AssetClassId/3/ReportType/F?excluded=CEE,CEU,KCB&hidelinks=false&html='
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content)
table = soup.find("table", attrs={"class": "cmeData voiDataset"})
print(table)
To get data for specific date you can change URL as below
# Date for 2020 November 12
date = '20201112'
url = 'https://www.cmegroup.com/CmeWS/mvc/xsltTransformer.do?xlstDoc=/XSLT/md/voi/voi_asset_class_final.xsl&url=/da/VOI/V2/Totals/TradeDate/{}/AssetClassId/3/ReportType/F?excluded=CEE,CEU,KCB&hidelinks=false&html='.format(date)

Related

web scraping infogol.net attributerror with beautiful soup

I am trying to scrape the information for ajax matches from infogol. When I inspect the webpage I find that the table class = 'teamstats-summary-matches ng-scope' but whenI try this i find nothing. So far I have come up with the following code:
import requests
from bs4 import BeautifulSoup
# Set the URL of the webpage you want to scrape
url = 'https://www.infogol.net/en/team/ajax/62'
# Make a request to the webpage
response = requests.get(url)
# Parse the HTML of the webpage
soup = BeautifulSoup(response.text, 'html.parser')
# Find the table containing the data
table = soup.find('table', class_='teamstats-summary-matches ng-scope')
if not table:
print('Cannot find table')
Check that you have found what you are expecting before proceeding
# Find the table containing the data
table = soup.find('table', class_='stats-table')
if not table:
print('Cannot find table')
sys.exit(1)

Extract data by looping though dates using pandas

I want to scrape exchange rate data from July 1 2021 to June 30 2022 by enumerating exchangeDate variable and save it to excel.
Here is my code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# Set the URL for the website you want to scrape
url = "https://www.bot.go.tz/ExchangeRate/previous_rates?__RequestVerificationToken=P0qGKEy8P6ISFMLlu7mKvMi4YrMyeHc1aCz4ZuGQVyJ6mK9w6StV6QPyinF7ym_mAZG6yO6ShU1DuFm6teqBAxCcCrEQSjz7KtXzi2kbJH41&exchangeDate=04%2F05%2F2022"
# Send an HTTP request to the website and retrieve the HTML content
response = requests.get(url)
html = response.content
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Find the table containing the data you want to scrape
table = soup.find("table", attrs={"class": "table"})
# Extract the data from the table and save it to a Pandas DataFrame
df = pd.read_html(str(table))[0]
# Save the DataFrame to an Excel file
df.to_excel("exchange_Rate_data.xlsx", index=False)
How do I loop through all dates?
You can use something like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
start='2021-07-01'
end='2022-06-30'
dates=[i.replace('-','%2F') for i in pd.date_range(start,end,freq='d').strftime('%m-%d-%Y').tolist()]
final_df=pd.DataFrame()
for i in dates:
# Set the URL for the website you want to scrape
url = "https://www.bot.go.tz/ExchangeRate/previous_rates?__RequestVerificationToken=P0qGKEy8P6ISFMLlu7mKvMi4YrMyeHc1aCz4ZuGQVyJ6mK9w6StV6QPyinF7ym_mAZG6yO6ShU1DuFm6teqBAxCcCrEQSjz7KtXzi2kbJH41&exchangeDate={}".format(i)
# Send an HTTP request to the website and retrieve the HTML content
response = requests.get(url)
html = response.content
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
# Find the table containing the data you want to scrape
table = soup.find("table", attrs={"class": "table"})
# Extract the data from the table and save it to a Pandas DataFrame
df = pd.read_html(str(table))[0]
final_df=pd.concat([final_df,df])
final_df.to_excel("exchange_Rate_data.xlsx", index=False)

scrape book body text from project gutenberg de

I am new to python and I am looking for a way to extract with beautiful soup existing open source books that are available on gutenberg-de, such as this one
I need to use them for further analysis and text mining.
I tried this code, found in a tutorial, and it extracts metadata, but instead of the body content it gives me a list of the "pages" I need to scrape the text from.
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.projekt-gutenberg.org/keller/heinrich/")
soup = BeautifulSoup(page.content, 'html.parser')
# Extract title of page
page_title = soup.title
# Extract body of page
page_body = soup.body
# Extract head of page
page_head = soup.head
# print the result
print(page_title, page_head)
I suppose I could use that as a second step to extract it then? I am not sure how, though.
Ideally I would like to store them in a tabular way and be able to save them as csv, preserving the metadata author, title, year, and chapter. any ideas?
What happens?
First of all you will get a list of pages, cause you not requesting the right url it to:
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
Recommend that if your looping all the urls store the content in a list of dicts and push it to csv or pandas or ...
Example
import requests
from bs4 import BeautifulSoup
data = []
# Make a request
page = requests.get('https://www.projekt-gutenberg.org/keller/heinrich/hein101.html')
soup = BeautifulSoup(page.content, 'html.parser')
data.append({
'title': soup.title,
'chapter': soup.h2.get_text(),
'text': ' '.join([p.get_text(strip=True) for p in soup.select('body p')[2:]])
}
)
data

Scraping a table with beautiful soup

I'm trying to scrape the price table (buy yes, prices and contracts available) from this site: https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#prices.
This is my (obviously very preliminary) code, structured now just to find the table:
from bs4 import BeautifulSoup
import requests
from lxml import html
import json, re
url = "https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#prices"
ret = requests.get(url).text
soup = BeautifulSoup(ret, "lxml")
try:
table = soup.find('table')
print table
except AttributeError as e:
print 'No tables found, exiting'
The code finds and parses a table; however, it's the wrong one (the data table on a different tab https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#data).
How do I resolve this error to ensure the code identifies the correct table?
As #downshift mentioned in the comments the table is js generated using xhr request.
So you can either use Selenium or make a direct request to the site's api.
Using the 2nd option:
url = "https://www.predictit.org/PrivateData/GetPriceListAjax?contractId=7069"
ret = requests.get(url).text
soup = BeautifulSoup(ret, "lxml")
table = soup.find('table')

Parsing NBA reference with python beautiful soup

So I'm trying to scrape out the miscellaneous stats table from this site http://www.basketball-reference.com/leagues/NBA_2016.html using python and beautiful soup. This is the basic code so far I just want to see if it is even reading the table but when I do print table I just get none.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "http://www.basketball-reference.com/leagues/NBA_2016.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
table = soup.find('table', id='misc_stats')
print table
When I inspect the html on the webpage itself, the table that I want appears with this symbol in front <!-- and the html text is green for the portion. What can I do?
<!-- is the start of a comment and --> is the end in html so just remove the comments before you parse it:
from bs4 import BeautifulSoup
import requests
comm = re.compile("<!--|-->")
html = requests.get("http://www.basketball-reference.com/leagues/NBA_2016.html").content
cleaned_soup = BeautifulSoup(re.sub("<!--|-->","", html))
tableStats = cleaned_soup.find('table', {'id':'team_stats'})
print(tableStats)

Categories

Resources