Scraping a table from html using python and beautifulsoup - python

I am trying to scrape data from a table in a government website, I have tried to watch some tutorials but so far to no avail (coding dummy over here!!!)
I would like to get a .csv file out of the tables they have containing the date, the type of event, and the adopted measures for a project. I leave here the website if any of you wants to crack it!
https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps
!pip install beautifulsoup4
!pip install requests
from bs4 import BeautifulSoup
import requests
url= "https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
FIRST_table = soup.find('table', class_ = 'tableau-timeline')
print(FIRST_table)
for timeline in FIRST_table.find_all('tbody'):
rows= timeline.find_all('tr')
for row in rows:
pl_timeline = row.find('td', class_ = 'date').text
print(pl_timeline)
p
I was expecting to get in order the dates and to use the same for loop to get the also the other two columns by tweaking it for "Type d'événement" and "Mesures adoptées"
What am I doing wrong? How can I tweak it? (I am using colab if it makes any difference) Thanks in advance

Make your life easier and just use pandas to parse the tables and dump the data to a .csv file.
For example, this gets you all the tables, merges them, and spits out a .csv file:
import pandas as pd
import requests
url = "https://www.inspq.qc.ca/covid-19/donnees/ligne-du-temps"
tables = pd.read_html(requests.get(url).text, flavor="lxml")
df = pd.concat(tables).to_csv("data.csv", index=False)
Output:

Related

bs4 - How to extract table data from website?

Here is the link,
https://www.vit.org/WebReports/vesselschedule.aspx
I'm using BeautifulSoup and my goal was to extract the table from it.
I wrote the code..
from bs4 import BeautifulSoup
import requests
import pandas as pd
url="https://www.vit.org/WebReports/vesselschedule.aspx"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
gdp_table = soup.find("table", attrs={"id": "ctl00_ContentPlaceHolder1_VesselScheduleControl1_Grid1"})
The final line of code gave me an error displaying 'None'.
I'm new to this web scraping, can you help me find a solution to get the table?
Why not pd.read_html(url)?
It will extract tables automatically
Problem is that the id by which you are looking for this table is appended to the element dynamically via js and as requests library is only downloading files at URL, nothing is dynamically appended and in result Your table is without id
If you encounter a similar error in the future (element exists but bs4 cant find it) try saving the response as text to an HTML file and inspect it in your browser.
For your particular case this code could be used:
import requests
from bs4 import BeautifulSoup
resp = requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
with open("tmp.html", "w") as f:
f.write(resp.text)
bs = BeautifulSoup(resp.text)
table = bs.find_all("table")[6] # not the best way to select elements
rows = table.find_all("tr")
Warning: Try avoiding such style of relative selecting, web pages are constantly updating and such code may procude errors in the future
I Parsed The Table and Added Each Rows in A List And Appended That To Data List
And Here You Go!..
And I Added The Total List In [Hashbin]
from bs4 import BeautifulSoup
import requests
url="https://www.vit.org/WebReports/vesselschedule.aspx"
soup = BeautifulSoup( requests.get(url).text )
table = soup.find_all('table')[6] # as it is not the best way as told by darkKnight
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols])
print(data)
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.vit.org/WebReports/vesselschedule.aspx")
soup=BeautifulSoup(res.text,"html.parser")
find columns by below code:
table=soup.find_all("table")[6]
columns=[col.get_text(strip=True) for col in table.find("tr",class_="HeadingRow").find_all("td")[1:-1]]
find row data by below code:
main_lst=[]
for row in table.find_all("tr",class_="Row"):
lst=[i.get_text(strip=True) for i in row.find_all("td")[1:-1]]
main_lst.append(lst)
create table using pandas
import pandas as pd
df=pd.DataFrame(columns=columns,data=main_lst)
df
Image:
You need a way to specify a pattern that uniquely identifies the target table given the nested tabular structure. The following css pattern will grab that table based on a string it contains ("Shipline"), an attribute that is not present, as well as the table's relationship to other elements within the DOM.
You can then pass that specific table to read_html and do some cleaning on the returned DataFrame.
import requests
from bs4 import BeautifulSoup as bs
from pandas import read_html as rh
r = requests.get('https://www.vit.org/WebReports/vesselschedule.aspx').text
soup = bs(r, 'lxml')
df = rh(str(soup.select_one('table table:not([style]):-soup-contains("Shipline")')))[0] #earlier soupsieve version use :contains
df.dropna(how='all', axis = 1, inplace = True)
df.columns = df.iloc[0, :]
df = df.iloc[1:, :]

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

How do i change this code to scrape a table

I am struggling to data scrape this website:
https://wix-visual-data.appspot.com/app/widget?pageId=cu7nt&compId=comp-kesofw00&viewerCompId=comp-kesofw00&siteRevision=947&viewMode=site&deviceType=desktop&locale=en&tz=Europe%2FLondon&width=980&height=890&instance=k983l1LiiUeOz5_3Pd_CLXbjfadc08q1fEu54xfh9aA.eyJpbnN0YW5jZUlkIjoiYjQ0MWIxMGUtNTRmNy00YzdhLTgwY2QtNmU0ZjkwYzljMzA3IiwiYXBwRGVmSWQiOiIxMzQxMzlmMy1mMmEwLTJjMmMtNjkzYy1lZDIyMTY1Y2ZkODQiLCJtZXRhU2l0ZUlkIjoiM2M3ZmE5OWItY2I3Yy00MTg0LTk1OTEtNWY0MDhmYWYwZmRhIiwic2lnbkRhdGUiOiIyMDIxLTAxLTMwVDAxOjIzOjAyLjU1MVoiLCJ1aWQiOiIzYWMyNDI3YS04NGVhLTQ0ZGUtYjYxMS02MTNiZTVlOWJiZGQiLCJkZW1vTW9kZSI6ZmFsc2UsImFpZCI6IjczYWE3ZWNjLTQyODUtNDY2My1iNjMxLTMzMjE0MWJiZDhhMiIsImJpVG9rZW4iOiI4ODNlMTg5NS05ZjhiLTBkZmUtMTU1Yy0zMTBmMWY2NmNjZGQiLCJzaXRlT3duZXJJZCI6ImVhYWU1MDEzLTMxZjgtNDQzNC04MDFhLTE3NDQ2N2EwZjE5YSIsImV4cGlyYXRpb25EYXRlIjoiMjAyMS0wMS0zMFQwNToyMzowMi41NTFaIiwiaGFzVXNlclJvbGUiOmZhbHNlfQ&currency=GBP&currentCurrency=GBP&vsi=795183b4-8f30-4854-bd85-77678dbe4cf8&consent-policy=%7B%22func%22%3A0%2C%22anl%22%3A0%2C%22adv%22%3A0%2C%22dt3%22%3A1%2C%22ess%22%3A1%7D&commonConfig=%7B%22brand%22%3A%22wix%22%2C%22bsi%22%3Anull%2C%22BSI%22%3Anull%7D
This URL has a table but for some reason I am not able to scrape this into an excel file. This is my current code in Python and this is what I have tried. Any help is much appreciated thank you legends!
import urllib
import urllib.request
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://wix-visual-data.appspot.com/app/widget?pageId=cu7nt&compId=comp-kesofw00&viewerCompId=comp-kesofw00&siteRevision=947&viewMode=site&deviceType=desktop&locale=en&tz=Europe%2FLondon&width=980&height=890&instance=dxGyx3zK9ULK0A8UtGOrLw-__FTD9EBEfzQojJ7Bz00.eyJpbnN0YW5jZUlkIjoiYjQ0MWIxMGUtNTRmNy00YzdhLTgwY2QtNmU0ZjkwYzljMzA3IiwiYXBwRGVmSWQiOiIxMzQxMzlmMy1mMmEwLTJjMmMtNjkzYy1lZDIyMTY1Y2ZkODQiLCJtZXRhU2l0ZUlkIjoiM2M3ZmE5OWItY2I3Yy00MTg0LTk1OTEtNWY0MDhmYWYwZmRhIiwic2lnbkRhdGUiOiIyMDIxLTAxLTI5VDE4OjM0OjQwLjgwM1oiLCJ1aWQiOiIzYWMyNDI3YS04NGVhLTQ0ZGUtYjYxMS02MTNiZTVlOWJiZGQiLCJkZW1vTW9kZSI6ZmFsc2UsImFpZCI6IjczYWE3ZWNjLTQyODUtNDY2My1iNjMxLTMzMjE0MWJiZDhhMiIsImJpVG9rZW4iOiI4ODNlMTg5NS05ZjhiLTBkZmUtMTU1Yy0zMTBmMWY2NmNjZGQiLCJzaXRlT3duZXJJZCI6ImVhYWU1MDEzLTMxZjgtNDQzNC04MDFhLTE3NDQ2N2EwZjE5YSIsImV4cGlyYXRpb25EYXRlIjoiMjAyMS0wMS0yOVQyMjozNDo0MC44MDNaIiwiaGFzVXNlclJvbGUiOmZhbHNlfQ&currency=GBP&currentCurrency=GBP&vsi=57130cda-8191-488e-8089-f472928266e3&consent-policy=%7B%22func%22%3A0%2C%22anl%22%3A0%2C%22adv%22%3A0%2C%22dt3%22%3A1%2C%22ess%22%3A1%7D&commonConfig=%7B%22brand%22%3A%22wix%22%2C%22bsi%22%3Anull%2C%22BSI%22%3Anull%7D"
table_id = "theTable"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find('table', attrs={"id" : theTable})
df = pd.read_html(str(table))
The page is loading the data using JavaScript. You can find the URL using the Network tab of Firefox. Even better news, the data is in the CSV format so you don't even need an HTML parser to parse it.
You can find the CSV here.

Web Scraping Python For Two Different Buttons

I am trying to scrape data from https://www.wsj.com/market-data/bonds/treasuries.
There are two tables on this website which get switched when we select the options:
1. Treasury Notes and Bond
2. Treasury Bills
I want to scrape the data for Treasury bills. But there is no change in the link and attributes or anything when i click that option. I have tried a lot of things but every time, i am able to scrape the data for Treasury Notes and Bond.
Can someone help me with that?
Following the my code:
import re
import csv
import requests
import pandas as pd
from bs4 import BeautifulSoup
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
list_rows = []
for row in rows:
cells = row.find_all('td')
str_cells = str(cells)
clean = re.compile('<.*?>')
clean2 = (re.sub(clean, '',str_cells))
list_rows.append(clean2)
df = pd.DataFrame(list_rows)
df1 = df[0].str.split(',', expand=True)
All the data in the site is loaded once and then js is used to update the values in the table
Here is a working quickly written code:
import requests
from bs4 import BeautifulSoup
import json
mostActiveStocksUrl = "https://www.wsj.com/market-data/bonds/treasuries"
page = requests.get(mostActiveStocksUrl)
data = page.text
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('script') # we get all the script tags
importantJson = ''
for r in rows:
text = r.text
if 'NOTES_AND_BONDS' in text: # the scirpt tags containing the date, probably you can do this better
importantJson = text
break
# remove the non json stuff
importantJson = importantJson\
.replace('window.__STATE__ =', '')\
.replace(';', '')\
.strip()
#parse the json
jsn = json.loads(importantJson)
print(jsn) #json object containing all the data you need
How did I got to this conclusion?
First I noticed that switching between the two tables makes no http requests to the server, meaning the data is already there.
Then I inspected the table html and noticed that there is only one table and its contents are dynamically changing, which lead me to the conclusion that this data is already on the page.
Then with simple search in the source I found the script tag containing the json.

Pandas dataframe not exporting all rows to csv (but all are displayed in terminal)

I've started playing with pandas and web scraping, the code seems to work and all result rows are displayed in the terminal when I run the code, however when I export it to csv it only displays half of the result rows. It might have something to do that I'm iterating through urls however then I'm not sure why the results are still displayed correctly in the terminal.
import pandas as pd
import requests
import bs4
from bs4 import BeautifulSoup
urls = ['https://www.indeed.co.uk/jobs?q=Scrum+master&l=London', 'https://www.indeed.co.uk/jobs?q=Scrum+master&l=London&start=10']
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
job_results = soup.find(id='resultsCol')
jobs = job_results.find_all(class_='jobsearch-SerpJobCard')
titles = [job.find(class_='jobtitle').get_text() for job in jobs]
descriptions = [job.find('div', attrs={'class': 'summary'}).get_text() for job in jobs]
jobs_filtered = pd.DataFrame(
{
'title' : titles,
'description' : descriptions,
})
print(jobs_filtered)
jobs_filtered.to_csv('jobs_filtered11.csv')
Please use append mode to get the required output.
jobs_filtered.to_csv('jobs_filtered11.csv', mode='a', header=False) # True for the first time if necessary

Categories

Resources