Parsing a table on webpage using BeautifulSoup - python

Trying get a table from the website SGX.
The page is saved to local drive and I am using BeautifulSoup to parse it:
soup = BeautifulSoup(open(pages), "lxml")
soup.prettify()
list_0 = soup.find_all('table')[0]
print list_0
What it returned, is not the first row on the page:
[<tr><td>Zhongmin Baihui</td><td>5SR</td><td class="nowrap">09:44 AM</td><td class="nowrap">09:49 AM</td><td>0.615</td><td>0.675</td><td>0.555</td></tr>]
What's the right way to retrieve this table?
Thank you.

Data are being fetched after page loads using AJAX request, by inspecting the page you can find the API URL (the Url below), and then you can use something like that:
import pandas as pd
import requests
import json
response = requests.get('https://api.sgx.com/securities/v1.1?excludetypes=bonds&params=nc%2Cadjusted-vwap%2Cb%2Cbv%2Cp%2Cc%2Cchange_vs_pc%2Cchange_vs_pc_percentage%2Ccx%2Ccn%2Cdp%2Cdpc%2Cdu%2Ced%2Cfn%2Ch%2Ciiv%2Ciopv%2Clt%2Cl%2Co%2Cp_%2Cpv%2Cptd%2Cs%2Csv%2Ctrading_time%2Cv_%2Cv%2Cvl%2Cvwap%2Cvwap-currency')
data = json.loads(response.content)["data"]["prices"]
df = pd.DataFrame(data)
print(df)
If your requirement are complex and your crawling done in regular basis I recommend using scrapy.

Related

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

Scraping Webpage With Beautiful Soup

I am new to web scraping and I am trying to scrape wind data from a website. Here is the website: https://wx.ikitesurf.com/spot/507.
I understand that I can do this using selenium to find elements but I think I may have found a better way. Please correct if I am wrong. When in developer tools I can find this page by going to network->JS->getGraph?
https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881
This page contains all the data I need and it is constantly updating. Here is my code:
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?callback=jQuery17200020271765600428093_1619158293267&units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
time.sleep(3)
wind = soup.find("last_ob_wind_desc")
print (wind)
I tried using beautiful soup to scrape but I always receive the answer "None". Does anyone know how I can scrape this page? I would like to know what I am doing wrong. Thanks for any help!
Removing callback=jQuery17200020271765600428093_1619158293267& from the api url will make it return proper json:
import requests
url = 'https://api.weatherflow.com/wxengine/rest/graph/getGraph?units_wind=mph&units_temp=f&units_distance=mi&fields=wind&format=json&null_ob_min_from_now=60&show_virtual_obs=true&spot_id=507&time_start_offset_hours=-36&time_end_offset_hours=0&type=dataonly&model_ids=-101&wf_token=3a648ec44797cbf12aca8ebc6c538868&_=1619158293881'
response = requests.get(url).json()
response is now a dictionary with the data. last_ob_wind_desc can be retrieved with response['last_ob_wind_desc'].
You can also save the data to csv or other file formats with pandas:
import pandas as pd
df = pd.json_normalize(response)
df.to_csv('filename.csv')

How do i change this code to scrape a table

I am struggling to data scrape this website:
https://wix-visual-data.appspot.com/app/widget?pageId=cu7nt&compId=comp-kesofw00&viewerCompId=comp-kesofw00&siteRevision=947&viewMode=site&deviceType=desktop&locale=en&tz=Europe%2FLondon&width=980&height=890&instance=k983l1LiiUeOz5_3Pd_CLXbjfadc08q1fEu54xfh9aA.eyJpbnN0YW5jZUlkIjoiYjQ0MWIxMGUtNTRmNy00YzdhLTgwY2QtNmU0ZjkwYzljMzA3IiwiYXBwRGVmSWQiOiIxMzQxMzlmMy1mMmEwLTJjMmMtNjkzYy1lZDIyMTY1Y2ZkODQiLCJtZXRhU2l0ZUlkIjoiM2M3ZmE5OWItY2I3Yy00MTg0LTk1OTEtNWY0MDhmYWYwZmRhIiwic2lnbkRhdGUiOiIyMDIxLTAxLTMwVDAxOjIzOjAyLjU1MVoiLCJ1aWQiOiIzYWMyNDI3YS04NGVhLTQ0ZGUtYjYxMS02MTNiZTVlOWJiZGQiLCJkZW1vTW9kZSI6ZmFsc2UsImFpZCI6IjczYWE3ZWNjLTQyODUtNDY2My1iNjMxLTMzMjE0MWJiZDhhMiIsImJpVG9rZW4iOiI4ODNlMTg5NS05ZjhiLTBkZmUtMTU1Yy0zMTBmMWY2NmNjZGQiLCJzaXRlT3duZXJJZCI6ImVhYWU1MDEzLTMxZjgtNDQzNC04MDFhLTE3NDQ2N2EwZjE5YSIsImV4cGlyYXRpb25EYXRlIjoiMjAyMS0wMS0zMFQwNToyMzowMi41NTFaIiwiaGFzVXNlclJvbGUiOmZhbHNlfQ&currency=GBP&currentCurrency=GBP&vsi=795183b4-8f30-4854-bd85-77678dbe4cf8&consent-policy=%7B%22func%22%3A0%2C%22anl%22%3A0%2C%22adv%22%3A0%2C%22dt3%22%3A1%2C%22ess%22%3A1%7D&commonConfig=%7B%22brand%22%3A%22wix%22%2C%22bsi%22%3Anull%2C%22BSI%22%3Anull%7D
This URL has a table but for some reason I am not able to scrape this into an excel file. This is my current code in Python and this is what I have tried. Any help is much appreciated thank you legends!
import urllib
import urllib.request
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://wix-visual-data.appspot.com/app/widget?pageId=cu7nt&compId=comp-kesofw00&viewerCompId=comp-kesofw00&siteRevision=947&viewMode=site&deviceType=desktop&locale=en&tz=Europe%2FLondon&width=980&height=890&instance=dxGyx3zK9ULK0A8UtGOrLw-__FTD9EBEfzQojJ7Bz00.eyJpbnN0YW5jZUlkIjoiYjQ0MWIxMGUtNTRmNy00YzdhLTgwY2QtNmU0ZjkwYzljMzA3IiwiYXBwRGVmSWQiOiIxMzQxMzlmMy1mMmEwLTJjMmMtNjkzYy1lZDIyMTY1Y2ZkODQiLCJtZXRhU2l0ZUlkIjoiM2M3ZmE5OWItY2I3Yy00MTg0LTk1OTEtNWY0MDhmYWYwZmRhIiwic2lnbkRhdGUiOiIyMDIxLTAxLTI5VDE4OjM0OjQwLjgwM1oiLCJ1aWQiOiIzYWMyNDI3YS04NGVhLTQ0ZGUtYjYxMS02MTNiZTVlOWJiZGQiLCJkZW1vTW9kZSI6ZmFsc2UsImFpZCI6IjczYWE3ZWNjLTQyODUtNDY2My1iNjMxLTMzMjE0MWJiZDhhMiIsImJpVG9rZW4iOiI4ODNlMTg5NS05ZjhiLTBkZmUtMTU1Yy0zMTBmMWY2NmNjZGQiLCJzaXRlT3duZXJJZCI6ImVhYWU1MDEzLTMxZjgtNDQzNC04MDFhLTE3NDQ2N2EwZjE5YSIsImV4cGlyYXRpb25EYXRlIjoiMjAyMS0wMS0yOVQyMjozNDo0MC44MDNaIiwiaGFzVXNlclJvbGUiOmZhbHNlfQ&currency=GBP&currentCurrency=GBP&vsi=57130cda-8191-488e-8089-f472928266e3&consent-policy=%7B%22func%22%3A0%2C%22anl%22%3A0%2C%22adv%22%3A0%2C%22dt3%22%3A1%2C%22ess%22%3A1%7D&commonConfig=%7B%22brand%22%3A%22wix%22%2C%22bsi%22%3Anull%2C%22BSI%22%3Anull%7D"
table_id = "theTable"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find('table', attrs={"id" : theTable})
df = pd.read_html(str(table))
The page is loading the data using JavaScript. You can find the URL using the Network tab of Firefox. Even better news, the data is in the CSV format so you don't even need an HTML parser to parse it.
You can find the CSV here.

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

using beautiful soup for simulating a page-click to access all HTML on a page?

I'm trying to scrape the following website:
https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs
I'm able to successfully scrape the events listed on the page using beautifulsoup, using the following code:
from bs4 import BeautifulSoup
import requests
url = 'https://www.bandsintown.com/?came_from=257&sort_by_filter=Number+of+RSVPs'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
dates = soup.find_all('div', {'class': 'event-b58f7990'})
month=[]
day=[]
for i in dates:
md = i.find_all('div')
month.append(md[0].text)
day.append(md[1].text)
However, the issue I'm having is that I'm only able to scrape the first 18 events - the rest of the page is only available if the 'view all' button is clicked at the bottom. Is there a way in beautifulsoup, or otherwise, to simulate this button being clicked, so that I can scrape ALL of the data? I'd prefer to keep this in python as I'm doing most scraping with beautifulsoup. Thanks so much!
If you can work out the end point or set an end point for range in the following (with error handling for going too far) you can get a json response and parse out the info you require as follows. Depending on how many requests making you may choose to re-use connection with session.
import requests
import pandas as pd
url = 'https://www.bandsintown.com/upcomingEvents?came_from=257&sort_by_filter=Number+of+RSVPs&page={}&latitude=51.5167&longitude=0.0667'
results = []
for page in range(1,20):
data = requests.get(url.format(page)).json()
for item in data['events']:
results.append([item['artistName'], item['eventDate']['day'],item['eventDate']['month']])
df = pd.DataFrame(results)
print(df)

Categories

Resources