I'm unable to parse past the div id= "id="divTradeHaltResults". When I try to return the table within this div I get None. Thanks in advance!
from bs4 import BeautifulSoup
import requests
my_url = "https://www.nasdaqtrader.com/Trader.aspx?id=TradeHalts"
r = requests.get(url=my_url)
page_text = r.text
soup = BeautifulSoup(page_text, "lxml")
table = soup.table
print(table)
If you select that tag inside the soup, you get the tag but it's empty. If you look on the webpage, you can see the table in the tag. My guess is that this table is generated with JS (in some form), thus it does not come with the HTML. My solution would be to turn to something like Selenium.
This is the code I ran to select that tag:
soup.find('div', {'id':'divTradeHaltResults'})
# <div id="divTradeHaltResults"></div>
If you look into the JS on the page, you can actually find the function that generates the table, as I mentioned above:
function GetTradeHalts()
{
document.getElementById('divTradeHaltResults').innerHTML = "updating....";
Server.BL_TradeHalt.GetTradeHalts(cb_GetTradeHalts);
setTimeout(GetTradeHalts, 1000 * 60);
}
Related
Im trying to parse table with orders from html page.
Here the html:
HTML PIC
I need to get data from those table rows,
Here what i tried to do:
response = requests.get('https://partner.market.yandex.ru/supplier/23309133/fulfillment/orders', headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
q = soup.findAll('tr')
a = soup.find('tr')
print(q)
print(a)
But it gives me None. So any idea how to get into those table rows?
I tried to iterate over each div in html... once i get closer to div which contains those tables it give me None as well.
Appreciate any help
Aight. I found a solution by using selenium instead of requests lib.
I don't have any idea why it doesn't work with requests lib since it's doing the same thing as selenium (just sending an get request). But, with the selenium it works.
So here is what I do:
driver = webdriver.Chrome(r"C:\Users\Booking\PycharmProjects\britishairways\chromedriver.exe")
driver.get('https://www.britishairways.com/travel/managebooking/public/ru_ru')
time.sleep(15) # make an authorization
res = driver.page_source
print(res)
soup = BeautifulSoup(res, 'lxml')
b = soup.find_all('tr')
This is the link of the webpage I want to scrape:
https://www.tripadvisor.in/Restaurants-g494941-Indore_Indore_District_Madhya_Pradesh.html
I have also applied additional filters, by clicking on the encircled heading1
This is how the webpage looks like after clicking on the heading2
I want to get names of all the places displayed on the webpage but I seem to be having trouble with it as the url doesn't get changed on applying the filter.
I am using python urllib for this.
Here is my code:
url = "https://www.tripadvisor.in/Hotels-g494941-Indore_Indore_District_Madhya_Pradesh-Hotels.html"
page = urlopen(url)
html_bytes = page.read()
html = html_bytes.decode("utf-8")
print(html)
You can use bs4. Bs4 is a python module that allows you to get certain things off of webpages. This will get the text from the site:
from bs4 import BeautifulSoup as bs
soup = bs(html, features='html5lib')
text = soup.get_text()
print(text)
If you would like to get something that is not the text, maybe something with a certain tag you can also use bs4:
soup.findall('p') # Getting all p tags
soup.findall('p', class_='Title') #getting all p tags with a class of Title
Find what class and tag all of the place names have, and then use the above to get all the place names.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])
I have tried to get the data from this table but I have been unable to do so: https://datagolf.ca/player-trends
I have tried many things for the last few hours, below is my most recent when just returns an empty list.
import bs4
import requests
res = requests.get('https://datagolf.ca/player-trends')
soup = bs4.BeautifulSoup(res.text, 'lxml')
table = soup.find_all("div", class_ = "table")
table
Is the issue something similar to this:
Scrape of table with only 'div's
This page is java script rendered. Take a closer look at what requests.get('https://datagolf.ca/player-trends') actually returns. It does not contain the table.
Despite that it pulls the data from https://dl.dropboxusercontent.com/s/hprrfklqs5q0oge/player_trends_dev.csv?dl=1
I am trying to find a table in a Wikipedia page using BeautifulSoup. I know how to get the first table, but how do I get the second table (Recent changes to the list of S&P 500 Components) with the same class wikitable sortable?
my code:
import bs4 as bs
import requests
url='https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
r=requests.get(url)
url=r.content
soup = bs.BeautifulSoup(url,'html.parser')
tab = soup.find("table",{"class":"wikitable sortable"})
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
You can use soup.find_all and access the last table. Since there are only two table tags with wikitable sortable as its class, the last element in the resulting list will be the "Recent Changes" table:
soup.find_all("table", {"class":"wikitable sortable"})[-1]
You could use an nth-of-type css selector to specify the second matching table
import bs4 as bs
import requests
url = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
r = requests.get(url)
url = r.content
soup = bs.BeautifulSoup(url,'lxml')
tab = soup.select_one("table.wikitable.sortable:nth-of-type(2)")
print(tab)