Beautiful Soup parsing table from react html

Beautiful Soup parsing table from react html - python

Im trying to parse table with orders from html page.
Here the html:
HTML PIC
I need to get data from those table rows,
Here what i tried to do:
response = requests.get('https://partner.market.yandex.ru/supplier/23309133/fulfillment/orders', headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
q = soup.findAll('tr')
a = soup.find('tr')
print(q)
print(a)
But it gives me None. So any idea how to get into those table rows?
I tried to iterate over each div in html... once i get closer to div which contains those tables it give me None as well.
Appreciate any help

Aight. I found a solution by using selenium instead of requests lib.
I don't have any idea why it doesn't work with requests lib since it's doing the same thing as selenium (just sending an get request). But, with the selenium it works.
So here is what I do:
driver = webdriver.Chrome(r"C:\Users\Booking\PycharmProjects\britishairways\chromedriver.exe")
driver.get('https://www.britishairways.com/travel/managebooking/public/ru_ru')
time.sleep(15) # make an authorization
res = driver.page_source
print(res)
soup = BeautifulSoup(res, 'lxml')
b = soup.find_all('tr')

Related

scraping table with bs4

I am trying to scrape a table that is under a div tag with id pcaxis_tablediv using the following code. However, when I am printing it, it returns None. I am looking at the source code of the website and I can't see what am I doing wrong.
url='https://www.statistikdatabasen.scb.se/pxweb/sv/ssd/START__AM__AM0208__AM0208B/YREG65/sortedtable/tableViewSorted/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
wanted_table = soup.find_all('div', id="pcaxis_tablediv")
print(wanted_table)

Scraping a table with BeautifulSoup

I am trying to scrape a table with info on footballplayers on https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019/plus/1
It works fine when I try to get information manually like this:
url = 'https://www.transfermarkt.co.uk/manchester-city/startseite/verein/281/saison_id/2019'
response = requests.get(url, headers={'User-Agent': 'Custom5'})
data = response.text
soup = BeautifulSoup(data, 'html.parser')
players_table = soup.find("table", attrs={"class": "items"})
Players = soup.find_all("a", {"class": "spielprofil_tooltip"})
Players[5].text
Values = soup.find_all("td", {"class": "rechts hauptlink"})
Values[9].text
Birthdays = soup.find_all("td", {"class": "zentriert"})
Birthdays[1].text
But to actually get the data into a table I think I need to use a for loop with td and tr tags. I have looked for solutions but cannot find anything that works with this particular website.
When I try this for example, the list remains empty
data = []
for tr in players_table.find_all("tr"):
# remove any newlines and extra spaces from left and right
data.append
print(data)

You don't actually append anything to the list.
Change data.append to data.append(tr).
That way you tell the your program what to append to the list, assuming players_table.find_all("tr") does return at least 1 item.

The website uses JavaScript, but requests doesn't support it. so we can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.transfermarkt.co.uk/manchester-city/kader/verein/281/saison_id/2019/plus/1"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
# Wait 5 seconds for the page to load
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
players_table = soup.find("table", attrs={"class": "items"})
for tr in players_table.find_all('tr'):
tds = ' '.join(td.get_text(strip=True) for td in tr.select('td'))
print(tds)
driver.quit()

How can I get the text from this specific div class?

I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)

Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))

You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help

Extract data from BSE website

How can I extract the value of Security ID, Security Code, Group / Index, Wtd.Avg Price, Trade Date, Quantity Traded, % of Deliverable Quantity to Traded Quantity using Python 3 and save it to an XLS file. Below is the link.
https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/
PS: I am completely new to the python. I know there are few libs which make scrapping easier like BeautifulSoup, selenium, requests, lxml etc. Don't have much idea about them.
Edit 1:
I tried something
from bs4 import BeautifulSoup
import requests
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'newheaddivgrey'})
print(table)
Its output is None. I was expecting all tables in the webpage and filter them further to get required data.
import requests
import lxml.html
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
root = lxml.html.fromstring(r.content)
title = root.xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(title)
Tried another code. Same problem.
Edit 2:
Tried selenium. But I am not getting the table contents.
from selenium import webdriver
driver = webdriver.Chrome(r"C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\bin\chromedriver.exe")
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
table=driver.find_elements_by_xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(table)
driver.quit()
Output is [<selenium.webdriver.remote.webelement.WebElement (session="befdd4f01e6152942c9cfc7c563a6bf2", element="0.13124528538297953-1")>]

After loading the page with Selenium, you can get the Javascript modified page source using driver.page_source. You can then pass this page source in the BeautifulSoup object.
driver = webdriver.Chrome()
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('div', id='SecuritywiseDeliveryPosition')
This code will give you the Securitywise Delivery Position table in the table variable. You can then parse this BeautifulSoup object to get the different values you want.
The soup object contains the full page source including the elements that were dynamically added. Now, you can parse this to get all the things you mentioned.

Extracting specific Information from a website using BeautifulSoup (Python)

I am accessing the following website to extract a list of stocks:
http://www.barchart.com/stocks/performance/12month.php
I am using the following code:
from bs4 import BeautifulSoup
import requests
url=raw_input("http://www.barchart.com/stocks/performance/12month.php")
r = requests.get("http://www.barchart.com/stocks/performance/12month.php")
data = r.text
soup =BeautifulSoup(data, "lxml")
for link in soup.find_all('a'):
print(link.get('href'))
The problem is I am getting a lot of other information that is not needed. I wanted to ask what would be a method that would just give me the stock names and nothing else.

r = requests.get("http://www.barchart.com/stocks/performance/12month.php")
html = r.text
soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all("td", {"class": "ds_name"})
for td in tds:
print td.a.text
If you look at the source code of the page, you will find that all you need is in a table. To be specific, the stocks' names are in <td></td> whose class="ds_name". So, that's it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautiful Soup parsing table from react html - python

Related

scraping table with bs4

Scraping a table with BeautifulSoup

How can I get the text from this specific div class?

Extract data from BSE website

Extracting specific Information from a website using BeautifulSoup (Python)

Categories

Resources