scraping table with bs4

scraping table with bs4 - python

I am trying to scrape a table that is under a div tag with id pcaxis_tablediv using the following code. However, when I am printing it, it returns None. I am looking at the source code of the website and I can't see what am I doing wrong.
url='https://www.statistikdatabasen.scb.se/pxweb/sv/ssd/START__AM__AM0208__AM0208B/YREG65/sortedtable/tableViewSorted/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
wanted_table = soup.find_all('div', id="pcaxis_tablediv")
print(wanted_table)

Related

Beautiful Soup parsing table from react html

Im trying to parse table with orders from html page.
Here the html:
HTML PIC
I need to get data from those table rows,
Here what i tried to do:
response = requests.get('https://partner.market.yandex.ru/supplier/23309133/fulfillment/orders', headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
q = soup.findAll('tr')
a = soup.find('tr')
print(q)
print(a)
But it gives me None. So any idea how to get into those table rows?
I tried to iterate over each div in html... once i get closer to div which contains those tables it give me None as well.
Appreciate any help

Aight. I found a solution by using selenium instead of requests lib.
I don't have any idea why it doesn't work with requests lib since it's doing the same thing as selenium (just sending an get request). But, with the selenium it works.
So here is what I do:
driver = webdriver.Chrome(r"C:\Users\Booking\PycharmProjects\britishairways\chromedriver.exe")
driver.get('https://www.britishairways.com/travel/managebooking/public/ru_ru')
time.sleep(15) # make an authorization
res = driver.page_source
print(res)
soup = BeautifulSoup(res, 'lxml')
b = soup.find_all('tr')

Webscraping Google news page: getting AttributeError: 'NoneType' object has no attribute 'find_all'

I'm trying to web-scrape the Google News page for a personal project and retrieve the article headlines to print out onto another page. I've been searching for any typos or mistakes but I'm not sure why my element keeps returning as "None" when I try to print it.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.google.com/search?q=beyond+meat&rlz=1C1CHBF_enUS898US898&sxsrf=ALeKk00IH9jp1Kz5-LSyi7FUB4rd6--_hw:1624935518812&source=lnms&tbm=nws&sa=X&ved=2ahUKEwicqIbD7LvxAhVWo54KHXgRA9oQ_AUoAXoECAEQAw&biw=1536&bih=754'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', id='rso') #grabs everything in column
article_results = results.find_all('div', class_='yr3B8d KWQBje') #grabs divs surrounding each article
for article_result in article_results:
headliner = article_result.find('div', class_='JheGif nDgy9d')#grabs article header div for every article
if None in (headliner):
continue
headliner_text = headliner.text.strip()
print()

import requests
from bs4 import BeautifulSoup
URL = 'https://www.google.com/search?q=beyond+meat&rlz=1C1CHBF_enUS898US898&sxsrf=ALeKk00IH9jp1Kz5-LSyi7FUB4rd6--_hw:1624935518812&source=lnms&tbm=nws&sa=X&ved=2ahUKEwicqIbD7LvxAhVWo54KHXgRA9oQ_AUoAXoECAEQAw&biw=1536&bih=754'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
headers = soup.find_all('div', class_='BNeawe vvjwJb AP7Wnd')
for h in headers:
print(h.text)
Refer the output... Is this what you are expecting?

Trying to Scrape a Span

I 've been trying to scrape two values from a website using beautiful soup in Python, and it's been giving me trouble. Here is the URL of the page I'm scraping:
https://www.stjosephpartners.com/Home/Index
Here are the values I'm trying to scrape:
HTML of Website to be Scraped
I tried:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.stjosephpartners.com/Home/Index').text
soup = BeautifulSoup(source, 'lxml')
gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
print(gold_spot_shell)
the output I got was: <list_iterator object at 0x039FD0A0>
When I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').children
The output was: ['\n']
when I tried using: gold_spot_shell = soup.find('div', class_ = 'col-lg-10').span
The output was: none
The HTML definitely has at least one span child. I'm not sure how to scrape the values I'm after. Thanks.

Beautifulsoup + Request is not a good method to scrape dynamic website like this. That span is generated by javascript so when you get the html using request, it just does not exist.
You can try to use selenium instead.
You can check if the website is using javascript to render element or not by disabling javascript on the page and find that element again, or just "view page source"

How can I get the text from this specific div class?

I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)

Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))

You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help

Extracting specific Information from a website using BeautifulSoup (Python)

I am accessing the following website to extract a list of stocks:
http://www.barchart.com/stocks/performance/12month.php
I am using the following code:
from bs4 import BeautifulSoup
import requests
url=raw_input("http://www.barchart.com/stocks/performance/12month.php")
r = requests.get("http://www.barchart.com/stocks/performance/12month.php")
data = r.text
soup =BeautifulSoup(data, "lxml")
for link in soup.find_all('a'):
print(link.get('href'))
The problem is I am getting a lot of other information that is not needed. I wanted to ask what would be a method that would just give me the stock names and nothing else.

r = requests.get("http://www.barchart.com/stocks/performance/12month.php")
html = r.text
soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all("td", {"class": "ds_name"})
for td in tds:
print td.a.text
If you look at the source code of the page, you will find that all you need is in a table. To be specific, the stocks' names are in <td></td> whose class="ds_name". So, that's it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scraping table with bs4 - python

Related

Beautiful Soup parsing table from react html

Webscraping Google news page: getting AttributeError: 'NoneType' object has no attribute 'find_all'

Trying to Scrape a Span

How can I get the text from this specific div class?

Extracting specific Information from a website using BeautifulSoup (Python)

Categories

Resources