Issues with requests and BeautifulSoup

Issues with requests and BeautifulSoup - python

I'm trying to read a news webpage to get the titles of their stories. I'm attempting to put them in a list, but I keep getting an empty list. Can someone please point in the right direction here? What am I missing? Please see code below. Thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://nypost.com/'
ttl_lst = []
soup = BeautifulSoup(requests.get(url).text, "lxml")
title = soup.findAll('h2', {'class': 'story-heading'})
for row in title:
ttl_lst.append(row.text)
print (ttl_lst)

the requests module only returns the first html file sent to them. Sites like nypost use ajax requests to get their articles. You will have to use something like selenium for this, which allows for ajax requests after the page loads.

Related

Cannot webscrape elements with Playwright and BeautifulSoup

I am trying to update my webscraping scripts as the site (https://covid19.gov.vn/) have updated but I can't for my life found out how to parse these elements. Inspecting the elements it seems like it is there as usual but I cannot parse it with BeautifulSoup. My initial attempts include using Playwright and tried again but I still couldn't scrape it correctly. Viewing the source it's almost like the elements is not there at all. Can anyone with more knowledge about HTML and webscraping explain to me how this works? I'm pretty much stuck here
This is basically my last attempt before I gave up looking at the page source:
from bs4 import BeautifulSoup as bs
import requests
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://covid19.gov.vn/")
page_content = page.content()
soup = bs(page_content, features="lxml")
test = soup.findAll('div', class_ = "content-tab show", id="vi")
print(test)
browser.close()
My idea was to scrape and just iterate through all the content inside. But well, it doesn't work. Much appreciated if anyone can help me with this! Thanks!

Try the code below - it is based on HTTP GET call that fetch the data you are looking for.
import requests
r = requests.get('https://static.pipezero.com/covid/data.json')
if r.status_code == 200:
data = r.json()
print(data['total']['internal'])
output
{'death': 17545, 'treating': 27876, 'cases': 707436, 'recovered': 475343}

BeautifulSoup doesn't return the expected tag as Chrome

I'm trying to parse a page to learn beautifulSoup, here is the code
import requests as req
from bs4 import BeautifulSoup
page = 'https://www.pathofexile.com/trade/search/Delirium/w0brcb'
resp = req.get(page)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all('results')
print(len(res))
Result: 0
The goal is to get the first price.
I tried to look for the tag in Chrome and it's there, but probably the browser does another requests to get the results.
Can someone explain what am I missing here?
website's source code

Problems with the code
Your code is looking for a "results"-element. What you really have to look for (based on your screenshot) is a div-element with the class "results".
So try this:
soup.find_all("div", attrs={"class":"results"})
But if you want the price you have to dig deeper for the element which contains the price:
price = soup.find("span", attrs={"data-field":"price"}).text
Problems with the site
It seems the site is loading data by Ajax. With Requests you get the page before/without Ajax data call.
In this case you should change from Requests to Selenium module. This will navigate through a "real Browser" and you can wait until data is finally loaded before you start scraping.
Documentation: Selenium

Can not find nested ul/div using Beautifulsoup

I am trying to extract all the links to the store locations on a web site: https://www.ulta.com/stores/directory
The structure of the web looks like this
i want to extract all the links under the ul (class="sl-all-locations__wrapper") using Beautifulsoup as below, but i got nothing from the code
import requests
import bs4 as bs
resp = requests.get('https://www.ulta.com/stores/directory')
soup = bs.BeautifulSoup(resp.text, 'lxml')
soup.find_all('ul',{'class': 'sl-all-locations__wrapper'})
Also, inquiries like
soup.find_all('div',id='sl-all-locations')
does not work either...
i am wondering if there is anything wrong with my code or the website is anti-scraper....
can anyone help me with this?

Scraping facebook likes, comments and shares with Beautiful Soup

I want to scrape number of likes, comments and shares with Beautiful soup and Python.
I have wrote a code, but it returns me the empty list, I do not know why:
this is the code:
from bs4 import BeautifulSoup
import requests
website = "https://www.facebook.com/nike"
soup = requests.get(website).text
my_html = BeautifulSoup(soup, 'lxml')
list_of_likes = my_html.find_all('span', class_='_81hb')
print(list_of_likes)
for i in list_of_likes:
print(i)
The same is with comments and likes. What should I do?

Facebook uses client side rendering...that means in the HTML document that you get and you have it stored in soup variable is just javascript code that actually renders the content only when you display it in browser.

Probably, you can try use the Selenium.

Range loop not working in webscrape

I have written a small web scraper in BS4.With the code I am able to scrape one page at a time,here is the relevant code.
import csv
from bs4 import BeautifulSoup
import requests
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id=129867").text
soup = BeautifulSoup(html,'lxml')
This code scrapes one page but I want to scrape more than one page at a time(a range) so I tried adding this for loop like this.
import csv
from bs4 import BeautifulSoup
import requests
for ace in range(129867, 129869):
html = requests.get("http://www.gbgb.org.uk/resultsMeeting.aspx?id= {ace}").text
soup = BeautifulSoup(html,'lxml')
Nothing happens when I run the code and I don't even get up any of the usual cryptic messages up hinting at what went wrong.Could it be syntax,or is it something else.Any help appreciated.

You should do everything inside the loop now. And, you are not inserting the ace value into the URL and there is an extra space after the id=. It might also be a good idea to establish a web-scraping session and use the params keyword of the get() method.
Fixed version:
import csv
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
for ace in range(129867, 129869):
url = "http://www.gbgb.org.uk/resultsMeeting.aspx"
html = session.get(url, params={'id': ace}).text
soup = BeautifulSoup(html, 'lxml')
Note this code is still of a blocking nature, it would process the pages one at a time. If you want to speed things up, look into Scrapy web-scraping framework.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Issues with requests and BeautifulSoup - python

the requests module only returns the first html file sent to them. Sites like nypost use ajax requests to get their articles. You will have to use something like selenium for this, which allows for ajax requests after the page loads.

Related

Cannot webscrape elements with Playwright and BeautifulSoup

BeautifulSoup doesn't return the expected tag as Chrome

Can not find nested ul/div using Beautifulsoup

Scraping facebook likes, comments and shares with Beautiful Soup

Range loop not working in webscrape

Categories

Resources