BeautifulSoup findAll not returning results - python

I want to get the product name and prices of this page. I pretty much repeated the exact same thing, I did for the product name for the price, but I'm not getting anything.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bSoup
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:77.0) Gecko/20100101 Firefox/77.0'}
url = "https://www.walmart.ca/search?q=lettuce"
req = Request (url = url, headers = header)
client = urlopen (req)
pageHtml = client.read()
client.close()
pageSoup = bSoup(pageHtml, 'html.parser')
products = pageSoup.findAll ("div", {"class":"css-155zfob e175iya63"})
print (len(products)) #prints 15, like expected
for product in products:
pass
prices = pageSoup.findAll ("div", {"class":"css-8frhg8 e175iya65"})
print (len(prices)) #prints 0 and idk why :/
for price in prices:
pass

The page https://www.walmart.ca/search?q=lettuce does not return the content you expect:
curl -s -H 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:77.0) Gecko/20100101 Firefox/77.0' 'https://www.walmart.ca/search?q=lettuce' | grep 'css-8frhg8'
You probably saw that class in a browser where the content was partially render at run-time via JavaScript. This means you need to use a library that can emulate a browser with JavaScript support.

Related

Python webscraping doesn't produce raw HTML

I am working on scraping data from a table.
With the following code, instead of the raw HTML values being fetched, I am given the variable, so instead of getting the value in the table, I get {value},
How can I get the raw HTML rather than the variable name?
Thanks
my_session = requests.session()
for_cookies = my_session.get("https://uk.investing.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://uk.investing.com/stock-screener/?sp=country::4|sector::a|industry::80|equityType::a|exchange::3%3Ceq_market_cap;1'
response = my_session.get(my_url, headers=headers, cookies=cookies)
soup = BeautifulSoup(response.content, "lxml")
print(soup.prettify()) # print the parsed data of html

Trimming the outputs in python

I made something which gets the time from https://time.is/ and shows the time. I used BeautifulSoup and urllib.request.
But I want to trim the output. I'm getting this as output and I want to remove the code part.
<div id="twd">07:29:26</div>
Program File:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://time.is/'
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
req = urllib.request.Request(url, headers=hdr)
res = urllib.request.urlopen(req)
soup = BeautifulSoup(res, 'html.parser')
string = soup.find(id='twd')
print(string)
How can I get just the text?
You can get the text from the dom element with .text like:
string.text
Test Code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://time.is/'
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
req = urllib.request.Request(url, headers=hdr)
res = urllib.request.urlopen(req)
soup = BeautifulSoup(res, 'html.parser')
string = soup.find(id='twd')
print(string.text)
Results:
07:06:11PM

Python request not showing desired result

I am not an expert with python but this is what I did with python-requests. I am trying to call this URL that gives me the email id of the user if I provide the first_name, last_name and domain.
https://dry-tor-58240.herokuapp.com
However, when I try to request it with python I get the 200 response code but when I convert the response.text to Beautiful Soup object I don't see the email address anywhere in it.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {"first_name":"nandish","last_name":"ajani","domain":"atyantik.com"}
r = requests.get("https://dry-tor-58240.herokuapp.com/", headers = headers, params = payload)
soup = BeautifulSoup(r.text, 'lxml')
Can anyone let me know what is it that I am doing wrong?
It should be POST request method. This will return a json format, so I also utilized request's .json()
import requests
from bs4 import BeautifulSoup
request_url = 'https://dry-tor-58240.herokuapp.com/find'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
payload = {"first_name":"nandish","last_name":"ajani","domain":"atyantik.com"}
jsonObj = requests.post(request_url, headers = headers, json = payload).json()
Output:
print (jsonObj['email'])

Unable to scrape google news accurately

I'm trying to scrape google headlines for a given keyword (eg. Blackrock) for a given period (eg. 7-jan-2012 to 14-jan-2012).
I'm trying to do this by constructing the url and then using urllib2 as shown in the code below. if I put the constructed url in a browser, it gives me the correct result. however, if I use it through python, I get news results for the right keyword but for the current period.
here'e the code. Can someone tell me what I'm doing wrong and how I can correct it?
import urllib
import urllib2
import json
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=Blackrock&hl=en&gl=uk&authuser=0&source=lnt&tbs=cdr%3A1%2Ccd_min%3A7%2F1%2F2012%2Ccd_max%3A14%2F1%2F2012&tbm=nws'
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
html = response.read()
soup = BeautifulSoup(html)
text = soup.text
start = text.index('000 results')+11
end = text.index('NextThe selection')
text = text[start:end]
print text
The problem is with your user-agent, it works for me with:
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36')
You are using a user-agent for Firefox 3, which is about 6 years old.

Why can't I scrape Amazon by BeautifulSoup?

Here is my python code:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
it works for google.com and many other websites, but it doesn't work for amazon.com.
I can open amazon.com in my browser, but the resulting "soup" is still none.
Besides, I find that it cannot scrape from appannie.com, either. However, rather than give none, the code returns an error:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
So I doubt whether Amazon and App Annie block scraping.
Add a header, then it will work.
from bs4 import BeautifulSoup
import requests
url = "http://www.amazon.com/"
# add header
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
print soup
You can try this:
import urllib2
from bs4 import BeautifulSoup
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup
In python arbitrary text is called a string and it must be enclosed in quotes(" ").
I just ran into this and found that setting any user-agent will work. You don't need to lie about your user agent.
response = HTTParty.get #url, headers: {'User-Agent' => 'Httparty'}
Add a header
import urllib2
from bs4 import BeautifulSoup
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36'}
page = urllib2.urlopen("http://www.amazon.com/")
soup = BeautifulSoup(page)
print soup

Categories

Resources