Python webscraping doesn't produce raw HTML

Python webscraping doesn't produce raw HTML - python

I am working on scraping data from a table.
With the following code, instead of the raw HTML values being fetched, I am given the variable, so instead of getting the value in the table, I get {value},
How can I get the raw HTML rather than the variable name?
Thanks
my_session = requests.session()
for_cookies = my_session.get("https://uk.investing.com")
cookies = for_cookies.cookies
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0'}
my_url = 'https://uk.investing.com/stock-screener/?sp=country::4|sector::a|industry::80|equityType::a|exchange::3%3Ceq_market_cap;1'
response = my_session.get(my_url, headers=headers, cookies=cookies)
soup = BeautifulSoup(response.content, "lxml")
print(soup.prettify()) # print the parsed data of html

Related

BeautifulSoup findAll not returning results

I want to get the product name and prices of this page. I pretty much repeated the exact same thing, I did for the product name for the price, but I'm not getting anything.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as bSoup
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:77.0) Gecko/20100101 Firefox/77.0'}
url = "https://www.walmart.ca/search?q=lettuce"
req = Request (url = url, headers = header)
client = urlopen (req)
pageHtml = client.read()
client.close()
pageSoup = bSoup(pageHtml, 'html.parser')
products = pageSoup.findAll ("div", {"class":"css-155zfob e175iya63"})
print (len(products)) #prints 15, like expected
for product in products:
pass
prices = pageSoup.findAll ("div", {"class":"css-8frhg8 e175iya65"})
print (len(prices)) #prints 0 and idk why :/
for price in prices:
pass

The page https://www.walmart.ca/search?q=lettuce does not return the content you expect:
curl -s -H 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:77.0) Gecko/20100101 Firefox/77.0' 'https://www.walmart.ca/search?q=lettuce' | grep 'css-8frhg8'
You probably saw that class in a browser where the content was partially render at run-time via JavaScript. This means you need to use a library that can emulate a browser with JavaScript support.

beautifulsoup not returning all html

import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.amazon.com/s?k=iphone+5s&ref=nb_sb_noss')
c = r.content
soup = BeautifulSoup(c, 'html.parser')
all = soup.find_all("span", {"class": "a-size-medium a-color-base a-text-normal"})
print(all)
so this is my simple script of python trying to scrape a page in amazon but not all the html is returned in the "soup" variable therefor i get nothing when trying to find a specific series of tags and extract them.

Try the below code, it should do the trick for you.
You actually missed to add headers in your code
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
url = 'https://www.amazon.com/s?k=iphone+5s&ref=nb_sb_noss'
response = requests.get(url, headers=headers)
print(response.text)
soup = BeautifulSoup(response.content, features="lxml")
my_all = soup.find_all("span", {"class": "a-size-medium a-color-base a-text-normal"})
print(my_all)

BeautifulSoup Find periodically returns None

I am trying to get a value from a class. From time to time, find returns the value I need, but another time it no longer works.
Code:
import requests
from bs4 import BeautifulSoup
url = 'https://beru.ru/catalog/molotyi-kofe/76321/list'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
item_count = (soup.find('div', class_='_2StYqKhlBr')).text.split()[4]
print(item_count)

The reason why that you get the values sometimes and sometimes not. That's because the website is protected by CAPTCHA
So when the request is blocked by CAPTCHA
It's became like the following:
https://beru.ru/showcaptcha?retpath=https://beru.ru/catalog/molotyi-kofe/76321/list?ncrnd=4561_aa1b86c2ca77ae2b0831c4d95b9d85a4&t=0/1575204790/b39289ef083d539e2a4630548592a778&s=7e77bfda14c97f6fad34a8a654d9cd16
You can verify by parse the response content:
import requests
from bs4 import BeautifulSoup
r = requests.get(
'https://beru.ru/catalog/molotyi-kofe/76321/list')
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll('div', attrs={'class': '_2StYqKhlBr _1wAXjGKtqe'}):
print(item)
for item in soup.findAll('div', attrs={'class': 'captcha__image'}):
for captcha in item.findAll('img'):
print(captcha.get('src'))
And you will get the CAPTCHA image link:
https://beru.ru/captchaimg?aHR0cHM6Ly9leHQuY2FwdGNoYS55YW5kZXgubmV0L2ltYWdlP2tleT0wMEFMQldoTnlaVGh3T21WRmN4NWFJRUdYeWp2TVZrUCZzZXJ2aWNlPW1hcmtldGJsdWU,_0/1575206667/b49556a86deeece9765a88f635c7bef2_df12d7a36f0e2d36bd9c9d94d8d9e3d7

Trimming the outputs in python

I made something which gets the time from https://time.is/ and shows the time. I used BeautifulSoup and urllib.request.
But I want to trim the output. I'm getting this as output and I want to remove the code part.
<div id="twd">07:29:26</div>
Program File:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://time.is/'
hdr = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
req = urllib.request.Request(url, headers=hdr)
res = urllib.request.urlopen(req)
soup = BeautifulSoup(res, 'html.parser')
string = soup.find(id='twd')
print(string)
How can I get just the text?

You can get the text from the dom element with .text like:
string.text
Test Code:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://time.is/'
hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'}
req = urllib.request.Request(url, headers=hdr)
res = urllib.request.urlopen(req)
soup = BeautifulSoup(res, 'html.parser')
string = soup.find(id='twd')
print(string.text)
Results:
07:06:11PM

Python request not showing desired result

I am not an expert with python but this is what I did with python-requests. I am trying to call this URL that gives me the email id of the user if I provide the first_name, last_name and domain.
https://dry-tor-58240.herokuapp.com
However, when I try to request it with python I get the 200 response code but when I convert the response.text to Beautiful Soup object I don't see the email address anywhere in it.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
payload = {"first_name":"nandish","last_name":"ajani","domain":"atyantik.com"}
r = requests.get("https://dry-tor-58240.herokuapp.com/", headers = headers, params = payload)
soup = BeautifulSoup(r.text, 'lxml')
Can anyone let me know what is it that I am doing wrong?

It should be POST request method. This will return a json format, so I also utilized request's .json()
import requests
from bs4 import BeautifulSoup
request_url = 'https://dry-tor-58240.herokuapp.com/find'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
payload = {"first_name":"nandish","last_name":"ajani","domain":"atyantik.com"}
jsonObj = requests.post(request_url, headers = headers, json = payload).json()
Output:
print (jsonObj['email'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python webscraping doesn't produce raw HTML - python

Related

BeautifulSoup findAll not returning results

beautifulsoup not returning all html

BeautifulSoup Find periodically returns None

Trimming the outputs in python

Python request not showing desired result

Categories

Resources