Why python output doesn't match html for target website - python

I'm trying to webscrape a target website of details such as price, name, jpeg of the product, but what is pulled through python using beautifulsoup doesn't seem to match the html from the target website(using F12).
I've tried using html.parser and lxml within the beautifulsoup function, but both don't seem to make a difference. I've tried googling similar problems, but haven't found anything. I'm using atom to run the python code and am using Ubuntu 18.04.2. I am pretty new at using python, but have coded a bit before.
url = 'https://www.target.com/s?searchTerm=dove'
# Gets html from the given url
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
items = html_soup.find_all('li', class_ = 'bkaxin')
print(len(items))
It's suppose to output 28, but I consistently get 0

It looks like the elements that you're trying to find aren't there because they are created dynamically after the site loads. You can see that by yourself by looking at the source code when the website first loads. You can also try printing html_soup.prettify() and you'll see that the elements you're trying to find aren't there.
Inspired by this question, I present a solution based on using selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.target.com/s?searchTerm=dove"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
html_soup = BeautifulSoup(html, 'html.parser')
items = html_soup.find_all('li', class_ = 'bkaXIn')
driver.close()
print(len(items))
The previous code outputs 28 when I run it.
Note that you need to install selenium (installation guide here) and the appropriate driver for this to work (in my solution I used the Firefox driver which can be downloaded here).
Also note that I use class_ = 'bkaXIn' (case sensitive!) in html_soup.find_all.

Related

Extracting content from webpage using BeautifulSoup

I am working on scrapping numbers from the Powerball website with the code below.
However, numbers keeps coming back empty. Why is this?
import requests
from bs4 import BeautifulSoup
url = 'https://www.powerball.com/games/home'
page = requests.get(url).text
bsPage = BeautifulSoup(page)
numbers = bsPage.find_all("div", class_="field_numbers")
numbers
Can confirm #Teprr is absolutely correct. You'll need to download chrome and add chromedriver.exe to your system path for this to work but the following code gets what you are looking for. You can use other browsers too you just need their respective driver.
from bs4 import BeautifulSoup
from selenium import webdriver
import time
url = 'https://www.powerball.com/games/home'
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(options=options)
browser.get(url)
time.sleep(3) # wait three seconds for all the js to happen
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
draws = soup.findAll("div", {"class":"number-card"})
print(draws)
for d in draws:
info = d.find("div",{"class":"field_draw_date"}).getText()
balls = d.find("div",{"class":"field_numbers"}).findAll("div",{"class":"numbers-ball"})
numbers = [ball.getText() for ball in balls]
print(info)
print(numbers)
If you download that file and inspect it locally, you can see that there is no <div> with that class. That means that it is likely generated dynamically using javascript by your browser, so you would need to use something like selenium to get the full, generated HTML content.
Anyway, in this specific case, this piece of HTML seems to be the container for the data you are looking for:
<div data-url="/api/v1/numbers/powerball/recent?_format=json" class="recent-winning-numbers"
data-numbers-powerball="Power Play" data-numbers="All Star Bonus">
Now, if you check that custom data-url, you can find the information you want in JSON format.

Why is my parsed image link comming out in base64 format

i was trying to parse a image link from a website.
When i inspect the link on the website, it is this one :https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png but when i parse it with my code the output is .
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').text
soup = BeautifulSoup(source, 'lxml')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.find('img', class_='css-1fxh5tw product-card__hero-image')['src']
print(image_scr)
I think the code isn't the issue but i don't know what's causing the link to come out in base64 format. So how could i set the code to render the link as .png ?
What happens?
First at all, take a look into your soup - There is the truth. Website provides not all information static, there are a lot things provided dynamically and also done by the browser -> So requests wont get this info this way.
Workaround
Take a look at the <noscript> next to your selection, it holds a smaller version of the image and is providing the src
Example
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content
soup = BeautifulSoup(source, 'lxml')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.select_one('noscript img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)
Output
https://static.nike.com/a/images/c_limit,w_318,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png
If you like a "big picture" just replace parameter w_318 with w_1000...
Edit
Concerning your comment - There are a lot more solutions, but still depending on what you like to do with the information and what you gonna work with.
Following approache uses selenium that is unlike requests rendering the website and give you the "right page source" back but also needs more resources then requests:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok')
soup=BeautifulSoup(driver.page_source, 'html.parser')
pair = soup.find('div', class_='product-card__body')
image_scr = pair.select_one('img.css-1fxh5tw.product-card__hero-image')['src']
print(image_scr)
Output
https://static.nike.com/a/images/c_limit,w_592,f_auto/t_product_v1/df7c2668-f714-4ced-9f8f-1f0024f945a9/chaussure-de-basketball-zoom-freak-3-MZpJZF.png
As you want to grab src meaning image data, so downloading data from server using requests, you need to use .content format as follows:
source = requests.get('https://www.nike.com/fr/w/hommes-chaussures-nik1zy7ok').content

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)
regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(f"An error occurred: {e}")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)
The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line
print(soup)
When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?
EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.
That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.
First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)
No <button> tag in the page source :
So I suggest using Selenium as the solution, and I tried a basic scrape of the website.
Here is the code I used :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'
browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)
browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")
span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
print(span_tag.text)
browser.quit()
This is the output:
There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

Trying to print HREF tags from a site and getting weird results

I'm trying to print HREF tags of the link below.
Here's my first attempt.
# the Python 3 version:
from bs4 import BeautifulSoup
import urllib.request
resp = urllib.request.urlopen("https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])
When I run that, I get this.
/feed/
/feed/
/feed/
/mynetwork/
/jobs/
/messaging/
/notifications/
#
Here's my second attempt.
# and a version using the requests library, which as written will work in both Python 2 and 3:
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER')
soup = bs(r.content, 'lxml')
links = [item['href'] if item.get('href') is not None else item['src'] for item in soup.select('[href^="https"]') ]
print(links)
When I run that, I get this.
['https://static-exp1.licdn.com/sc/h/n7m1fekt1d9hawp3s7wats11', 'https://static-exp1.licdn.com/sc/h/al2o9zrvru7aqj8e1x2rzsrca', 'https://static-exp1.licdn.com/sc/h/2if24wp7oqlodqdlgei1n1520', 'https://static-exp1.licdn.com/sc/h/eahiplrwoq61f4uan012ia17i', 'https://static-exp1.licdn.com/sc/h/2if24wp7oqlodqdlgei1n1520', 'https://static-exp1.licdn.com/sc/h/eahiplrwoq61f4uan012ia17i', 'https://static-exp1.licdn.com/sc/h/c7y7qgvm2uh1zn8pgl84l3rty', 'https://static-exp1.licdn.com/sc/h/auhsc2hi2zkvt7nbqep2ejauv', 'https://static-exp1.licdn.com/sc/h/9vf4mi871c6wolrcm3pgqywes', 'https://static-exp1.licdn.com/sc/h/7z1536jzhgep1sw5uk19e8ec7', 'https://static-exp1.licdn.com/sc/h/a0on5mxqtufmy9y66neg9mdgy', 'https://static-exp1.licdn.com/sc/h/1edhu1lemiqjsbgubat2dejxr', 'https://static-exp1.licdn.com/sc/h/2gdon0pq1074su3zwdop1y2g1']
I was expecting to see something like this:
https://www.linkedin.com/in/timlmorgan/
https://www.linkedin.com/in/timmorgan3/
https://www.linkedin.com/in/tim-morgan-19543731/
etc., etc., etc.
I guess LinkedIn must be doing something special, which I'm not aware of. When I run the same code against 'https://www.nytimes.com/', I get the results that I would expect. This is just a learning exercise. I'm curious to know what's going on here. I'm not interested in actually scanning LinkedIn for data.
LinkedIn loads in data asynchronously, if we actually view-source (Ctrl + U on Windows) on that URL you're fetching, you won't find your expected results, because Javascript is loading them after the page has already loaded with the base information.
BeautifulSoup won't execute the Javascript on the page that fetches that data.
To solve this, one would actually figure out the API functions and have your script call those.
https://www.linkedin.com/voyager/api/search/filters?filters=List()&keywords=tim%20morgan&q=universalAll&queryContext=List(primaryHitType-%3EPEOPLE)
Except adjusting your call to pass the CSRF check. Or actually utilizing their API.
I tested some Selenium code which seems to do the trick.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox(executable_path=r'C:\files\geckodriver.exe')
driver.set_page_load_timeout(30)
driver.get("https://www.google.com/")
driver.get("https://www.linkedin.com/search/results/all/?keywords=tim%20morgan&origin=GLOBAL_SEARCH_HEADER")
continue_link = driver.find_element_by_tag_name('a')
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))

Requests.content not matching with Chrome inspect element

I'm using BeautifulSoup and Requests to scrape allrecipes user data.
When inspecting the HTML code I find that the data I want is contained within
<article class="profile-review-card">
However when I use the following code
URL = 'http://allrecipes.com/cook/2010/reviews/'
response = requests.get(URL ).content
soup = BeautifulSoup(response, 'html.parser')
X = soup.find_all('article', class_ = "profile-review-card" )
While soup and response are full of html, X is empty. I've looked through and there are some inconsistencies between what I see with inspect element and requests.get(URL).content, what is going on?
What Chrome inspect shows me
That's because it's loaded using Ajax/javascript. Requests library doesn't handle that, you'll need to use something that can execute these scripts and get the dom. There are various options, I'll list a couple to get you started.
Selenium
ghost.py

Categories

Resources