Im new to coding with Python. So please bear with me Im trying to find the number of product images a Product has on Amazon.
1. I cant seem to get it work correctly?
2. Is there a way to insert a list of ASINS so they can all print out with the number?
Thanks!
import bs4
import webbrowser
import requests
File = requests.get('https://www.amazon.com/dp/B01MRXQPJ5')
soup = bs4.BeautifulSoup(File.text, 'html.parser' )
elems = soup.select('ul.a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro > li ')
Since Amazon render it's page using javascript, the content is generated at client side, instead of server-side.
When you use requests you get the content at server-side. To get the content generated in client-side, you must use selenium or dryscrape, for example.
Here's a working code that will count the number of items for a list of product ids.
Code:
import selenium.webdriver as webdriver
import lxml.html as html
import lxml.html.clean as clean
urls = ['B017TSPK5K', 'B00B96KLCQ', 'B01MZ9E6CG']
browser = webdriver.Chrome()
for url in urls:
amazon_url = "https://www.amazon.com/dp/{}".format(url)
browser.get(amazon_url)
content = browser.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = html.fromstring(content)
soup = BeautifulSoup(content, 'html.parser')
soup_li = soup.find_all('li', {'class':'a-spacing-small item a-declarative'})
print("Product ID: {} has {} images.".format(url, len(soup_li)))
browser.close()
Output:
'Product ID: B017TSPK5K has 2 images.'
'Product ID: B00B96KLCQ has 5 images.'
'Product ID: B01MZ9E6CG has 3 images.'
Related
I am trying to webscrape a dynamic page using selenium and beautifulsoup and python and am able to scrape the first page. But when i try to get to the next page, the url doesnt change and when i Inspect, i am unable to see Form Data as well. Can someone can help me?
import time
from selenium import webdriver
from parsel import Selector
from bs4 import BeautifulSoup
import random
import re
import csv
import requests
import pandas as pd
companies = []
overview = []
people = []
driver = webdriver.Chrome(executable_path=r'C:\\Users\\rahul\Downloads\\chromedriver_win32 (1)\\chromedriver.exe')
driver.get('https://coverager.com/data/companies/')
driver.maximize_window()
src = driver.page_source
soup = BeautifulSoup(src, 'lxml')
table = soup.find('tbody')
descrip = []
table_rows = table.find_all('tr')
for tr in table_rows:
td = tr.find_all('td')
#print(td)
row = [i.text.strip() for i in td]
descrip.append(row)
#print(row)
#file = open('gag.csv','w')
#with file:
# write = csv.writer(file)
# write.writerows(descrip)
url = ('https://coverager.com')
a_tags = table.find_all('a', href = True)
for link in a_tags:
ol = link.get('href')
pl = link.string.strip()
#companies.append(row)
#print(pl)
#print(ol)
driver.get(url + ol)
driver.implicitly_wait(1000)
data1 = driver.find_element_by_class_name('tab-details').text
overview.append(data1.strip())
data2 = driver.find_element_by_link_text('People').click()
p_tags = driver.find_element_by_class_name('tab-details').text
people.append(p_tags)
In your case of https://coverager.com/data/companies/ it would be much easier to scrape the api call instead of the html on the page.
Open dev tools (on chrome right click and hit inspect) and go to the network tab. When you hit the "next" button a row should show up in the network tab. Click on this row and then go to preview. You should see the company in this tab.
The api is accessing links which look like the following:
https://coverager.com/wp-json/ath/v1/coverager-data/companies?per_page=20&page=2&draw=4&column=3&dir=desc&filters=%7B%22companies%22:[],%22company_lob%22:[],%22industry%22:[],%22company_type%22:[],%22company_category%22:[],%22region%22:[],%22founded%22:[],%22company_stage%22:[],%22company_business_model%22:[]%7D
It seems like all the pages call the same api url but change the page= and raw= which are 2 apart.
So, simply use requests to call this class of links and loop through as many pages as you need! You could also change the per_page return as many companies as you need. You will have to test that though.
Here is my first post. Hope to be clear.
I'm scraping a web-site and here is the code I'm interested to scrape:
<div id="live-table">
<div class="event mobile event--summary">
<div elementtiming="SpeedCurveFRP" class="leagues--static event--leagues summary-results">
<div class="sportName tennis">
<div id="g_2_ldRHDOEp" title="Clicca per i dettagli dell'incontro!" class="event__matchevent__match--static event__match--twoLine">
...
What I would like to obtain is the last id (g_2_ldRHDOEp) and here is the code I produced using the beautifulsoup library
import urllib.request, urllib.error, urllib.parse
from bs4 import BeautifulSoup
url = '...'
response = urllib.request.urlopen(url)
webContent = response.read()
soup = BeautifulSoup(webContent, 'html.parser')
list = []
list = soup.find_all("div")
total_id = " "
for i in list :
id = i.get('id')
total_id = total_id + "\n" + str(id)
print(total_id)
But what I get is only
live-table
None
None
None
None
I'm quite new both to python and beautifulsoup and I'm not a seriuos programmer, I do this just for fun.
Can anyone answer me why can't I get what I want and may be how I could do this in a better and succesful way?
Thank you in advance
First of all, id and list are built-in functions, so don't use them as variable names.
The website is loaded dynamically so requests doesn't support it. We can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
URL = "https://www.flashscore.it/giocatore/djokovic-novak/AZg49Et9/"
driver = webdriver.Chrome(r"C:\path\to\chromedriver.exe")
driver.get(URL)
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
for tag in soup.find_all("div", id="g_2_ldRHDOEp"):
print(tag.get_text(separator=" "))
driver.quit()
Output:
30.10. 12:05 Djokovic N. (Srb) Sonego L. (Ita) 0 2 2 6 1 6 P
I am trying to retreive the 'Shares Outstanding' of a stock via this page:
https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#
(Click on 'Financial Statements' - 'Condensed Consolidated Balance Sheets (Unaudited) (Parenthical)')
the data is in the bottom of the table in the left row, I am using beautiful soup but I am having issues with retreiving the sharecount.
the code I am using:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
rows = soup.find_all('tr')
for row in rows:
document = row.find('a', string='Common stock, shares outstanding (in shares)')
shares = row.find('td', class_='nump')
if None in (document, shares):
continue
print(document)
print(shares)
this returns nothing, but the desired output is 4,323,987,000
can someone help me to retreive this data?
Thanks!
That's a JS rendered page. Use Selenium:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
# import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/cgi-bin/viewer?action=view&cik=320193&accession_number=0000320193-20-000052&xbrl_type=v#'
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get(url)
time.sleep(10) # <--- waits for 10 seconds so that page can gets rendered
# action = webdriver.ActionChains(driver)
# print(driver.page_source) # <--- this will give you source code
soup = BeautifulSoup(driver.page_source)
rows = soup.find_all('tr')
for row in rows:
shares = row.find('td', class_='nump')
if shares:
print(shares)
<td class="nump">4,334,335<span></span>
</td>
<td class="nump">4,334,335<span></span>
</td>
Better use :
shares = soup.find('td', class_='nump')
if shares:
print(shares.text.strip())
4,334,335
Ah, the joys of scraping EDGAR filings :(...
You're not getting your expected output because you're looking in the wrong place. The url you have is an ixbrl viewer. The data comes from here:
url = 'https://www.sec.gov/Archives/edgar/data/320193/000032019320000052/R1.htm'
You can either find that url by looking at network tab in the developer tooks, or, you can simply translate the viewer url into this url: for example, the 320193& figure is the cik number, etc.
Once you figure that out, the rest is simple:
req = requests.get(url)
soup = bs(req.text,'lxml')
soup.select_one('.nump').text.strip()
Output:
'4,334,335'
Edit:
To search by "Shares Outstanding", try:
targets = soup.select('tr.ro')
for target in targets:
targ = target.select('td.pl')
for t in targ:
if "Shares Outstanding" in t.text:
print(target.select_one('td.nump').text.strip())
And might as well throw this one in: Another, different way, to do that is to use xpath instead, using the lxml library:
import lxml.html as lh
doc = lh.fromstring(req.text)
doc.xpath('//tr[#class="ro"]//td[#class="pl "][contains(.//text(),"Shares Outstanding")]/following-sibling::td[#class="nump"]/text()')[0]
I am trying to web scrape this site in order to get basic stock information: https://www.macrotrends.net/stocks/charts/AAPL/apple/financial-ratios
My code is as follows:
from requests import get
from bs4 import BeautifulSoup as bs
url = 'https://www.macrotrends.net/stocks/charts/AAPL/apple/financial-ratios'
response = get(url)
html_soup = bs(response.text, 'html.parser')
stock_container = html_soup.find_all("div", attrs= {'id': 'row0jqxgrid'})
print(len(stock_container))
Right now I am taking it slow and just trying to return the number of "div" under the id name "row0jqxgrid". I am pretty sure everything up to line 8 is fine but I don't know how to properly reference the id using attrs, or if that's even possible.
Can anybody provide any information?
Ross
You can use selenium for this job:
from selenium import webdriver
import os
# define path to chrome driver
chrome_driver = os.path.abspath(os.path.dirname(__file__)) + '/chromedriver'
browser = webdriver.Chrome(chrome_driver)
browser.get("https://www.macrotrends.net/stocks/charts/AAPL/apple/financial-ratios")
# get row element
row = browser.find_element_by_xpath('//*[#id="row0jqxgrid"]')
# find all divs currently displayed
divs_list = row.find_elements_by_tag_name('div')
# get text from cells
for item in divs_list:
print(item.text)
Output:
Output text is doubled because table data ale loaded dynamically as you move bottom scroll to right.
Current Ratio
Current Ratio
1.5401
1.5401
1.1329
1.1329
1.2761
1.2761
1.3527
1.3527
1.1088
1.1088
1.0801
1.0801
I'd like to gather a list of films and their links to all available movies on Sky Cinema website.
The website is:
http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200
I am using Python 3.6 and Beautiful Soup.
I am having problems finding the title and link. Especially as there are several pages to click through - possibly based on scroll position (in the URL?)
I've tried using BS and Python but there is no output. The code I have tried would only return the title. I'd like the title and the link to the film. As these are in different areas on the site, I am unsure on how this is done.
Code I have tried:
from bs4 import BeautifulSoup
import requests
link = "http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")
for dd in page.find_all("div", {"class":"sentence-result-infos"}):
title = dd.find(class_="title ellipsis ng-binding").text.strip()
print(title)
spans=page.find_all('span', {'class': 'title ellipsis ng-binding'})
for span in spans:
print(span.text)
I'd like the output to show as the title, link.
EDIT:
I have just tried the following but get "text" is not an attribute:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://www.sky.com/tv/channel/skycinema/find-a-movie/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200')
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('span', {'class': 'title ellipsis ng-binding'}).text.strip()
print(title)
There is an API to be found in network tab. You can get all results with one call. You can set the limit to a number greater than the expected result count
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=10000&window=skyMovies').json()
Or use the number you can see on the page
import requests
import pandas as pd
base = 'http://www.sky.com/tv'
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=1555&window=skyMovies').json()
data = [(item['title'], base + item['url']) for item in r['items']]
df = pd.DataFrame(data, columns = ['Title', 'Link'])
print(df)
First of all, read terms and conditions of the site you are going to scrape.
Next, you need selenium:
from selenium import webdriver
import bs4
# MODIFY the url with YOURS
url = "type the url to scrape here"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")
baseurl = 'http://www.sky.com/'
titles = [n.text for n in soup.find_all('span', {'class':'title ellipsis ng-binding'})]
links = [baseurl+h['href'] for h in soup.find_all('a', {'class':'sentence-result-pod ng-isolate-scope'})]