I can not scrape item from this website. Python - python

I am trying to scrape all the clothing items in this website but I was not be able to do it. I set 'limit=3' in 'find_all' but it gives me only 1 result. How can I get all result in one request?
Please help me I am stuck with this!
This is the e-commerce website I am trying to scrape
def trendyol():
url = "https://www.trendyol.com/erkek+kazak--hirka?filtreler=22|175"
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
page = requests.get(url, headers=headers).text
soup = BeautifulSoup(page, "html.parser")
list= soup.find_all("div",{"class":"p-card-chldrn-cntnr"}, limit=3)
for div in list:
link= str("https://www.trendyol.com/" + div.a.get("href"))
name = div.find("span",{"class":"prdct-desc-cntnr-name hasRatings"}).text
print(f'link: {link}')
print(f'isim: {name}')

Try this code:
from bs4 import BeautifulSoup
import requests
def trendyol(url):
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
page = requests.get(url, headers=headers).text
soup = BeautifulSoup(page, "html.parser")
list= soup.find("div", {'class':'prdct-cntnr-wrppr'})
for link in list.find_all('div',{'class': 'p-card-chldrn-cntnr'}):
print("https://www.trendyol.com" + link.find('a', href=True)['href'])
print(link.find('div',{'class':'image-container'}).img['alt'])
print(link.find('span',{'class':'prdct-desc-cntnr-ttl'}).text)
url = "https://www.trendyol.com/erkek+kazak--hirka?filtreler=22%7C175&pi=3"
trendyol(url)
This code with print product url, title and alt text of title. Thanks.

Related

Web Scraping a list of links from Tripadvisor

I'm trying to create a webscraper that will return a list of links to individual objects from the website example website.
The code I wrote takes the list of pages and returns the list of links to each attraction, but in the wrong way (the links are not one after the other):
Could someone help me to correct this code so that it would take list of links like below?
I will be grateful for any help.
My code:
import requests
from bs4 import BeautifulSoup
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
}
restaurantLinks = open('pages.csv')
print(restaurantLinks)
urls = [url.strip() for url in restaurantLinks.readlines()]
restlist = []
for link in urls:
print("Opening link:"+str(link))
response=requests.get(link, headers=header)
soup = BeautifulSoup(response.text, 'html.parser')
productlist = soup.find_all('div', class_='cNjlV')
print(productlist)
productlinks =[]
for item in productlist:
for link in item.find_all('a', href=True):
productlinks.append('https://www.tripadvisor.com'+link['href'])
print(productlinks)
restlist.append(productlinks)
print(restlist)
df = pd.DataFrame(restlist)
df.to_csv('links.csv')
Instead of append() elements to your list try to extend() it:
restlist.extend(productlinks)
Example
import requests
from bs4 import BeautifulSoup
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
}
urls = ['https://www.tripadvisor.com/Attractions-g187427-Activities-oa60-Spain.html']
restlist = []
for link in urls:
print("Opening link:"+str(link))
response=requests.get(link, headers=header)
soup = BeautifulSoup(response.text, 'html.parser')
restlist.extend(['https://www.tripadvisor.com'+a['href'] for a in soup.select('a:has(h3)')])
df = pd.DataFrame(restlist)
df.to_csv('links.csv', index=False)

Python Request returning different result than original page (browser)

I am trying to do a simple WebScrapper to monitor Nike's site here in Brazil.
Basically i want to track products that have stock right now, to check when new products are added.
My problem is that when i navigate to the site https://www.nike.com.br/snkrs#estoque I see different products compared to what I see using python requests method.
Here is the code I am using:
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://www.nike.com.br/snkrs#estoque'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
len(soup.find_all(class_='produto produto--comprar'))
This code gives me 40, but using the browser I can see 56 products https://prnt.sc/26jeo1i
The data comes from a different source, within 3 pages.
import requests
from bs4 import BeautifulSoup
headers ={
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
productList = []
for p in [1,2,3]:
url = f'https://www.nike.com.br/Snkrs/Estoque?p={p}&demanda=true'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
productList += soup.find_all(class_='produto produto--comprar')
Output:
print(len(productList))
56

How to collect all specified href's?

In this test model I can collect the href value for the first ('tr', class_='rowLive'), I've tried to create a loop to collect all the others href but it always gives IndentationError: expected an indented block or says I'm trying to use find instead of find_all.
How should I proceed to collect all href?
import requests
from bs4 import BeautifulSoup
url = 'http://sports.williamhill.com/bet/pt/betlive/9'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
site = requests.get(url, headers=headers)
soup = BeautifulSoup(site.content, 'html.parser')
jogos = soup.find_all('tr', class_='rowLive')
jogo = jogos[0]
linksgame = jogo.find('a', href=True).attrs['href'].strip()
print(linksgame)
jogos returns a list, you can loop over it and find() an a for every iteration:
import requests
from bs4 import BeautifulSoup
url = "http://sports.williamhill.com/bet/pt/betlive/9"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
site = requests.get(url, headers=headers)
soup = BeautifulSoup(site.content, "html.parser")
jogos = soup.find_all("tr", class_="rowLive")
for tag in jogos:
print(tag.find("a", href=True)["href"])
Or:
print([tag.find("a", href=True)["href"] for tag in jogos])

My code prints none when trying to webscrape

I'm a beginner just started learning python a week ago, I was trying to get a product title for a specific product on amazon but when I try to run my code it prints "None" instead of printing the title, Any help?
import requests
from bs4 import BeautifulSoup
url = 'https://www.amazon.com/Sony-ILCE7SM2-mount-Camera-Full-Frame/dp/B0158SRJVQ/ref=sr_1_1?
dchild=1&keywords=a7s&qid=1589917834&sr=8-1'
headers = {
'user_agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id='productTitle')
print(title)

It returns none when I get the id of the url using beatiful soup and how could i get the content of its id

It returns none when I get the id of the url using Beautiful Soup and how could I get the content of its id
import requests
import json
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Ozeri-Digital-Multifunction-Kitchen-Elegant/dp/B01LAVADW2?pf_rd_p=3e7c8265-9bb7-5ab2-be71-1af95f06a1ad&pf_rd_r=52Z7DNQGKGV31B114R1K&pd_rd_wg=IAKey&ref_=pd_gw_ri&pd_rd_w=rDONb&pd_rd_r=b6b3cf66-c4a8-449a-8676-9027e8922b96'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'}
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
you have created a variable headers, but you didn't add it to your request, also, you are not checking your request response status code (which is 503)
fixing your code it should look something like this:
import requests
import json
from bs4 import BeautifulSoup
URL = 'https://www.amazon.com/Ozeri-Digital-Multifunction-Kitchen-Elegant/dp/B01LAVADW2?pf_rd_p=3e7c8265-9bb7-5ab2-be71-1af95f06a1ad&pf_rd_r=52Z7DNQGKGV31B114R1K&pd_rd_wg=IAKey&ref_=pd_gw_ri&pd_rd_w=rDONb&pd_rd_r=b6b3cf66-c4a8-449a-8676-9027e8922b96'
headers = {"User-Agent":'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'}
r = requests.get(URL, headers=headers)
if r.status_code == 200:
soup = BeautifulSoup(r.text)
title = soup.find(id="productTitle")
print(title.next)

Categories

Resources