Class ID to scrape
I wanted to scrape data from facebook market using python and by using this script below however no data is showing when i run the script. Class ID is in the picture above.
elements = driver.find_elements_by_class_name('l9j0dhe7 f9o22wc5 ad2k81qe') for ele in elements: print(ele.text) print(ele.get_attribute('title'))
Okay there are a few things, you should take a look - find_element_by_class_name takes only one class name, you better take use of find_element_by_css_selector
Solution for getting informations based on your information
Get all results:
elements = driver.find_elements_by_class_name('kbiprv82')
Loop results and print:
for ele in elements:
title = ele.find_element_by_css_selector('span.a8c37x1j.ni8dbmo4.stjgntxs.l9j0dhe7').text
price = ele.find_element_by_css_selector('div.hlyrhctz > span').text
print(title, price)
Output
IMac 21,5 Ende 2012 380 €
Imac 27 Mitte 2011 (Wie Neu Zustand ) 550 €
iMac 14,2 27" 550 €
Hope that helps, let us know.
Related
I want to download some movie reviews from imdb so that I cant use those reviews for my LDA model. (for my school)
But default website for reviews contains only 25 reviews (e. g. https://www.imdb.com/title/tt0111161/reviews/?ref_=tt_ql_urv)
If I want more I need to press "Load more" button at the bottom of website, which gives me 25 more reviews.
I don't know how to automate this in python, I can't go to *https://www.imdb.com/title/tt0111161/reviews/?ref_=tt_ql_urv*```/2``` or add parameter ?page=2
How to automate traversing pages at imdb review site using python?
Also, is this deliberately made so difficult?
When I click Load More then in DevTools in Crome/Firefox (tab: Network, filter: XHR) I see link like
https://www.imdb.com/title/tt0111161/reviews/_ajax?ref_=undefined&paginationKey=g4xolermtiqhejcxxxgs753i36t52q343mpt34pjada6qpye4w6qtalmfyy7wfxcwfzuwsyh
and it has paginationKey=g4x...
and I see something similar in HTML <div ... data-key="g4x..." - so
using this data-key I create link to get next page.
Example code.
First I get HTML from normal URL, and I get titles from reviews. Next I get data-key and create URL to get new reviews. I repeate it in for-loop to get 3 pages but you could use while True loop and repeate it if there is still data-key.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
#s.headers['User-Agent'] = 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:93.0) Gecko/20100101 Firefox/93.0'
# get first/full page
url = 'https://www.imdb.com/title/tt0111161/reviews/?ref_=tt_ql_urv'
r = s.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('a', {'class': 'title'})
for number, title in enumerate(items, 1):
print(number, '>', title.text.strip())
# get next page(s)
for _ in range(3):
div = soup.find('div', {'data-key': True})
print('---', div['data-key'], '---')
url = 'https://www.imdb.com/title/tt0111161/reviews/_ajax'
payload = {
'ref_': 'tt_ql_urv',
'paginationKey': div['data-key']
}
#headers = {'X-Requested-With': 'XMLHttpRequest'}
r = s.get(url, params=payload) #, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
items = soup.find_all('a', {'class': 'title'})
for number, title in enumerate(items, 1):
print(number, '>', title.text.strip())
Result:
1 > Enthralling, fantastic, intriguing, truly remarkable!
2 > "I Had To Go To Prison To Learn To Be A Crook"
3 > Masterpiece
4 > All-time prison film classic
5 > Freeman gives it depth
6 > impressive
7 > Simply a great story that is moving and uplifting
8 > An incredible movie. One that lives with you.
9 > "I'm a convicted murderer who provides sound financial planning".
10 > IMDb and the Greatest Film of All Time
11 > never give up hope
12 > The Shawshank Redemption
13 > Brutal Anti-Bible Bigotry Prevails Again
14 > Time and Pressure.
15 > A classic
16 > An extraordinary and unforgettable film about a bank veep who is convicted of murders and sentenced to the toughest prison
17 > A genre picture, but a satisfying one...
18 > Why it is ranked so highly.
19 > Exceptional
20 > Shawshank Redemption- Prison Film is Redeemed by Quality ****
21 > A Classic Film On Hope And Redemption
22 > Compelling masterpiece
23 > Relentless Storytelling
24 > Some birds aren't meant to be caged.
25 > Good , But It Is Overrated By Some
--- g4xolermtiqhejcxxxgs753i36t52q343mpt34pjada6qpye4w6qtalmfyy7wfxcwfzuwsyh ---
1 > Stephen King's prison tale with a happy ending...
2 > Breaking Big Rocks Into Little Rocks
3 > Over Rated
4 > Terrific stuff!
5 > Highly Overrated But Still Good
6 > Superb
7 > Beautiful movie
8 > Tedious, overlong, with "hope" being the second word spoken in just about every sentence... who cares?
9 > Excellent Stephen King adaptation; flawless Robbins & Freeman
10 > Good for the spirit
11 > Entertaining Prison Movie Isn't Nearly as Good as Its Rabid Fan Base Would Lead You to Believe
12 > Observations...
13 > Why can't they make films like this anymore?
14 > Shawshank Redemption Comes Out Clean
15 > Hope Springs Eternal:Rita Hayworth And The Shawshank Redemption.
16 > Redeeming.
17 > You don't understand! I'm not supposed to be here!
18 > A Story Of Hope & Resilence
19 > Salvation lies within....
20 > Pretty good movie...one of those that you do not really need to watch from beginning to end.
21 > A film of Eloquence
22 > A great film of a helping hand leading to end-around justice
23 > about freedom
24 > Reputation notwithstanding, this is powerful stuff
25 > The best film ever made!
--- g4uorermtiqhejcxxxgs753i36t52q343eod34plapeoqp27z6b2lhdevccn5wyrz2vmgufh ---
1 > A Sort of Secular Redemption
2 > In virus times, we need this hope.
3 > The placement isn't an exaggeration
4 > A true story of friendship and hard times
5 > Escape from Shawshank
6 > Great Story Telling
7 > moving and emotionally impactful(if you liked "The Green Mile" you will like this movie)
8 > Super Good - Highly Recommended
9 > I can see why this is rated Number 1 on IMDb.
# ...
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I am trying to access the reviews and star rating of each reviewer and append the values to the list. However, it doesn't allow me to retun the output. Can anyone tell me what's wrong with my codes?
l=[]
for i in range(0,len(allrev)):
try:
l["stars"]=allrev[i].allrev.find("div",{"class":"lemon--div__373c0__1mboc i-stars__373c0__1T6rz i-stars--regular-4__373c0__2YrSK border-color--default__373c0__3-ifU overflow--hidden__373c0__2y4YK"}).get('aria-label')
except:
l["stars"]= None
try:
l["review"]=allrev[i].find("span",{"class":"lemon--span__373c0__3997G raw__373c0__3rKqk"}).text
except:
l["review"]=None
u.append(l)
l={}
print({"data":u})
To get all the reviews you can try the following:
import requests
from bs4 import BeautifulSoup
URL = "https://www.yelp.com/biz/sushi-yasaka-new-york"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for star, review in zip(
soup.select(
".margin-b1__373c0__1khoT .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .overflow--hidden__373c0__2y4YK"
),
soup.select(".comment__373c0__3EKjH .raw__373c0__3rcx7"),
):
print(star.get("aria-label"))
print(review.text)
print("-" * 50)
Output:
5 star rating
I've been craving sushi for weeks now and Sushi Yasaka hit the spot for me. Their lunch prices are unbeatable. Their lunch specials seem to extend through weekends which is also amazing.I got the Miyabi lunch as take out and ate in along the benches near the MTA. It came with 4 nigiri, 7 sashimi and you get to pick the other roll (6 pieces). It also came with a side (choose salad or soup, add $1 for both). It was an incredible deal for only $20. I was so full and happy! The fish tasted very fresh with wonderful flavor. I ordered right as they opened and there were at least 10 people waiting outside when I picked up my food so I imagine there is high turnover, keeping the seafood fresh. This will be a regular splurge lunch spot for sure.
--------------------------------------------------
5 star rating
If you're looking for great sushi on Manhattan's upper west side, head over to Sushi Yakasa ! Best sushi lunch specials, especially for sashimi. I ordered the Miyabi - it included a fresh oyster ! The oyster was delicious, served raw on the half shell. The sashimi was delicious too. The portion size was very good for the area, which tends to be a pricey neighborhood. The restaurant is located on a busy street (west 72nd) & it was packed when I dropped by around lunchtimeStill, they handled my order with ease & had it ready quickly. Streamlined service & highly professional. It's a popular sushi place for a reason. Every piece of sashimi was perfect. The salmon avocado roll was delicious too. Very high quality for the price. Highly recommend! Update - I've ordered from Sushi Yasaka a few times since the pandemic & it's just as good as it was before. Fresh, and they always get my order correct. I like their takeout system - you can order over the phone (no app required) & they text you when it's ready. Home delivery is also available & very reliable. One of my favorite restaurants- I'm so glad they're still in business !
--------------------------------------------------
...
...
Edit to only get the first 100 reviews:
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.com/biz/sushi-yasaka-new-york?start={}"
offset = 0
review_count = 0
with open("output.csv", "a", encoding="utf-8") as f:
csv_writer = csv.writer(f, delimiter="\t")
csv_writer.writerow(["rating", "review"])
while True:
resp = requests.get(url.format(offset))
soup = BeautifulSoup(resp.content, "html.parser")
for rating, review in zip(
soup.select(
".margin-b1__373c0__1khoT .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .overflow--hidden__373c0__2y4YK"
),
soup.select(".comment__373c0__3EKjH .raw__373c0__3rcx7"),
):
print(f"review # {review_count}. link: {resp.url}")
csv_writer.writerow([rating.get("aria-label"), review.text])
review_count += 1
if review_count > 100:
raise Exception("Exceeded 100 reviews.")
offset += 20
Intent: Scrape company data from the Inc.5000 list (e.g., rank, company name, growth, industry, state, city, description (via hovering over company name)).
Problem: From what I can see, data from the list is dynamically generated in the browser (no AJAX). Additionally, I can't just scroll to the bottom and then scrape the whole page because only a certain number of companies are available at any one time. In other words, companies 1-10 render, but once I scroll to companies 500-510, companies 1-10 are "de-rendered".
Current effort: The following code is where I'm at now.
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get('https://www.inc.com/inc5000/list/2020')
all_companies = []
scroll_max = 600645 #found via Selenium IDE
curr_scroll = 0
next_scroll = curr_scroll+2000
for elem in driver.find_elements_by_class_name('franchise-list__companies'):
while scroll_num <= scroll_max:
scroll_fn = ''.join(("window.scrollTo(", str(curr_scroll), ", ", str(next_scroll), ")"))
driver.execute_script(scroll_fn)
all_companies.append(elem.text.split('\n'))
print('Current length: ', len(all_companies))
curr_scroll += 2000
next_scroll += 2000
Most SO posts related to infinite scroll deal with those that either maintain the data generated as scrolling occurs, or produce AJAX that can be tapped. This problem is an exception to both (but if I missed an applicable SO post, feel free to point me in that direction).
Problem:
Redundant data is scraped (e.g. a single company may be scraped twice)
I still have to split out the data afterwards (final destination is a Pandas datafarame)
Doesn't include the company description (seen by hovering over company name)
It's slow (I realize this is a caveat to Selenium itself, but think the code can be optimized)
The data is loaded from external URL. To print all companies, you can use this example:
import json
import requests
url = 'https://www.inc.com/rest/i5list/2020'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i, company in enumerate(data['companies'], 1):
print('{:>05d} {}'.format(i, company['company']))
# the hover text is stored in company['ifc_business_model']
Prints:
00001 OneTrust
00002 Create Music Group
00003 Lovell Government Services
00004 Avalon Healthcare Solutions
00005 ZULIE VENTURE INC
00006 Hunt A Killer
00007 Case Energy Partners
00008 Nationwide Mortgage Bankers
00009 Paxon Energy
00010 Inspire11
00011 Nugget
00012 TRYFACTA
00013 CannaSafe
00014 BRUMATE
00015 Resource Innovations
...and so on.
I started coding in Python (3). I would like to extract some data related to movies, here is the list link
I already scraped the data related to the number of votes:
first_votes = first_movie.find('span', attrs = {'name':'nv'})
first_votes
first_votes['data-value']
Which gives me exactly the number of times the movie has been rated by users.
But when I try to scrape the gross amount, I do not really know how to make the code concentrate on the gross since both the gross and the number of votes have the same construction:
This is what the DevTool shows
Does anyone of you know how to solve this? Sorry if I didn't provide any additional information, but I am new. If you require some information, I would be happy to provide them.
You could use the findAll method to get a list of all of the elements that match your criteria and then you can select the second element in the list, e.g.:
first_votes = first_movie.findAll('span', attrs = {'name':'nv'})[1]
You can try this for votes and gross
votes = first_movie.find_all('span', attrs = {'name':'nv'})[0]['data-value']
gross = first_movie.find_all('span', attrs = {'name':'nv'})[1]['data-value']
Or in a single line
votes, gross = [item['data-value'] for item in first_movie.find_all('span', attrs = {'name':'nv'})]
Building a webscraper to scrape this page http://espn.go.com/nba/teams in order to fill a database with all the team names and their corresponding divisions using the scrapy python library. I am attempting to write my parse function however I still don't exactly understand how to extract the corresponding division name to match each team.
[1] https://www.dropbox.com/s/jv1n49rg4p6p2yh/2014-12-29%2014.08.07-2.jpg?dl=0
def parse(self,response):
items = []
mex = "//div[#class='span-6']/div[#class='span-4']/div/div/div/div[2]/ul/li"
i=0
for sel in response.xpath(mex):
item = TeamStats()
item['team'] = sel.xpath(mex + "/div/h5/a/text()")[i]
item['division'] = sel.xpath("//div[#class='span-6']/div[#class='span-4']/div/div/div/div[1]/h4")
items.append(item)
i=i+1
return items
My parse function is able to return a list of teams and a corresponding divisions list which lists ALL divisions. Now I'm not really how to specify the exact division, as it seems to me that I must navigate from the team name selected (which is represented by item['team'] = sel.xpath(mex + "/div/h5/a/text()")[i] ) up the DOM by using the preceding child relation (was going to include a website that I've been following as a tutorial however I don't have 10 reputation points) to get the RIGHT division, but I'm not sure how to write that...
If I'm on the wrong track with this, let me know as I'm no expert with XPath. I'm actually not even sure if I need a counter as if I remove the [i] then I just get 30 lists with all 30 teams.
Let's make it simpler.
Each division is represented with div with a mod-teams-list-medium class. Each division div consist of 2 parts:
div with class="mod-header" containing the division name
div with class="mod-content" containing the list of teams
Inside your spider it would be reflected this way:
for division in response.xpath('//div[#id="content"]//div[contains(#class, "mod-teams-list-medium")]'):
division_name = division.xpath('.//div[contains(#class, "mod-header")]/h4/text()').extract()[0]
print division_name
print
for team in division.xpath('.//div[contains(#class, "mod-content")]//li'):
team_name = team.xpath('.//h5/a/text()').extract()[0]
print team_name
print "------"
And here is what I'm getting on the console:
Atlantic
Boston Celtics
Brooklyn Nets
New York Knicks
Philadelphia 76ers
Toronto Raptors
------
Pacific
Golden State Warriors
Los Angeles Clippers
Los Angeles Lakers
Phoenix Suns
Sacramento Kings
------
...