Access aria label and reviews of yelp with BeautifulSoup [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I am trying to access the reviews and star rating of each reviewer and append the values to the list. However, it doesn't allow me to retun the output. Can anyone tell me what's wrong with my codes?
l=[]
for i in range(0,len(allrev)):
try:
l["stars"]=allrev[i].allrev.find("div",{"class":"lemon--div__373c0__1mboc i-stars__373c0__1T6rz i-stars--regular-4__373c0__2YrSK border-color--default__373c0__3-ifU overflow--hidden__373c0__2y4YK"}).get('aria-label')
except:
l["stars"]= None
try:
l["review"]=allrev[i].find("span",{"class":"lemon--span__373c0__3997G raw__373c0__3rKqk"}).text
except:
l["review"]=None
u.append(l)
l={}
print({"data":u})

To get all the reviews you can try the following:
import requests
from bs4 import BeautifulSoup
URL = "https://www.yelp.com/biz/sushi-yasaka-new-york"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for star, review in zip(
soup.select(
".margin-b1__373c0__1khoT .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .overflow--hidden__373c0__2y4YK"
),
soup.select(".comment__373c0__3EKjH .raw__373c0__3rcx7"),
):
print(star.get("aria-label"))
print(review.text)
print("-" * 50)
Output:
5 star rating
I've been craving sushi for weeks now and Sushi Yasaka hit the spot for me. Their lunch prices are unbeatable. Their lunch specials seem to extend through weekends which is also amazing.I got the Miyabi lunch as take out and ate in along the benches near the MTA. It came with 4 nigiri, 7 sashimi and you get to pick the other roll (6 pieces). It also came with a side (choose salad or soup, add $1 for both). It was an incredible deal for only $20. I was so full and happy! The fish tasted very fresh with wonderful flavor. I ordered right as they opened and there were at least 10 people waiting outside when I picked up my food so I imagine there is high turnover, keeping the seafood fresh. This will be a regular splurge lunch spot for sure.
--------------------------------------------------
5 star rating
If you're looking for great sushi on Manhattan's upper west side, head over to Sushi Yakasa ! Best sushi lunch specials, especially for sashimi. I ordered the Miyabi - it included a fresh oyster ! The oyster was delicious, served raw on the half shell. The sashimi was delicious too. The portion size was very good for the area, which tends to be a pricey neighborhood. The restaurant is located on a busy street (west 72nd) & it was packed when I dropped by around lunchtimeStill, they handled my order with ease & had it ready quickly. Streamlined service & highly professional. It's a popular sushi place for a reason. Every piece of sashimi was perfect. The salmon avocado roll was delicious too. Very high quality for the price. Highly recommend! Update - I've ordered from Sushi Yasaka a few times since the pandemic & it's just as good as it was before. Fresh, and they always get my order correct. I like their takeout system - you can order over the phone (no app required) & they text you when it's ready. Home delivery is also available & very reliable. One of my favorite restaurants- I'm so glad they're still in business !
--------------------------------------------------
...
...
Edit to only get the first 100 reviews:
import csv
import requests
from bs4 import BeautifulSoup
url = "https://www.yelp.com/biz/sushi-yasaka-new-york?start={}"
offset = 0
review_count = 0
with open("output.csv", "a", encoding="utf-8") as f:
csv_writer = csv.writer(f, delimiter="\t")
csv_writer.writerow(["rating", "review"])
while True:
resp = requests.get(url.format(offset))
soup = BeautifulSoup(resp.content, "html.parser")
for rating, review in zip(
soup.select(
".margin-b1__373c0__1khoT .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .border-color--default__373c0__3-ifU .overflow--hidden__373c0__2y4YK"
),
soup.select(".comment__373c0__3EKjH .raw__373c0__3rcx7"),
):
print(f"review # {review_count}. link: {resp.url}")
csv_writer.writerow([rating.get("aria-label"), review.text])
review_count += 1
if review_count > 100:
raise Exception("Exceeded 100 reviews.")
offset += 20

Related

Having two divs under the same class take the content of the fist and second seperately webscraping with BeautifulSoup

I have such a html page inside the content_list variable
<h3 class="sds-heading--7 title">Problems with battery capacity long-term</h3>
<div class="review-byline review-section">
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
<div class="review-type"><strong>Owns this car</strong></div>
</div>
<div class="review-section">
<p class="review-body">We have owned our Leaf since May 2011. We have loved the car but are now getting quite concerned. My husband drives the car, on average, 20-40 miles/day to and from work and running errands, mostly 100% on city roads. We live in San Diego, so no issue with winter weather and we live 7 miles from the ocean so seldom have daytime temperatures above 85. Originally, we would get 65-70 miles per 80-90% charge. Last fall we noticed that there was considerably less remaining charge left after a day of driving. He began to track daily miles, remaining "bars", as well as started charging it 100%. For 9 months we have only been getting 40-45 miles on a full charge with only 1-2 "bars" remaining at the end of the day. Sometimes it will be blinking and "talking" to us to get to a charging place ASAP. We just had it into the dealership. Though on a full charge, the car gauge shows 12 bars, the dealership states that the batteries have lost 2 bars via the computer diagnostics (which we are told is a different reading from the car gauge itself) and, that they say, is average and excepted for the car at this age. Everything else (software, diagnostics, etc.) shows 100%, so the dealership thinks that the car is functioning as it should. They are unable to explain why we can only go 40-45 miles on a charge, but keep saying that the car tests out fine. If the distance one is able to drive on a full charge decreases any further, it will begin to render the car useless. As someone else recommended, in retrospect, the best way to go is to lease the Leaf so that battery life is not an issue.</p>
</div>
First I used this code to get to the collection of reviews
ua = UserAgent()
header = {'User-Agent':str(ua.safari)}
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
print(response)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
Now I would like to take the value of date of the review and the name of the reviewer which in this case would be
<div class="review-byline review-section">
<div>July 21, 2014</div>
<div>By Cathie from San Diego</div>
The problem is I can't separate those two divs
My code:
data = []
for e in content_list:
data.append({
'review_date':e.find_all("div", {"class":"review-byline"})[0].text,
'overall_rating': e.select_one('span.sds-rating__count').text,
'review_title': e.h3.text,
'review_content': e.select_one('p').text,
})
The result of my code
{'overall_rating': '4.7',
'review_content': 'This is the perfect electric car for driving around town, doing errands or even for a short daily commuter. It is very comfy and very quick. The only issue was the first gen battery. The 2011-2014 battery degraded quickly and if the owner did not have Nissan replace it, all those cars are now junk and can only go 20 miles or so on a charge. We had Nissan replace our battery with the 2nd gen battery and it is good as new!',
'review_date': '\nFebruary 24, 2020\nBy EVs are the future from Tucson, AZ\nOwns this car\n',
'review_title': 'Great Electric Car!'}
For the first one you could the <div> directly:
'review_date':e.find("div", {"class":"review-byline"}).div.text,
for the second one use e.g. css selector:
'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
Example
url = 'https://www.cars.com/research/nissan-leaf-2011/consumer-reviews/?page=1'
response = requests.get(url, headers=header)
html_soup = BeautifulSoup(response.text, 'lxml')
content_list = html_soup.find_all('div', attrs={'class': 'consumer-review-container'})
data = []
for e in content_list:
data.append({
'review_date':e.find("div", {"class":"review-byline"}).div.text,
'reviewer_name':e.select_one("div.review-byline div:nth-of-type(2)").text,
'overall_rating': e.select_one('span.sds-rating__count').text,
'review_title': e.h3.text,
'review_content': e.select_one('p').text,
})
data

Ranker.com python beautifulsoup scraper not scraping the entire website

So I am working on a beautifulsoup scraper that would scrape 100 names from the ranker.com page list. The code is as follows
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.ranker.com/crowdranked-list/best-anime-series-all-time')
soup = BeautifulSoup(r.text, 'html.parser')
for p in soup.find_all('a', class_='gridItem_name__3zasT gridItem_nameLink__3jE6V'):
print(p.text)
This works and gives the output as
Attack on Titan
My Hero Academia
Naruto: Shippuden
Hunter x Hunter (2011)
One-Punch Man
Fullmetal Alchemist: Brotherhood
One Piece
Naruto
Tokyo Ghoul
Assassination Classroom
The Seven Deadly Sins
Parasyte: The Maxim
Code Geass
Haikyuu!!
Your Lie in April
Noragami
Akame ga Kill!
Dragon Ball
No Game No Life
Fullmetal Alchemist
Dragon Ball Z
Cowboy Bebop
Steins;Gate
Mob Psycho 100
Fairy Tail
I wanted the program to fetch 100 items from the list, but it just gives 25 items. Can someone pls help me with this.
Additional items come from API call with offset and limit params to determine next batch of 25 results to return. You can simply remove both of these and get a max 200 results, or leave in limit and set to 100. You can ignore everything else in the API call apart from the endpoint.
import requests
r = requests.get('https://api.ranker.com/lists/538997/items?limit=100')
data = r.json()['listItems']
ranked_titles = {i['rank']:i['name'] for i in data}
print(ranked_titles)

Selenium scraping dynamic infinite scroll without AJAX

Intent: Scrape company data from the Inc.5000 list (e.g., rank, company name, growth, industry, state, city, description (via hovering over company name)).
Problem: From what I can see, data from the list is dynamically generated in the browser (no AJAX). Additionally, I can't just scroll to the bottom and then scrape the whole page because only a certain number of companies are available at any one time. In other words, companies 1-10 render, but once I scroll to companies 500-510, companies 1-10 are "de-rendered".
Current effort: The following code is where I'm at now.
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get('https://www.inc.com/inc5000/list/2020')
all_companies = []
scroll_max = 600645 #found via Selenium IDE
curr_scroll = 0
next_scroll = curr_scroll+2000
for elem in driver.find_elements_by_class_name('franchise-list__companies'):
while scroll_num <= scroll_max:
scroll_fn = ''.join(("window.scrollTo(", str(curr_scroll), ", ", str(next_scroll), ")"))
driver.execute_script(scroll_fn)
all_companies.append(elem.text.split('\n'))
print('Current length: ', len(all_companies))
curr_scroll += 2000
next_scroll += 2000
Most SO posts related to infinite scroll deal with those that either maintain the data generated as scrolling occurs, or produce AJAX that can be tapped. This problem is an exception to both (but if I missed an applicable SO post, feel free to point me in that direction).
Problem:
Redundant data is scraped (e.g. a single company may be scraped twice)
I still have to split out the data afterwards (final destination is a Pandas datafarame)
Doesn't include the company description (seen by hovering over company name)
It's slow (I realize this is a caveat to Selenium itself, but think the code can be optimized)
The data is loaded from external URL. To print all companies, you can use this example:
import json
import requests
url = 'https://www.inc.com/rest/i5list/2020'
data = requests.get(url).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
for i, company in enumerate(data['companies'], 1):
print('{:>05d} {}'.format(i, company['company']))
# the hover text is stored in company['ifc_business_model']
Prints:
00001 OneTrust
00002 Create Music Group
00003 Lovell Government Services
00004 Avalon Healthcare Solutions
00005 ZULIE VENTURE INC
00006 Hunt A Killer
00007 Case Energy Partners
00008 Nationwide Mortgage Bankers
00009 Paxon Energy
00010 Inspire11
00011 Nugget
00012 TRYFACTA
00013 CannaSafe
00014 BRUMATE
00015 Resource Innovations
...and so on.

Less memory intensive way to parse large JSON file in Python

Here is my code
import json
data = []
with open("review.json") as f:
for line in f:
data.append(json.loads(line))
lst_string = []
lst_num = []
for i in range(len(data)):
if (data[i]["stars"] == 5.0):
x = data[i]["text"]
for word in x.split():
if word in lst_string:
lst_num[lst_string.index(word)] += 1
else:
lst_string.append(word)
lst_num.append(1)
result = set(zip(lst_string, lst_num))
print(result)
with open("set.txt", "w") as g:
g.write(str(result))
I'm trying to write a set of all words in reviews that were given 5 stars from a pulled in json file formatted like
{"review_id":"Q1sbwvVQXV2734tPgoKj4Q","user_id":"hG7b0MtEbXx5QzbzE6C_VA","business_id":"ujmEBvifdJM6h6RLv4wQIg","stars":1.0,"useful":6,"funny":1,"cool":0,"text":"Total bill for this horrible service? Over $8Gs. These crooks actually had the nerve to charge us $69 for 3 pills. I checked online the pills can be had for 19 cents EACH! Avoid Hospital ERs at all costs.","date":"2013-05-07 04:34:36"}
{"review_id":"GJXCdrto3ASJOqKeVWPi6Q","user_id":"yXQM5uF2jS6es16SJzNHfg","business_id":"NZnhc2sEQy3RmzKTZnqtwQ","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"I *adore* Travis at the Hard Rock's new Kelly Cardenas Salon! I'm always a fan of a great blowout and no stranger to the chains that offer this service; however, Travis has taken the flawless blowout to a whole new level! \n\nTravis's greets you with his perfectly green swoosh in his otherwise perfectly styled black hair and a Vegas-worthy rockstar outfit. Next comes the most relaxing and incredible shampoo -- where you get a full head message that could cure even the very worst migraine in minutes --- and the scented shampoo room. Travis has freakishly strong fingers (in a good way) and use the perfect amount of pressure. That was superb! Then starts the glorious blowout... where not one, not two, but THREE people were involved in doing the best round-brush action my hair has ever seen. The team of stylists clearly gets along extremely well, as it's evident from the way they talk to and help one another that it's really genuine and not some corporate requirement. It was so much fun to be there! \n\nNext Travis started with the flat iron. The way he flipped his wrist to get volume all around without over-doing it and making me look like a Texas pagent girl was admirable. It's also worth noting that he didn't fry my hair -- something that I've had happen before with less skilled stylists. At the end of the blowout & style my hair was perfectly bouncey and looked terrific. The only thing better? That this awesome blowout lasted for days! \n\nTravis, I will see you every single time I'm out in Vegas. You make me feel beauuuutiful!","date":"2017-01-14 21:30:33"}
{"review_id":"2TzJjDVDEuAW6MR5Vuc1ug","user_id":"n6-Gk65cPZL6Uz8qRm3NYw","business_id":"WTqjgwHlXbSFevF32_DJVw","stars":1.0,"useful":3,"funny":0,"cool":0,"text":"I have to say that this office really has it together, they are so organized and friendly! Dr. J. Phillipp is a great dentist, very friendly and professional. The dental assistants that helped in my procedure were amazing, Jewel and Bailey helped me to feel comfortable! I don't have dental insurance, but they have this insurance through their office you can purchase for $80 something a year and this gave me 25% off all of my dental work, plus they helped me get signed up for care credit which I knew nothing about before this visit! I highly recommend this office for the nice synergy the whole office has!","date":"2016-11-09 20:09:03"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"Went in for a lunch. Steak sandwich was delicious, and the Caesar salad had an absolutely delicious dressing, with a perfect amount of dressing, and distributed perfectly across each leaf. I know I'm going on about the salad ... But it was perfect.\n\nDrink prices were pretty good.\n\nThe Server, Dawn, was friendly and accommodating. Very happy with her.\n\nIn summation, a great pub experience. Would go again!","date":"2018-01-09 20:56:38"}
{"review_id":"yi0R0Ugj_xUx_Nek0-_Qig","user_id":"dacAIZ6fTM6mqwW5uxkskg","business_id":"ikCg8xy5JIg_NGPx-MSIDA","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"a b aa bb a b","date":"2018-01-09 20:56:38"}
but it is using all the memory on my computer before it can output into a text file. How can I use a less memory intensive way?
Only get text where stars == 5:
Data:
Based on the question, the data is a file containing rows of dicts.
Get the text into a list:
Given the data from Yelp Challenge, getting the 5 stars text into a list, doesn't take that much memory.
The Windows resource manager showed an increase of about 1.3GB, but the object size of text_list was about 25MB.
import json
text_list = list()
with open("review.json", encoding="utf8") as f:
for line in f:
line = json.loads(line)
if line['stars'] == 5:
text_list.append(line['text'])
print(text_list)
>>> ['Test text, example 1!', 'Test text, example 2!']
Extra:
Everything after loading the data, seems to require a lot of memory that isn't being released.
When cleaning the text, Windows resource manager went up by 16GB, though the final size of clean_text was also only about 25MB.
Interestingly, deleting clean_text does not release the 16GB of memory.
In Jupyter Lab, restarting the Kernel will release the memory
In PyCharm, stopping the process also releases the memory
I tried manually running the garbage collector, but that didn't release the memory
Clean text_list:
import string
def clean_string(value: str) -> list:
value = value.lower()
value = value.translate(str.maketrans('', '', string.punctuation))
value = value.split()
return value
clean_text = [clean_string(item) for item in text_list]
print(clean_text)
>>> [['test', 'text', 'example', '1'], ['test', 'text', 'example', '2']]
Count words in clean_text:
from collection import Counter
words = Counter()
for item in clean_text:
words.update(item)
print(words)
>>> Counter({'test': 2, 'text': 2, 'example': 2, '1': 1, '2': 1})

Improve speed/performance of web-scraping with lots of exceptions

I've written some web-scraping code that is currently working, however quite slow. Some background: I am using Selenium as it requires several stages of clicks and entry, along with BeautifulSoup. My code is looking at a list of materials within subcategories on a website(image below) and scraping them. If the material scraped from the website is one of the 30 I am interested in (lst below), then it writes the number 1 to a dataframe which I later convert to an Excel sheet.
The reason it is so slow, I believe anyway, is due to the fact that there are a lot of exceptions. However, I am not sure how to handle these besides try/except. The main bits of code can be seen below, as the entire piece of code is quite lengthy. I have also attached an image of the website in question for reference.
lst = ["Household cleaner and detergent bottles", "Plastic milk bottles", "Toiletries and shampoo bottles", "Plastic drinks bottles",
"Drinks cans", "Food tins", "Metal lids from glass jars", "Aerosols",
"Food pots and tubs", "Margarine tubs", "Plastic trays","Yoghurt pots", "Carrier bags",
"Aluminium foil", "Foil trays",
"Cardboard sleeves", "Cardboard egg boxes", "Cardboard fruit and veg punnets", "Cereal boxes", "Corrugated cardboard", "Toilet roll tubes", "Food and drink cartons",
"Newspapers", "Window envelopes", "Magazines", "Junk mail", "Brown envelopes", "Shredded paper", "Yellow Pages" , "Telephone directories",
"Glass bottles and jars"]
def site_scraper(site):
page_loc = ('//*[#id="wrap-rlw"]/div/div[2]/div/div/div/div[2]/div/ol/li[{}]/div').format(site)
page = driver.find_element_by_xpath(page_loc)
page.click()
driver.execute_script("arguments[0].scrollIntoView(true);", page)
soup=BeautifulSoup(driver.page_source, 'lxml')
for i in x:
for j in y:
try:
material = soup.find_all("div", class_ = "rlw-accordion-content")[i].find_all('li')[j].get_text(strip=True).encode('utf-8')
if material in lst:
df.at[code_no, material] = 1
else:
continue
continue
except IndexError:
continue
x = xrange(0,8)
y = xrange(0,9)
p = xrange(1,31)
for site in p:
site_scraper(site)
Specifically, the i's and j's rarely go to 6,7 or 8 but when they do, it is important that I capture that information too. For context, the i's correspond to the number of different categories in the image below (Automative, Building materials etc.) whilst the j's represent the sub-list (car batteries and engine oil etc.). Because these two loops are repeated for all 30 sites for each code, and I have 1500 codes, this is extremely slow. Currently it is taking 6.5 minutes for 10 codes.
Is there a way I could improve this process? I tried list comprehension however it was difficult to handle errors like this and my results were no longer accurate. Could an "if" function be a better choice for this and if so, how would I incorporate it? I also would be happy to attach the full code if required. Thank you!
EDIT:
by changing
except IndexError:
continue
to
except IndexError:
break
it is now running almost twice as fast! Obviously it is best to exit to loop after it fails once, as the later iterations will also fail. However any other pythonic tips are still welcome :)
It sounds like you just need the text of those lis:
lis = driver.execute_script("[...document.querySelectorAll('.rlw-accordion-content li')].map(li => li.innerText.trim())")
Now you can use those for your logic:
for material in lis:
if material in lst:
df.at[code_no, material] = 1

Categories

Resources