google-play-scraper doesn't scrape all reviews

google-play-scraper doesn't scrape all reviews - python

I tried to scrape the review for an app using the google-play-scraper package. I followed the code from their readme on github: https://github.com/JoMingyu/google-play-scraper. But it doesn't seem to scrape all reviews!
I'm not sure whether it's possible to scrape all reviews as the lang and country arguments default to 'en' and 'us'. But I was trying to scrape several apps that are exclusively used in Germany using 'de' for country and language. I know there will be some people with a foreign play store account who reviewed the app but for an app that only exists in Germany, this share shouldn't be too high. But for many apps I've tried, the difference between the number of reviews stated on google play's website and the number of reviews that are scraped is just implausible.
Here's my code:
from google_play_scraper import app
import pandas as pd
import numpy as np
from google_play_scraper import Sort, reviews_all
app_reviews = reviews_all(
'de.flaschenpost.app',
sleep_milliseconds=0,
lang='de',
country='de',
sort=Sort.NEWEST
)
For this app, Google Play has 69,600 reviews, but only 7,608 are scraped.
Other examples: de.hafas.android.db (184,247 reviews, 51,552 scraped), de.materna.bbk.mobile.app (24,401 reviews, 12,896 scraped).
Am I missing something? Thanks a lot!
EDIT: Thanks to Joaquin for pointing me in the right direction. All scraped reviews include some comments, i.e. the larger number probably includes all ratings, also those who only left 1-5 stars and didn't write anything!

Thanks to Joaquin for pointing me in the right direction. All scraped reviews include some comments, i.e. the larger number probably includes all ratings, also those who only left 1-5 stars and didn't write anything!

Related

Python googlesearch API - change country location and get Ads results

I'm trying to use the googlesearch api in Python to get the top 10 results for several queries, and I'm encountering two issues:
Changing the country using the 'country' param (e.g country='us' etc..) doesn't seem to have any affect on the results at all. Tried this with several countries.
I want to include the Ads results and can't find any way to do so.
If anyone knows how to do this with googlesearch or any other free API that would be great.
Thanks!
# coding: utf-8
from googlesearch import search
from urlparse import urlparse
import csv
import datetime
keywords = [
"best website builder"
]
countries = [
"us",
"il"
]
filename = 'google_results.csv'
with open(filename, 'w') as f:
writer = csv.writer(f, delimiter=',')
for country in countries:
for keyword in keywords:
print "Showing results for: '" + keyword + "'"
writer.writerow([])
writer.writerow([keyword])
for url in search(keyword, lang='en', stop=10, country=country):
print(urlparse(url).netloc)
print(url)
writer.writerow([urlparse(url).netloc, url])

Answer 1. Your country format is incorrect.
What the module is doing is building the URL to make the request. With the following format:
url_search = "https://www.google.%(tld)s/search?hl=%(lang)s&q=%(query)s&btnG=Google+Search&tbs=%(tbs)s&safe=%(safe)s&cr=%(country)s"
When you give it a country, simply passing in us or il is not enough. You want the country parameter to be in the format of countryXX where XX is the two letter abbreviation. For example France is FR. So country will be countryFR.
And even in the source code it say that this parameter is not always reliable.
:param str country: Country or region to focus the search on. Similar to
changing the TLD, but does not yield exactly the same results.
Only Google knows why...
Answer 2: Ads are dynamically loaded using JavaScript. This library on the other hand only does static parsing. It does not execute any of the JavaScript. You will need to run Selenium or pyppeteer to have the browser execute the JavaScript to get the ads.

Unfortunately, the country targeting parameter is just a signal to Google, not a setting change. Google will not actually show you the results as they appear to an anonymous user in that country. So it's basically useless.
The APIs mentioned above will not fix this either as they only use US based IP addresses. (#Link can you confirm? I'd pay for your API if it wasn't only on US servers.)
So you're going to actually need to run this code from a server with an IP address in the country you're targeting, with the browser settings params of the country language set too.
You won't be able to render the ads either, as they're rendered slightly after the fact separately. There is a huge industry trying to get this right, and anyone who has nailed it charges pretty high fees. But the best place to start would be on an IP address in that country and using selenium. Requests won't cut it, and certainly not if you want ads.
Finally, Google is super aggressive with automated search detection as every search you automate that shows an ad, skews their advertiser numbers and actually costs advertisers money, even if you don't click on them (due to a mechanism called quality score).
If your volume is low, selenium based script with a private IP (as in, not an AWS or Azure data center ip) in that country is your best bet.
And if you figure out how to do this at scale, you'll have people falling over themselves to get the solution.

get filmorgaphy for a chosen company with IMDBPY

From the documentation i see that companies has only 'main' and don't have 'filmography' unlike persons, but are there a way to fetch movies for the chosen company? Maybe it's possible to see the list of 'Films in Production' and 'Past Film & Video'?
I try to populate the data for the company with related movies somehow, but 'main' stays an empty list.
I don't want to go through all the movies in the db in order to check if the company present there, as it seems to be very inefficient. I use 'http' access, as i don't need a data base copy locally.
my_company = ia.search_company('Walt Disney Pictures [US]')[0]
id = my_company.companyID
ia.get_company(id) # the only info i can get!
ia.update(my_company)
ia.get_company_infoset()
my_company.infoset2keys

Unfortunately, the IMDb web site changed the information on a company page, and right now IMDbPY is no longer able to collect other information beside the company's name and country.
I have opened an issue to describe the problem: https://github.com/alberanid/imdbpy/issues/198

How to get list of categories in Google Books API?

I was searching for an already answered question about this but couldn't find one so please forgive me if I somehow missed it.
I'm using Google Books API and I know I can search a book by specific category.
My question is, how can I get all the available categories from the API?
I looked in the API documentation but couldn't find any mention of this.

The Google books api does not have an end point for returning Categories that are not associated with a book itself.
The Google Books api is only there to list books. You can
search and browse through the list of books that match a given query.
view information about a book, including metadata, availability and price, links to the preview page.
manage your own bookshelves.
You can see the category of a book you can not get a list of available categories in the whole system
You may be interested to know this has been on their todo list since 2012 category list
We have numerous requests for this and we're investigating how we can properly provide the data. One issue is Google does not own all the category information. "New York Times Bestsellers" is one obvious example. We need to first identify what we can publish through the API.
work around
i worked around it by implementing my own category list mechanism so i can pull all categories that exists in my app's database.
(unfortunately, the newly announced ScriptDb deprecation means my whole system will go to waste in a couple of monthes anyway... but that's another story)

https://support.google.com/books/partner/answer/3237055?hl=en
Scroll down to subject/genres and you will see this link.
https://bisg.org/page/bisacedition
This list is apparently a list of subjects AKA categories for North American Books. I am making various GET requests with an API testing tool and getting for the most part, perfect matches (you may have to drop a word from the query string. ex: "criticism" instead of "literary criticism") for whatever subject I choose from the BISG subjects list, and what comes back in the json response under the "categories" key.
Ex: GET https://www.googleapis.com/books/v1/volumes?q=business+subject:juvenile+fiction
Long story short, the BISG link is where I'm pretty sure Google got all the options for their "categories" key from.

data extraction from web

I am planning to do a data extraction from web sources (web scraping) as part of my work. I would like to extract info around my company's 10km radius.
I would like to extract information such as condominiums, its address, number of units and its price per sqft. Other things like number of schools and kindergarten in the area and hotels.
I understand I need to extract from few sources/webpages. I will also be using Python.
I would like to know which library or libraries should I be using. Is web scraping the only means? Can we extract info from Google Maps?
Also, if anyone has any experience I will really appreciate if you can guide me on this.
Thanks a lot, guys.

For Google Maps, try the API. Using web scraping tools for Maps data extraction is highly discouraged by Google TOS.
If you are using Python, it has very nice libraries BeautifulSoup and Scrapy for this purpose.
Other means? You can extract POIs from OSM data, try the open source tools. Property Info? May be it's available for your county / state from Govt Office, give it a try.

Getting a list of all churches in a certain state using Python

I am pretty good with Python, so pseudo-code will suffice when details are trivial. Please get me started on the task - how do go about crawling the net for the snail mail addresses of churches in my state. Once I have a one liner such as "123 Old West Road #3 Old Lyme City MD 01234", I can probably parse it into City, State, Street, number, apt with enough trial and error. My problem is - if I use white pages online, then how do I deal with all the HTML junk, HTML tables, ads, etc? I do not think I need their phone number, but it will not hurt - I can always throw it out once parsed. Even if your solution is half-manual (such as save to pdf, then open acrobat, save as text) - I might be happy with it still. Thanks! Heck, I will even accept Perl snippets - I can translate them myself.

You could use mechanize. It's a python library that simulates a browser, so you could crawl through the white pages (similarly to what you do manually).
In order to deal with the 'html junk' python has a library for that too: BeautifulSoup
It is a lovely way to get the data you want out of HTML (of course it assumes you know a little bit about HTML, as you will still have to navigate the parse tree).
Update: As to your follow-up question on how to click through multiple pages. mechanize is a library to do just that. Take a closer look at their examples, esp. the follow_link method. As I said it simulates a browser, so 'clicking' can be realized quickly in python.

Try lynx --dump <url> to download the web pages. All the troublesome HTML tags will be stripped from the output, and all the links from the page will appear together.

What you're trying to do is called Scraping or web scraping.
If you do some searches on python and scraping, you may find a list of tools that will help.
(I have never used scrapy, but it's site looks promising :)

Beautiful Soup is a no brainer. Here's a site you might start at http://www.churchangel.com/. They have a huge list and the formatting is very regular -- translation: easy to setup BSoup to scrape.

Python scripts might not be the best tool for this job, if you're just looking for addresses of churches in a geographic area.
The US census provides a data set of churches for use with geographic information systems. If finding all the x in a spatial area is a recurring problem, invest in learning a GIS. Then you can bring your Python skills to bear on many geographic tasks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

google-play-scraper doesn't scrape all reviews - python

Thanks to Joaquin for pointing me in the right direction. All scraped reviews include some comments, i.e. the larger number probably includes all ratings, also those who only left 1-5 stars and didn't write anything!

Related

Python googlesearch API - change country location and get Ads results

get filmorgaphy for a chosen company with IMDBPY

How to get list of categories in Google Books API?

data extraction from web

Getting a list of all churches in a certain state using Python

Categories

Resources