scrape urls from google search - python

I am trying to write a code that gets 1000 first URL's of http pages in google search of some word. I used this code in Python to get the 1000 first URL's
import GoogleScraper
import urllib
urls = GoogleScraper.scrape('english teachers', number_pages=2)
for url in urls:
print(urllib.parse.unquote(url.geturl()))
print('[!] Received %d results by asking %d pages with %d results per page' %
(len(urls), 2, 100))`
but this code returns 0 received results.
is there another way to get a lot of URL's from google search in a convenient way?
I also tried xgoogle and pygoogle modules but they can handle just with small amount of requests for pages.

Google has a Custom Search API which allows you to make 100 queries a day for free. Given that each page has 10 results per page, you can barely fit in 1000 results in a day. xgoogle and pygoogle are just wrappers around this API, so I don't think you'll be able to get more results by using them.
If you do need more, consider creating another Google account with another API key which will effectively double your limit. If you're okay with slightly inferior results, you can try out Bing's Search API (they offer 5000 requests a month).

Related

Getting the number of results of a google search using python in 2020?

There were solutions provided before, but they don't work anymore :
extract the number of results from google search
for example the above code doesn't work anymore because the number of results doesn't seem to even be in the respond, there is no resultStats ID, in my browser the result is in the id of "result-status" but this doesn't exist in the respond
I don't want to actually use the API of google because there is a big limit on daily search, and i need to search for thousands of words daily, what is the solution for me?

Twitter scraping in Python

I have to scrape tweets from Twitter for a specific user (#salvinimi), from January 2018. The issue is that there are a lot of tweets in this range of time, and so I am not able to scrape all the ones I need!
I tried multiple solutions:
1)
pip install twitterscraper
from twitterscraper import query_tweets_from_user as qtfu
tweets = qtfu(user='matteosalvinimi')
With this method, I get only a few teets (500~600 more or less), instead of all the tweets... Do you know why?
2)
!pip install twitter_scraper
from twitter_scraper import get_tweets
tweets = []
for i in get_tweets('matteosalvinimi', pages=100):
tweets.append(i)
With this method I get an error -> "ParserError: Document is empty"...
If I set "pages=40", I get the tweets without errors, but not all the ones. Do you know why?
Three things for the first issue you encounter:
first of all, every API has its limits and one like Twitter would be expected to monitor its use and eventually stop a user from retrieving data if the user is asking for more than the limits. Trying to overcome the limitations of the API might not be the best idea and might result in being banned from accessing the site or other things (I'm taking guesses here as I don't know what's the policy of Twitter on the matter). That said, the documentation on the library you're using states :
With Twitter's Search API you can only sent 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour.
By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.
then, the function you're using, query_tweets_from_user() has a limit argument which you can set to an integer. One thing you can try is changing that argument and seeing whether you get what you want or not.
finally, if the above does not work, you could be subsetting your time range in two, three ore more subsets if needed, collect the data separately and merge them together afterwards.
The second issue you mention might be due to many different things so I'll just take a broad guess here. For me, either setting pages=100 is too high and by one way or another the program or the API is unable to retrieve the data, or you're trying to look at a hundred pages when there is less than a hundred in pages to look for reality, which results in the program trying to parse an empty document.

Am I setting my multithreading web scraper properly?

I'm trying to improve the speed of my web scraper, and I have thousands of sites I need to get info from. I'm trying to get the ratings and number of ratings for sites in Google search webpages from Facebook and Yelp. I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible to use their Graph API (I've tried...). My sites are all in Google search pages. What I have so far (I have provided 8 sample sites for reproducibility):
from multiprocessing.dummy import Pool
import requests
from bs4 import BeautifulSoup
pools = Pool(8) #My computer has 8 cores
proxies = MY_PROXIES
#How I set up my urls for requests on Google searches.
#Since each item has a "+" in between in a Google search, I have to format
#my urls to copy it.
site_list = ['Golden Gate Bridge', 'Statue of Liberty', 'Empire State Building', 'Millennium Park', 'Gum Wall', 'The Alamo', 'National Art Gallery', 'The Bellagio Hotel']
urls = list(map(lambda x: "+".join(x.split(" ")), site_list)
def scrape_google(url_list):
info = []
for i in url_list:
reviews = {'FB Rating': None,
'FB Reviews': None,
'Yelp Rating': None,
'Yelp Reviews': None}
request = requests.get(i, proxies=proxies, verify=False).text
search = BeautifulSoup(search, 'lxml')
results = search.find_all('div', {'class': 's'}) #Where the ratings roughly are
for j in results:
if 'Rating' in str(j.findChildren()) and 'yelp' in str(j.findChildren()[1]):
reviews['Yelp Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1] #Had to brute-force get the ratings this way.
reviews['Yelp Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3]
elif 'Rating' in str(j.findChildren()) and 'facebook' in str(j.findChildren()[1]):
reviews['FB Rating'] = str(j.findChildren()).partition('Rating')[2].split()[1]
reviews['FB Reviews'] = str(j.findChildren()).partition('Rating')[2].split()[3]
info.append(reviews)
return info
results = pools.map(scrape_google, urls)
I tried something similar to this, but I think I'm getting way too many duplicated results. Will multithreading make this run more quickly? I did diagnostics on my code to see which parts took up the most time, and by far getting the requests was the rate-limiting factor.
EDIT: I just tried this out, and I get the following error:
Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
I don't understand what the problem is, because if I try my scrape_google function without multithreading, it works just fine (albeit very very slowly), so url validity should not be an issue.
Yes, multithreading will probably make it run more quickly.
As a very rough rule of thumb, you can usually profitably make about 8-64 requests in parallel, as long as no more than 2-12 of them are to the same host. So, one dead-simple way to apply that is to just toss all of your requests into a concurrent.futures.ThreadPoolExecutor with, say, 8 workers.
In fact, that's the main example for ThreadPoolExecutor in the docs.
(By the way, the fact that your computer has 8 cores is irrelevant here. Your code isn't CPU-bound, it's I/O bound. If you do 12 requests in parallel, or even 500 of them, at any given moment, almost all of your threads are waiting on a socket.recv or similar call somewhere, blocking until the server responds, so they aren't using your CPU.)
However:
I think I'm getting way too many duplicated result
Fixing this may help far more than threading. Although, of course, you can do both.
I have no idea what your issue is here from the limited information you provided, but there's a pretty obvious workaround: Keep a set of everything you've seen so far. Whenever you get a new URL, if it's already in the set, throw it away instead of queuing up a new request.
Finally:
I would just use an API normally, but because I have a huge list of sites to search for and time is of the essence, Facebook's small request limits per hour make this not feasible
If you're trying to get around the rate limits for a major site, (a) you're probably violating their T&C, and (b) you're almost surely going to trigger some kind of detection and get yourself blocked.1
In your edited question, you attempted to do this with multiprocessing.dummy.Pool.map, which is fine—but you're getting the arguments wrong.
Your function takes a list of urls and loops over them:
def scrape_google(url_list):
# ...
for i in url_list:
But then you call it with a single URL at a time:
results = pools.map(scrape_google, urls)
This is similar to using the builtin map, or a list comprehension:
results = map(scrape_google, urls)
results = [scrape_google(url) for url in urls]
What happens if you get a single URL instead of a list of them, but try to use it as a list? A string is a sequence of its characters, so you loop over the characters of the URL one by one, trying to download each character as if it were a URL.
So, you want to change your function, like this:
def scrape_google(url):
reviews = # …
request = requests.get(url, proxies=proxies, verify=False).text
# …
return reviews
Now it takes a single URL, and returns a set of reviews for that URL. The pools.map will call it with each URL, and give you back an iterable of reviews, one per URL.
1. Or maybe something more creative. Someone posted a question on SO a few years ago about a site that apparently sent corrupted responses that seem to have been specifically crafted to waste exponential CPU for a typical scraper regex…

How to extract all tweets from multiple users' timelines using R?

I am working on a project for which I want to extract the timelines of around 500 different twitter users (I am using this for historical analysis, so I'll only need to retrieve them all once- no need to update with incoming tweets).
While I know the Twitter API only allows the last 3,200 tweets to be retrieved, when I use the basic UserTimeline method of the R twitteR package, I only seem to fetch about 20 every time I try (for users with significantly more, recent, tweets). Is this because of rate limiting, or because I am doing something wrong?
Does anyone have tips for doing this most efficiently? I realize it might take a lot of time because of rate limiting, is there a way of automating/iterating this process in R?
I am quite stuck, so thank you very much for any help/tips you may have!
(I have some experience using the Twitter API/twitteR package to extract tweets using a certain hashtag over a couple of days. I have basic Python skills, if it turns out to be easier/quicker to do in Python).
It looks like the twitteR documentation suggests using the maxID argument for pagination. So when you get the first batch of results, you could use the minimum ID in that set minus one as the maxID for the next request, until you get no more results back (meaning you've gotten to the beginning of a user's timeline).

Query a website/web service using python

I am trying to find tweets per day for a twitter handle, i found this site called " http://www.howoftendoyoutweet.com/"- it gives you the tweets per day , when you provide it with the twitter handle. I want my python script to query this website for a list of twitter handles and extract the tweets per day from the page.
I know that i have to use urllib2 and json for it, but not been able to. Is there any better way to find tweets per day?
Seems like the python-twitter library might give you better results.

Categories

Resources