Twitter scraping in Python

Twitter scraping in Python - python

I have to scrape tweets from Twitter for a specific user (#salvinimi), from January 2018. The issue is that there are a lot of tweets in this range of time, and so I am not able to scrape all the ones I need!
I tried multiple solutions:
1)
pip install twitterscraper
from twitterscraper import query_tweets_from_user as qtfu
tweets = qtfu(user='matteosalvinimi')
With this method, I get only a few teets (500~600 more or less), instead of all the tweets... Do you know why?
2)
!pip install twitter_scraper
from twitter_scraper import get_tweets
tweets = []
for i in get_tweets('matteosalvinimi', pages=100):
tweets.append(i)
With this method I get an error -> "ParserError: Document is empty"...
If I set "pages=40", I get the tweets without errors, but not all the ones. Do you know why?

Three things for the first issue you encounter:
first of all, every API has its limits and one like Twitter would be expected to monitor its use and eventually stop a user from retrieving data if the user is asking for more than the limits. Trying to overcome the limitations of the API might not be the best idea and might result in being banned from accessing the site or other things (I'm taking guesses here as I don't know what's the policy of Twitter on the matter). That said, the documentation on the library you're using states :
With Twitter's Search API you can only sent 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request this means you can mine for 4 x 180 x 100 = 72.000 tweets per hour.
By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.
then, the function you're using, query_tweets_from_user() has a limit argument which you can set to an integer. One thing you can try is changing that argument and seeing whether you get what you want or not.
finally, if the above does not work, you could be subsetting your time range in two, three ore more subsets if needed, collect the data separately and merge them together afterwards.
The second issue you mention might be due to many different things so I'll just take a broad guess here. For me, either setting pages=100 is too high and by one way or another the program or the API is unable to retrieve the data, or you're trying to look at a hundred pages when there is less than a hundred in pages to look for reality, which results in the program trying to parse an empty document.

Related

Number of orders matching parameters?

I'm trying to setup some automation in my life here and I've hit a bit of a snag.
I'm using the WooCommerce API and I've utilized the woocommerce python library.
I'm able to get a max of 100 orders for a given date range. So I've done the following:
wcapi.get("orders?per_page=100&before=2023-02-01T23:59:59&after=2023-01-01T00:00:00").json()
It appears 100 is the max you can set for "per_page". I can then utilize the "page" argument to get page 2 and so on. I'm aware of how to loop through the pages, however I can't find anything in the documentation that can tell me how many orders fit my before/after parameters. So I'm left with paging through until I receive an error?
reports/orders/totals appears to ignore any parameters given to it, so there is no way to do any date filtering.

Getting the number of results of a google search using python in 2020?

There were solutions provided before, but they don't work anymore :
extract the number of results from google search
for example the above code doesn't work anymore because the number of results doesn't seem to even be in the respond, there is no resultStats ID, in my browser the result is in the id of "result-status" but this doesn't exist in the respond
I don't want to actually use the API of google because there is a big limit on daily search, and i need to search for thousands of words daily, what is the solution for me?

How to extract all tweets from multiple users' timelines using R?

I am working on a project for which I want to extract the timelines of around 500 different twitter users (I am using this for historical analysis, so I'll only need to retrieve them all once- no need to update with incoming tweets).
While I know the Twitter API only allows the last 3,200 tweets to be retrieved, when I use the basic UserTimeline method of the R twitteR package, I only seem to fetch about 20 every time I try (for users with significantly more, recent, tweets). Is this because of rate limiting, or because I am doing something wrong?
Does anyone have tips for doing this most efficiently? I realize it might take a lot of time because of rate limiting, is there a way of automating/iterating this process in R?
I am quite stuck, so thank you very much for any help/tips you may have!
(I have some experience using the Twitter API/twitteR package to extract tweets using a certain hashtag over a couple of days. I have basic Python skills, if it turns out to be easier/quicker to do in Python).

It looks like the twitteR documentation suggests using the maxID argument for pagination. So when you get the first batch of results, you could use the minimum ID in that set minus one as the maxID for the next request, until you get no more results back (meaning you've gotten to the beginning of a user's timeline).

How much data can I get with the Twitter Search API for one specific keyword?

I want collect data from twitter using python Tweepy library.
I surveyed the rate limits for Twitter API,which is 180 requests per 15-minute.
What I want to know how many data I can get for one specific keyword?put it in another way , when I use the Tweepy.Cursor,when it'll stops?
I not saying the maths calculation(100 count * 180 request * 4 times/hour etc.) but the real experience.I found a view as follows:
"With a specific keyword, you can typically only poll the last 5,000 tweets per keyword. You are further limited by the number of requests you can make in a certain time period. "
http://www.brightplanet.com/2013/06/twitter-firehose-vs-twitter-api-whats-the-difference-and-why-should-you-care/
Is this correct(if this's correct,I only need to run the program for 5 minutes or so)? or I am needed to keep getting as many tweets as they are there(which may make the program keep running very long time)?

You will definitely not be getting as many tweets as exist. The way Twitter limits how far back you can go (and therefore how many tweets are available) is with a minimum since_id parameter passed to the GET search/tweets call to the Twitter API. In Tweepy, the API.search function interfaces with the Twitter API. Twitter's GET search/tweets documentation has a lot of good info:
There are limits to the number of Tweets which can be accessed through the API. If the limit of Tweets has occured since the since_id, the since_id will be forced to the oldest ID available.
In practical terms, Tweepy's API.search should not take long to get all the available tweets. Note that not all tweets are available per the Twitter API, but I've never had a search take up more than 10 minutes.

Python Twitter Statistics

I need to get the number of people who have followed a certain account by month, also the number of people who have unfollowed the same account by month, the total number of tweets by month, and the total number of times something the account tweeted has been retweeted by month.
I am using python to do this, and have installed python-twitter, but as the documentation is rather sparse, I'm having to do a lot of guesswork. I was wondering if anyone could point me in the right direction? I was able to get authenticated using OAuth, so thats not an issue, I just need some help with getting those numbers.
Thank you all.

These types of statistical breakdowns are not generally available via the Twitter API. Depending on your sample date range, you may have luck using Twittercounter.com's API (you can sign up for an API key here).
The API is rate limited to 100 calls per hour, unless you get whitelisted. You can get results for the previous 14 days. An example request is below:
http://api.twittercounter.com?twitter_id=813286&apikey=[api_key]
The results, in JSON, look like this:
{"version":"1.1","username":"BarackObama","url":"http:\/\/www.barackobama.com","avatar":"http:\/\/a1.twimg.com\/profile_images\/784227851\/BarackObama_twitter_photo_normal.jpg","followers_current":7420937,"date_updated":"2011-04-16","follow_days":"563","started_followers":"2264457","growth_since":5156480,"average_growth":"9166","tomorrow":"7430103","next_month":"7695917","followers_yesterday":7414507,"rank":"3","followers_2w_ago":7243541,"growth_since_2w":177396,"average_growth_2w":"12671","tomorrow_2w":"7433608","next_month_2w":"7801067","followersperdate":{"date2011-04-16":7420937,"date2011-04-15":7414507,"date2011-04-14":7400522,"date2011-04-13":7385729,"date2011-04-12":7370229,"date2011-04-11":7366548,"date2011-04-10":7349078,"date2011-04-09":7341737,"date2011-04-08":7325918,"date2011-04-07":7309609,"date2011-04-06":7306325,"date2011-04-05":7283591,"date2011-04-04":7269377,"date2011-04-03":7257596},"last_update":1302981230}
The retweet stats aren't available from Twittercounter, but you might be able to obtain those from Favstar (although they don't have a public API currently.)

My problem is I also need to get unfollow statistics, which twittercounter does not supply.
My solution was to access the twitter REST API directly, using the oauth2 library in python. I found this very simple compared to some of the other twitter libraries for python out there. This example was particularly helpful: http://parand.com/say/index.php/2010/06/13/using-python-oauth2-to-access-oauth-protected-resources/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.