This is a small project I'd like to get started on in the near future. It's still in the planning stage so this post is more about being steered in the right direction
Essentially, I'd like to obtain tweets from a user and parse the tweets into a table/database, with the aim to be able to run this program in real-time.
My initial plan to tackle this was to use Beautiful Soup, a Python specific library, however, I believe the Twitter API is the better approach (advice on this subject would be appreciated)
There are still 3 unknowns:
Where do I store the tweets once obtained?
How to parse the tweets?
Where to store the parsed data?
To answer (3), I suppose it depends on what I want to do with the data. I still haven't decided how I'll use the parsed data but I know that I'd like it put into categories so my thinking is probably a database/table/excel??
A few questions still to answer and I'd like you guys to steer me in the right direction. My programming language knowledge is limited to just C for now, but as this project means a great deal to me, I'm willing to put the effort in and learn the necessary languages/APIs.
What languages/APIs will I need to gain an understanding of to accomplish this project? From where I stand, it seems to be Twitter API and Python.
EDIT: So I have a basic script going which obtains a user tweets. It works better than expected. However, I'd like to take it another step. I'd like to only obtain the users' tweets if it contains a hashtag inside of the tweet. All other tweets should be ignored. How best to do this?
Here is a snippet of the basic code I have going:
import tweepy
import twitter_credentials
auth = tweepy.OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
stuff = api.user_timeline(screen_name = 'XXXXXXXXXX', count = 10, include_rts = False)
for status in stuff:
print(status.text)
Scraping Twitter (or any other social network) with for example Beautiful soup, as you said, is not a good idea for 2 reasons :
if the source pages changes (name attributes, div ids...), you have to keep your code up to date
your script can be banned because scraping is not "allowed".
To answer your questions :
1) you can store the tweets wherever you want : csv, mysql, sqlite, redis, neo4j...
2) With official API, you get JSON. Here is a Tweet Object : https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html . With tweepy, for example status.text will give you the text of the tweet.
3) Same as #1. If you don't know actually what you will do with the data, store the full JSONs. You will be able later to parse them.
I suggest tweepy/python (http://www.tweepy.org/) or twit/nodejs (https://www.npmjs.com/package/twit). And read official docs : https://developer.twitter.com/en/docs/api-reference-index
Related
I was wondering if anyone has any sample code for finding a certain keyword in twitter that has been recently posted and has a certain amount of likes within a certain timeframe
preferably in python. Anything related to this would help a lot if you have it. Thank You!
I have personally not done this before, but a simple google search yielded this (a python wrapper for the Twitter API):
https://python-twitter.readthedocs.io/en/latest/index.html
and a GitHub with examples that they linked from their getting started page:
https://github.com/bear/python-twitter/tree/master/examples
There you can find some example code for getting all of a user's tweets and much more.
Iterating through a list of users tweets might be able to do the job here, but if that doesn't cut it I recommend searching the docs linked above for what you need.
I have tried using tweepy to extract tweets for a specific keyword.But the count of extracted tweets using tweepy is less compared those tweets for the specific keyword as seen on twitter search.
Also I want to know how to effectively extract ALL the tweets for a specific keyword of interest using any twitter data extracting library (tweepy/twython).
I also face a problem of irrelevant tweets with same keyword coming up.Is there a way to fine tune search and perform accurate extraction so that I get all the tweets extracted for the specific keyword.
Im adding the code snippet as many asked for it.But I don't have a problem with the code as its running.
tweets = api.search('Mexican Food', count=500,tweet_mode = 'extended')
data = pd.DataFrame(data=[tweet.full_text for tweet in tweets], columns
['Tweets'])
data.head(10)
print(tweets[0].created_at)
My question is that how to get ALL the tweets with a particular keyword.For example when I run the above code ,for each time I am getting different count of tweets.Also I cross checked with doing manual search on twitter and it seems that there are much more tweets than extracted through tweepy for the particular keyword.
Also I want to know if there is any way to fine tune the keyword search through python so that all the relevant tweets for my keyword of interest is fetched.
The thing is when you use tweepy It has some limitation. It won't be able to fetch older tweets.
So I will suggest you to use
https://github.com/Jefferson-Henrique/GetOldTweets-python
in place of tweepy to fetch the older tweets.
Since you refuse to help me with your question, I'll do the bare minimum with my answer:
You are probably not doing pagination correctly
ps: Check out the stack overflow guidelines. There is an important point about Helping others reproduce the problem
I am working on a project for which I want to extract the timelines of around 500 different twitter users (I am using this for historical analysis, so I'll only need to retrieve them all once- no need to update with incoming tweets).
While I know the Twitter API only allows the last 3,200 tweets to be retrieved, when I use the basic UserTimeline method of the R twitteR package, I only seem to fetch about 20 every time I try (for users with significantly more, recent, tweets). Is this because of rate limiting, or because I am doing something wrong?
Does anyone have tips for doing this most efficiently? I realize it might take a lot of time because of rate limiting, is there a way of automating/iterating this process in R?
I am quite stuck, so thank you very much for any help/tips you may have!
(I have some experience using the Twitter API/twitteR package to extract tweets using a certain hashtag over a couple of days. I have basic Python skills, if it turns out to be easier/quicker to do in Python).
It looks like the twitteR documentation suggests using the maxID argument for pagination. So when you get the first batch of results, you could use the minimum ID in that set minus one as the maxID for the next request, until you get no more results back (meaning you've gotten to the beginning of a user's timeline).
This question already exists:
Python: visiting random LinkedIn profiles [closed]
Closed 8 years ago.
I'm trying to visit, say a set of 8,000 LinkedIn profiles that belong to people who have a certain first name (just for example, let's say "Larry"), and then would like to extract the kinds of jobs each user has held in the past. Is there an efficient way to do this? I would need each Larry to be picked independently from one another; basically, traversing someone's network isn't a good way to do this. Is there a way to completely randomize how the Larry's are picked?
Don't even know where to start. Thanks.
To Start:
Trying to crawl the response linkedin gives you on your browser would be almost suicidal.
Check their APIs (particularly the People's API) and their code samples.
Important disclaimer found in the People's API:
People Search API is a part of our Vetted API Access Program. You must
apply here and get LinkedIn's approval before using this API.
MAYBE with that in mind you'll be able to write an script that queries and parses those APIs. For instance, retrieving users with Larry as a first name http://api.linkedin.com/v1/people-search?first-name=Larry
Once you get approved by Linkedin and you have retrieved some data from their APIs and tried some json or XML parsing (whatever the APIs return), you will have something more specific to ask.
If you still want to crawl the html returned by linkedin when you hit https://www.linkedin.com/pub/dir/?first=Larry&last=&search=Search take a look to BeautifulSoup
I am sorry for asking but I am new in writing crawler.
I would like to crawl Twitter space for Twitter users and follow relationship among them using python.
Any recommendation for starting points such as tutorials?
Thank you very much in advance.
I'm a big fan of Tweepy myself - https://github.com/tweepy/tweepy
You'll have to refer to the Twitter docs for the API methods that you're going to need. As far as I know, Tweepy wraps all of them, but I recommend looking at Twitter's own docs to find out which ones you need.
To construct a following/follower graph, you're going to need some of these:
GET followers/ids - grab followers (in IDs) for a user
GET friends/ids - grab followings (in IDs) for a user
GET users/lookup - grab up to 100 users, specified by IDs
besides reading the twitter api?
a good starting point would be the great python twitter library by mike verdona which personally I think is the the best one. (also an intorduction here)
also see this question in stackoverflow