I am sorry for asking but I am new in writing crawler.
I would like to crawl Twitter space for Twitter users and follow relationship among them using python.
Any recommendation for starting points such as tutorials?
Thank you very much in advance.
I'm a big fan of Tweepy myself - https://github.com/tweepy/tweepy
You'll have to refer to the Twitter docs for the API methods that you're going to need. As far as I know, Tweepy wraps all of them, but I recommend looking at Twitter's own docs to find out which ones you need.
To construct a following/follower graph, you're going to need some of these:
GET followers/ids - grab followers (in IDs) for a user
GET friends/ids - grab followings (in IDs) for a user
GET users/lookup - grab up to 100 users, specified by IDs
besides reading the twitter api?
a good starting point would be the great python twitter library by mike verdona which personally I think is the the best one. (also an intorduction here)
also see this question in stackoverflow
Related
I was wondering if anyone has any sample code for finding a certain keyword in twitter that has been recently posted and has a certain amount of likes within a certain timeframe
preferably in python. Anything related to this would help a lot if you have it. Thank You!
I have personally not done this before, but a simple google search yielded this (a python wrapper for the Twitter API):
https://python-twitter.readthedocs.io/en/latest/index.html
and a GitHub with examples that they linked from their getting started page:
https://github.com/bear/python-twitter/tree/master/examples
There you can find some example code for getting all of a user's tweets and much more.
Iterating through a list of users tweets might be able to do the job here, but if that doesn't cut it I recommend searching the docs linked above for what you need.
This is a small project I'd like to get started on in the near future. It's still in the planning stage so this post is more about being steered in the right direction
Essentially, I'd like to obtain tweets from a user and parse the tweets into a table/database, with the aim to be able to run this program in real-time.
My initial plan to tackle this was to use Beautiful Soup, a Python specific library, however, I believe the Twitter API is the better approach (advice on this subject would be appreciated)
There are still 3 unknowns:
Where do I store the tweets once obtained?
How to parse the tweets?
Where to store the parsed data?
To answer (3), I suppose it depends on what I want to do with the data. I still haven't decided how I'll use the parsed data but I know that I'd like it put into categories so my thinking is probably a database/table/excel??
A few questions still to answer and I'd like you guys to steer me in the right direction. My programming language knowledge is limited to just C for now, but as this project means a great deal to me, I'm willing to put the effort in and learn the necessary languages/APIs.
What languages/APIs will I need to gain an understanding of to accomplish this project? From where I stand, it seems to be Twitter API and Python.
EDIT: So I have a basic script going which obtains a user tweets. It works better than expected. However, I'd like to take it another step. I'd like to only obtain the users' tweets if it contains a hashtag inside of the tweet. All other tweets should be ignored. How best to do this?
Here is a snippet of the basic code I have going:
import tweepy
import twitter_credentials
auth = tweepy.OAuthHandler(twitter_credentials.CONSUMER_KEY, twitter_credentials.CONSUMER_SECRET)
auth.set_access_token(twitter_credentials.ACCESS_TOKEN, twitter_credentials.ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)
stuff = api.user_timeline(screen_name = 'XXXXXXXXXX', count = 10, include_rts = False)
for status in stuff:
print(status.text)
Scraping Twitter (or any other social network) with for example Beautiful soup, as you said, is not a good idea for 2 reasons :
if the source pages changes (name attributes, div ids...), you have to keep your code up to date
your script can be banned because scraping is not "allowed".
To answer your questions :
1) you can store the tweets wherever you want : csv, mysql, sqlite, redis, neo4j...
2) With official API, you get JSON. Here is a Tweet Object : https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object.html . With tweepy, for example status.text will give you the text of the tweet.
3) Same as #1. If you don't know actually what you will do with the data, store the full JSONs. You will be able later to parse them.
I suggest tweepy/python (http://www.tweepy.org/) or twit/nodejs (https://www.npmjs.com/package/twit). And read official docs : https://developer.twitter.com/en/docs/api-reference-index
This question already exists:
Python: visiting random LinkedIn profiles [closed]
Closed 8 years ago.
I'm trying to visit, say a set of 8,000 LinkedIn profiles that belong to people who have a certain first name (just for example, let's say "Larry"), and then would like to extract the kinds of jobs each user has held in the past. Is there an efficient way to do this? I would need each Larry to be picked independently from one another; basically, traversing someone's network isn't a good way to do this. Is there a way to completely randomize how the Larry's are picked?
Don't even know where to start. Thanks.
To Start:
Trying to crawl the response linkedin gives you on your browser would be almost suicidal.
Check their APIs (particularly the People's API) and their code samples.
Important disclaimer found in the People's API:
People Search API is a part of our Vetted API Access Program. You must
apply here and get LinkedIn's approval before using this API.
MAYBE with that in mind you'll be able to write an script that queries and parses those APIs. For instance, retrieving users with Larry as a first name http://api.linkedin.com/v1/people-search?first-name=Larry
Once you get approved by Linkedin and you have retrieved some data from their APIs and tried some json or XML parsing (whatever the APIs return), you will have something more specific to ask.
If you still want to crawl the html returned by linkedin when you hit https://www.linkedin.com/pub/dir/?first=Larry&last=&search=Search take a look to BeautifulSoup
I've been looking into scraping, and I cant manage to scrape twitter searches that date long way back by using python code, i can do it however with an addon for chrome but it falls short since it will only let me obtain a very limited amount of tweets. can anybody point me in the right direction¿
I found a related answer here:
https://stackoverflow.com/a/6082633/1449045
you can get up to 1500 tweets for a search; 3200 for a particular user's timeline
source: https://dev.twitter.com/docs/things-every-developer-should-know
see "There are pagination limits"
Here you can find a list of libraries to simplify the use of the APIs,
in different languages, including python
https://dev.twitter.com/docs/twitter-libraries
You can use snscrape. It doesn't require a Twitter Developer Account it can go back many years.
I am working on a website for which it would be useful to know the number of links shared by a particular facebook page (e.g., http://www.facebook.com/cocacola) so that the user can know whether they are 'liking' a firehose of information or a dribble of goodness. What is the best way to get the number of links/status updates that are shared by a particular page?
+1 for implementations that use python (this is a django website) but any solutions are welcome! I tried using fbconsole to accomplish this but I have come up a little short.
For what it is worth, this unanswered question seems relevant. As does the fact that, as of 2012.04.18, you can export your data to csv from the insights management page on the facebook site. The information is in there I just don't know how to get it out...
Thanks for your help!
In the event that anyone else finds this useful, I thought I'd post my gist example here. fbconsole makes it fairly simple to extract data through the Facebook Graph API.
The caveat is that it was not terribly easy to programmatically extract data through fbconsole so I wrote the fbconsole.automatically_authenticate to make it much easier to access this information in a systematic way. This addition has not yet been incorporated into the master branch of fbconsole (it was just posted this morning), but it is available here in the meantime for those that are interested.