I have a list of a few thousand twitter ids and I would like to check who follows who in this network.
I used Tweepy to get the accounts using something like:
ids = {}
for i in list_of_accounts:
for page in tweepy.Cursor(api.followers_ids, screen_name=i).pages():
ids[i]=page
time.sleep(60)
The values in the dictionary ids form the network I would like to analyze. If I try to get the complete list of followers for each id (to compare to the list of users in the network) I run into two problems.
The first is that I may not have permission to see the user's followers - that's okay and I can skip those - but they stop my program. This is the case with the following code:
connections = {}
for x in user_ids:
l=[]
for page in tweepy.Cursor(api.followers_ids, user_id=x).pages():
l.append(page)
connections[x]=l
The second is that I have no way of telling when my program will need to sleep to avoid the rate-limit. If I put a 60 second wait after every page in this query - my program would take too long to run.
I tried to find a simple 'exists_friendship' command that might get around these issues in a simpler way - but I only find things that became obsolete with the change to API 1.1. I am open to using other packages for Python. Thanks.
if api.exists_friendship(userid_a, userid_b):
print "a follows b"
else:
print "a doesn't follow b, check separately if b follows a"
Related
I am using Python and Selenium to scrape some data out of an website. This website has the following structure:
First group item has the following base ID: frmGroupList_Label_GroupName and then you add _2 or _3 at the end of this base ID to get the 2nd/3rd group's ID.
Same thing goes for the user item, it has the following base ID: frmGroupContacts_TextLabel3 and then you add _2 or _3 at the end of this base ID to get the 2nd/3rd users's ID.
What I am trying to do is to get all the users out of each group. And this is how I did it: find the first group, select it and grab all of it users, then, go back to the 2nd group, grab its users, and so on.
def grab_contact(number_of_members):
groupContact = 'frmGroupContacts_TextLabel3'
contact = browser.find_element_by_id(groupContact).text
print(contact)
i = 2
time.sleep(1)
# write_to_excel(contact, group)
while i <= number_of_members:
group_contact_string = groupContact + '_' + str(i)
print(group_contact_string)
try:
contact = browser.find_element_by_id(group_contact_string).text
print(contact)
i = i + 1
time.sleep(1)
# write_to_excel(contact, group)
except NoSuchElementException:
break
time.sleep(3)
Same code applies for scraping the groups. And it works, up to a point!! Although the IDs of the groups are different, the IDs of the users are the same from one group to another. Example:
group_id_1 = user_id_1, user_id_2
group_id_2 = user_id_1, user_id_2, user_id_3, user_id_4, user_id_5
group_id_3 = user_id_1, user_id_2, user_id_3
The code runs, it goes to group_id_1, grabs user_id_1 and user_id_2 correctly, but when it gets to group_id_2, the user_id_1 and user_id_2 (which are different in matter of content) are EMPTY, and only user_id_3, user_id_4, user_id_5 are correct. Then, when it gets to group_id_3, all of the users are empty.
This has to do with the users having same IDs. As soon as it gets to a certain user ID in a group, I cannot retrieve all the users before that ID in another group. I tried quitting the browser, and reopening a new browser (it doesn't work, the new browser doesn't open), tried refreshing the page (doesn't work), tried opening a new tab (doesn't work).
I think the content of the IDs get stuck in memory when they are accessed, and are not freed when accessing a new group. Any ideas on how to get past this?
Thanks!
As the saying goes... it ain't stupid, if it works.
def refresh():
# accessing the groups page
url = "https://google.com"
browser.get(url)
time.sleep(5)
url = "https://my_url.com"
browser.get(url)
time.sleep(5)
While trying to debug this, and finding a solution, I thought: "what if you go to another website, then come back to yours, between group scraping"... and it works! Until I find other solution, I'll stick with this one.
Please help!
Well, first of all, I will explain what this should do. I'm trying to store users and servers from discord in a list (users that use the bot and servers in which the bot is in) with the id.
for example:
class User(object):
name = ""
uid = 0
Now, all discord id are very long and I want to store lots of users and servers in my list (one list for each one) but suppose that I get 10.000 users in my list, and I want to get the last one (without knowing it's the last one), this would take a lot of time. Instead, I thought that I could make a directory system for storing users in the list and finding it quickly. This is how it works:
I can get the id easily so imagine my id is 12345.
Now I convert it into a string using python str(id) function and I store it in a variable, strId.
For each digit of the list, I use it as an index for the users list, like this:
The User() is where the user is stored
users_list = [[[], [[], [], [[], [], [], [User()]]]]]
actual_dir = 0
for digit in strId:
actual_dir = digit
user = actual_dir[0]
And that's how I reach the user (or something like that)
Now, here is where my problem is. I know I can get the user easily by getting the user by id, but when I want to save the changes, I should do something like users_list[1][2][3][4][5] = changed_user_variable, but how far I know I cannot do something like list[1] += [2]
Is there any way to reach the user and save the changes?
Thanks in advance
You can use a python dictionary with the user id as the key and the user object as the value. I ran a test on my own computer and found that finding 100 000 random users in a dictionary with 10 million users only took 0.3s. This method is much simpler and I would guess it's just as fast, if not faster.
You can create a dictionary and add users with:
users = {}
users[userID] = some_user
(many other ways of doing this)
by using a dictionary you can easily change a user's field by:
users[userID].some_field = "Some value"
or overwrite the same way you add users in the first place.
On the Soundcloud API guide (https://developers.soundcloud.com/docs/api/guide#pagination)
the example given for reading more than 100 piece of data is as follows:
# get first 100 tracks
tracks = client.get('/tracks', order='created_at', limit=page_size)
for track in tracks:
print track.title
# start paging through results, 100 at a time
tracks = client.get('/tracks', order='created_at', limit=page_size,
linked_partitioning=1)
for track in tracks:
print track.title
I'm pretty certain this is wrong as I found that 'tracks.collection' needs referencing rather than just 'tracks'. Based on the GitHub python soundcloud API wiki it should look more like this:
tracks = client.get('/tracks', order='created_at',limit=10,linked_partitioning=1)
while tracks.collection != None:
for track in tracks.collection:
print(track.playback_count)
tracks = tracks.GetNextPartition()
Where I have removed the indent from the last line (I think there is an error on the wiki it is within the for loop which makes no sense to me). This works for the first loop. However, this doesn't work for successive pages because the "GetNextPartition()" function is not found. I've tried the last line as:
tracks = tracks.collection.GetNextPartition()
...but no success.
Maybe I'm getting versions mixed up? But I'm trying to run this with Python 3.4 after downloading the version from here: https://github.com/soundcloud/soundcloud-python
Any help much appreciated!
For anyone that cares, I found this solution on the SoundCloud developer forum. It is slightly modified from the original case (searching for tracks) to list my own followers. The trick is to call the client.get function repeatedly, passing the previously returned "users.next_href" as the request that points to the next page of results. Hooray!
pgsize=200
c=1
me = client.get('/me')
#first call to get a page of followers
users = client.get('/users/%d/followers' % me.id, limit=pgsize, order='id',
linked_partitioning=1)
for user in users.collection:
print(c,user.username)
c=c+1
#linked_partitioning means .next_href exists
while users.next_href != None:
#pass the contents of users.next_href that contains 'cursor=' to
#locate next page of results
users = client.get(users.next_href, limit=pgsize, order='id',
linked_partitioning=1)
for user in users.collection:
print(c,user.username)
c=c+1
I am dealing with the Box.com API using python and am having some trouble automating a step in the authentication process.
I am able to supply my API key and client secret key to Box. Once Box.com accepts my login credentials, they supply me with an HTTP GET parameter like
'http://www.myapp.com/finish_box?code=my_code&'
I want to be able to read and store my_code using python. Any ideas? I am new to python and dealing with APIs.
This is actually a more robust question than it seems, as it exposes some useful functions with web dev in general. You're basically asking how to separate my_code in the string 'http://www.myapp.com/finish_box?code=my_code&'.
Well let's take it in bits and pieces. First of all, you know that you only really need the stuff after the question mark, right? I mean, you don't need to know what website you got it from (though that would be good to save, let's keep that in case we need it later), you just need to know what arguments are being passed back. Let's start with String.split():
>>> return_string = 'http://www.myapp.com/finish_box?code=my_code&'
>>> step1 = return_string.split('?')
["http://www.myapp.com/finish_box","code=my_code&"]
This will return a list to step1 containing two elements, "http://www.myapp.com/finish_box" and "code=my_code&". Well hell, we're there! Let's split the second one again on the equals sign!
>>> step2 = step1[1].split("=")
["code","my_code&"]
Well lookie there, we're almost done! However, this doesn't really allow any more robust uses of it. What if instead we're given:
>>> return_string = r'http://www.myapp.com/finish_box?code=my_code&junk_data=ohyestheresverymuch&my_birthday=nottoday&stackoverflow=usefulplaceforinfo'
Suddenly our plan doesn't work. Let's instead break that second set on the & sign, since that's what's separating the key:value pairs.
step2 = step1[1].split("&")
["code=my_code",
"junk_data=ohyestheresverymuch",
"my_birthday=nottoday",
"stackoverflow=usefulplaceforinfo"]
Now we're getting somewhere. Let's save those as a dict, shall we?
>>> list_those_args = []
>>> for each_item in step2:
>>> list_those_args[each_item.split("=")[0]] = each_item.split("=")[1]
Now we've got a dictionary in list_those_args that contains key and value for every argument the GET passed back to you! Science!
So how do you access it now?
>>> list_those_args['code']
my_code
You need a webserver and a cgi-script to do this. I have setup a single python script solution to this to run this. You can see my code at:
https://github.com/jkitchin/box-course/blob/master/box_course/cgi-bin/box-course-authenticate
When you access the script, it redirects you to box for authentication. After authentication, if "code" is in the incoming request, the code is grabbed and redirected to the site where tokens are granted.
You have to setup a .htaccess file to store your secret key and id.
I am trying to loop through some 50-odd files in a directory. Each file has some text for which i am trying to find the keywords using Yahoo Term Extractor. I am able to extract text from each file, but I am not able to iteratively call the API using the text as input. Only the keywords for the first file is displayed.
Here is my code snippet:
in 'comments' list, I have extracted and stored the text from each file.
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Well I don't know anything about the Yahoo Term Extractor, but I'd presume that your call request.add_data(queryData) simply tacks on another data set with each iteration of your loop. And then the call to OPENER.open(request).read() would probably only process the results of the first data set. So either your request object can only hold one query, or your OPENER object's inner workings can only process one query, it's as simple as that.
Actually a third reason comes to mind now that I read the documentation provided at your link, and this is probably the true one:
RATE LIMITS
The Term Extraction service is limited to 5,000 queries per IP address per day and to noncommercial use. See information on rate limiting.
So it would make sense that the API would limit your usage to one query at a time, and not allow you to flood a bunch of queries in a single request.
In any event, I'd assume you could fix your problem in a "naive" way by having many request variables instead of just one, or maybe just creating a new request with every iteration of your loop. If you're not worried about storing your results, and just trying to debug, you could try:
for c in comments:
print "building query"
dataDict = [ ('appid', appid), ('context', c)]
queryData = urllib.urlencode(dataDict)
request = urllib2.Request() # I don't know how to initialize this variable, do it yourself
request.add_data(queryData)
print "fetching result"
result = OPENER.open(request).read()
print result
time.sleep(1)
Again, I don't know about the Yahoo Term Extractor (nor do I really have time to research it) so there may very well be a better, more native way to do this. If you post more details of your code (i.e. what classes are the request and OPENER objects coming from) then I might be able to elaborate on this.