urlib & requests fail "sometimes" to get the final URL - python

To give an overview of the problem, I have a list of Twitter users "screen_names" and I want to verify wether they are suspended users or not. I don't want to use the twitter search API to avoid the rate limits problem (the list is quite big). Therefore, I am trying to use a cluster of computers to label my dataset (wether an account in my database is suspended or not).
If an account is suspended by Twitter and you try to access them through the link http://www.twitter/screen_name you get redirected to https://twitter.com/account/suspended
I tried to capture this behaviour using python 2.7 with urlib using the geturl() method. It works but is not reliable (I don't get the same results on the same link). I tested it on the same account and yet sometimes it returns the https://twitter.com/account/suspended and some other times it returns http://www.twitter/screen_name
The same problem occurs with requests.
My code:
import requests
from lxml import html
screen_name = 'IaMaGuyGetIt'
account_url = "https://twitter.com/"+screen_name
url = requests.get(account_url)
print url.url
req = urllib.urlopen(url.url).read()
page = html.fromstring(req)
for heading in page.xpath("//h1"):
if heading.text == 'Account suspended':
print True

The twitter server only serves you the 302 redirect once; after that it'll assume your browser has cached the redirect.
The body of the page does contain a pointer though, so even if you were not redirected you can see that there is still the link there:
r = requests.get(account_url)
>>> r.url
u'https://twitter.com/IaMaGuyGetIt'
>>> r.text
u'<html><body>You are being redirected.</body></html>'
Look for that exact text.

Related

Printing Cookie domain in RequestsCookieJar using Python 3

I'm working with Python requests module and have ran into an issue I can't seem to resolve. I'm relatively new to requests, but even with googling/documentation I have hit a wall.
I am trying to get a link (or "domain") from a specific cookie that is in the response of a GET request i've made. I can only seem to print the cookie, and not the domain. Explanation below:
Code (See comments):
import pickle
import requests
import time
r = requests.post('https://example.com/AddToCartURL')
print("----------------------------------------")
cookies = r.cookies #prints all cookies
print(r.cookies)
time.sleep(3)
# Code below will now use cookies and do a GET request to checkout URL
r = requests.get("https://example.com/checkout", cookies=cookies)
print("HEADERS")
time.sleep(1)
print(r.headers)
print("---------")
print("Cookie")
print(r.cookies)
print("---------") # THIS IS WHERE MY ISSUE IS :
print(r.cookies['checkout']) # This prints the cookie itself
print(r.cookies['checkout']['domain'])
outputs & issues:
So here are my issues:
#1 - The cookieJar cookie is shown like so when I print r.cookies:
<Cookie checkout=r3jk43nb42knj32--fjnk3jk2jkn2323njk for www.example.com/checkout/url>
And when I print(r.cookies['checkout']) I get the cookie, obviously:
r3jk43nb42knj32--fjnk3jk2jkn2323njk
Well what I'm trying to do here, essentially, is get the domain from it, which I try to do as:
print(r.cookies['checkout']['domain'])
getting the response:
Traceback (most recent call last):
File "/Users/user/PycharmProjects/project/Main.py", line 29, in <module>
print(r.cookies['checkout']['domain'])
TypeError: string indices must be integers
Which is something you'll find from a quick google search. Documentation wise I wasn't able to find a clear answer, probably because I'm still bad at searching. I tried the obvious of using an integer, presuming they're referencing indexing. However, this prints a single character of the cookie itself.
What i'm trying to print, specifically, is example.com/checkout/url from the cookie above. I'm trying to interact with it to continue my code on through the checkout process.
This is an example of the full cookie jar: (called by print(r.cookies))
<RequestsCookieJar[<Cookie random_other_cookie= 3j32fj302fj023jfi for example.com/checkoutURL>, <Cookie checkout=r3jk43nb42knj32--fjnk3jk2jkn2323njk for example.com/checkoutURL>]>
--------------------------------
#2 - Am I tackling this the wrong way?
Little more background information, the GET request above gives a response of 301. Permanently moved. I am 99% sure that the URL that I end up at (at least, front end wise) is the URL I need/the same URL as in the cookie above.
My question is should I not be trying to grab the domain from the cookie, and instead somehow try to grab the redirection URL?
(aka the URL that the request ends up at, not the original url https://example.com/checkout)
------------------------------------
I hope I outlined my issues well enough. This is my first post on StackOverflow after months of lurking around for answers.
Thank you.

Imgur API - How do I retrieve all favorites without pagination?

According to the Imgur Docs, the "GET Account Favorites" API call takes optional arguments for pagination, implying that all objects are returned without it.
However, when I use the following code snippet (the application has been registered and OAuth has already performed against my account for testing), I get only the first 30 JSON objects. In the snippet below, I already have an access_token for an authorized user and can retrieve data for that username. But the length of the returned list is always the first 30 items.
username = token['username']
bearer_headers = {
'Authorization': 'Bearer ' + token['access_token']
}
fav_url = 'https://api.imgur.com/3/account/' + username + '/' + 'favorites'
r = requests.get(fav_url, headers=bearer_headers)
r_json = r.json()
favorites=r_json['data']
len(favorites)
print(favorites)
The requests response returns a dictionary with three keys: status (the HTTP status code), success (true or false), and data, of which the value is a list of dictionaries (one per favorited item).
I'm trying to retrieve this without pagination so I can extract specific metadata values into a Pandas dataframe (id, post date, etc).
I originally thought this was a Pandas display problem in Jupyter notebook, but tracked it back to the API only returning the newest 30 list items, despite the docs indicating otherwise. If I place an arbitrary page number at the end (eg, "/favorites/1"), it returns the 30 items appropriate to that page, but there doesn't seem to be an option to get all items or retrieve a count of the total items or number of pages in advance.
What am I missing?
Postscript: It appears that none of the URIs work without pagination, eg, get account images, get gallery submissions, etc. Anything where there is an optional "/{{page}}" parameter, it will default to first page if none is specified. So I guess the larger question is, "does Imgur API even support non-paginated data, and how is that accessed?".
Paginated data is usually used when the possible size of the response can be arbitrarily large. I would be surprised if a major service like Imgur had an API that didn't work this way.
As you have found, the page attribute may be optional, and if you don't provide it, you get the first page as your response.
If you want to get more than the first page, you will need to loop over the page number:
data = []
page = 0
while block := connection.get(page=page):
data.append(block)
page += 1
This assumes Python3.8+ due to the := assignment expression. If you are on an older version you'll need to set block in the loop body, but the same idea applies.

Why do I get this error - "missing token"? (playing with an API using Python)

So, I am playing around with Etilbudsavis' API (Danish directory containing offers from retail stores). My goal is to retrieve data based on a search query. the API acutally allows this, out of the box. However, when I try to do this, I end up with an error saying that my token is missing. Anyways, here is my code:
from urllib2 import urlopen
from json import load
import requests
body = {'api_key': 'secret_api_key'}
response = requests.post('https://api.etilbudsavis.dk/v2/sessions', data=body)
print response.text
new_body = {'_token:': 'token_obtained_from_POST_method', 'query:': 'coca-cola'}
new_response = requests.get('https://api.etilbudsavis.dk/v2/offers/search', data=new_body)
print new_response.text
Full error:
{"code":1107,"id":"00ilpgq7etum2ksrh4nr6y1jlu5ng8cj","message":"missing token","
details":"Missing token\nNo token found in request to an endpoint that requires
a valid token.","previous":null,"#note.1":"Hey! It looks like you found an error
. If you have any questions about this error, feel free to contact support with
the above error id."}
Since this is a GET request, you should use the params argument to pass the data in the URL.
new_response = requests.get('https://api.etilbudsavis.dk/v2/offers/search', params=new_body)
See the requests docs.
I managed to solve the problem with the help of Daniel Roseman who reminded me of the fact that playing with an API in the Python Shell is different from interacting with the API in the browser. The docs clearly stated that you'd have to sign the API token in order for it to work. I missed that tiny detail ... Never the less, Daniel helped me figure everything out. Thanks again, Dan.

Python - Facebook fb_dtsg

On Facebook I want to find fb_dtsg to make a status:
import urllib, urllib2, cookielib
jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(cookie)
data = urllib.urlencode({'email':"email",'pass':"password", "Log+In":"Log+In"})
req = urllib2.Request('http://www.facebook.com/login.php')
opener.open(req, data)
opener.open(req, data) #Needs to be opened twice to log on.
req2 = urllib2.Request("http://www.facebook.com/")
page = opener.open(req2)
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 33] #This just finds the value of "fb_dtsg".
Yes, this does find a value, and a value that looks like fb_dtsg would look like, but this value is always changing when I would open the webpage again and also when I would use it to make a status, it would not work, and when I would record what is happening on google chrome if I was making a status normally, I would get an working fb_dtsg value and it would not change (for a long session), and would work if I used it to try make a status. Please, please show me how I can fix this up without using the API.
The searching criteria to find fb_dtsg truncates last digit, so change 33 to 34
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 34]
Anyways you can use a better way of searching the fb_dtsg using re
re.findall('fb_dtsg.+?value="([^"]+)"',page)
As I answered in one of your early posts it may also require other hidden variables also.
If this still doesn't work, can you provide the code where you are making the post including all the post form data
BTW sorry for not looking at all your previous posts with same content :P

Sending the variable's content to my mailbox in Python?

I have asked this question here about a Python command that fetches a URL of a web page and stores it in a variable. The first thing that I wanted to know then was whether or not the variable in this code contains the HTML code of a web-page:
from google.appengine.api import urlfetch
url = "http://www.google.com/"
result = urlfetch.fetch(url)
if result.status_code == 200:
doSomethingWithResult(result.content)
The answer that I received was "yes", i.e. the variable "result" in the code did contain the HTML code of a web page, and the programmer who was answering said that I needed to "check the Content-Type header and verify that it's either text/html or application/xhtml+xml". I've looked through several Python tutorials, but couldn't find anything about headers. So my question is where is this Content-Type header located and how can I check it? Could I send the content of that variable directly to my mailbox?
Here is where I got this code. It's on Google App Engines.
If you look at the Google App Engine documentation for the response object, the result of urlfetch.fetch() contains the member headers which contains the HTTP response headers, as a mapping of names to values. So, all you probably need to do is:
if result['Content-Type'] in ('text/html', 'application/xhtml+xml'):
# assuming you want to do something with the content
doSomethingWithXHTML(result.content)
else:
# use content for something else
doTheOtherThing(result.content)
As far as emailing the variable's contents, I suggest the Python email module.
for info on sending Content-Type header, see here: http://code.google.com/appengine/docs/python/urlfetch/overview.html#Request_Headers

Categories

Resources