On Facebook I want to find fb_dtsg to make a status:
import urllib, urllib2, cookielib
jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(cookie)
data = urllib.urlencode({'email':"email",'pass':"password", "Log+In":"Log+In"})
req = urllib2.Request('http://www.facebook.com/login.php')
opener.open(req, data)
opener.open(req, data) #Needs to be opened twice to log on.
req2 = urllib2.Request("http://www.facebook.com/")
page = opener.open(req2)
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 33] #This just finds the value of "fb_dtsg".
Yes, this does find a value, and a value that looks like fb_dtsg would look like, but this value is always changing when I would open the webpage again and also when I would use it to make a status, it would not work, and when I would record what is happening on google chrome if I was making a status normally, I would get an working fb_dtsg value and it would not change (for a long session), and would work if I used it to try make a status. Please, please show me how I can fix this up without using the API.
The searching criteria to find fb_dtsg truncates last digit, so change 33 to 34
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 34]
Anyways you can use a better way of searching the fb_dtsg using re
re.findall('fb_dtsg.+?value="([^"]+)"',page)
As I answered in one of your early posts it may also require other hidden variables also.
If this still doesn't work, can you provide the code where you are making the post including all the post form data
BTW sorry for not looking at all your previous posts with same content :P
Related
EDIT:
In a similar vein, when I now try to log into their account with a post request, what is returned is none of the errors they suggest on their site, but is in fact a "JSON exception". Is there any way to debug this, or is an error code 500 completely impossible to deal with?
I'm well aware this question has been asked before. Sadly, when trying the proposed answers, none worked. I have an extremely simple Python project with urllib, and I've never done web programming in Python before, nor am I even a regular Python user. My friend needs to get access to content from this site, but their user-friendly front-end is down and I learned that they have a public API to access their content. Not knowing what I'm doing, but glad to try to help and interested in the challenge, I have very slowly set out.
Note that it is necessary for me to only use standard Python libraries, so that any finished project could easily be emailed to their computer and just work.
The following works completely fine minus the "originalLanguage" query, but when using it, which the API has documented as an array value, no matter whether I comma-separate things, or write "originalLanguage[0]" or "originalLanguage0" or anything that I've seen online, this creates the error message from the server: "Array value expected but string detected" or something along those lines.
Is there any way for me to get this working? Because it clearly can work, otherwise the API wouldn't document it. Many thanks.
In case it helps, when using "[]" or "<>" or "{}" or any delimeter I could think of, my IDE didn't recognise it as part of the URL.
import urllib.request as request
import urllib.parse as parse
def make_query(url, params):
url += "?"
for i in range(len(params)):
url += list(params)[i]
url += '='
url += list(params.values())[i]
if i < len(params) - 1:
url += '&'
return url
base = "https://api.mangadex.org/manga"
params = {
"limit": "50",
"originalLanguage": "en"
}
url = make_query(base, params)
req = request.Request(url)
response = request.urlopen(req)
To give an overview of the problem, I have a list of Twitter users "screen_names" and I want to verify wether they are suspended users or not. I don't want to use the twitter search API to avoid the rate limits problem (the list is quite big). Therefore, I am trying to use a cluster of computers to label my dataset (wether an account in my database is suspended or not).
If an account is suspended by Twitter and you try to access them through the link http://www.twitter/screen_name you get redirected to https://twitter.com/account/suspended
I tried to capture this behaviour using python 2.7 with urlib using the geturl() method. It works but is not reliable (I don't get the same results on the same link). I tested it on the same account and yet sometimes it returns the https://twitter.com/account/suspended and some other times it returns http://www.twitter/screen_name
The same problem occurs with requests.
My code:
import requests
from lxml import html
screen_name = 'IaMaGuyGetIt'
account_url = "https://twitter.com/"+screen_name
url = requests.get(account_url)
print url.url
req = urllib.urlopen(url.url).read()
page = html.fromstring(req)
for heading in page.xpath("//h1"):
if heading.text == 'Account suspended':
print True
The twitter server only serves you the 302 redirect once; after that it'll assume your browser has cached the redirect.
The body of the page does contain a pointer though, so even if you were not redirected you can see that there is still the link there:
r = requests.get(account_url)
>>> r.url
u'https://twitter.com/IaMaGuyGetIt'
>>> r.text
u'<html><body>You are being redirected.</body></html>'
Look for that exact text.
Based on some quick examples found on SO and other sources, I am trying to use Python urllib/urllib2 to submit a form in the following manner:
>>> import urllib, urllib2
>>> url = 'http://example.com'
>>> r_params = {'a':'test','b':'hooray'}
>>> e_params = urllib.urlencode(r_params)
>>> user_agent = 'some browser and such'
>>> headers = {'User-Agent': user_agent}
>>> req = urllib2.Request(url, e_params, headers)
>>> response = urllib2.urlopen(req)
>>> data = response.read()
I've gotten this to work, however, on the particular form I am looking for there are two buttons of type "submit". e.g.:
<b><input type="submit" name="ButtonA" value="SUBMIT"></b>
<b><input type="submit" name="ButtonB" value="LINK"></b>
I believe the problem I'm having results from the current code choosing the wrong one. How do I get a response by submitting ButtonB rather than ButtonA? Some of the stuff I've read seems to indicate that I could try using mechanize, but I was hoping to keep this simple without having to read up and learn mechanize. Is there an easy way to do this, or do I need to suck it up and actually take the time to learn and understand what I'm doing?
It should be fairly simple, if that's the case - you should look in to what exactly you're doing. Specifically, you're sending a POST request (urllib2.urlopen will send a POST request automatically if the data argument is supplied) with the data that would normally be supplied by the form element itself. In the case of multiple "submit" inputs, the name and value of the activated submit input is sent along with the rest of the form data.
So, that's all you have to do - include "ButtonB":"LINK" as data.
A quick reference so you can see how HTML does all the stuff it does:
http://www.w3.org/TR/html401/interact/forms.html#submit-format
I recommend using a tool like TamperData for Firefox to discover precisely how the site's POSTs are formed. Activate TamperData just before you're ready to click one of the buttons. When it's up, go ahead and click one. The POST will be recorded in TamperData. Find it and click on it.
Find the POSTDATA row below and double-click it. Select the "Decoded" radio button to remove the HTML escapes. Now you have a 1:1 reference you should copy when making your "r_params" dictionary. For instance, if the POSTDATA looked like this:
Name | Value
--------------------
QueryString | test
Page |
Search | blah
then you will create your dictionary like this:
r_params = {'QueryString': 'test',
'Page': '',
'Search':, 'blah'}
After you've found out what the POSTDATA looks like for each separate submit event, you'll know how to create the right dictionary to send along. Also, be sure to confirm you are POSTing to the correct URL. Good luck!
I am so frustrated trying to use Ruby to fetch a specific url content.
I've tried many different ways like open-uri, standard request none worked so far. I always get empty html. I also tried to use python to fetch the same url which always returned the correct html content. I am really not sure why... Please help as I am newbiew to both Ruby and Python... I want to use Ruby (prefer the tidy syntax and human friendly function names, easier to install libs using gem and homebrew (on mac) than python easy_install) but I am now considering Python because it just works (yet still trying to get my head around 2.x and 3.x issue). I may be doing something really stupid but I think is very unlikely.
ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin10.6.0]
Implementation 1:
url = URI.parse('http//:www.stackoverflow.com/') req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body #empty
Implementation 2:
doc = Nokogiri::HTML(open("http//:www.stackoverflow.com/", "User-Agent" => "Safari"))
#empty
#I tried to use without user agent, without Nokogiri none worked.
Python Implementation which worked every time perfectly
f = urllib.urlopen("http//:www.stackoverflow.com/")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
print s
If that is your exact code it is invalid for several reasons.
http: should be http://
URL needs a path. if you want the root page of example.com it needs to be http://example.com/ the trailing slash is significant.
if you put 2 lines of code on one line you need to use ; to denote the end of the first line
SO
require 'net/http'
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body
Same is true with using open in nokogiri
EDIT: that site is returning bad results many times:
counter = 0
20.times do
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
sleep 1
counter +=1 unless res.body.empty?
end
puts counter
for me this only returned once a non empty body. If you substitute in another site it works all the time
curl "http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia"
Yields the same inconsistent results.
Two examples with openURI (standard lib), a wrapper for (among others) the rather cumbersome Net::HTTP :
require 'open-uri'
open("http://www.stackoverflow.com/"){|f| puts f.read}
puts URI::parse("http://www.google.com/").read
I have a CGI script for which I've successfully set a cookie (which I can see in Firefox/Chrome!) which has (say) the name uid and the content 1. I don't seem to understand how to access this cookie from another CGI script--and I'm working in Python 2.4 so a lot of the examples I've found may not apply.
This code prints "can't get uid" followed by the rest of the page:
c = Cookie.SimpleCookie(os.environ.get("HTTP_COOKIE"))
print("Content-Type: text/html")
print c.output()
print("\n\n")
uid = c.get("uid")
#uid = c["uid"].value # this would create an error and page would fail totally
if uid is None:
print("can't get uid")
uid = 1 # set manually to prevent the rest of the page from failing
I haven't done anything fishy with the domain the cookie applies to, so I don't understand why this doesn't grab the uid value. By the way, if I try to print c.output(), it's blank.
First thing is are you sure the webserver or the framework is setting the HTTP_COOKIE environment variable?
Otherwise, in one of your script you may want to store the cookies in the CookieJar file in the file system and access the set cookies from there.
import cookielib
COOKIEFILE = 'Cookies.lwp'
cookiejar = cookielib.LWPCookieJar()
cookiejar.load(COOKIEFILE)
cookiejar["uid"] = 1
cookiejar.save(COOKIEFILE)
Load the same cookiejar and do the get of uid in the other script.
Okay, I think I figured this out! I confirmed that os.environ.get("HTTP_COOKIE") was getting something, and then played with the order of the elements in my tiny test until it worked. Then I reproduced that order in my more complicated script. (Specifically: content type declaration, two newlines, get cookie, get value from cookie, everything else.)
The main thing I've learned about Python and CGI is that the order of elements (starting with the content type declaration) is very fussy. Thanks very much for the hints in the right direction.