ruby fetching url content is always empty - python

I am so frustrated trying to use Ruby to fetch a specific url content.
I've tried many different ways like open-uri, standard request none worked so far. I always get empty html. I also tried to use python to fetch the same url which always returned the correct html content. I am really not sure why... Please help as I am newbiew to both Ruby and Python... I want to use Ruby (prefer the tidy syntax and human friendly function names, easier to install libs using gem and homebrew (on mac) than python easy_install) but I am now considering Python because it just works (yet still trying to get my head around 2.x and 3.x issue). I may be doing something really stupid but I think is very unlikely.
ruby 1.9.2p136 (2010-12-25 revision 30365) [i386-darwin10.6.0]
Implementation 1:
url = URI.parse('http//:www.stackoverflow.com/') req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body #empty
Implementation 2:
doc = Nokogiri::HTML(open("http//:www.stackoverflow.com/", "User-Agent" => "Safari"))
#empty
#I tried to use without user agent, without Nokogiri none worked.
Python Implementation which worked every time perfectly
f = urllib.urlopen("http//:www.stackoverflow.com/")
# Read from the object, storing the page's contents in 's'.
s = f.read()
f.close()
print s

If that is your exact code it is invalid for several reasons.
http: should be http://
URL needs a path. if you want the root page of example.com it needs to be http://example.com/ the trailing slash is significant.
if you put 2 lines of code on one line you need to use ; to denote the end of the first line
SO
require 'net/http'
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
puts res.body
Same is true with using open in nokogiri
EDIT: that site is returning bad results many times:
counter = 0
20.times do
url = URI.parse('http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
sleep 1
counter +=1 unless res.body.empty?
end
puts counter
for me this only returned once a non empty body. If you substitute in another site it works all the time
curl "http://www.yellowpages.com.au/search/listings?clue=plumber&locationClue=Australia"
Yields the same inconsistent results.

Two examples with openURI (standard lib), a wrapper for (among others) the rather cumbersome Net::HTTP :
require 'open-uri'
open("http://www.stackoverflow.com/"){|f| puts f.read}
puts URI::parse("http://www.google.com/").read

Related

How do I make this website recognise my arrays as part of a valid url query?

EDIT:
In a similar vein, when I now try to log into their account with a post request, what is returned is none of the errors they suggest on their site, but is in fact a "JSON exception". Is there any way to debug this, or is an error code 500 completely impossible to deal with?
I'm well aware this question has been asked before. Sadly, when trying the proposed answers, none worked. I have an extremely simple Python project with urllib, and I've never done web programming in Python before, nor am I even a regular Python user. My friend needs to get access to content from this site, but their user-friendly front-end is down and I learned that they have a public API to access their content. Not knowing what I'm doing, but glad to try to help and interested in the challenge, I have very slowly set out.
Note that it is necessary for me to only use standard Python libraries, so that any finished project could easily be emailed to their computer and just work.
The following works completely fine minus the "originalLanguage" query, but when using it, which the API has documented as an array value, no matter whether I comma-separate things, or write "originalLanguage[0]" or "originalLanguage0" or anything that I've seen online, this creates the error message from the server: "Array value expected but string detected" or something along those lines.
Is there any way for me to get this working? Because it clearly can work, otherwise the API wouldn't document it. Many thanks.
In case it helps, when using "[]" or "<>" or "{}" or any delimeter I could think of, my IDE didn't recognise it as part of the URL.
import urllib.request as request
import urllib.parse as parse
def make_query(url, params):
url += "?"
for i in range(len(params)):
url += list(params)[i]
url += '='
url += list(params.values())[i]
if i < len(params) - 1:
url += '&'
return url
base = "https://api.mangadex.org/manga"
params = {
"limit": "50",
"originalLanguage": "en"
}
url = make_query(base, params)
req = request.Request(url)
response = request.urlopen(req)

Python - Facebook fb_dtsg

On Facebook I want to find fb_dtsg to make a status:
import urllib, urllib2, cookielib
jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(cookie)
data = urllib.urlencode({'email':"email",'pass':"password", "Log+In":"Log+In"})
req = urllib2.Request('http://www.facebook.com/login.php')
opener.open(req, data)
opener.open(req, data) #Needs to be opened twice to log on.
req2 = urllib2.Request("http://www.facebook.com/")
page = opener.open(req2)
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 33] #This just finds the value of "fb_dtsg".
Yes, this does find a value, and a value that looks like fb_dtsg would look like, but this value is always changing when I would open the webpage again and also when I would use it to make a status, it would not work, and when I would record what is happening on google chrome if I was making a status normally, I would get an working fb_dtsg value and it would not change (for a long session), and would work if I used it to try make a status. Please, please show me how I can fix this up without using the API.
The searching criteria to find fb_dtsg truncates last digit, so change 33 to 34
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 34]
Anyways you can use a better way of searching the fb_dtsg using re
re.findall('fb_dtsg.+?value="([^"]+)"',page)
As I answered in one of your early posts it may also require other hidden variables also.
If this still doesn't work, can you provide the code where you are making the post including all the post form data
BTW sorry for not looking at all your previous posts with same content :P

Parse what you google search

I'd like to write a script (preferably in python, but other languages is not a problem), that can parse what you type into a google search. Suppose I search 'cats', then I'd like to be able to parse the string cats and, for example, append it to a .txt file on my computer.
So if my searches were 'cats', 'dogs', 'cows' then I could have a .txt file like so,
cats
dogs
cows
Anyone know any APIs that can parse the search bar and return the string inputted? Or some object that I can cast into a string?
EDIT: I don't want to make a chrome extension or anything, but preferably a python (or bash or ruby) script I can run in terminal that can do this.
Thanks
If you have access to the URL, you can look for "&q=" to find the search term. (http://google.com/...&q=cats..., for example).
I can offer 2 popular solution
1) Google have a search-engine API https://developers.google.com/products/#google-search
(It have restriction on 100 requests per day)
cutted code:
def gapi_parser(args):
query = args.text; count = args.max_sites
import config
api_key = config.api_key
cx = config.cx
#Note: This API returns up to the first 100 results only.
#https://developers.google.com/custom-search/v1/using_rest?hl=ru-RU#WorkingResults
results = []; domains = set(); errors = []; start = 1
while True:
req = 'https://www.googleapis.com/customsearch/v1?key={key}&cx={cx}&q={q}&alt=json&start={start}'.format(key=api_key, cx=cx, q=query, start=start)
if start>=100: #google API does not can do more
break
con = urllib2.urlopen(req)
if con.getcode()==200:
data = con.read()
j = json.loads(data)
start = int(j['queries']['nextPage'][0]['startIndex'])
for item in j['items']:
match = re.search('^(https?://)?\w(\w|\.|-)+', item['link'])
if match:
domain = match.group(0)
if domain not in results:
results.append(domain)
domains.update([domain])
else:
errors.append('Can`t recognize domain: %s' % item['link'])
if len(domains) >= args.max_sites:
break
print
for error in errors:
print error
return (results, domains)
2) I wrote a selenuim based script what parse a page in real browser instance, but this solution have a some restrictions, for example captcha if you run searches like a robots.
A few options you might consider, with their advantages and disadvantages:
URL:
advantage: as Chris mentioned, accessing the URL and manually changing it is an option. It should be easy to write a script for this, and I can send you my perl script if you want
disadvantage: I am not sure if you can do it. I made a perl script for that before, but it didn't work because Google states that you can't use its services outside the Google interface. You might face the same problem
Google's search API:
advantage: popular choice. Good documentation. It should be a safe choice
disadvantage: Google's restrictions.
Research other search engines:
advantage: they might not have the same restrictions as Google. You might find some search engines that let you play around more and have more freedom in general.
disadvantage: you're not going to get results that are as good as Google's

exporting wikipedia with Python

I am trying to export a category from Turkish wikipedia page by following http://www.mediawiki.org/wiki/Manual:Parameters_to_Special:Export . Here is the code I am using;
# -*- coding: utf-8 -*-
import requests
from BeautifulSoup import BeautifulStoneSoup
from sys import version
link = "http://tr.wikipedia.org/w/index.php?title=%C3%96zel:D%C4%B1%C5%9FaAktar&action=submit"
def get(pages=[], category = False, curonly=True):
params = {}
if pages:
params["pages"] = "\n".join(pages)
if category:
params["addcat"] = 1
params["category"] = category
if curonly:
params["curonly"] = 1
headers = {"User-Agent":"Wiki Downloader -- Python %s, contact: Yaşar Arabacı: yasar11732#gmail.com" % version}
r = requests.post(link, headers=headers, data=params)
return r.text
print get(category="Matematik")
Since I am trying to get data from Turkish wikipedia, I have used its url. Other things should be self explanatory. I am getting the form page that you can use to export data instead of the actual xml. Can anyone see what am I doing wrong here? I have also tried making a get request.
There is no parameter named category, the category name should be in the catname parameter.
But Special:Export was not build for bots, it was build for humans. So, if you use catname correctly, it will return the form again, this time with pages from the category filled in. Then you are supposed to click "Submit" again, which will return the XML you want.
I think doing this in code would be too complicated. It would be easier if you used the API instead. There are some Python libraries that can help you with that: Pywikipediabot or wikitools.
Sorry my original answer was horribly flawed. I misunderstood the original intent.
I did some more experimenting because I was curious. It seems that the code you have above is not necessarily incorrect, it is, in fact, that the Special Export documentation is misleading. The documentation states that using catname and addcat will add the categories to the output, but instead it only lists the pages and categories within the specified catname inside an html form. It seems that wikipedia actually requires that the pages that you wish download be specified explicitly. Granted, there documentation doesn't necessarily appear to be very thorough on that matter. I would suggest that you parse the page for the pages within the category and then explicitly download those pages with your script. I do see an an issue with this approach in terms of efficiency. Due to the nature of Wikipedia's data, you'll get a lot of pages which are simply category pages of other pages.
As an aside, it could possibly be faster to use the actual corpus of data from Wikipedia which is available for download.
Good luck!

Can't seem to access cookie in Python (2.4)

I have a CGI script for which I've successfully set a cookie (which I can see in Firefox/Chrome!) which has (say) the name uid and the content 1. I don't seem to understand how to access this cookie from another CGI script--and I'm working in Python 2.4 so a lot of the examples I've found may not apply.
This code prints "can't get uid" followed by the rest of the page:
c = Cookie.SimpleCookie(os.environ.get("HTTP_COOKIE"))
print("Content-Type: text/html")
print c.output()
print("\n\n")
uid = c.get("uid")
#uid = c["uid"].value # this would create an error and page would fail totally
if uid is None:
print("can't get uid")
uid = 1 # set manually to prevent the rest of the page from failing
I haven't done anything fishy with the domain the cookie applies to, so I don't understand why this doesn't grab the uid value. By the way, if I try to print c.output(), it's blank.
First thing is are you sure the webserver or the framework is setting the HTTP_COOKIE environment variable?
Otherwise, in one of your script you may want to store the cookies in the CookieJar file in the file system and access the set cookies from there.
import cookielib
COOKIEFILE = 'Cookies.lwp'
cookiejar = cookielib.LWPCookieJar()
cookiejar.load(COOKIEFILE)
cookiejar["uid"] = 1
cookiejar.save(COOKIEFILE)
Load the same cookiejar and do the get of uid in the other script.
Okay, I think I figured this out! I confirmed that os.environ.get("HTTP_COOKIE") was getting something, and then played with the order of the elements in my tiny test until it worked. Then I reproduced that order in my more complicated script. (Specifically: content type declaration, two newlines, get cookie, get value from cookie, everything else.)
The main thing I've learned about Python and CGI is that the order of elements (starting with the content type declaration) is very fussy. Thanks very much for the hints in the right direction.

Categories

Resources