I am trying to query counts of bookmarks for research papers in CiteULike. I am using the "http://www.citeulike.org/api/posts/for/doi/" URL in order to put in a request (using urllib2 library for Python) for an XML document which contains information on the bookmarks for a given DOI (unique identifier for papers). However I keep getting a HTTP 403 Error: Forbbiden.
Does anyone know why I am getting this error? I've tried putting the URL with the DOI in the browser and that returns the XML just fine, so the problem seems related to my automated requests.
Thanks,
Nathanael
You should read http://wiki.citeulike.org/index.php/Importing_and_Exporting#Scripting_CiteULike
If you access CiteULike via an automated process, you MUST provide a
means to identify yourself via the User-Agent string. Please use
"<username>/<email> <application>" e.g., "fred/fred#wilma.com
myscraper/1.0". Any scripting of the site without a means to identify
you may result in a block.
Related
Hello I am trying to solve this problem currently. How would one go about getting a specific resource, i.e. "resources/2021/helloworld" from a web service at a specific url (example.com/example) using python? I also need to specify a user agent (chrome on android in this case) and a link which it was referred by (i.e. google.com/resourcelink). Then preferrably print the text of the resource or write it to a file.
I recently wanted to extract data from a website that seems to use cookies to grant me access. I do not know very much about those procedures but appearently this inteferes with my method of getting the html content of the website via Python and its requests module.
The code I am running to extract the information contains the following lines:
import responses
#...
response = requests.get(url, proxies=proxies)
content = requests.text
Where the website i am referring to is http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=6675630&tag=1 and proxies is a valid dict of my proxy servers (I tested those settings on websites that seemed to work fine). However, instead of the content of the article on this site I receive the html-content of the page that you get when you do not accept cookies in your browser.
As I am not really aware of what website is really doing and lack real Web-Developement experience I could not find a solution so far, even if a similar question might have been asked before. Is there any solution to access the content of this website via Python?
startr = requests.get('https://viennaairport.com/login/')
secondr = requests.post('http://xxx/', cookies=startr.cookies)
I'm writing a simple Python web crawler using the mechanize library.
Right now, I just want to do the following:
Accept a list of startURLs as input
For each URL in startURLs, grab all the links on the page
Then, do an HTTP request for each of those links, and grab all of the links from them ...
Repeat this to the specified depth from the startURL.
So my problem is that when it is in step 3, I want it to skip downloading any links that point to image files (so if there is a URL http://www.example.com/kittens.jpg) then I want it to not add that to the list of URLs to fetch.
Obviously I could do this by just using a regex to match various file extensions in the URL path, but I was wondering if there is a cleaner way to determine whether or not a URL points to an image file, rather than an HTML document. Is there some sort of library function (either in mechanize, or some other library) that will let me do this?
Your suggested approach of using a regex on the url is probably the best way to do this, the only way to see for sure what the url points to would be to make a request to the server and examine the Content-Type header of the response to see if it starts with 'image/'.
If you don't mind the overhead of making additional server requests then you should send a HEAD request for the resource rather than the usual GET request - this will cause the server to return information about the resource (including its content type) without actually returning the file itself, saving you some bandwidth.
I am still fairly new to python and have some questions regarding using requests to sign in. I have read for hours but can't seem to get an answer to the following questions. If I choose a site such as www.amazon.com. I can sign in & determine the sign in link: https://www.amazon.com/gp/sign-in.html...
I can also find the sent form data, which includes items such as:
appActionToken:
appAction:SIGNIN
openid.pape.max_auth_age:ape:MA==
openid.return_to:
password: XXXX
email: XXXX
prevRID:
create:
metadata1: XXXX
my questions are as follows:
When finding form data, how do I know which items I must send back in a dictionary via post request. For the above, are email & password sufficient, and when browsing other sites, how do I know which ones are necessary?
The following code should work, but doesn't. What am I doing wrong?
The example includes a header category to determine the browser type. Another site, such as www.slashdot.org, does not need the header value to sign in. How do I know which sites require the header value and which ones don't?
Anyone who could provide input and help me sign in with requests would be doing me a great favor. I thank you very much.
import requests
session = requests.Session()
data = {'email':'xxxxx', 'password':'xxxxx'}
header={'User-Agent' : 'Mozilla/5.0'}
response = session.post('https://www.amazon.com/gp/sign-in.html', data,headers=header)
print response.content
When finding form data, how do I know which items I must send back in a dictionary via post request. For the above, are email & password sufficient, and when browsing other sites, how do I know which ones are necessary?
You generally need to either (a) read the documentation for the site you're using, if it's available, or (b) examine the HTML yourself (and possibly trace the http traffic) to see what parameters are necessary.
The following code should work, but doesn't. What am I doing wrong?
You didn't provide any details about how your code is not working.
The example includes a header category to determine the browser type. Another site, such as www.slashdot.org, does not need the header value to sign in. How do I know which sites require the header value and which ones don't?
The answer here is really the same as for the first question. Either you are using an API for which documentation exists that answers this question, or you're trying to automate a site that was designed primarily for human consumption via a web browser, which means you're going to have figure out through investigation, trial, and error exactly what parameters you need to provide to make the remote server happy.
I have a scraper script that pulls binary content off publishers websites. Its built to replace the manual action of saving hundreds of individual pdf files that colleagues would other wise have to undertake.
The websites are credential based, and we have the correct credentials and permissions to collect this content.
I have encountered a website that has the pdf file inside an iFrame.
I can extract the content URL from the HTML. When I feed the URL to the content grabber, I collect a small piece of HTML that says: <html><body>Forbidden: Direct file requests are not allowed.</body></html>
I can feed the URL directly to the browser, and the PDF file resolves correctly.
I am assuming that there is a session cookie (or something, I'm not 100% comfortable with the terminology) that gets sent with the request to show that the GET request comes from a live session, not a remote link.
I looked at the refering URL, and saw these different URLs that point to the same article that I collected over a day of testing (I have scrubbed identifers from the URL):-
http://content_provider.com/NDM3NTYyNi45MTcxODM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3NjYyMS4wNjU3MzY%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njc3Mi4wOTY3MDM%3D/elibrary//title/issue/article.pdf
http://content_provider.com/NDM3Njg3Ni4yOTc0NDg%3D/elibrary//title/issue/article.pdf
This suggests that there is something in the URL that is unique, and needs associating to something else to circumvent the direct link detector.
Any suggestions on how to get round this problem?
OK. The answer was Cookies and headers. I collected the get header info via httpfox and made a identical header object in my script, and i grabbed the session ID from request.cookie and sent the cookie with each request.
For good measure I also set the user agent to a known working browser agent, just in case the server was checking agent details.
Works fine.