I want to be able to scrapp data from a website. I use the requests.get fonction to do so. Everything works out fine, except that the said website takes time to full load. Then, when I download it, some parts are not fully loaded.
I tried to use the timeout and stream arguments of the get function but it doesn't work.
Here is my code:
acc = open(r'C:\Users\axelg\.spyder-py3\accueil.html','w',encoding="utf-8")
with requests.Session() as s:
url = 'http://localhost/mysiste.php'
s.get(url)
login_data = {'log' : 'myLog', 'pwd' : 'MyPwd'}
s.post(url, data=login_data)
r = s.get('http://localhost/location/',stream = True)
acc.write(r.text)
Thank you for your answers !
You might be using the timeout argument the wrong way
r = s.get('http://localhost/location/',timeout=(5, 20))
The first value of the tuple of timeout argument is to set the session time out and the second value of the tuple is to specify how much time the browser should wait before sending the response. Try changing the second value to suit your requirements
Related
I am rewriting this question to make it more concise and focused on the real problem:
test program code:
https://drive.google.com/file/d/1kDEUxSpNMlyxPYqPw0ikUxG8INW4My8n/view?usp=sharing
implementation program code:
https://drive.google.com/file/d/14v06AZlGMTFMmFeaQ9cbpmNJwI33fNtS/view?usp=sharing
Currently, I have the same code that I am trying to run in the test program and the implementation program.
r = requests.get(url, headers = head)
Located in line 58 in the test program and 381 in the implementation program.
In the implementation program, that line throws this error:
r = requests.get(url, headers = head)
TypeError: get() takes no keyword arguments
This does not happen in the test program. Any suggestions would be very much appreciated. Thanks!
So, when you have bearer token with you, is client ID is really needed?
Well, here's how I do,
headers = { 'Authorization' : 'Bearer ' + 'bd897dbb4e493881c8385f89f608d5e3bf28c656' }
r = requests.get(<url>, headers=headers, verify=False)
response = r.text
Please give a try.
problem was that I had a dictionary named requests too. Thanks for your time and help
The .get() function doesn't require you to specify which type of argument it is. Just input your url as a variable name into requests.get() like so:
import requests
url = # your url
r = requests.get(url)
I am trying to read Twitter feeds using the URL. Yesterday I was able to pull some 80K tweets using the code and due to some updates on my machine, my Mac terminal stopped responding before the python code completed.
Today the same code is not returning any json data. It is throwing me empty results. While if I type the same URL in browser I am able to get a json file with full of data.
Here is my code:
Method 1:
try:
urllib.request.urlcleanup()
response = urllib.request.urlopen(url)
print('URL to used: ', url)
testURL = response.geturl()
print('URL you used: ', testURL)
jsonResponse = response.read()
jsonResponse = urllib.request.urlopen(url).read()
This printed:
URL to used: https://twitter.com/i/search/timeline?f=tweets&q=%20since%3A2017-08-14%20until%3A2017-08-15%20USA&src=typd&max_position=
URL you used: https://twitter.com/i/search/timeline?f=tweets&q=%20since%3A2017-08-14%20until%3A2017-08-15%20USA&src=typd&max_position=
json: {'items_html': '\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n', 'focused_refresh_interval': 30000, 'has_more_items': False, 'min_position': 'TWEET--', 'new_latent_count': 0}
****Method 2:****
try:
request = urllib.request.Request(url, headers=headers)
except:
print("Thats the problem here:")
try:
response = urllib.request.urlopen(request)
except:
print("Exception while fetching response")
testURL = response.geturl()
print('URL you used: ', testURL)
try:
jsonResponse = response.read()
except:
print("Exception while reading response")
Same results in both the cases.
Kindly help.
Based on my testing this behavior has nothing to do with urllib. The same thing happens with the requests library for example.
It appears Twitter detects automated scraping via repeated hits against the search URL, based on your IP address and user agent (UA) string. At some point, subsequent hits return empty results. This seems to happen after a day or so, probably as a result of delayed analysis on Twitter's part.
If you change the UA string in the search URL request header, you should once again receive valid results in the response. Twitter will probably block you again after a while, so you'll need to change your UA string frequently.
I assume Twitter expires these blocks after some timeout, but I don't know how long that will take.
By way of reference, the twitter-past-crawler project demonstrates using a semi-random UA string taken from a file containing multiple UA strings.
Also, the Twitter-Search-API-Python project uses a hardcoded UA string, which stopped working a day or so after my first test. Changing the string in the code (adding random characters) resulted in a resumption of prior functionality.
I'm fetching a batch of urls using the Python Requests module. I first want to read their headers only, to get the actual url and size of response. Then I get the actual content for any that pass muster.
So I use 'streams=True' to delay getting the content. This generally works fine.
But I'm encountering an occasional url that doesn't respond. So I put in timeout=3.
But those never time out. They just hang. If I remove the 'streams=True' it times out correctly. Is there some reason streams and timeout shouldn't work together? Removing the streams=True forces me to get all the content.
Doing this:
import requests
url = 'http://bit.ly/1pQH0o2'
x = requests.get(url) # hangs
x = requests.get(url, stream=True) # hangs
x = requests.get(url, stream=True, timeout=1) # hangs
x = requests.get(url, timeout=3) # times out correctly after 3 seconds
There was a relevant github issue:
Timeouts do not occur when stream == True
The fix was included into requests==2.3.0 version.
Tested it using the latest version - worked for me.
Do you close your responses? Unclosed and partially read responses can make multiple connections to the same resource and site may have connection limit for a single IP.
url = 'http://developer.usa.gov/1usagov.json'
r = requests.get(url)
Python code hangs forever and i not behind a http proxy or anything.
Pointing my browser directly to the url works
Following my comment above.. I think your problem is the continuous stream. You need to do something like in the doc
r = requests.get(url, stream=True)
if int(r.headers['content-length']) < TOO_LONG:
# rebuild the content and parse
with a while instead of if if you want a continuous loop.
I am developing a download manager. Using the requests module in python to check for a valid link (and hopefully broken links).
My code for checking link below:
url = 'http://pyscripter.googlecode.com/files/PyScripter-v2.5.3-Setup.exe'
r = requests.get(url, allow_redirects=False) # this line takes 40 seconds
if r.status_code==200:
print("link valid")
else:
print("link invalid")
Now, the issue is this takes approximately 40 seconds to perform this check, which is huge.
My question is how can I speed this up maybe using urllib2 or something??
Note: Also if I replace url with the actual URL which is 'http://pyscripter.googlecode.com/files/PyScripter-v2.5.3-Setup.exe', this takes one second so it appears to be an issue with requests.
Not all hosts support head requests. You can use this instead:
r = requests.get(url, stream=True)
This actually only download the headers, not the response content. Moreover, if the idea is to get the file afterwards, you don't have to make another request.
See here for more infos.
Don't use get that actually retrieves the file, use:
r = requests.head(url,allow_redirects=False)
Which goes from 6.9secs on my machine to 0.4secs