Hi I am trying to a hit an API using requests module of python. The Api has to be hit 20000 times as the number of pages are around 20000. In every hit the data comes around 10 mb. By the end of the process it creates a json file of around 100gb. Here is the code I have written
with open('file.json','wb',buffering=100*1048567) as f:
while(next_page_cursor != ""):
with request.get(url,headers=headers) as response:
json_response = json.loads(response.content.decode('utf-8'))
"""
json response looks something like this
{
content:[{},{},{}........50 dictionaries]
next_page_cursor : "abcd"
}
"""
next_page_cursor = json_response['next_page_cursor']
for data in json_response['content']:
f.write((json.dumps(data) + "\n").encode())
But after running successfully for few pages the code fails giving the below error:
Traceback (most recent call last):
File "<command-1206920060120926>", line 65, in <module>
with requests.get(data_url, headers = headers) as response:
File "/databricks/python/lib/python3.7/site-packages/requests/api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "/databricks/python/lib/python3.7/site-packages/requests/api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "/databricks/python/lib/python3.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/databricks/python/lib/python3.7/site-packages/requests/sessions.py", line 686, in send
r.content
File "/databricks/python/lib/python3.7/site-packages/requests/models.py", line 828, in content
self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
File "/databricks/python/lib/python3.7/site-packages/requests/models.py", line 753, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: OSError("(104, \'ECONNRESET\')")', OSError("(104, 'ECONNRESET')"))
you need to use response.iter_content
https://2.python-requests.org/en/master/api/#requests.Response.iter_content
Related
I'm trying to implement a PUT request on HDFS via the HDFS Web API.
So I looked up the Documentation on how to do that : https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE
First do a PUT without redirect, to get a 307, catch the new URL and then PUT on that URL with DATA.
When I do my first put, I do get the 307, but the URL is the same has the first one. So I'm note sure if I'm already on the "good" datanode or not.
Any way, I get this URL and try to add DATA to it, but I get an error, from what I understand, it is a connection error. Host is cutting down the connection.
class HttpFS:
def __init__(self, url=settings.HTTPFS_URL):
self.httpfs = url
self.auth = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
def put(self, local_file, hdfs_dir):
url = "{}{}".format(self.httpfs, hdfs_dir)
params = {"op": "CREATE", "overwrite": True}
print(url)
r = requests.put(url, auth=self.auth, params=params, stream=True, verify=settings.CA_ROOT_PATH, allow_redirects=False)
r = requests.put(r.headers['Location'], auth=self.auth, data=open(local_file, 'rb'), params=params, stream=True, verify=settings.CA_ROOT_PATH)
Here is the error given:
r = requests.put(r.headers['Location'], auth=self.auth, data=open(local_file, 'rb'), params=params, stream=True, verify=settings.CA_ROOT_PATH)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/api.py", line 134, in put
return request('put', url, data=data, **kwargs)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BrokenPipeError(32, 'Broken pipe'))
Edit 1:
I tried also with https://github.com/pywebhdfs/pywebhdfs repos. Since it is supposed to do exactly what i'm looking for. But I still have this Broken Pipe Error.
from requests_kerberos import OPTIONAL, HTTPKerberosAuth
from pywebhdfs.webhdfs import PyWebHdfsClient
from utils import settings
auth = HTTPKerberosAuth(mutual_authentication=OPTIONAL)
url = f"https://{settings.HDFS_HTTPFS_HOST}:{settings.HDFS_HTTPFS_PORT}/webhdfs/v1/"
hdfs_client = PyWebHdfsClient(base_uri_pattern=url, request_extra_opts={'auth':auth, 'verify': settings.CA_ROOT_PATH})
with open(data_dir + file_name, 'rb') as file_data:
hdfs_client.create_file(hdfs_path + file_name, file_data=file_data, overwrite=True)
Same error:
hdfs_client.create_file(hdfs_path + file_name, file_data=file_data, overwrite=True)
File "/home/cdsw/.local/lib/python3.6/site-packages/pywebhdfs/webhdfs.py", line 115, in create_file
**self.request_extra_opts)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/api.py", line 134, in put
return request('put', url, data=data, **kwargs)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
resp = self.send(prep, **send_kwargs)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
r = adapter.send(request, **kwargs)
File "/home/cdsw/.local/lib/python3.6/site-packages/requests/adapters.py", line 498, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BrokenPipeError(32, 'Broken pipe'))
Edit 2:
I found out I was sending too much data at once. So now I create a file on HDFS, then I append to it with chunk of data. But it is slow... And I still can get the same error as above randomly. The bigger are the are the chunk the more I have chances to get a Connection aborted. My files more in range of 200Mb, so it take ages comparing to the Hadoop binary "hdfs dfs -put"
I am trying to make app that download comics but whenever I try to download an image, it says no host supplied.
I really searched and there was nothing.
This is the code:
import requests,bs4
url='https://www.marvel.com/comics/issue/71314/edge_of_spider-geddon_2018_1'
res=requests.get(url,stream=True)
res.raise_for_status()
soup=bs4.BeautifulSoup(res.text)
elem=soup.select('div[class="row-item-image"] img')#.viewer-cnt .row .col-xs-12 #ppp img')
#print(elem)
comicurl='https:'+elem[0].get('src')
res=requests.get(comicurl,stream=True,allow_redirects=True)
res.raise_for_status()
with open(comicurl[comicurl.rfind('/')+1:],'wb') as i:
for chunk in res.iter_content(100000):
i.write(chunk)
I expect it to download the image but it gives me this error:
Traceback (most recent call last):
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\comicdownloader.py", line 10, in <module>
res=requests.get(comicurl,stream=True,allow_redirects=True)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 75, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 519, in request
prep = self.prepare_request(req)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 462, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\models.py", line 313, in prepare
self.prepare_url(url, params)
File "C:\Users\Islam\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\models.py", line 390, in prepare_url
raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:https://i.annihil.us/u/prod/marvel/i/mg/6/b0/5b6c5e4154f75/portrait_uncanny.jpg': No host supplied
And it gives it to me whenever I try it on any website.
it looks like elem[0].get('src') evaluates to https://i.annihil.us/u/prod/marvel/i/mg/6/b0/5b6c5e4154f75/portrait_uncanny.jpg.
so on line comicurl='https:'+elem[0].get('src') you add http: in front of an already well formed url, making it invalid
Can't argue with this: Invalid URL 'https:https://i.annihil.us/u/prod -- the URL is really invalid, probably you should get rid of https in the following statement:
comicurl='https:'+elem[0].get('src')
I use reddit API praw and psraw to extract comments from a subreddit, however, I got two errors today after running a few loops:
JSON object decoded error or empty -> ValueError, even I catch exception in my code, still doesnt work.
http request
example:
Traceback (most recent call last):
File "C:/Users/.../subreddit psraw.py", line 20, in <module>
for comment in submission.comments:
File "C:\Python27\lib\site-packages\praw\models\reddit\base.py", line 31, in __getattr__
self._fetch()
File "C:\Python27\lib\site-packages\praw\models\reddit\submission.py", line 142, in _fetch
'sort': self.comment_sort})
File "C:\Python27\lib\site-packages\praw\reddit.py", line 367, in get
data = self.request('GET', path, params=params)
File "C:\Python27\lib\site-packages\praw\reddit.py", line 451, in request
params=params)
File "C:\Python27\lib\site-packages\prawcore\sessions.py", line 174, in request
params=params, url=url)
File "C:\Python27\lib\site-packages\prawcore\sessions.py", line 108, in _request_with_retries
data, files, json, method, params, retries, url)
File "C:\Python27\lib\site-packages\prawcore\sessions.py", line 93, in _make_request
params=params)
File "C:\Python27\lib\site-packages\prawcore\rate_limit.py", line 33, in call
response = request_function(*args, **kwargs)
File "C:\Python27\lib\site-packages\prawcore\requestor.py", line 49, in request
raise RequestException(exc, args, kwargs)
prawcore.exceptions.RequestException: error with request
HTTPSConnectionPool(host='oauth.reddit.com', port=443): Read timed out. (read timeout=16.0)
Since a subreddit contains 10k+ comments, is there a way to solve such issue? is it because reddit website has some problems today?
My code:
import praw, datetime, os, psraw
reddit = praw.Reddit('bot1')
subreddit = reddit.subreddit('example')
for submission in psraw.submission_search(reddit, subreddit='example', limit=1000000):
try:
#get comments
for comment in submission.comments:
subid = submission.id
comid = comment.id
com_body = comment.body.encode('utf-8').replace("\n", " ")
com_date = datetime.datetime.utcfromtimestamp(comment.created_utc)
string_com = '"{0}", "{1}", "{2}"\n'
formatted_string_com = string_com.format(comid, com_body, com_date)
indexFile_comment = open('path' + subid + '.txt', 'a+')
indexFile_comment.write(formatted_string_com)
except ValueError:
print ("error")
pass
continue
except AttributeError:
print ("error")
pass
continue
macOS 10.12.3 python 2.7.13 requests 2.13.0
I use requests package to send post request.This request need to login before post data.So I use request.Session() and load a logined cookie.
Then I use this session to send post data in cycle mode.
It is no error that I used to run this code in Windows and Linux.
Simple Code:
s = request.Session()
s.cookies = cookieslib.LWPCookieJar('cookise')
s.cookies.load(ignore_discard=True)
for user_id in range(100,200):
url = 'http://xxxx'
data = { 'user': user_id, 'content': '123'}
r = s.post(url, data)
...
But the program frequently (about every interval) crash, the error isAttributeError: 'module' object has no attribute 'kqueue'
Traceback (most recent call last):
File "/Users/gasxia/Dev/Projects/TgbookSpider/kfz_send_msg.py", line 90, in send_msg
r = requests.post(url, data) # catch error if user isn't exist
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 535, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/site-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 588, in urlopen
conn = self._get_conn(timeout=pool_timeout)
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py", line 241, in _get_conn
if conn and is_connection_dropped(conn):
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/util/connection.py", line 27, in is_connection_dropped
return bool(wait_for_read(sock, timeout=0.0))
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/util/wait.py", line 33, in wait_for_read
return _wait_for_io_events(socks, EVENT_READ, timeout)
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/util/wait.py", line 22, in _wait_for_io_events
with DefaultSelector() as selector:
File "/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/util/selectors.py", line 431, in __init__
self._kqueue = select.kqueue()
AttributeError: 'module' object has no attribute 'kqueue'
This looks like a problem that commonly arises if you're using something like eventlet or gevent, both of which monkeypatch the select module. If you're using those to achieve asynchrony, you will need to ensure that those monkeypatches are applied before importing requests. This is a known bug, being tracked in this issue.
I am writing a python script where I right-click a file, click on a context menu item. On clicking it opens a webpage. I have to validate the url of the webpage. How do I shift my control from OS to the browser and get Current URL.
Why don't you call requests.get(url) and check the response code. Another option is to call request.head(url)
>>> import requests
>>> url1 = 'http://example.com'
>>> url2 = 'http://sdsdsdsdsdss.com'
>>> r = requests.head(url1)
>>> r.status_code
200
>>> r = requests.head(url2, timeout=5)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 77, in head
return request('head', url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 383, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 486, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 387, in send
raise Timeout(e)
requests.exceptions.Timeout: (<urllib3.connectionpool.HTTPConnectionPool object at 0x7f4e77635950>, 'Connection to sdsdsdsdsdss.com timed out. (connect timeout=5)')
>>>
You need to handle the exception. Details about requests module: http://docs.python-requests.org/en/latest/
And if you really need to open the web browser, you can use this library: https://docs.python.org/2/library/webbrowser.html