How to write content directly to a tar file? - python

I have a url that I can make curl requests against
curl --insecure --header "Expect:" \
--header "Authorization: Bearer <api key>" \
https://some-url --silent --show-error --fail -o data-package.tar -v
Here I am trying to do it with the requests module
r = requests.get('https://stg-app.conduce.com/conduce/api/v1/admin/export/' + id,
headers=headers)
r.content ##binary tar file info
How do I write this to a tarfile-like data package?

The content will be the entire file (as bytes) that you can write out.
import requests
r = requests.get('...YOUR URL...')
# Create a file to write to in binary mode and just write out
# the entire contents at once.
# Also check to see if we get a successful response (add whatever codes
# are necessary if this endpoint will return something other than 200 for success)
if r.status_code in (200,):
with open('tarfile.tar', 'wb') as tarfile:
tarfile.write(r.content)
If you are downloading any arbitrary tar file and it could be rather large, you can choose to stream it instead.
import requests
tar_url = 'YOUR TAR URL HERE'
rsp = requests.get(tar_url, stream=True)
if rsp.status_code in (200,):
with open('tarfile.tar', 'wb') as tarfile:
# chunk size is how many bytes to read at a time,
# feel free to adjust up or down as you see fit.
for file_chunk in rsp.iter_content(chunk_size=512):
tarfile.write(chunk)
Note that this pattern (opening a file with wb mode) should generally work with writing any type of binary file. I would suggest reading the writing file documentation for Python 3 (Python 2 documentation here).

Related

Python version for curl --output

I have a GitLab API (v4) that I need to call to get a project sub-directory (something apparently new in v.14.4, it seems not yet included python-gitlab libs), which in curl can be done with the following command:
curl --header "PRIVATE-TOKEN: A_Token001" http://192.168.156.55/api/v4/projects/10/repository/archive?path=ProjectSubDirectory --output ~./temp/ProjectSubDirectory.tar.gz
The issue is in the last part, the --output ~./GitLab/some_project_files/ProjectSubDirectory.tar.gz
I tried different methods (.content, .text) which failed, as:
...
response = requests.get(url=url, headers=headers, params=params).content
# and save the respon content with with open(...)
but in all the cases it saved a non-valid tar.gz file, or other issues.
I even tried https://curlconverter.com/, but the code it generates does not work as well, it seems ignoring precisely the --output parameter, not showing anything about the file itself:
headers = {'PRIVATE-TOKEN': 'A_Token001',}
params = (('path', 'ProjectSubDirectory'),)
response = requests.get('http://192.168.156.55/api/v4/projects/10/repository/archive', headers=headers, params=params)
For now, I just created a script and call it with sub-process, but I don't like much this approach due to Python has libraries, as requests, that I guess should have some way to do the same...
2 key things.
Allow redirects
Use raise_for_status() to make sure the request was successful before writing the file. This will help uncover other potential issues, like failed authentication.
After that write response.content to a file opened in binary mode for writing ('wb')
import requests
url = "https://..."
headers = {} # ...
paramus = {} # ...
output_path = 'path/to/local/file.tar.gz'
response = requests.get(url, headers=headers, params=params, allow_redirects=True)
response.raise_for_status() # make sure the request is successful
with open(output_path, 'wb') as f:
f.write(response.content)

Convert Grobid curl command to requests in Python

I'm trying to convert curl script to parse pdf file from grobid server to requests in Python.
Basically, if I run the grobid server as follows,
./gradlew run
I can use the following curl to get the output of parsed XML of an academic paper example.pdf as below
curl -v --form input=#example.pdf localhost:8070/api/processHeaderDocument
However, I don't know the way to convert this script into Python. Here is my attempt to use requests:
GROBID_URL = 'http://localhost:8070'
url = '%s/processHeaderDocument' % GROBID_URL
pdf = 'example.pdf'
xml = requests.post(url, files=[pdf]).text
I got the answer. Basically, I missed api in the GROBID_URL and also the input files should be a dictionary instead of a list.
GROBID_URL = 'http://localhost:8070'
url = '%s/api/processHeaderDocument' % GROBID_URL
pdf = 'example.pdf'
xml = requests.post(url, files={'input': open(pdf, 'rb')}).text
Here is an example bash script from http://ceur-ws.bitplan.com/index.php/Grobid. Please note that there is also a ready to use python client available. See https://github.com/kermitt2/grobid_client_python
#!/bin/bash
# WF 2020-08-04
# call grobid service with paper from ceur-ws
v=2644
p=44
vol=Vol-$v
pdf=paper$p.pdf
if [ ! -f $pdf ]
then
wget http://ceur-ws.org/$vol/$pdf
else
echo "paper $p from volume $v already downloaded"
fi
curl -v --form input=#./$pdf http://grobid.bitplan.com/api/processFulltextDocument > $p.tei

Python requests zip upload makes zipfile unreadable in Windows

I'm trying to upload a zipfile to a Server using Python requests. The upload works fine. However the uploaded file cannot be opened using Windows Explorer or ark. I suppose there's some problem with mime-type or content-Length.
Oddly, uploading the file using curl, does not seem to cause the same problem.
Here is my python code for the request:
s = requests.Session()
headers = {'Content-Type': 'application/zip'}
zip = open('file.zip', 'rb')
files = {'file': ('file.zip', zip, 'application/zip')}
fc = {'Content-Disposition': 'attachment; filename=file.zip'}
headers.update(fc)
r = requests.Request('POST', url, files=files, headers=headers, auth=(user, password))
prepared = r.prepare()
resp = s.send(prepared)
This is the curl code, which works flawlessly:
curl -X POST \
-ik \
-u user:password \
--data-binary '#file.zip' \
-H 'Content-Type: application/zip' \
-H "Content-Disposition: attachment; filename=file.zip" \
url
Uploading the file works in both, the Server also seems to recognize the content-type. However the file is rendered invalid when re-downloading. The zifile is readable before sending via requests or after sending with normal curl, using --data-binary.
Opening the downloaded zifile with unip or file-roller works either way.
EDIT:
I was uploading two files successively. Oddly the error was fixed when uploading the exact same files in reverse order.
This has NOT been a python problem. When trying with standard curl
I must have accidentally reversed the order, which is why it has been working.
I can not explain this behavior nor do I have a fix for it.
In conclusion: Uploading the bigger file first did the trick.
All of the above seems to be applicable in curl, pycurl and python requests, so I assume it's some kind of bug in one of the curl libraries.

How do I use requests.put() to upload a file using Python?

I am trying to use the requests library in Python to upload a file into Fedora commons repository on localhost. I'm fairly certain my main problem is not understanding open() / read() and what I need to do to send data with an http request.
def postBinary(fileName,dirPath,url):
path = dirPath+'/'+fileName
print('to ' + url + '\n' + path)
openBin = {'file':(fileName,open(path,'rb').read())}
headers = {'Slug': fileName} #not important
r = requests.put(url, files=openBin,headers=headers, auth=HTTPBasicAuth('username', 'pass'))
print(r.text)
print("and the url used:")
print(r.url)
This will successfully upload a file in the repository, but it will be slightly larger and corrupted after. For example an image that was 6.6kb became 6.75kb and was not openable anymore.
So how should I properly open and upload a file using put in python?
###Extra details:###
When I replace files=openBin with data=openBin I end up with my dictionary and I presume the data as a string. I don't know if that information is helpful or not.
"file=FILE_NAME.extension&file=TYPE89a%24%02Q%03%E7%FF%00E%5B%19%FC%....
and the size of the file increases to a number of megabytes
I am using specifically put because the Fedora RESTful HTTP API end point says to use put.
The following command does work:
curl -u username:password -H "Content-Type: text/plain" -X PUT -T /path/to/someFile.jpeg http://localhost:8080/fcrepo/rest/someFile.jpeg
Updated
Using requests.put() with the files parameter sends a multipart/form-data encoded request which the server does not seem to be able to handle without corrupting the data, even when the correct content type is declared.
The curl command simply performs a PUT with the raw data contained in the body of the request. You can create a similar request by passing the file data in the data parameter. Specify the content type in the header:
headers = {'Content-type': 'image/jpeg', 'Slug': fileName}
r = requests.put(url, data=open(path, 'rb'), headers=headers, auth=('username', 'pass'))
You can vary the Content-type header to suit the payload as required.
Try setting the Content-type for the file.
If you are sure that it is a text file then try text/plain which you used in your curl command - even though you would appear to be uploading a jpeg file? However, for a jpeg image, you should use image/jpeg.
Otherwise for arbitrary binary data you can use application/octet-stream:
openBin = {'file': (fileName, open(path,'rb'), 'image/jpeg' )}
Also it is not necessary to explicitly read the file contents in your code, requests will do that for you, so just pass the open file handle as shown above.

transpose curl PUT to requests.put

I would like to transpose the curl command (which upload a local file to rackspace)
curl -X PUT -T screenies/hello.jpg -D - \
-H "X-Auth-Token: fc81aaa6-98a1-9ab0-94ba-aba9a89aa9ae" \
https://storage101.dfw1.clouddrive.com/v1/CF_xer7_343/images/hello.jpg
to python requests. So far I have:
url = 'http://storage.clouddrive.com/v1/CF_xer7_343/images/hello.jpg'
headers = {'X-Auth-Token': 'fc81aaa6-98a1-9ab0-94ba-aba9a89aa9ae'}
request = requests.put(url, headers=headers, data={})
where do I specify I want to upload screenies/hello.jpg?
I understand -T in curl represents 'to FTP server', but I have searched the requests's github but cannot find mention of FTP.
No, -T just means 'upload this file', which can be used with FTP but is not limited to that.
You can just upload the file data as the data parameter:
with open('screenies/hello.jpg', 'rb') as image:
request = requests.put(url, headers=headers, data=image)
where data will read and upload the image data for you.

Categories

Resources