'application/octet-stream' instead of application/csv? - python

I am quite new to Python. I want to confirm that the type of the dataset (URL in the code below) is indeed a csv file. However, when checking via the headers I get 'application/octet-stream' instead of 'application/csv'.
I assume that I defined something in the wrong way when reading in the data, but I don't know what.
import requests
url="https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/data.csv"
d1 = requests.get( url )
filePath = 'data/data_notebook-1_covid-new.csv'
with open(filePath, "wb") as f:
f.write(d1.content)
## data type via headers #PROBLEM
import requests
headerDict=d1.headers
#accessing content-type header
if "Content-Type" in headerDict:
print("Content-Type:")
print( headerDict['Content-Type'] )

I assume that I defined something in the wrong way when reading in the data
No, you didn't. The Content-Type header is supposed to indicate what the response body is, but there is nothing you can do to force the server to set that to a value you expect. Some servers are just badly configured and don't play along.
application/octet-stream is the most generic content type of them all - it gives you no more info than "it's a bunch of bytes, have fun".
What's more, there isn't necessarily One True Type for each kind of content, only more-or-less widely agreed-upon conventions. For CSV, a common one would be text/csv.
So if you're sure what the content is, feel free to ignore the Content-Type header.
import requests
url = "https://opendata.ecdc.europa.eu/covid19/casedistribution/csv/data.csv"
response = requests.get(url)
filePath = 'data/data_notebook-1_covid-new.csv'
with open(filePath, "wb") as f:
f.write(response.content)
Writing to file in binary mode is a good idea in the absence of any further information, because this will retain the original bytes exactly as they were.
In order to convert that to string, it needs to be decoded using a certain encoding. Since the Content-Type did not give any indication here (it could have said Content-Type: text/csv; charset=XYZ), the best first assumption for data from the Internet would be UTF-8:
import csv
filePath = 'data/data_notebook-1_covid-new.csv'
with open(filePath, encoding='utf-8') as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
print(row)
Should that turn out to be wrong (i.e. there are decoding errors or garbled characters), you can try a different encoding until you find one that works. That would not be possible if you had written the file in text mode in the beginning, as any data corruption from wrong decoding would have made it into the file.

Related

Python - read huge online csv through proxy

I have a huuuuuge csv online and I wan't to read it line by line whitout download it. But this file is behind a proxy.
I wrote this code :
import requests
import pandas as pd
import io
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
content = requests.get(url, proxies=proxies, auth=auth, verify=cafile).content
csv_read = pd.read_csv(io.StringIO(content.decode('utf-8')))
pattern = 'mypattern'
for row in csv_read:
if row[0] == pattern:
print(row)
break
This code above works but the line 'content = requests.get(...' takes soooo much time ! Because of the size of the csv file.
So my question is :
Is it possible to read an online csv line by line through proxy ?
In the best way, I wish to read the first row, check if it equals to my pattern, if yes = break, if not = read the second line ans so on.
Thank's for your help
You can pass stream=True to requests.get to avoid fetching the entire result immediately. In that case you can access a pseudo-file object through response.raw, you can build your CSV reader based on that (alternatively, the response object has iter_content and iter_lines methods but I don't know how easy it is to feed that to a CSV parser).
However while the stdlib's csv module simply yields a sequence of lists or dicts and can therefore easily be lazy, pandas returns a dataframe which are not lazy, so you need to specify some special parameters then you get a dataframe per chunk or something it looks like.
The requests.get call will get you the whole file anyway. You'd need to implement your own HTTP code, down to the socket level, to be able to process the content as it gets in, in a plain HTTP Get method.
The only way of getting partial results and slice the download is to add HTTP "range" request headers, if the server providing the file support then. (requests can let you set these headers).
enter requests advanced usage:
The good news is that requests can do that for you under the hood -
you can set stream=True parameter when calling requests, and it even will let you iterate the contents line by line. Check the documentation on that part.
Here is more or less what requests does under the hood so that you can get your contents line by line:
It will get reasobale sized chunks of your data, - but certainly not equest one line at a time (think ~80 bytes versus 100.000 bytes), because otherwise it'd need a new HTTP request for each line,and the overhead for each request is not trivial, even if made over the same TCP connection.
Anyway, as CSV being a text format, neither requests nor any other software could know the size of the lines, and even less the exact size of the "next" line to be read - before setting the range headers accordingly.
So, for this to work, ther have to have to be Python code to:
accept a request for a "new line" of the CSV if there are buffered
text lines, yield the next line,
otherwise make an HTTP request for
the next 100KB or so
Concatenate the downloaded data to the
remainder of the last downloaded line
split the downloaded data at
the last line-feed in the binary data,
save the remainder of the
last line
convert your binary buffer to text, (you'd have to take
care of multi-byte character boundaries in a multi-byte encoding
(like utf-8) - but cutting at newlines may save you that)
yield the
next text line
According to Masklinn's answer, my code looks like this now :
import requests
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
pattern = 'mypattern'
r = requests.get(url, stream=True, proxies=proxies, verify=cafile)
if r.encoding is None:
r.encoding = 'ISO-8859-1'
for line in r.iter_lines(decode_unicode=True):
if line.split(';')[0] == pattern:
print(line)
break

Python csv package - issue with DictReader module

I'm having a curious issue with the csv package in Python 3.7.
I'm importing a csv file and able to access all the file as expected, with one exception - the header row, as stored in the "fieldnames" object, appears have the first column header (first item in fieldnames) malformed.
This first field always has the format: 'xxx"header"'
where:
xxx are garbage characters that always seem to be the same
header is the correct header text
See the following screenshot of my table <csv.DictReader> object from my debug window:
My code to open the file, follows. I added the headers[0] = table.fieldnames[0].split('"')[1] in order to extract the correct header and place it back into fieldnames`.
import csv
with self.inputfile.open() as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
headers[0] = table.fieldnames[0].split('"')[1]
(Note: self.inputfile is a pathlib.Path object)
I didn't notice this for a long time because I wasn't using the first column (with the # header) - I've been happily parsing with the rest of the columns for a while on multiple files.
If I look directly at the csv, there doesn't appear to be any issue:
Questions:
Does anyone know what the issue is? Is there anything I can try to correct the import issue?
If there isn't a fix, is there a better way to parse the garbage? I realize this could clear up in the future, but I think the split will still work even with just bare double quotes (the header should still be the 2nd item in the split, right?). Is there a better solution?
It looks like your csv file is encoded as utf-8-sig - a version of utf-8 used by some Windows applications, but it's being decoded as cp1252 - another encoding in common use on Windows.
>>> print('"#"'.encode('utf-8-sig').decode('cp1252'))
"#"
The "garbage" characters preceding the header are the byte-order-mark that utf-8-sig uses to tell Windows applications that a file is encoded as utf-8 rather than one of the historically more common 8-bit encodings.
To avoid the "garbage", specify utf-8-sig as the encoding when opening your file.
The code in the question could be modified to work like this:
import csv
encoding = 'utf-8-sig'
with self.inputfile.open(encoding=encoding, newline='') as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
...
If - as seems likely - the encoding of input files may vary, the value of encoding (or a best guess) must be determined by using a tool like chardet, as used in the comments.

Parsing an email with Python

I feel like this is a simple question but nonetheless I cannot find a straightforward answer.
I have an email (an .eml file) that I need to parse. This email has a data table in the body that I need to export to my database. I have been successful parsing data out of txt file emails and attached PDF files, so I understand concepts like mapping to where the data is stored as well as RegularExpressions, but these eml files I can't seem to figure out.
In my code below I have three blocks of code essentially trying to do the same thing (two of them are comments now). I am simply attempting to capture any, or all, of the data in the email. Each block of code produces the same error though:
TypeError: initial_value must be str or None, not _io.TextIOWrapper
I have read that this error is most likely due to Python expecting to receive a string but receives bytes instead, or vice versa. So I followed up those attempts by trying to implement io.StringIO or io.BytesIO but neither worked. I would like to be able to recognize and parse specific data out of the email.
Thank you for any help, as well as question asking criticism.
My code:
import email
#import io
import os
import re
path = 'Z:\\folderwithemlfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
with open(file_path, 'r', encoding="utf-8") as f:
b = email.message_from_string(f)
if b.is_multipart():
for paylod in b.get_payload():
print(payload.get_payload())
else:
print(b.get_payload())
#b = email.message_from_string(f)
#bbb = b['from']
#ccc = b['to']
#print(f)
#msg = email.message_from_string(f)
#msg['from']
#msg['to']
Picture of email:

Trying to download data from URL with CSV File

I'm slightly new to Python and have a question as to why the following code doesn't produce any output in the csv file. The code is as follows:
import csv
import urllib2
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib2.urlopen(url)
cr = csv.reader(response)
for row in cr:
with open("AusCentralbank.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(row)
Cheers.
Edit:
Brien and Albert solved the initial issue I had. However, I now have one further question. When I download the CSV File which I have listed above which is in "http://www.rba.gov.au/statistics/tables/#interest-rates" under Zero-coupon "Interest Rates - Analytical Series - 2009 to Current - F17" and is the F-17 Yields CSV I see that it has 5 workbooks and I actually just want to gather the data in the 5th Workbook. Is there a way I could do this? Cheers.
I could only test my code using Python 3. However, the only diffence should be urllib2, hence I am using urllib.respose for opening the desired url.
The variable html is type bytes and can generally be written to a file in binary mode. Additionally, your source is a csv-file already, so there should be no need to convert it somehow:
#!/usr/bin/env python3
# coding: utf-8
import urllib
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib.request.urlopen(url)
html = response.read()
with open('output.csv', 'wb') as f:
f.write(html)
It is probably because of your opening mode.
According to documentation:
'w' for only writing (an existing file with the same name will be
erased)
You should use append(a) mode to append it to the end of the file.
'a' opens the file for appending; any data written to the file is
automatically added to the end.
Also, since the file you are trying to download is csv file, you don't need to convert it.
#albert had a great answer. I've gone ahead and converted it to the equivalent Python 2.x code. You were doing a bit too much work in your original program; since the file was already a csv you didn't need to do any special work to turn it into a csv.
import urllib2
url = 'http://www.rba.gov.au/statistics/tables/csv/f17-yields.csv'
response = urllib2.urlopen(url)
html = response.read()
with open('AusCentralbank.csv', 'wb') as f:
f.write(html)

Returning a python bytearray in an HttpResponse

I have a django view that I want to return an Excel file. The code is below:
def get_template(request, spec_pk):
spec = get_object_or_404(Spec, pk=spec_pk)
response = HttpResponse(spec.get_template(), mimetype='application/ms-excel')
response['Content-Disposition'] = 'attachment; filename=%s_template.xls' % spec.name
return response
In that example, the type of spec.get_template() is <type 'bytearray'> which contains the binary data of an Excel spreadsheet.
The problem is, when I try to download that view, and open it with Excel, it comes in as garbled binary data. I know that the bytearray is correct though, because if I do the following:
f = open('temp.xls', 'wb')
f.write(spec.get_template())
I can open temp.xls in the Excel perfectly.
I've even gone so far as to modify my view to:
def get_template(request, spec_pk):
spec = get_object_or_404(Spec, pk=spec_pk)
f = open('/home/user/temp.xls', 'wb')
f.write(spec.get_template())
f.close()
f = open('/home/user/temp.xls', 'rb')
response = HttpResponse(f.read(), mimetype='application/ms-excel')
response['Content-Disposition'] = 'attachment; filename=%s_template.xls' % spec.name
return response
And it works perfectly- I can open the xls file from the browser into Excel and everything is alright.
So my question is- what do I need to do that bytearray before I pass it to the HttpResponse. Why does saving it as binary, then re-opening it work perfectly, but passing the bytearray itself results in garbled data?
Okay, through completely random (and very persistent) trial and error, I found a solution using the python binascii module.
This works:
response = HttpResponse(binascii.a2b_qp(spec.get_template()), mimetype='application/ms-excel')
According to the python docs for binascii.a2b_qp:
Convert a block of quoted-printable data back to binary and return the binary data. More than one line may be passed at a time. If the optional argument header is present and true, underscores will be decoded as spaces.
Would love for someone to tell me why saving it as binary, then reopening it worked though.
TLDR: Cast the bytearray to bytes
The problem is that Django's HttpResponse doesn't treat bytearray objects the same as bytes objects. HttpResponse has a special case for bytes which sends them to the client as-is, but it doesn't have a similar case for bytearray objects. They get handled by a catchall case which treats them as an iterable of int.
If you open the corrupted Excel file in a text editor, you'll probably see a bunch of ascii numbers, which are the numeric values of the bytes you were trying to return from the bytearray
Mr. Digital Flapjack gives a very complete explanation here: https://www.digitalflapjack.com/blog/2021/4/27/bytes-not-bytearrays-with-django-please

Categories

Resources