Read compressed stdin

Read compressed stdin - python

I would like to have such call:
pv -ptebar compressed.csv.gz | python my_script.py
Inside my_script.py I would like to decompress compressed.csv.gz and parse it using Python csv parser. I would expect something like this:
import csv
import gzip
import sys
with gzip.open(fileobj=sys.stdin, mode='rt') as f:
reader = csv.reader(f)
print(next(reader))
print(next(reader))
print(next(reader))
Of course it doesn't work because gzip.open doesn't have fileobj argument. Could you provide some working example solving this issue?
UPDATE
Traceback (most recent call last):
File "my_script.py", line 8, in <module>
print(next(reader))
File "/usr/lib/python3.5/gzip.py", line 287, in read1
return self._buffer.read1(size)
File "/usr/lib/python3.5/_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "/usr/lib/python3.5/gzip.py", line 461, in read
if not self._read_gzip_header():
File "/usr/lib/python3.5/gzip.py", line 404, in _read_gzip_header
magic = self._fp.read(2)
File "/usr/lib/python3.5/gzip.py", line 91, in read
self.file.read(size-self._length+read)
File "/usr/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
The traceback above appeared after applying #Rawing advice.

In python 3.3+, you can pass a file object to gzip.open:
The filename argument can be an actual filename (a str or bytes object), or an existing file object to read from or write to.
So your code should work if you just omit the fileobj=:
with gzip.open(sys.stdin, mode='rt') as f:
Or, a slightly more efficient solution:
with gzip.open(sys.stdin.buffer, mode='rb') as f:
If for some odd reason you're using a python older than 3.3, you can directly invoke the gzip.GzipFile constructor. However, these old versions of the gzip module didn't have support for files opened in text mode, so we'll use sys.stdin's underlying buffer instead:
with gzip.GzipFile(fileobj=sys.stdin.buffer) as f:

Using gzip.open(sys.stdin.buffer, 'rt') fixes issue for Python 3.

Related

PdfFileWriter doesn't work because of content including Chineses character

I was trying to create a code to generate a combined pdf from a bunch of small pdf files while I found the script failing with UnicodeEncodeError error.
I also tried to include encoding param by
with open("Combined.pdf", "w",encoding='utf-8-sig') as outputStream:
but compiler said it needs to be binary 'wb' mode. So this isn't working.
Below is the code:
writer = PdfFileWriter()
input_stream = []
for f2 in f_re:
inputf_file = str(mypath+'\\'+f2[2])
input_stream.append(open(inputf_file,'rb'))
for reader in map(PdfFileReader, input_stream):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
with open("Combined.pdf", "wb") as outputStream:
writer.write(outputStream)
writer.save()
for f in input_stream:
f.close()
Below is error message:
Traceback (most recent call last):
File "\Workspace\Python\py_CombinPDF\py_combinePDF.py", line 89, in <module>
writer.write(outputStream)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\pdf.py", line 501, in write
obj.writeToStream(stream, key)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\generic.py", line 549, in writeToStream
value.writeToStream(stream, encryption_key)
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\generic.py", line 472, in writeToStream
stream.write(b_(self))
File "\AppData\Local\Programs\Python\Python36\lib\site-packages\PyPDF2\utils.py", line 238, in b_
r = s.encode('latin-1')
UnicodeEncodeError: 'latin-1' codec can't encode characters in position 8-9: ordinal not in range(256)

Upgrading PyPDF2 solved this issue.
Now, 4 years later, people should use pypdf. It contains the latest code (I'm the maintainer of PyPDF2 and pypdf)

Python pandas to excel UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11

After the web scraping of an e-commerce web site I have saved all the data into a pandas dataframe. Well, when I'm trying to save my pandas dataframe to an excel file but I get the following error:
Traceback (most recent call last):
File "<ipython-input-7-3dafdf6b87bd>", line 2, in <module>
sheet_name='Dolci', encoding='iso-8859-1')
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\core\frame.py", line
1466, in to_excel
excel_writer.save()
File "C:\ProgramData\Anaconda2\lib\site-packages\pandas\io\excel.py", line
1502, in save
return self.book.close()
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\workbook.py",
line 299, in close
self._store_workbook()
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\workbook.py",
line 607, in _store_workbook
xml_files = packager._create_package()
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\packager.py",
line 139, in _create_package
self._write_shared_strings_file()
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\packager.py",
line 286, in _write_shared_strings_file
sst._assemble_xml_file()
File "C:\ProgramData\Anaconda2\lib\site-
packages\xlsxwriter\sharedstrings.py", line 53, in _assemble_xml_file
self._write_sst_strings()
File "C:\ProgramData\Anaconda2\lib\site-
packages\xlsxwriter\sharedstrings.py", line 83, in _write_sst_strings
self._write_si(string)
File "C:\ProgramData\Anaconda2\lib\site-
packages\xlsxwriter\sharedstrings.py", line 110, in _write_si
self._xml_si_element(string, attributes)
File "C:\ProgramData\Anaconda2\lib\site-packages\xlsxwriter\xmlwriter.py",
line 122, in _xml_si_element
self.fh.write("""<si><t%s>%s</t></si>""" % (attr, string))
File "C:\ProgramData\Anaconda2\lib\codecs.py", line 706, in write
return self.writer.write(data)
File "C:\ProgramData\Anaconda2\lib\codecs.py", line 369, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11:
ordinal not in range(128)
The code I use is this:
df.to_excel('my_file.xlsx',sheet_name='Dolci', encoding='iso-8859-1')
but it doesn't work, I even have tried:
df.to_excel('my_file.xlsx',sheet_name='Dolci', encoding='utf-8')
but it still give me error.
Can somebody help me on this issue?

It seems like you use xlsxwriter engine in ExcelWriter.
Try to use openpyxl instead.
writer = pd.ExcelWriter('file_name.xlsx', engine='openpyxl')
df.to_excel(writer)
writer.save()

There is an essential param of to_excel method, try
df.to_excel('filename.xlsx', engine='openpyxl')
and it works for me.

Adding to #Vadym 's response, you may have to close your writer to get the file to be created.
writer = pd.ExcelWriter(xlPath, engine='openpyxl')
df.to_excel(writer)
writer.close()
"depends on the behaviour of the used engine"
See:
https://github.com/pandas-dev/pandas/issues/9145
This should be a comment but I don't have the rep...

python: csv to json conversion when csv contains unicode

I'm trying to use the following code (within web2py) to read a csv file and convert it into a json object:
import csv
import json
originalfilename, file_stream = db.tablename.file.retrieve(info.file)
file_contents = file_stream.read()
csv_reader = csv.DictReader(StringIO(file_contents))
json = json.dumps([x for x in csv_reader])
This produces the following error:
'utf8' codec can't decode byte
0xa0 in position 1: invalid start byte
Apparently, there is a problem handling the spaces in the .csv file. The problem appears to stem from the json.dumps() line. The traceback from that point on:
Traceback (most recent call last):
File ".../web2py/gluon/restricted.py", line 212, in restricted
exec ccode in environment
File ".../controllers/default.py", line 2345, in <module>
File ".../web2py/gluon/globals.py", line 194, in <lambda>
self._caller = lambda f: f()
File ".../web2py/gluon/tools.py", line 3021, in f
return action(*a, **b)
File ".../controllers/default.py", line 697, in generate_vis
request.vars.json = json.dumps(list(csv_reader))
File "/usr/local/lib/python2.7/json/__init__.py", line 243, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python2.7/json/encoder.py", line 207, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python2.7/json/encoder.py", line 270, in iterencode
return _iterencode(o, 0)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
Any suggestions regarding how to resolve this, or another way to get a csv file (which contains a header; using StringIO) into a json object that won't produce similar complications? Thank you.

The csv module (under Python 2) is purely byte-based; all strings you get out of it are bytes. However JSON is Unicode character-based, so there is an implicit conversion when you try to write out the bytes you got from CSV into JSON. Python guessed UTF-8 for this, but your CSV file wasn't UTF-8 - it was probably Windows code page 1252 (Western European - like ISO-8859-1 only not quite).
A quick fix would be to transcode your input (file_contents= file_contents.decode('windows-1252').encode('utf-8')), but probably you don't really want to rely on json guessing a particular encoding.
Best would be to explicitly decode your strings at the point of reading them from CSV. Then JSON will be able to cope with them OK. Unfortately csv doesn't have built-in decoding (at least in this Python version), but you can do it manually:
class UnicodeDictReader(csv.DictReader):
def __init__(self, f, encoding, *args, **kwargs):
csv.DictReader.__init__(self, f, *args, **kwargs)
self.encoding = encoding
def next(self):
return {
k.decode(self.encoding): v.decode(self.encoding)
for (k, v) in csv.DictReader.next(self).items()
}
csv_reader = UnicodeDictReader(StringIO(file_contents), 'windows-1252')
json_output = json.dumps(list(csv_reader))
it's not known in advance what sort of encoding will come up
Well that's more of a problem, since it's impossible to guess accurately what encoding a file is in. You would either have to specific a particular encoding, or give the user a way to signal what the encoding is, if you want to support non-ASCII characters properly.

Try replacing your final line with
json = json.dumps([x.encode('utf-8') for x in csv_reader])

Running unidecode over the file contents seems to do the trick:
from isounidecode import unidecode
...
file_contents = unidecode(file_stream.read())
...
Thanks, everyone!

storing tgz to couchdb with python

I am trying to read a tgz file and writing it to couchdb.
here is the code.
import couchdb
conn = couchdb.Server('http://localhost:5984')
db = conn['test']
with open('/tmp/test.txt.tgz.enc') as f:
data = f.read()
doc = {'file': data}
db.save(doc)
it fails with
Traceback (most recent call last):
File "<stdin>", line 4, in <module>
File "/usr/local/lib/python2.7/dist-packages/couchdb/client.py", line 407, in save
_, _, data = func(body=doc, **options)
File "/usr/local/lib/python2.7/dist-packages/couchdb/http.py", line 399, in post_json
status, headers, data = self.post(*a, **k)
File "/usr/local/lib/python2.7/dist-packages/couchdb/http.py", line 381, in post
**params)
File "/usr/local/lib/python2.7/dist-packages/couchdb/http.py", line 419, in _request
credentials=self.credentials)
File "/usr/local/lib/python2.7/dist-packages/couchdb/http.py", line 176, in request
body = json.encode(body).encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 11: ordinal not in range(128)
still googling around to find a solution myself.

alright I solved it. double checked the documentation and there is a put_attachment function but it requires a document to be created upfront you will assign the attachment to.
code example just if somebody else needs it:
import couchdb
conn = couchdb.Server('http://localhost:5984')
db = conn['test1']
doc = {'name': 'testfile'}
db.save(doc)
db.put_attachment(doc, data, filename="test.txt.tgz")

k i got it.See this below example db=couch.create('test1')-This is to create a database name with test1.doc={'name':'testfile'} -This is key value pair.f=open('/home/yamunapriya/pythonpractices/addd.py','r')-This is to open the file with read mode.db.save(doc)-to save the file couchdb.db.put_attachment(doc,f,filename="/home/yamunapriya/pythonpractices/addd.py") -in this the parameter doc-key value pair,f-filename/path with read/write mode,filename
import couchdb
couch=couchdb.Server()
db=couch.create('test1')
doc={'name':'testfile'}
f=open('/home/yamunapriya/pythonpractices/addd.py','r')
db.save(doc)
db.put_attachment(doc,f,filename="/home/yamunapriya/pythonpractices/addd.py")

What caused this traceback? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
UnicodeDecodeError when passing GET data in Python/AppEngine
I get the following traceback locally and on production when trying to submit a form. Can you explain where I should look or should I start making debugging statements to see where in the code the exception occurs?
--> --> -->
Traceback (most recent call last):
File "/media/Lexar/montao/google/appengine/tools/dev_appserver.py", line 3858, in _HandleRequest
self._Dispatch(dispatcher, self.rfile, outfile, env_dict)
File "/media/Lexar/montao/google/appengine/tools/dev_appserver.py", line 3792, in _Dispatch
base_env_dict=env_dict)
File "/media/Lexar/montao/google/appengine/tools/dev_appserver.py", line 580, in Dispatch
base_env_dict=base_env_dict)
File "/media/Lexar/montao/google/appengine/tools/dev_appserver.py", line 2918, in Dispatch
self._module_dict)
File "/media/Lexar/montao/google/appengine/tools/dev_appserver.py", line 2822, in ExecuteCGI
reset_modules = exec_script(handler_path, cgi_path, hook)
File "/media/Lexar/montao/google/appengine/tools/dev_appserver.py", line 2704, in ExecuteOrImportScript
script_module.main()
File "/media/Lexar/montao/classifiedsmarket/main.py", line 2497, in main
util.run_wsgi_app(application)
File "/media/Lexar/montao/google/appengine/ext/webapp/util.py", line 98, in run_wsgi_app
run_bare_wsgi_app(add_wsgi_middleware(application))
File "/media/Lexar/montao/google/appengine/ext/webapp/util.py", line 116, in run_bare_wsgi_app
result = application(env, _start_response)
File "/media/Lexar/montao/google/appengine/ext/webapp/__init__.py", line 655, in __call__
response.wsgi_write(start_response)
File "/media/Lexar/montao/google/appengine/ext/webapp/__init__.py", line 274, in wsgi_write
body = self.out.getvalue()
File "/usr/lib/python2.6/StringIO.py", line 270, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

The error manifested itself in /usr/lib/python2.6/StringIO.py i.e. the Python StringIO module. We don't need to read too far into that source file (line 49) to find this warning:
The StringIO object can accept either Unicode or 8-bit strings, but
mixing the two may take some care. If both are used, 8-bit strings that
cannot be interpreted as 7-bit ASCII (that use the 8th bit) will
cause a UnicodeError to be raised when getvalue() is called.
Bingo! And the warning is repeated again in the getvalue() method. Note that the warning is ancient; it mentions UnicodeError instead of UnicodeDecodeError, but you get the drift.
I'd suggest patching the module so that it displays what's in the bag when the error happens. Wrap up the offending statement at line 270 like this:
if self.buflist:
try:
self.buf += ''.join(self.buflist)
except UnicodeDecodeError:
import sys
print >> sys.stderr, "*** error context: buf=%r buflist=%r" % (self.buf, self.buflist)
raise
self.buflist = []
return self.buf
If the idea of patching a Python-supplied module in situ horrifies you, put the patched version in a directory that's earlier in sys.path than /usr/lib/python2.6.
Here's an example of mixing non-ASCII str and unicode:
>>> from StringIO import StringIO
>>> f = StringIO()
>>> f.write('ascii')
>>> f.write(u'\u1234'.encode('utf8'))
>>> f.write(u'\u5678')
>>> f.getvalue()
*** error context: buf='' buflist=['ascii', '\xe1\x88\xb4', u'\u5678']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\python26\lib\StringIO.py", line 271, in getvalue
self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe1 in position 0: ordinal not in range(128)
>>>
Then you can run your application and look at what is in buflist: which parts are data that you wrote, and which are provided by GAE. You need to look at the GAE docs to see whether it is expecting str contents (with what encoding?) or unicode contents, and adjust your code accordingly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read compressed stdin - python

Using gzip.open(sys.stdin.buffer, 'rt') fixes issue for Python 3.

Related

PdfFileWriter doesn't work because of content including Chineses character

Python pandas to excel UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 11

python: csv to json conversion when csv contains unicode

storing tgz to couchdb with python

What caused this traceback? [duplicate]

Categories

Resources