Extract gzip file without BOM in Python 3.6 - python

I have multiple gzfile in subfolders that I want to unzip in one folder. It works fine but there's a BOM signature at the beginning of each file that I would like to be removed. I have checked other questions like Removing BOM from gzip'ed CSV in Python or Convert UTF-8 with BOM to UTF-8 with no BOM in Python but it doesn't seem to work. I use Python 3.6 in Pycharm on Windows.
Here's first my code without attempt:
import gzip
import pickle
import glob
def save_object(obj, filename):
with open(filename, 'wb') as output: # Overwrites any existing file.
pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
output_path = 'path_out'
i = 1
for filename in glob.iglob(
'path_in/**/*.gz', recursive=True):
print(filename)
with gzip.open(filename, 'rb') as f:
file_content = f.read()
new_file = output_path + "z" + str(i) + ".txt"
save_object(file_content, new_file)
f.close()
i += 1
Now, with the logic defined in Removing BOM from gzip'ed CSV in Python (at least what I understand of it) if I replace file_content = f.read() by file_content = csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines()), I get:
TypeError: can't pickle _csv.reader objects
I checked for this error (e.g. "Can't pickle <type '_csv.reader'>" error when using multiprocessing on Windows) but I found no solution I could apply.

A minor adaptation of the very first question you link to trivially works.
tripleee$ cat bomgz.py
import gzip
from subprocess import run
with open('bom.txt', 'w') as handle:
handle.write('\ufeffmoo!\n')
run(['gzip', 'bom.txt'])
with gzip.open('bom.txt.gz', 'rb') as f:
file_content = f.read().decode('utf-8-sig')
with open('nobom.txt', 'w') as output:
output.write(file_content)
tripleee$ python3 bomgz.py
tripleee$ gzip -dc bom.txt.gz | xxd
00000000: efbb bf6d 6f6f 210a ...moo!.
tripleee$ xxd nobom.txt
00000000: 6d6f 6f21 0a moo!.
The pickle parts didn't seem relevant here but might have been obscuring the goal of getting a block of decoded str out of an encoded blob of bytes.

Related

Python: how to compress lists to a zip file [duplicate]

I want to write a file. Based on the name of the file this may or may not be compressed with the gzip module. Here is my code:
import gzip
filename = 'output.gz'
opener = gzip.open if filename.endswith('.gz') else open
with opener(filename, 'wb') as fd:
print('blah blah blah'.encode(), file=fd)
I'm opening the writable file in binary mode and encoding my string to be written. However I get the following error:
File "/usr/lib/python3.5/gzip.py", line 258, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
Why is my object not a bytes? I get the same error if I open the file with 'w' and skip the encoding step. I also get the same error if I remove the '.gz' from the filename.
I'm using Python3.5 on Ubuntu 16.04
For me, changing the gzip flag to 'wt' did the job. I could write the original string, without "byting" it.
(tested on python 3.5, 3.7 on ubuntu 16).
From python 3 gzip doc - quoting: "... The mode argument can be any of 'r', 'rb', 'a', 'ab', 'w', 'wb', 'x' or 'xb' for binary mode, or 'rt', 'at', 'wt', or 'xt' for text mode..."
import gzip
filename = 'output.gz'
opener = gzip.open if filename.endswith('.gz') else open
with opener(filename, 'wt') as fd:
print('blah blah blah', file=fd)
!zcat output.gz
> blah blah blah
you can convert it to bytes like this.
import gzip
with gzip.open(filename, 'wb') as fd:
fd.write('blah blah blah'.encode('utf-8'))
print is a relatively complex function. It writes str to a file but not the str that you pass, it writes the str that is the result of rendering the parameters.
If you have bytes already, you can use fd.write(bytes) directly and take care of adding a newline if you need it.
If you don't have bytes, make sure fd is opened to receive text.
You can serialize it using pickle.
First serializing the object to be written using pickle, then using gzip.
To save the object:
import gzip, pickle
filename = 'non-serialize_object.zip'
# serialize the object
serialized_obj = pickle.dumps(object)
# writing zip file
with gzip.open(filename, 'wb') as f:
f.write(serialized_obj)
To load the object:
import gzip, pickle
filename = 'non-serialize_object.zip'
with gzip.open(filename, 'rb') as f:
serialized_obj = f.read()
# de-serialize the object
object = pickle.loads(serialized_obj)

Converting a .csv.gz to .csv in Python 2.7

I have read the documentation and a few additional posts on SO and other various places, but I can't quite figure out this concept:
When you call csvFilename = gzip.open(filename, 'rb') then reader = csv.reader(open(csvFilename)), is that reader not a valid csv file?
I am trying to solve the problem outlined below, and am getting a coercing to Unicode: need string or buffer, GzipFile found error on line 41 and 7 (highlighted below), leading me to believe that the gzip.open and csv.reader do not work as I had previously thought.
Problem I am trying to solve
I am trying to take a results.csv.gz and convert it to a results.csv so that I can turn the results.csv into a python dictionary and then combine it with another python dictionary.
File 1:
alertFile = payload.get('results_file')
alertDataCSV = rh.dataToDict(alertFile) # LINE 41
alertDataTotal = rh.mergeTwoDicts(splunkParams, alertDataCSV)
Calls File 2:
import gzip
import csv
def dataToDict(filename):
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename)) # LINE 7
alertData={}
for row in reader:
alertData[row[0]]=row[1:]
return alertData
def mergeTwoDicts(dictA, dictB):
dictC = dictA.copy()
dictC.update(dictB)
return dictC
*edit: also forgive my non-PEP style of naming in Python
gzip.open returns a file-like object (same as what plain open returns), not the name of the decompressed file. Simply pass the result directly to csv.reader and it will work (the csv.reader will receive the decompressed lines). csv does expect text though, so on Python 3 you need to open it to read as text (on Python 2 'rb' is fine, the module doesn't deal with encodings, but then, neither does the csv module). Simply change:
csvFilename = gzip.open(filename, 'rb')
reader = csv.reader(open(csvFilename))
to:
# Python 2
csvFile = gzip.open(filename, 'rb')
reader = csv.reader(csvFile) # No reopening involved
# Python 3
csvFile = gzip.open(filename, 'rt', newline='') # Open in text mode, not binary, no line ending translation
reader = csv.reader(csvFile) # No reopening involved
The following worked for me for python==3.7.9:
import gzip
my_filename = my_compressed_file.csv.gz
with gzip.open(my_filename, 'rt') as gz_file:
data = gz_file.read() # read decompressed data
with open(my_filename[:-3], 'wt') as out_file:
out_file.write(data) # write decompressed data
my_filename[:-3] is to get the actual filename so that it does get a random filename.

Merging multiple JSONs into one: TypeError, a bytes-like object is required, not 'str'

So, I am trying write a small program in python 3.6 to merge multiple JSON files (17k) and I am getting the above error.
I put together the script by reading other SO Q&As. I played around a little bit, getting various errors, nevertheless, I couldn't get it to work. Here is my code:
# -*- coding: utf-8 -*-
import glob
import json
import os
import sys
def merge_json(source, target):
os.chdir(source)
read_files = glob.glob("*.json")
result = []
i = 0
for f in glob.glob("*.json"):
print("Files merged so far:" + str(i))
with open(f, "rb") as infile:
print("Appending file:" + f)
result.append(json.load(infile))
i = i + 1
output_folder = os.path.join(target, "mergedJSON")
output_folder = os.path.join(output_folder)
if not os.path.exists(output_folder):
os.makedirs(output_folder)
os.chdir(output_folder)
with open("documents.json", "wb") as outfile:
json.dump(result, outfile)
try:
sys.argv[1], sys.argv[2]
except:
sys.exit("\n\n Error: Missing arguments!\n See usage example:\n\n python merge_json.py {JSON source directory} {output directory} \n\n")
merge_json(sys.argv[1], sys.argv[2])
In your case you are opening the file in 'wb' mode which means it works with bytes-like objects only. But json.dump is trying to write a string to it. You can simply change the open mode from 'wb' to 'w' (text mode) and it will work.

I'm trying to encode csv file to utf8 using python

I'm using python to read and encode many files to utf8 using python,I try it with the code below:
import os
from os import listdir
def find_csv_filenames(path_to_dir, suffix=".csv" ):
path_to_dir = os.path.normpath(path_to_dir)
filenames = listdir(path_to_dir)
#Check *csv directory
fp = lambda f: not os.path.isdir(path_to_dir+"/"+f) and f.endswith(suffix)
return [path_to_dir+"/"+fname for fname in filenames if fp(fname)]
def convert_files(files, ascii, to="utf-8"):
count = 0
lineno = 0
for name in files:
lineno = lineno+1
with open(name) as f:
file_target = open(name, mode='r', encoding='latin-1')
file_content = file_target.read()
file_target.close
print(lineno)
file_source = open("./csv/data{}.csv".format(lineno), mode='w', encoding='utf-8')
file_source.write(file_content)
csv_files = find_csv_filenames('./csv', ".csv")
convert_files(csv_files, "cp866")
The problem is that after I read and write data to other files and set encode it to utf8 but it still not work.
Before you open a file which encoding is not clear, you could use chardet to detect the file's encoding rather than use a encoding guessed to open a file. Usage is like this:
>>> import chardet
>>> encoding = chardet.detect('PATH/TO/FILE')['encoding']
And then open the file with the encoding detected and write the contents into a file opened with 'utf-8' encoding.
If you're not sure whether the file is converted using 'utf-8' encoding, you could use enca to see if the encoding of the file is 'ASCII' or 'utf-8' like this in Linux shell:
$ enca FILENAME

How do I automatically handle decompression when reading a file in Python?

I am writing some Python code that loops through a number of files and processes the first few hundred lines of each file. I would like to extend this code so that if any of the files in the list are compressed, it will automatically decompress while reading them, so that my code always receives the decompressed lines. Essentially my code currently looks like:
for f in files:
handle = open(f)
process_file_contents(handle)
Is there any function that can replace open in the above code so that if f is either plain text or gzip-compressed text (or bzip2, etc.), the function will always return a file handle to the decompressed contents of the file? (No seeking required, just sequential access.)
I had the same problem: I'd like my code to accept filenames and return a filehandle to be used with with, automatically compressed & etc.
In my case, I'm willing to trust the filename extensions and I only need to deal with gzip and maybe bzip files.
import gzip
import bz2
def open_by_suffix(filename):
if filename.endswith('.gz'):
return gzip.open(filename, 'rb')
elif filename.endswith('.bz2'):
return bz2.BZ2file(filename, 'r')
else:
return open(filename, 'r')
If we don't trust the filenames, we can compare the initial bytes of the file for magic strings (modified from https://stackoverflow.com/a/13044946/117714):
import gzip
import bz2
magic_dict = {
"\x1f\x8b\x08": (gzip.open, 'rb')
"\x42\x5a\x68": (bz2.BZ2File, 'r')
}
max_len = max(len(x) for x in magic_dict)
def open_by_magic(filename):
with open(filename) as f:
file_start = f.read(max_len)
for magic, (fn, flag) in magic_dict.items():
if file_start.startswith(magic):
return fn(filename, flag)
return open(filename, 'r')
Usage:
# cat
for filename in filenames:
with open_by_suffix(filename) as f:
for line in f:
print f
Your use-case would look like:
for f in files:
with open_by_suffix(f) as handle:
process_file_contents(handle)

Categories

Resources