Convert bytes object to string object in python - python

python code
#!python3
import sys
import os.path
import codecs
if not os.path.exists(sys.argv[1]):
print("File does not exist: " + sys.argv[1])
sys.exit(1)
file_name = sys.argv[1]
with codecs.open(file_name, 'rb', errors='ignore') as file:
file_contents = file.readlines()
for line_content in file_contents:
print(type(line_content))
line_content = codecs.decode(line_content)
print(line_content)
print(type(line_content))
File content : Log.txt
b'\x03\x00\x00\x00\xc3\x8a\xc3\xacRb\x00\x00\x00\x00042284899:ATBADSFASF:DSF456582:US\r\n1'
Output:
python3 file_convert.py Log.txt  ✔  19:08:22 
<class 'bytes'>
b'\x03\x00\x00\x00\xc3\x8a\xc3\xacRb\x00\x00\x00\x00042284899:ATBADSFASF:DSF456582:US\r\n1'
<class 'str'>
I tried all the below methods
line_content = line_content.decode('UTF-8')
line_content = line_content.decode()
line_content = codecs.decode(line_content, 'UTF-8')
Is there any other way to handle this?
The line_content variable still holds the byte data and only the type changes to str which is kind off confusing.

The data in Log.txt is the string representation of a python Bytes object. That is odd but we can deal with it. Since its a Bytes literal, evaluate it, which converts it to a real python Bytes object. Now there is still a question of what its encoding is.
I don't see any advantage to using codecs.open. That's a way to read unicode files in python 2.7, not usually needed in python 3. Guessing UTF-8, your code would be
#!python3
import sys
import os
import ast
if not os.path.exists(sys.argv[1]):
print("File does not exist: " + sys.argv[1])
sys.exit(1)
file_name = sys.argv[1]
with open(file_name) as file:
file_contents = file.readlines()
for line_content in file_contents:
print(type(line_content))
line_content = ast.literal_eval(line_content).decode("utf-8")
print(line_content)
print(type(line_content))

I think it's a list not a string. Whenever you look at byte-string started with \ (reverse backslash), it's potentially a list
try this
decoded_line_content = list(line_content)

Related

Python: how to compress lists to a zip file [duplicate]

I want to write a file. Based on the name of the file this may or may not be compressed with the gzip module. Here is my code:
import gzip
filename = 'output.gz'
opener = gzip.open if filename.endswith('.gz') else open
with opener(filename, 'wb') as fd:
print('blah blah blah'.encode(), file=fd)
I'm opening the writable file in binary mode and encoding my string to be written. However I get the following error:
File "/usr/lib/python3.5/gzip.py", line 258, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
Why is my object not a bytes? I get the same error if I open the file with 'w' and skip the encoding step. I also get the same error if I remove the '.gz' from the filename.
I'm using Python3.5 on Ubuntu 16.04
For me, changing the gzip flag to 'wt' did the job. I could write the original string, without "byting" it.
(tested on python 3.5, 3.7 on ubuntu 16).
From python 3 gzip doc - quoting: "... The mode argument can be any of 'r', 'rb', 'a', 'ab', 'w', 'wb', 'x' or 'xb' for binary mode, or 'rt', 'at', 'wt', or 'xt' for text mode..."
import gzip
filename = 'output.gz'
opener = gzip.open if filename.endswith('.gz') else open
with opener(filename, 'wt') as fd:
print('blah blah blah', file=fd)
!zcat output.gz
> blah blah blah
you can convert it to bytes like this.
import gzip
with gzip.open(filename, 'wb') as fd:
fd.write('blah blah blah'.encode('utf-8'))
print is a relatively complex function. It writes str to a file but not the str that you pass, it writes the str that is the result of rendering the parameters.
If you have bytes already, you can use fd.write(bytes) directly and take care of adding a newline if you need it.
If you don't have bytes, make sure fd is opened to receive text.
You can serialize it using pickle.
First serializing the object to be written using pickle, then using gzip.
To save the object:
import gzip, pickle
filename = 'non-serialize_object.zip'
# serialize the object
serialized_obj = pickle.dumps(object)
# writing zip file
with gzip.open(filename, 'wb') as f:
f.write(serialized_obj)
To load the object:
import gzip, pickle
filename = 'non-serialize_object.zip'
with gzip.open(filename, 'rb') as f:
serialized_obj = f.read()
# de-serialize the object
object = pickle.loads(serialized_obj)

Extract gzip file without BOM in Python 3.6

I have multiple gzfile in subfolders that I want to unzip in one folder. It works fine but there's a BOM signature at the beginning of each file that I would like to be removed. I have checked other questions like Removing BOM from gzip'ed CSV in Python or Convert UTF-8 with BOM to UTF-8 with no BOM in Python but it doesn't seem to work. I use Python 3.6 in Pycharm on Windows.
Here's first my code without attempt:
import gzip
import pickle
import glob
def save_object(obj, filename):
with open(filename, 'wb') as output: # Overwrites any existing file.
pickle.dump(obj, output, pickle.HIGHEST_PROTOCOL)
output_path = 'path_out'
i = 1
for filename in glob.iglob(
'path_in/**/*.gz', recursive=True):
print(filename)
with gzip.open(filename, 'rb') as f:
file_content = f.read()
new_file = output_path + "z" + str(i) + ".txt"
save_object(file_content, new_file)
f.close()
i += 1
Now, with the logic defined in Removing BOM from gzip'ed CSV in Python (at least what I understand of it) if I replace file_content = f.read() by file_content = csv.reader(f.read().decode('utf-8-sig').encode('utf-8').splitlines()), I get:
TypeError: can't pickle _csv.reader objects
I checked for this error (e.g. "Can't pickle <type '_csv.reader'>" error when using multiprocessing on Windows) but I found no solution I could apply.
A minor adaptation of the very first question you link to trivially works.
tripleee$ cat bomgz.py
import gzip
from subprocess import run
with open('bom.txt', 'w') as handle:
handle.write('\ufeffmoo!\n')
run(['gzip', 'bom.txt'])
with gzip.open('bom.txt.gz', 'rb') as f:
file_content = f.read().decode('utf-8-sig')
with open('nobom.txt', 'w') as output:
output.write(file_content)
tripleee$ python3 bomgz.py
tripleee$ gzip -dc bom.txt.gz | xxd
00000000: efbb bf6d 6f6f 210a ...moo!.
tripleee$ xxd nobom.txt
00000000: 6d6f 6f21 0a moo!.
The pickle parts didn't seem relevant here but might have been obscuring the goal of getting a block of decoded str out of an encoded blob of bytes.

Writing yaml file: attribute error

I'm trying to read a yaml file, replacing part of it and write the result it into the same file, but I get an attribute error.
Code
import yaml
import glob
import re
from yaml import load, dump
from yaml import CLoader as Loader, CDumper as Dumper
import io
list_paths = glob.glob("my_path/*.yaml")
for path in list_paths:
with open(path, 'r') as stream:
try:
text = load(stream, Loader=Loader)
text = str(text)
print text
if "my_string" in text:
start = "'my_string': '"
end = "'"
m = re.compile(r'%s.*?%s' % (start,end),re.S)
m = m.search(text).group(0)
text[m] = "'my_string': 'this is my string'"
except yaml.YAMLError as exc:
print(exc)
with io.open(path, 'w', encoding = 'utf8') as outfile:
yaml.dump(text, path, default_flow_style=False, allow_unicode=True)
Error
I get this error for the yaml_dump line
AttributeError: 'str' object has no attribute 'write'
What I have tried so far
Not converting the text to a string, but then I get an error on the m.search line:
TypeError: expected string or buffer
Convert first to string and then to dictagain, but I get this error from the code text: dict(text) : ValueError: dictionary update sequence element #0 has length 1; 2 is required
Yaml file
my string: something
string2: something else
Expected result: yaml file
my string: this is my string
string2: something else
To stop getting that error all you need to do is change the
with io.open(path, 'w', encoding = 'utf8') as outfile:
yaml.dump(text, path, default_flow_style=False, allow_unicode=True)
to
with open(path, 'w') as outfile:
yaml.dump(text.encode("UTF-8"), outfile, default_flow_style=False, allow_unicode=True)
As the other answer says, this solution simply replaces the string path with the open file descriptor.
This
yaml.dump(text, path, default_flow_style=False, allow_unicode=True)
is not possible if path is a str. It must be an open file.

Python3 write gzip file - memoryview: a bytes-like object is required, not 'str'

I want to write a file. Based on the name of the file this may or may not be compressed with the gzip module. Here is my code:
import gzip
filename = 'output.gz'
opener = gzip.open if filename.endswith('.gz') else open
with opener(filename, 'wb') as fd:
print('blah blah blah'.encode(), file=fd)
I'm opening the writable file in binary mode and encoding my string to be written. However I get the following error:
File "/usr/lib/python3.5/gzip.py", line 258, in write
data = memoryview(data)
TypeError: memoryview: a bytes-like object is required, not 'str'
Why is my object not a bytes? I get the same error if I open the file with 'w' and skip the encoding step. I also get the same error if I remove the '.gz' from the filename.
I'm using Python3.5 on Ubuntu 16.04
For me, changing the gzip flag to 'wt' did the job. I could write the original string, without "byting" it.
(tested on python 3.5, 3.7 on ubuntu 16).
From python 3 gzip doc - quoting: "... The mode argument can be any of 'r', 'rb', 'a', 'ab', 'w', 'wb', 'x' or 'xb' for binary mode, or 'rt', 'at', 'wt', or 'xt' for text mode..."
import gzip
filename = 'output.gz'
opener = gzip.open if filename.endswith('.gz') else open
with opener(filename, 'wt') as fd:
print('blah blah blah', file=fd)
!zcat output.gz
> blah blah blah
you can convert it to bytes like this.
import gzip
with gzip.open(filename, 'wb') as fd:
fd.write('blah blah blah'.encode('utf-8'))
print is a relatively complex function. It writes str to a file but not the str that you pass, it writes the str that is the result of rendering the parameters.
If you have bytes already, you can use fd.write(bytes) directly and take care of adding a newline if you need it.
If you don't have bytes, make sure fd is opened to receive text.
You can serialize it using pickle.
First serializing the object to be written using pickle, then using gzip.
To save the object:
import gzip, pickle
filename = 'non-serialize_object.zip'
# serialize the object
serialized_obj = pickle.dumps(object)
# writing zip file
with gzip.open(filename, 'wb') as f:
f.write(serialized_obj)
To load the object:
import gzip, pickle
filename = 'non-serialize_object.zip'
with gzip.open(filename, 'rb') as f:
serialized_obj = f.read()
# de-serialize the object
object = pickle.loads(serialized_obj)

StringIO with binary files?

I seem to get different outputs:
from StringIO import *
file = open('1.bmp', 'r')
print file.read(), '\n'
print StringIO(file.read()).getvalue()
Why? Is it because StringIO only supports text strings or something?
When you call file.read(), it will read the entire file into memory. Then, if you call file.read() again on the same file object, it will already have reached the end of the file, so it will only return an empty string.
Instead, try e.g. reopening the file:
from StringIO import *
file = open('1.bmp', 'r')
print file.read(), '\n'
file.close()
file2 = open('1.bmp', 'r')
print StringIO(file2.read()).getvalue()
file2.close()
You can also use the with statement to make that code cleaner:
from StringIO import *
with open('1.bmp', 'r') as file:
print file.read(), '\n'
with open('1.bmp', 'r') as file2:
print StringIO(file2.read()).getvalue()
As an aside, I would recommend opening binary files in binary mode: open('1.bmp', 'rb')
The second file.read() actually returns just an empty string. You should do file.seek(0) to rewind the internal file offset.
Shouldn't you be using "rb" to open, instead of just "r", since this mode assumes that you'll be processing only ASCII characters and EOFs?

Categories

Resources