I'm new to python and I have a file like this:
cw==ZA==YQ==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==dA==ZQ==cw==dA==
It's an keybord input, coded with base64, and new I want to decode it
I try this by the code is stoping at first character decoded.
import base64
file = "my_file.txt"
fin = open(file, "rb")
binary_data = fin.read()
fin.close()
b64_data = base64.b64decode(binary_data)
b64_fname = "original_b64.txt"
fout = open(b64_fname, "w")
fout.write(b64_data)
fout.close
Any help is welcome. thanks
I assume that you created your test input string yourself.
If I split your test input string in blocks of 4 characters and decode each one apart, I get the following:
>>> import base64
>>> s = 'cw==ZA==YQ==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==dA==ZQ==cw==dA=='
>>> ''.join(base64.b64decode(s[i:i+4]) for i in range(0, len(s), 4))
'sdadasdasdasdasdtest'
However, the correct base64 encoding of your test string sdadasdasdasdasdtest is:
>>> base64.b64encode('sdadasdasdasdasdtest')
'c2RhZGFzZGFzZGFzZGFzZHRlc3Q='
If you place this string in my_file.txt (and rewriting your code to be a bit more concise) then it all works.
import base64
with open("my_file.txt") as f, open("original_b64.txt", 'w') as g:
encoded = f.read()
decoded = base64.b64decode(encoded)
g.write(decoded)
Related
f.read(1) will return 1 byte, not one character. The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string. There is no newline character at the end of the string. How do I read such strings?
I have seen this question but none of the answers address the UTF-8 case.
Example code:
file = 'temp.txt'
with open(file, 'wb') as f:
f.write(b'\x41')
f.write(b'\xD0')
f.write(b'\xB1')
f.write(b'\xC0')
with open(file, 'rb') as f:
print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
print(f.buffer.read(1), '+', f.read(1))
This outputs:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 2: invalid start byte
When f.write(b'\xC0') is removed, it works as expected. It seems to read more than it is told: the code doesn't say to read the 0xC0 byte.
The file is binary but particular ranges in the file are UTF-8 encoded strings with the length coming before the string.
You have the length of the string, which is likely the byte length as it makes the most sense in a binary file. Read the range of bytes in binary mode and decode it after-the-fact. Here's a contrived example of writing a binary file with a UTF-8 string with the length encoded first. It has a two-byte length followed by the encoded string data, surrounded with 10 bytes of random data on each side.
import os
import struct
string = "我不喜欢你女朋友。你需要一个新的。"
with open('sample.bin','wb') as f:
f.write(os.urandom(10)) # write 10 random bytes
encoded = string.encode()
f.write(len(encoded).to_bytes(2,'big')) # write a two-byte big-endian length
f.write(encoded) # write string
f.write(os.urandom(10)) # 10 more random bytes
with open('sample.bin','rb') as f:
print(f.read()) # show the raw data
# Option 1: Seeking to the known offset, read the length, then the string
with open('sample.bin','rb') as f:
f.seek(10)
length = int.from_bytes(f.read(2),'big')
result = f.read(length).decode()
print(result)
# Option 2: read the fixed portion as a structure.
with open('sample.bin','rb') as f:
# read 10 bytes and a big endian 16-bit value
*other,header = struct.unpack('>10bH',f.read(12))
result = f.read(length).decode()
print(result)
Output:
b'\xa3\x1e\x07S8\xb9LA\xf0_\x003\xe6\x88\x91\xe4\xb8\x8d\xe5\x96\x9c\xe6\xac\xa2\xe4\xbd\xa0\xe5\xa5\xb3\xe6\x9c\x8b\xe5\x8f\x8b\xe3\x80\x82\xe4\xbd\xa0\xe9\x9c\x80\xe8\xa6\x81\xe4\xb8\x80\xe4\xb8\xaa\xe6\x96\xb0\xe7\x9a\x84\xe3\x80\x82ta\xacg\x9c\x82\x85\x95\xf9\x8c'
我不喜欢你女朋友。你需要一个新的。
我不喜欢你女朋友。你需要一个新的。
If you do need to read UTF-8 characters from a particular byte offset in a file, you can wrap the binary stream in a UTF-8 reader after seeking:
with open('sample.bin','rb') as f:
f.seek(12)
c = codecs.getreader('utf8')(f)
print(c.read(1))
Output:
我
Here's a character that takes up more than one byte. Whether you open the file giving the utf-8 encoding or not, reading one byte seems to do the job and you get the whole character.
file = 'temp.txt'
with open(file, 'wb') as f:
f.write('⾀'.encode('utf-8'))
f.write(b'\x01')
with open(file, 'rb') as f:
print(f.read(1))
with open(file, 'r') as f:
print(f.read(1))
Output:
b'\xe2'
⾀
Even though some of the file is non utf-8, you can still open the file in reading mode (non-binary), skip to the byte you want to read and then read a whole character by running read(1).
This works even if your character isn't in the beginning of the file:
file = 'temp.txt'
with open(file, 'wb') as f:
f.write(b'\x01')
f.write('⾀'.encode('utf-8'))
with open(file, 'rb') as f:
print(f.read(1), '+', f.read(1))
with open(file, 'r') as f:
print(f.read(1),'+', f.read(1))
If this does not work for you please provide an example.
I have this code
import collections
import csv
import sys
import codecs
from xml.dom.minidom import parse
import xml.dom.minidom
String = collections.namedtuple("String", ["tag", "text"])
def read_translations(filename): #Reads a csv file with rows made up of 2 columns: the string tag, and the translated tag
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=",")
result = [String(tag=row[0], text=row[1]) for row in csv_reader]
return result
The CSV file I'm reading contains Brazilian portuguese characters. When I try to run this, I get an error:
'utf8' codec can't decode byte 0x88 in position 21: invalid start byte
I'm using Python 2.7. As you can see, I'm encoding with codecs, but it doesn't work.
Any ideas?
The idea of this line:
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
is to say "This file was saved as utf-8. Please make appropriate conversions when reading from it."
That works fine if the file was actually saved as utf-8. If some other encoding was used, then it is bad.
What then?
Determine which encoding was used. Assuming the information cannot be obtained from the software which created the file - guess.
Open the file normally and print each line:
with open(filename, 'rt') as f:
for line in f:
print repr(line)
Then look for a character which is not ASCII, e.g. ñ - this letter will be printed as some code, e.g.:
'espa\xc3\xb1ol'
Above, ñ is represented as \xc3\xb1, because that is the utf-8 sequence for it.
Now, you can check what various encodings would give and see which is right:
>>> ntilde = u'\N{LATIN SMALL LETTER N WITH TILDE}'
>>>
>>> print repr(ntilde.encode('utf-8'))
'\xc3\xb1'
>>> print repr(ntilde.encode('windows-1252'))
'\xf1'
>>> print repr(ntilde.encode('iso-8859-1'))
'\xf1'
>>> print repr(ntilde.encode('macroman'))
'\x96'
Or print all of them:
for c in encodings.aliases.aliases:
try:
encoded = ntilde.encode(c)
print c, repr(encoded)
except:
pass
Then, when you have guessed which encoding it is, use that, e.g.:
with codecs.open(filename, "r", encoding='iso-8859-1') as csvfile:
I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
thank you!
EDIT1:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
EDIT2:
Thanks to all of you trying to helping me out.
So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
You have basically 2 ways to deal with decode errors:
use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)
For example:
f = open(INPUT_FILE_NAME,encoding="latin9")
or
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
If you are using python3, it provides inbuilt support for unicode content -
f = open('file.csv', encoding="utf-8")
If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
Now you can read it without any unicode data issues.
I operate with a thermal printer, this printer is able to print images, but it needs to get the data in hex format. For this I would need to have a python function to read an image and return a value containing the image data in hex format.
I currently use this format to sent hex format to the printer:
content = b"\x1B\x4E"
Which is the simplest way to do so using Python2.7?
All the best;
I don't really know what you mean by "hex format", but if it needs to get the whole file as a sequence of bytes you can do:
with open("image.jpeg", "rb") as fp:
img = fp.read()
If your printer expects the image in some other format (like 8bit values for every pixel) then try using the pillow library, it has many image manipulation functions and handles a wide range of input and ouput formats.
How about this:
with open('something.jpeg', 'rb') as f:
binValue = f.read(1)
while len(binValue) != 0:
hexVal = hex(ord(binValue))
# Do something with the hex value
binValue = f.read(1)
Or for a function, something like this:
import re
def imgToHex(file):
string = ''
with open(file, 'rb') as f:
binValue = f.read(1)
while len(binValue) != 0:
hexVal = hex(ord(binValue))
string += '\\' + hexVal
binValue = f.read(1)
string = re.sub('0x', 'x', string) # Replace '0x' with 'x' for your needs
return string
Note: You do not necessarily need to do the re.sub portion if you use struct.pack to write the bits, but this will get it into the format that you need
Read in a jpg and make a string of hex values. Then reverse the procedure. Take a string of hex and write it out as a jpg file...
import binascii
with open('my_img.jpg', 'rb') as f:
data = f.read()
print(data[:10])
im_hex = binascii.hexlify(data)
# check out the hex...
print(im_hex[:10])
# reversing the procedure
im_hex = binascii.a2b_hex(im_hex)
print(im_hex[:10])
# write it back out to a jpg file
with open('my_hex.jpg', 'wb') as image_file:
image_file.write(im_hex)
I would like to convert a binary to hexadecimal in a certain format and save it as a text file.
The end product should be something like this:
"\x7f\xe8\x89\x00\x00\x00\x60\x89\xe5\x31\xd2\x64\x8b\x52"
Input is from an executable file "a".
This is my current code:
with open('a', 'rb') as f:
byte = f.read(1)
hexbyte = '\\x%02s' % byte
print hexbyte
A few issues with this:
This only prints the first byte.
The result is "\x" and a box like this:
00
7f
In terminal it looks exactly like this:
Why is this so? And finally, how do I save all the hexadecimals to a text file to get the end product shown above?
EDIT: Able to save the file as text with
txt = open('out.txt', 'w')
print >> txt, hexbyte
txt.close()
You can't inject numbers into escape sequences like that. Escape sequences are essentially constants, so, they can't have dynamic parts.
There's already a module for this, anyway:
from binascii import hexlify
with open('test', 'rb') as f:
print(hexlify(f.read()).decode('utf-8'))
Just use the hexlify function on a byte string and it'll give you a hex byte string. You need the decode to convert it back into an ordinary string.
Not quite sure if decode works in Python 2, but you really should be using Python 3, anyway.
Your output looks like a representation of a bytestring in Python returned by repr():
with open('input_file', 'rb') as file:
print repr(file.read())
Note: some bytes are shown as ascii characters e.g. '\x52' == 'R'. If you want all bytes to be shown as the hex escapes:
with open('input_file', 'rb') as file:
print "\\x" + "\\x".join([c.encode('hex') for c in file.read()])
Just add the content to list and print:
with open("default.png",'rb') as file_png:
a = file_png.read()
l = []
l.append(a)
print l