I have a program in Python which analyses file headers and decides which file type it is. (https://github.com/LeoGSA/Browser-Cache-Grabber)
The problem is the following:
I read first 24 bytes of a file:
with open (from_folder+"/"+i, "rb") as myfile:
header=str(myfile.read(24))
then I look for pattern in it:
if y[1] in header:
shutil.move (from_folder+"/"+i,to_folder+y[2]+i+y[3])
where y = ['/video', r'\x47\x40\x00', '/video/', '.ts']
y[1] is the pattern and = r'\x47\x40\x00'
the file has it inside, as you can see from the picture below.
the program does NOT find this pattern (r'\x47\x40\x00') in the file header.
so, I tried to print header:
You see? Python sees it as 'G#' instead of '\x47\x40'
and if i search for 'G#'+r'\x00' in header - everything is ok. It finds it.
Question: What am I doing wrong? I want to look for r'\x47\x40\x00' and find it. Not for some strange 'G#'+r'\x00'.
OR
why python sees first two numbers as 'G#' and not as '\x47\x40', though the rest of header it sees in HEX? Is there a way to fix it?
with open (from_folder+"/"+i, "rb") as myfile:
header=myfile.read(24)
header = str(binascii.hexlify(header))[2:-1]
the result I get is:
And I can work with it
4740001b0000b00d0001c100000001efff3690e23dffffff
P.S. But anyway, if anybody will explain what was the problem with 2 first bytes - I would be grateful.
In Python 3 you'll get bytes from a binary read, rather than a string.
No need to convert it to a string by str.
Print will try to convert bytes to something human readable.
If you don't want that, convert your bytes to e.g. hex representations of the integer values of the bytes by:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (''.join ([hex (aByte) for aByte in aBytes]))
Output as redirected from the console:
b'\x00G#\x00\x13\x00\x00\xb0'
0x00x470x400x00x130x00x00xb0
You can't search in aBytes directly with the in operator, since aBytes isn't a string but an array of bytes.
If you want to apply a string search on '\x00\x47\x40', use:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (r'\x'.join ([''] + ['%0.2x'%aByte for aByte in aBytes]))
Which will give you:
b'\x00G#\x00\x13\x00\x00\xb0'
\x00\x47\x40\x00\x13\x00\x00\xb0
So there's a number of separate issues at play here:
print tries to print something human readable, which succeeds only for the first two chars.
You can't directly search for bytearrays in bytearrays with in, so convert them to a string containing fixed length hex representations as substrings, as shown.
Related
I have a list of 77 items. I have placed all 77 items in a text file (one per line).
I am trying to read this into my python script (where I will then compare each item in a list, to another list pulled via API).
Problem: for some reason, 2/77 of the items on the list have encoding, giving me characters of "u00c2" and "u00a2" which means they are not comparing correctly and being missed. I have no idea why these 2/77 have this encoding, but the other 75 are fine, and I don't know how to get rid of the encoding, in python.
Question:
In Python, How can I get rid of the encoding to ensure none of them have any special/weird characters and are just plain text?
Is there a method I can use to do this upon reading the file in?
Here is how I am reading the text file into python:
with open("name_list2.txt", "r") as myfile:
policy_match_list = myfile.readlines()
policy_match_list = [x.strip() for x in policy_match_list]
Note - "policy_match_list" is the list of 77 policies read in from the text file.
Here is how I am comparing my two lists:
for policy_name in policy_match_list:
for us_policy in us_policies:
if policy_name == us_policy["name"]:
print(f"Match #{match} | {policy_name}")
match += 1
Note - "us_policies" is another list of thousands of policies, pulled via API that I am comparing to
Which is resulting in 75/77 expected matches, due to the other 2 policies comparing e.g. "text12 - text" to "text12u00c2-u00a2text" rather than "text12 - text" to "text12 - text"
I hope this makes sense, let me know if I can add any further info
Cheers!
Did you try to open the file while decoding from utf8? because I can't see the file I can't tell this is the problem, but the file might have characters that the default decoding option (which I think is Latin) can't process.
Try doing:
with open("name_list2.txt", "r", encoding="utf-8") as myfile:
Also, you can watch this question about how to treat control characters: Python - how to delete hidden signs from string?
Sorry about not posting it as a comment (as I really don't know if this is the solution), I don't have enough reputation for that.
Certain Unicode characters aren't properly decoded in some cases. In your case, the characters \u00c2 and \u00a2 caused the issue. As of now, I see two fixes:
Try to resolve the encoding by replacing the characters (refer to https://stackoverflow.com/a/56967370)
Copy the text to a new plain text file (if possible) and save it. These extra characters tend to get ignored in that case and consequently removed.
I am using python v2.7.3 and am trying to get a conversion to work but am having some issues.
This is code that works the way I would like it to:
testString = "\x00\x13\xA2\x00\x40\xAA\x15\x47"
print 'Test String:',testString
This produces the following result
TestString: ¢#ªG
Now I load the same string as above along with some other data:
\x00\x13\xA2\x00\x40\xAA\x15\x47123456
into a SQLite3 database and then pull it from the database as such:
cur.execute('select datafield from databasetable')
rows = cur.fetchall()
if len(rows) == 0:
print 'Sorry Found Nothing!'
else:
print row[0][:32]
This however produces the following result:
\x00\x13\xA2\x00\x40\xAA\x15\x47
I can not figure out how to convert the database stored string to the bytes string, if that is what it is, as the first snippet of code does. I actually need it to load into a variable in that format so I can pass it to a function for further processing.
The following I have tried:
print "My Addy:",bytes(row[0][:32])
print '{0}'.format(row[0][:32])
...
They all produce the same results...
Please
First, Can anyone tell me what format the first results are in? I think its bytes format but am not sure.
Second, How can I convert the database stored text into
Any help and I would be eternally grateful.
Thanks in advance,
Ed
The problem is that you're not storing the value in the database properly. You want to store a sequence of bytes, but you're storing an escaped version of those bytes instead.
When entering string literals into a programming language, you can use escape codes in your source code to access non-printing characters. That's what you've done in your first example:
testString = "\x00\x13\xA2\x00\x40\xAA\x15\x47"
print 'Test String:',testString
But this is processing done by the Python interpreter as it's reading through your program and executing it.
Change the database column to a binary blob instead of a string, then go back to the code you're using to store the bytes in SQLite3, and have it store the actual bytes ('ABC', 3 bytes) instead of an escaped string ('\x41\x42\x43', 12 characters).
If you truly need to store the escaped string in SQLite3 and convert it at run-time, you might be able to use ast.literal_eval() to evaluate it as a string literal.
Whenever a program opens a file it sees the file as binary data. It translates it to a higher interpretive language i.e. octal, hex, ascii , etc. In this case it displays hexadecimal in the LH pane and ansi (windows 7 so it should be CP1252) in the RH pane. The 3 pictures below illustrate the original view, then the desired alteration, and the 3rd is the actual change made by the code:
with open(tar,'rb') as f:
data = binascii.hexlify(f.read(160))
if old in data:
print 'found!'
data = data.replace(old, new)
else:
print 'not found'
with open(tar+'new', 'wb') as fo:
binascii.unhexlify(data)
fo.write(data)
I have obviously not correctly targeted the write delivery method.
Hint: What is the difference between these two lines:
data = binascii.hexlify(f.read(160))
binascii.unhexlify(data)
In Python, string objects are immutable. There is nothing you can call upon data that will cause the string that data names to change, because strings do not change. binascii.unhexlify instead returns a new string - which is why the first statement even works in the first place. If you wanted to .write the resulting new string, then that's what you should specify to happen in the code - either directly:
fo.write(binascii.unhexlify(data))
or by assigning it back to data first.
This question already has answers here:
Suppress the u'prefix indicating unicode' in python strings
(11 answers)
Closed 8 years ago.
I want to go through data in my folder, identify them and rename them according to a list of rules I have in an excel spreadsheet
I load the needed libraries,
I make my directory the working directory;
I read in the xcel file (using xlrd)
and when I try to read the data by columns e.g. :
fname = metadata.col_values(0, start_rowx=1, end_rowx=None)
the list of values comes with a u in front of them - I guess unicode - such as:
fname = [u'file1', u'file2'] and so on
How can I convert fname to a list of ascii strings?
I'm not sure what the big issue behind having unicode filenames is, but assuming that all of your characters are ascii-valid characters the following should do it. This solution will just ignore anything that's non-ascii, but it's worth thinking about why you're doing this in the first place:
ascii_string = unicode_string.encode("ascii", "ignore")
Specifically, for converting a whole list I would use a list comprehension:
ascii_list = [old_string.encode("ascii", "ignore") for old_string in fname]
The u at the front is just a visual item to show you, when you print the string, what the underlying representation is. It's like the single-quotes around the strings when you print that list--they are there to show you something about the object being printed (specifically, that it's a string), but they aren't actually a part of the object.
In the case of the u, it's saying it's a unicode object. When you use the string internally, that u on the outside doesn't exist, just like the single-quotes. Try opening a file and writing the strings there, and you'll see that the u and the single-quotes don't show up, because they're not actually part of the underlying string objects.
with open(r'C:\test\foo.bar', 'w') as f:
for item in fname:
f.write(item)
f.write('\n')
If you really need to print strings without the u at the start, you can convert them to ASCII with u'unicode stuff'.encode('ascii'), but honestly I doubt this is something that actually matters for what you're doing.
You could also just use Python 3, where Unicode is the default and the u isn't normally printed.
i wrote a simple function to write into a text file. like this,
def write_func(var):
var = str(var)
myfile.write(var)
a= 5
b= 5
c= a + b
write_func(c)
this will write the output to a desired file.
now, i want the output in another format. say,
write_func("Output is :"+c)
so that the output will have a meaningful name in the file. how do i do it?
and why is that i cant write an integer to a file? i do, int = str(int) before writing to a file?
You can't add/concatenate a string and integer directly.
If you do anything more complicated than "string :"+str(number), I would strongly recommend using string formatting:
write_func('Output is: %i' % (c))
Python is a strongly typed language. This means, among other things, that you cannot concatenate a string and an integer. Therefore you'll have to convert the integer to string before concatenating. This can be done using a format string (as Nick T suggested) or passing the integer to the built in str function (as NullUserException suggested).
Simple, you do:
write_func('Output is' + str(c))
You have to convert c to a string before you can concatenate it with another string. Then you can also take off the:
var = str(var)
From your function.
why is that i cant write an integer to
a file? i do, int = str(int) before
writing to a file?
You can write binary data to a file, but byte representations of numbers aren't really human readable. -2 for example is 0xfffffffe in a 2's complement 32-bit integer. It's even worse when the number is a float: 2.1 is 0x40066666.
If you plan on having a human-readable file, you need to human-readable characters on them. In an ASCII file '0.5' isn't a number (at least not as a computer understands numbers), but instead the characters '0', '.' and '5'. And that's why you need convert your numbers to strings.
From http://docs.python.org/library/stdtypes.html#file.write
file.write(str)
Write a string to the file. There is no return value. Due to buffering,
the string may not actually show up in
the file until the flush() or close()
method is called.
Note how documentation specifies that write's argument must be a string.
So you should create a string yourself before passing it to file.write().