encoding error when reading excel file [duplicate] - python

This question already has answers here:
Suppress the u'prefix indicating unicode' in python strings
(11 answers)
Closed 8 years ago.
I want to go through data in my folder, identify them and rename them according to a list of rules I have in an excel spreadsheet
I load the needed libraries,
I make my directory the working directory;
I read in the xcel file (using xlrd)
and when I try to read the data by columns e.g. :
fname = metadata.col_values(0, start_rowx=1, end_rowx=None)
the list of values comes with a u in front of them - I guess unicode - such as:
fname = [u'file1', u'file2'] and so on
How can I convert fname to a list of ascii strings?

I'm not sure what the big issue behind having unicode filenames is, but assuming that all of your characters are ascii-valid characters the following should do it. This solution will just ignore anything that's non-ascii, but it's worth thinking about why you're doing this in the first place:
ascii_string = unicode_string.encode("ascii", "ignore")
Specifically, for converting a whole list I would use a list comprehension:
ascii_list = [old_string.encode("ascii", "ignore") for old_string in fname]

The u at the front is just a visual item to show you, when you print the string, what the underlying representation is. It's like the single-quotes around the strings when you print that list--they are there to show you something about the object being printed (specifically, that it's a string), but they aren't actually a part of the object.
In the case of the u, it's saying it's a unicode object. When you use the string internally, that u on the outside doesn't exist, just like the single-quotes. Try opening a file and writing the strings there, and you'll see that the u and the single-quotes don't show up, because they're not actually part of the underlying string objects.
with open(r'C:\test\foo.bar', 'w') as f:
for item in fname:
f.write(item)
f.write('\n')
If you really need to print strings without the u at the start, you can convert them to ASCII with u'unicode stuff'.encode('ascii'), but honestly I doubt this is something that actually matters for what you're doing.
You could also just use Python 3, where Unicode is the default and the u isn't normally printed.

Related

How to create variables for substitution based on user for unique filepath in python?

I'm writing code that I want to make generic to whoever needs to follow it.
Part of the code is reading in an excel file that the user has to download. I know that each user has a specific 6-digit unique ID, and the folder and name of the file remains the same. Is there some way for me to modify the pd.read_csv function so that it is like this:
USERID = '123abc'
pd.read_csv(r'C:\Users\USERID\Documents\Dataset.csv')
I keep getting stuck because there is an ' next to the r so concatenation with a constant does not seem to work.
Similarly, is there a method for code for exporting that would insert the current date in the title?
What you want to use are formatted strings. The r preceding the string literal in your code denotes that you are creating a raw string, which means that you aren't going to ever see the value of your variable get assigned correctly within that string. Python's docs explain what these raw strings are:
Both string and bytes literals may optionally be prefixed with a letter 'r' or 'R'; such strings are called raw strings and treat backslashes as literal characters. (3.10.4 Python Language Reference, Lexical Analysis)
Like Fredericka mentions in her comment, the formatted string is a great way to accomplish what you're trying to do. If you're using Python version 3.6 or greater, you can also use the format method on the string, which does the same thing.
# set the User ID
user_id = "PythonUser1"
# print the full filepath
print("C:\\Users\\{}\\Documents\\Dataset.csv".format(user_id))
# read the CSV file using formatted string literals
my_csv = pd.read_csv("C:\\Users\\{user_id}\\Documents\\Dataset.csv")
# read the CSV file using the format method
my_csv = pd.read_csv("C:\\Users\\{}\\Documents\\Dataset.csv".format(user_id))
For more information, I'd recommend checking out the official Python docs on input and output.

Python - How to read multiple lines from text file as a string and remove all encoding?

I have a list of 77 items. I have placed all 77 items in a text file (one per line).
I am trying to read this into my python script (where I will then compare each item in a list, to another list pulled via API).
Problem: for some reason, 2/77 of the items on the list have encoding, giving me characters of "u00c2" and "u00a2" which means they are not comparing correctly and being missed. I have no idea why these 2/77 have this encoding, but the other 75 are fine, and I don't know how to get rid of the encoding, in python.
Question:
In Python, How can I get rid of the encoding to ensure none of them have any special/weird characters and are just plain text?
Is there a method I can use to do this upon reading the file in?
Here is how I am reading the text file into python:
with open("name_list2.txt", "r") as myfile:
policy_match_list = myfile.readlines()
policy_match_list = [x.strip() for x in policy_match_list]
Note - "policy_match_list" is the list of 77 policies read in from the text file.
Here is how I am comparing my two lists:
for policy_name in policy_match_list:
for us_policy in us_policies:
if policy_name == us_policy["name"]:
print(f"Match #{match} | {policy_name}")
match += 1
Note - "us_policies" is another list of thousands of policies, pulled via API that I am comparing to
Which is resulting in 75/77 expected matches, due to the other 2 policies comparing e.g. "text12 - text" to "text12u00c2-u00a2text" rather than "text12 - text" to "text12 - text"
I hope this makes sense, let me know if I can add any further info
Cheers!
Did you try to open the file while decoding from utf8? because I can't see the file I can't tell this is the problem, but the file might have characters that the default decoding option (which I think is Latin) can't process.
Try doing:
with open("name_list2.txt", "r", encoding="utf-8") as myfile:
Also, you can watch this question about how to treat control characters: Python - how to delete hidden signs from string?
Sorry about not posting it as a comment (as I really don't know if this is the solution), I don't have enough reputation for that.
Certain Unicode characters aren't properly decoded in some cases. In your case, the characters \u00c2 and \u00a2 caused the issue. As of now, I see two fixes:
Try to resolve the encoding by replacing the characters (refer to https://stackoverflow.com/a/56967370)
Copy the text to a new plain text file (if possible) and save it. These extra characters tend to get ignored in that case and consequently removed.

hex header of file, magic numbers, python

I have a program in Python which analyses file headers and decides which file type it is. (https://github.com/LeoGSA/Browser-Cache-Grabber)
The problem is the following:
I read first 24 bytes of a file:
with open (from_folder+"/"+i, "rb") as myfile:
header=str(myfile.read(24))
then I look for pattern in it:
if y[1] in header:
shutil.move (from_folder+"/"+i,to_folder+y[2]+i+y[3])
where y = ['/video', r'\x47\x40\x00', '/video/', '.ts']
y[1] is the pattern and = r'\x47\x40\x00'
the file has it inside, as you can see from the picture below.
the program does NOT find this pattern (r'\x47\x40\x00') in the file header.
so, I tried to print header:
You see? Python sees it as 'G#' instead of '\x47\x40'
and if i search for 'G#'+r'\x00' in header - everything is ok. It finds it.
Question: What am I doing wrong? I want to look for r'\x47\x40\x00' and find it. Not for some strange 'G#'+r'\x00'.
OR
why python sees first two numbers as 'G#' and not as '\x47\x40', though the rest of header it sees in HEX? Is there a way to fix it?
with open (from_folder+"/"+i, "rb") as myfile:
header=myfile.read(24)
header = str(binascii.hexlify(header))[2:-1]
the result I get is:
And I can work with it
4740001b0000b00d0001c100000001efff3690e23dffffff
P.S. But anyway, if anybody will explain what was the problem with 2 first bytes - I would be grateful.
In Python 3 you'll get bytes from a binary read, rather than a string.
No need to convert it to a string by str.
Print will try to convert bytes to something human readable.
If you don't want that, convert your bytes to e.g. hex representations of the integer values of the bytes by:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (''.join ([hex (aByte) for aByte in aBytes]))
Output as redirected from the console:
b'\x00G#\x00\x13\x00\x00\xb0'
0x00x470x400x00x130x00x00xb0
You can't search in aBytes directly with the in operator, since aBytes isn't a string but an array of bytes.
If you want to apply a string search on '\x00\x47\x40', use:
aBytes = b'\x00\x47\x40\x00\x13\x00\x00\xb0'
print (aBytes)
print (r'\x'.join ([''] + ['%0.2x'%aByte for aByte in aBytes]))
Which will give you:
b'\x00G#\x00\x13\x00\x00\xb0'
\x00\x47\x40\x00\x13\x00\x00\xb0
So there's a number of separate issues at play here:
print tries to print something human readable, which succeeds only for the first two chars.
You can't directly search for bytearrays in bytearrays with in, so convert them to a string containing fixed length hex representations as substrings, as shown.

Writing unicode symbols to files (as opposed to unicode code)

I'm new to python and unicode is starting to give me headaches.
Currently I write to file like this:
my_string = "马/馬"
f = codecs.open(local_filepath, encoding='utf-8', mode='w+')
f.write(my_string)
f.close()
And when I open file with i.e. Gedit, I can see something like this:
\u9a6c/\u99ac\tm\u01ce
While I'd like to see exactly what I've written:
马/馬
I've tried a few different variations, like writing my_string.decode() or my_string.encode('utf-8') instead of just my_string, I know those two methods are the opposites but I was not sure which one I needed. Neither worked anyway.
If I manually write these symbols to text file, then with python read the file, re-write what I've just read back to the same file and save, symbols get turned to the code \u9a6c. Not sure if this is importat, figured I'd just mention it to help identify the problem.
Edit: the strings came from SQL Alchemy objects repr method, which turned out to be where the problem lied. I didn't mention it because it just didn't occur to me it can be related to the problem somehow. Thanks again for your help!
From the comments it is now clear you are using either the repr() function or calling the object.__repr__() method directly.
Don't do that. You are writing debugging information to your file:
>>> my_string = u"马/馬"
>>> print repr(my_string)
u'\u9a6c/\u99ac'
The value produced is meant to be pastable back into a Python session so you can re-produce the exact same value, and as such it is ASCII-safe (so it can be used in Python 2 source code without encoding issues).
From the repr() documentation:
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.
Write the Unicode objects to your file directly instead, codecs.open() handles encoding to UTF-8 correctly if you do.

How to remove double quotes from strings in a list in python?

I am trying to get some data in a list of dictionaries.
The data comes from a csv file so it's all string.
the the keys in the file all have double qoutes, but since these are all strings, I want to remove them so they look like this in the dictionary:
{'key':value}
instead of this
{'"key"':value}
I tried simply using string = string[1:-1], but this doesn's work...
Here is my code:
csvDelimiter = ","
tsvDelimiter = "\t"
dataOutput = []
dataFile = open("browser-ww-monthly-201305-201405.csv","r")
for line in dataFile:
line = line[:-1] # Removes \n after every line
data = line.split(csvDelimiter)
for i in data:
if type(i) == str: # Doesn't work, I also tried if isinstance(i, str)
# but that didn't work either.
print i
i = i[1:-1]
print i
dataOutput.append({data[0] : data[1]})
dataFile.close()
print "Data output:\n"
print dataOutput
all the prints I get from print i are good, without double quotes, but when I append data to dataOutput, the quotes are back!
Any idea how to make them disappear forever?
Strip it. For example:
data[0].strip('"')
However, when reading cvs files, the best is to use the built-in cvs module. It takes care of this for you.
As noted in the comments, when dealing with CSV files you truly ought to use Python's built-in csv module (linking to Python 2 docs since it seems that's what you're using).
Another thing to note is that when you do:
data = line.split(csvDelimiter)
every item in the returned list, if it is not empty, will be strings. There's no sense in doing a type check in the loop (though if there were a reason to you would use isinstance). I don't know what "didn't work" about it, though it's possible you were using unicode strings. On Python 2 you can usually use isinstance(..., basestring) where basestring is a base class for both str and unicode. On Python 3 just use str unless you know you're dealing with bytes.
You said: "I tried simply using string = string[1:-1], but this doesn't work...". It seems to work fine for me:
In [101]: s="'word'"
In [102]: s[1:-1]
Out[102]: 'word'

Categories

Resources