Enabling Excel to Open Python CSV Output as UTF8 - python

I've worked through several similar posts around this issue, but to no avail - I output a list of lists to a csv file, however the special characters are not showing properly as Excel isn't reading it as UTF8.
I'm a beginner so am struggling to implement some of the workarounds people have used such as writing to UTF16 and using BOMs - my latest attempt was to try to add the BOM to my output CSV but its not working.
with open(outputname, "wb") as f:
writer = csv.writer(f)
writer.writerows(my_list)
f.write(u'\ufeff'.encode('utf8')) # BOM
I've tried some more complicated ways like using UnicodeWriter, but with no luck. Any ideas would be appreciated!

I finally figured this out, so just in case its useful for anyone else - I got around it by decoding my csv content (stored in a nested list prior to exporting to csv) and then re-encoding the content into 'cp1252' which I believe is the encoding used by Excel. Haven't tested this on Mac yet but it certainly works on Windows.
for i in range(len(nestedlist)):
for j in range(7):
x = nestedlist[i][j]
y = unicode(x, 'utf-8')
nestedlist[i][j] = y.encode('cp1252')

Related

وصلى characters showing when writing the text obtained through web scraping into a csv file [duplicate]

I'm attempting to extract article information using the python newspaper3k package and then write to a CSV file. While the info is downloaded correctly, I'm having issues with the output to CSV. I don't think I fully understand unicode, despite my efforts to read about it.
from newspaper import Article, Source
import csv
first_article = Article(url="http://www.bloomberg.com/news/articles/2016-09-07/asian-stock-futures-deviate-as-s-p-500-ends-flat-crude-tops-46")
first_article.download()
if first_article.is_downloaded:
first_article.parse()
first_article.nlp
article_array = []
collate = {}
collate['title'] = first_article.title
collate['content'] = first_article.text
collate['keywords'] = first_article.keywords
collate['url'] = first_article.url
collate['summary'] = first_article.summary
print(collate['content'])
article_array.append(collate)
keys = article_array[0].keys()
with open('bloombergtest.csv', 'w') as output_file:
csv_writer = csv.DictWriter(output_file, keys)
csv_writer.writeheader()
csv_writer.writerows(article_array)
output_file.close()
When I print collate['content'], which is first_article.text, the console outputs the article's content just fine. Everything shows up correctly, apostrophes and all. When I write to the CVS, the content cell text has odd characters in it. For example:
“At the end of the day, Europe’s economy isn’t in great shape, inflation doesn’t look exciting and there are a bunch of political risks to reckon with.
So far I have tried:
with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file:
to no avail. I also tried utf-16 instead of 8, but that just resulted in the cells writing in an odd order. It didn't create the cells correctly in the CSV, although the output looked correct. I've also tried .encode('utf-8') are various variable but nothing has worked.
What's going on? Why would the console print the text correctly, while the CSV file has odd characters? How can I fix this?
Add encoding='utf-8-sig' to open(). Excel requires the UTF-8-encoded BOM code point (Byte Order Mark, U+FEFF) signature to interpret a file as UTF-8; otherwise, it assumes the default localized encoding.
Changing with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file: to with open('bloombergtest.csv', 'w', encoding='utf-8-sig') as output_file:, worked, as recommended by Leon and Mark Tolonen.
That's most probably a problem with the software that you use to open or print the CSV file - it doesn't "understand" that CSV is encoded in UTF-8 and assumes ASCII, latin-1, ISO-8859-1 or a similar encoding for it.
You can aid that software in recognizing the CSV file's encoding by placing a BOM sequence in the beginning of your file (which, in general, is not recommended for UTF-8).

Receiving Data From Universal Robot and Decoding

I am working on a project where we are looking to get some data from a universal robot such as position and force data and then store that data in a text file for later reference. We can receive the data just fine, but turning it into readable coordinates is an issue. An example data string is below:
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\x80\xbf\x00\x00\xc0?\x00\x00\x16C\x00\x00\xc0?\x00\x00\x16C\x00\x00\x00?\xcd\xcc\xcc>\x00\x00\x96C\x00\x00\xc8A\x1e\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x88\xfb\x7f?\xd0M><\xc0G\x9e:tNT?\r\x11\x07\xbc\xb9\xfd\x7f?~\xa0\xa1:\x03\x02+?\x16\xeb\x7f\xbf#\xce\xcc\xbc9\xdfl\xbbq\xc3\x8a>i\x19T<\xf3\xf9\x7f\xbf\xb4k\x87\xbb->\xc2>\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x80?\xdb\x0f\xc9#\xa7\xdcU#\xa7\xdcU#\xa7\xdcU#\xa7\xdcU#\xa7\xdcU#\xa7\xdcU#\xfe\xff\xff\xff\xfe\xff\xff\xff\xfe\xff\xff\xff\xfe\xff\xff\xff\xfe\xff\xff\xff\xff\xff\xff\xff\xecb\xc7#\xecb\xc7#\xecb\xc7#\
*not entire string received
At first I thought it was hex so I tried the code:
packet_12 = packet_12.encode('hex')
x = str(packet_12)
x = struct.unpack('!d', packet_12.decode('hex'))[0]
all_data.write("X=", x * 1000)
But to no avail. I tried several different decoding methods using codecs and .encode, but none worked. I found on a different post here the two code blocks below:
y = codecs.decode(packet_12, 'utf-8', errors='ignore')
packet_12 = s.recv(8)
z = str(packet_12)
x = ''.join('%02x' % ord(c) for c in packet_12)
Neither worked for my application. Finally I tried saving the entire sting in a .txt file and opening it with python and decoding it with the code below, but again nothing seemed to happen.
with io.open('C:\\Users\\myuser\\Desktop\\decode.txt', 'r', encoding='utf8') as f:
text = f.read()
with io.open('C:\\Users\\myuser\\Desktop\\decode', 'w', encoding='utf8') as f:
f.write(text)
I am aware I might be missing something incredibly simple such as using the wrong decoding type or I might even have jibberish as the robot output, but any help is appreciated.
The easiest way to receive data from the robot with python is to use Universal Robots' Real-Time-Data-Exchange Interface. They offer some python examples for receiving and sending data.
Check out my GitHub repo for an example code which is based on the official code from UR:
https://github.com/jonenfabian/Read_Data_From_Universal_Robots

Python csv package - issue with DictReader module

I'm having a curious issue with the csv package in Python 3.7.
I'm importing a csv file and able to access all the file as expected, with one exception - the header row, as stored in the "fieldnames" object, appears have the first column header (first item in fieldnames) malformed.
This first field always has the format: 'xxx"header"'
where:
xxx are garbage characters that always seem to be the same
header is the correct header text
See the following screenshot of my table <csv.DictReader> object from my debug window:
My code to open the file, follows. I added the headers[0] = table.fieldnames[0].split('"')[1] in order to extract the correct header and place it back into fieldnames`.
import csv
with self.inputfile.open() as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
headers[0] = table.fieldnames[0].split('"')[1]
(Note: self.inputfile is a pathlib.Path object)
I didn't notice this for a long time because I wasn't using the first column (with the # header) - I've been happily parsing with the rest of the columns for a while on multiple files.
If I look directly at the csv, there doesn't appear to be any issue:
Questions:
Does anyone know what the issue is? Is there anything I can try to correct the import issue?
If there isn't a fix, is there a better way to parse the garbage? I realize this could clear up in the future, but I think the split will still work even with just bare double quotes (the header should still be the 2nd item in the split, right?). Is there a better solution?
It looks like your csv file is encoded as utf-8-sig - a version of utf-8 used by some Windows applications, but it's being decoded as cp1252 - another encoding in common use on Windows.
>>> print('"#"'.encode('utf-8-sig').decode('cp1252'))
"#"
The "garbage" characters preceding the header are the byte-order-mark that utf-8-sig uses to tell Windows applications that a file is encoded as utf-8 rather than one of the historically more common 8-bit encodings.
To avoid the "garbage", specify utf-8-sig as the encoding when opening your file.
The code in the question could be modified to work like this:
import csv
encoding = 'utf-8-sig'
with self.inputfile.open(encoding=encoding, newline='') as self.inputfid:
table = csv.DictReader(self.inputfid, delimiter=',')
headers = table.fieldnames
...
If - as seems likely - the encoding of input files may vary, the value of encoding (or a best guess) must be determined by using a tool like chardet, as used in the comments.

How to resolve an encoding issue?

I need to read the content of a csv file using Python. However when I run this code:
with(open(self.path, 'r')) as csv_file:
csv_reader = csv.reader(csv_file, dialect=csv.excel, delimiter=';')
self.data = [[cell for cell in row] for row in csv_reader]
I get this error:
File "C:\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1137: character maps to <undefined>
My understanding is that this file was not encoded in cp-1252, and that I need to find out what encoding was used. I tried a bunch of things, but nothing worked for now.
About the file:
It is sent by an external company, I can't have more information about it.
It comes with other similar files, with which I don't have any issue when I run the same code
It has an .xls extension, but is more a csv file delimited with semicolons
When I open it with Excel it opens in Compatibility mode. But I don't see any sort of encoding issue: everything displays right.
What I already tried:
Saving it under a different file format to get rid of the compatibility mode
Adding an encoding in the first line of my code: (I tried more or less randomly some encodings that I know of)
with(open(self.path, 'r', encoding = 'utf8')) as csv_file:
Copy-pasting the content of the file into a new file, or deleting the whole content of the file. Still does not work. This one really bugs me because I feel like it means the probelm is not in the content of the file, and not in the file itself.
Searching a lot everywhere how to solve this kind of issue.
I recommend using pandas library (as well as numpy), it is very handy when it comes to data manipulation. This function imports the data from an xlsx or csv file type.
/!\ change datapath according to your needs /!\
import os
import pandas as pd
def GetData(directory, dataUse, format):
dataPath = os.getcwd() + "\\Data\\" + directory + "\\" + dataUse + "Set." + format
if format == "xlsx":
dataSet = pd.read_excel(dataPath, sheetname = 'Sheet1')
elif format == "csv":
dataSet = pd.read_csv(dataPath)
return dataSet
I finally found some sort of solution :
Open the file with Excel
Display the file properly using the "Text to Columns" feature
Save the file to csv format
Run the code
This does not quite satisfy me, but it works.
I still don't understand what the problem actually is, and why this solved it, so I am interested in any additional information !

Django upload and handle CSV file with right encoding

I try to upload and handle a CSV file in my Django project, but I get an encoding error, the CSV file is created on a mac with excel..
reader = csv.reader(request.FILES['file'].read().splitlines(), delimiter=";")
if withheader:
reader.next()
data = [[field.decode('utf-8') for field in row] for row in reader]
With this code example i get an error: http://puu.sh/1VmXc
If I use latin-1 decode i get an other "error"..
data = [[field.decode('latin-1') for field in row] for row in reader]
the result is: v¾gmontere and the result should be: vægmontere
Anyone know what to do? .. i have tried a lot!
The Python 2 csv module comes with lots of unicode hassle. Try unicodecsv instead or use Python 3.
Excel on Mac exports to CSV with broken encoding. Don't use it, use something useful like LibreOffice instead (has a much better CSV export with options).
When handling user files: either make sure files are consistently encoded in UTF-8 and only decode to UTF-8 (recommended) or use an encoding detection library like chardet.

Categories

Resources