Python2.7 Write unicode dictionary into a csv file - python

I have a dictionary scraped from a Chinese website. All processed with unicode. Now I want to write data into a csv file. The first line contains all the dict.keys() and the second line contains all the dict.values()
How to write this dictionary into the csv? Specifically, I need all the Chinese characters displayed in csv. I am having trouble with converting them.
Thanks in advance,
data = {u'\u6ce8\u518c\u8d44\u672c': u'6500\u4e07\u5143\u4eba\u6c11\u5e01[8]', u'\u7ecf\u8425\u8303\u56f4': u'\u4e92\u8054\u7f51', u'\u5b98\u7f51': u'http://www.tencent.com/', u'\u6210\u7acb\u65f6\u95f4': u'1998\u5e7411\u670811\u65e5[8]', u'\u6ce8\u518c\u53f7': u'440301103448669[8]', u'\u5e74\u8425\u4e1a\u989d': u'1028.63\u4ebf\u5143\u4eba\u6c11\u5e01\uff082015\u5e74\uff09[9]', u'\u521b\u59cb\u4eba': u'\u9a6c\u5316\u817e\u5f20\u5fd7\u4e1c\u8bb8\u6668\u6654\u9648\u4e00\u4e39\u66fe\u674e\u9752[10]', u'\u603b\u90e8\u5730\u70b9': u'\u4e2d\u56fd\u6df1\u5733', u'\u603b\u88c1': u'\u5218\u70bd\u5e73', u'\u6ce8\u518c\u5730': u'\u6df1\u5733', u'\u5916\u6587\u540d\u79f0': u'Tencent', u'\u8463\u4e8b\u5c40\u4e3b\u5e2d': u'\u9a6c\u5316\u817e', u'\u5458\u5de5\u6570': u'2.5\u4e07\u4f59\u4eba\uff082014\u5e74\uff09', u'\u516c\u53f8\u6027\u8d28': u'\u6709\u9650\u8d23\u4efb\u516c\u53f8[8]', u'\u516c\u53f8\u53e3\u53f7': u'\u4e00\u5207\u4ee5\u7528\u6237\u4ef7\u503c\u4e3a\u4f9d\u5f52', u'\u4f01\u4e1a\u613f\u666f': u'\u6700\u53d7\u5c0a\u656c\u7684\u4e92\u8054\u7f51\u4f01\u4e1a', u'\u516c\u53f8\u4f7f\u547d': u'\u901a\u8fc7\u4e92\u8054\u7f51\u670d\u52a1\u63d0\u5347\u4eba\u7c7b\u751f\u6d3b\u54c1\u8d28', u'\u6cd5\u5b9a\u4ee3\u8868\u4eba': u'\u9a6c\u5316\u817e', u'\u767b\u8bb0\u673a\u5173': u'\u6df1\u5733\u5e02\u5e02\u573a\u76d1\u7763\u7ba1\u7406\u5c40\u5357\u5c71\u5c40[8]', u'\u516c\u53f8\u540d\u79f0': u'\u6df1\u5733\u5e02\u817e\u8baf\u8ba1\u7b97\u673a\u7cfb\u7edf\u6709\u9650\u516c\u53f8[8]'}

It would be trivial if you were using Python3 that natively uses Unicode:
import csv
with open("file.csv", "w", newline='', encoding='utf8') as fd:
dw = DictWriter(fd, data.keys()
dw.writeheader()
dw.writerow(data)
As you prefixed your unicode strings with u, I assume that you use Python2. The csv module is great as processing csv files, but the Python2 version does not natively process Unicode strings. To process a unicode dict, you can just encode its keys and values in utf8:
import csv
utf8data = { k.encode('utf8'): v.encode('utf8') for (k,v) in data.iteritems() }
with open("file.csv", "wb") as fd:
dw = DictWriter(fd, utf8data.keys()
dw.writeheader()
dw.writerow(utf8data)

Try to use the codecs module.
import codecs
with codecs.open(filename, "w", "utf-8") as f:
for key, value in data.iteritems():
f.write(key+','+value+'\n')
This should have the desidered behaviour

'utf-8' encoding solves the problem. One approach for converting the dictionary into a csv file is pandas library. It can solve the problem easily.
import pandas as pd
df = pd.DataFrame.from_dict(data, orient='index')
df.to_csv('output.csv', encoding='utf-8', header=None)

Related

What is the simplest way to fix an existing csv unicode utf-8 without BOM file not displaying correctly in excel?

I have the task of converting utf-8 csv file to excel file, but it is not read properly in excel. Because there was no byte order mark (BOM) at the beginning of the file
I see how:
https://stackoverflow.com/a/38025106/6102332
with open('test.csv', 'w', newline='', encoding='utf-8-sig') as f:
w = csv.writer(f)
# Write Unicode strings.
w.writerow([u'English', u'Chinese'])
w.writerow([u'American', u'美国人'])
w.writerow([u'Chinese', u'中国人'])
But it seems like that only works with brand new files.
But not work for my file already has data.
Are there any easy ways to share?
Is there any other way than this? : https://stackoverflow.com/a/6488070/6102332
Save the exported file as a csv
Open Excel
Import the data using Data-->Import External Data --> Import Data
Select the file type of "csv" and browse to your file
In the import wizard change the File_Origin to "65001 UTF" (or choose correct language character identifier)
Change the Delimiter to comma
Select where to import to and Finish
Read the file in and write it back out with the encoding desired:
with open('input.csv','r',encoding='utf-8-sig') as fin:
with open('output.csv','w',encoding='utf-8-sig') as fout:
fout.write(fin.read())
utf-8-sig codec will remove BOM if present on read, and will add BOM on write, so the above can safely run on files with or without BOM originally.
You can convert in place by doing:
file = 'test.csv'
with open(file,'r',encoding='utf-8-sig') as f:
data = f.read()
with open(file,'w',encoding='utf-8-sig') as f:
f.write(data)
Note also that utf16 works as well. Some older Excels don't handle UTF-8 correctly.
Thank You!
I have found a way to automatically handle the missing BOM utf-8 signature.
In addition to the lack of BOM signature, there is another problem is that duplicate BOM signature is mixed in the file data. Excel does not show clearly and transparently. and make a mistake other data when compared, calculated. eg :
data -> Excel
Chinese -> Chinese
12 -> 12
If you compare it, obviously ChineseBOM will not be equal to Chinese.
Code python to solve the problem:
import codecs
bom_utf8 = codecs.BOM_UTF8
def fix_duplicate_bom_utf8(file, bom=bom_utf8):
with open(file, 'rb') as f:
data_f = f.read()
data_finish = bom + data_f.replace(bom, b'')
with open(file, 'wb') as f:
f.write(data_finish)
return
# Use:
file_csv = r"D:\data\d20200114.csv" # American, 美国人
fix_duplicate_bom_utf8(file_csv)
# file_csv -> American, 美国人

Import Data from scraping into CSV

I'm using pycharm and Python 3.7.
I would like to write data in a csv, but my code writes in the File just the first line of my data... someone knows why?
This is my code:
from pytrends.request import TrendReq
import csv
pytrend = TrendReq()
pytrend.build_payload(kw_list=['auto model A',
'auto model C'])
# Interest Over Time
interest_over_time_df = pytrend.interest_over_time()
print(interest_over_time_df.head(100))
writer=csv.writer(open("C:\\Users\\
Desktop\\Data\\c.csv", 'w', encoding='utf-8'))
writer.writerow(interest_over_time_df)
try using pandas,
import pandas as pd
interest_over_time_df.to_csv("file.csv")
Once i encountered the same problem and solve it like below:
with open("file.csv", "rb", encoding="utf-8) as fh:
precise details:
r = read mode
b = mode specifier in the open() states that the file shall be treated as binary,
so contents will remain a bytes. No decoding attempt will happen this way.
As we know python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string (str). This process of course is a decoding according to utf-8 rules. When it tries this, it encounters a byte sequence which is not allowed in utf-8-encoded strings (namely this 0xff at position 0).
You could try something like:
import csv
with open(<path to output_csv>, "wb") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
for line in interest_over_time_df:
writer.writerow(line)
Read more here: https://www.pythonforbeginners.com/files/with-statement-in-python
You need to loop over the data and write in line by line

Cleaning unicode characters while writing to csv

I am using a certain REST api to get data, and then attemping to write it to a csv using python 2.7
In the csv, every item with a tuple has u' ' around it. For example, with the 'tags' field i am retrieving, i am getting [u'01d/02d/major/--', u'45m/04h/12h/24h', u'internal', u'net', u'premium_custom', u'priority_fields_swapped', u'priority_saved', u'problem', u'urgent', u'urgent_priority_issue'] . However, if I print the data in the program prior to it being written in the csv, the data looks fine, .ie ('01d/02d/major--', '45m/04h/12h/24h', etc). So I am assuming I have to modify something in the csv write command or within the the csv writer object itself. My question is how to write the data into the csv properly so that there are no unicode characters.
In Python3:
Just define the encoding when opening the csv file to write in.
If the row contains non ascii chars, you will get UnicodeEncodeError
row = [u'01d/02d/major/--', u'45m/04h/12h/24h', u'internal', u'net', u'premium_custom', u'priority_fields_swapped', u'priority_saved', u'problem', u'urgent', u'urgent_priority_issue']
import csv
with open('output.csv', 'w', newline='', encoding='ascii') as f:
writer = csv.writer(f)
writer.writerow(row)

How can I remove the quote characters from the first field name when unicodecsv.DictReader is parsing a UTF-8-BOM file in Python2.7?

The issue is when the class unicodecsv.DictReader parses a CSV file's fields when the fields contain quotes and the file is encoded in UTF-8-BOM, the first field retains the quote characters where all consecutive fields have them properly removed.
Example UTF-8-BOM encoded CSV File:
"Field1","Field2","Field3"
content1,content2,content3
Example Python Code:
from unicodecsv import DictReader
filename = "/tmp/test.csv"
with open(filename, mode='r') as read_stream:
reader = DictReader(read_stream, encoding='utf-8-sig')
print reader.fieldnames
Print Value:
['"Field1"','Field2','Field3']
Is there a way to have that first field be like the others and have the quote characters removed?
One way is to consume the BOM manually yourself (though I expect the code as written demonstrates an actual bug in the underlying library and should be added to their issues on github). After consuming the BOM, use the utf-8 codec instead.
# My test code to write a file with a BOM
import io
filename = "/tmp/test.csv"
with io.open('test.csv', 'w', encoding='utf-8-sig') as f:
f.write(u'''\
"Field1","Field2","Field3"
content1,content2,content3
''')
from unicodecsv import DictReader
with open(filename, mode='r') as read_stream:
# Consume the BOM
read_stream.read(3)
reader = DictReader(read_stream, encoding='utf-8')
print reader.fieldnames

Non Ascii character export on csv with python

I have a list of lists and I export it on a csv. Some of the lists entities are strings and some them with non-ascii characters.
For example: Name = "Ömer Berin"
I try Name.encode('utf-8'), before exporting but on the csv the name show like this "Γ–mer Berin"
I use this code for exporting:
with open("output.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(mylist)
UnicodeWriter should satisfy your needs http://docs.python.org/2/library/csv.html#csv-examples

Categories

Resources