Python - read xls -> manipulate -> write CSV - python

im trying to archive the following:
input: xls file
output: csv file
I want to read the xls and do some manipulations (rewrite the headers (original: customernumer, csv needs Customer_Number__c), removing some columns, etc.
Right now I'm already reading the xls and try to write as csv (without any manipulations), but I'm struggling because of the coding.
The original file contains some "special" characters like "/", "\", and most impoartant "ä, ü, ö, ß".
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 8: ordinal not in range(128)
I have no clue which special characters can be in a file, this changes from time to time.
here is my current sandbox code:
# -*- coding: utf-8 -*-
__author__ = 'adieball'
import xlrd
import csv
from os import sys
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument("inname", type=str,
help="Names of the Input File in single quotes")
parser.add_argument("--outname", type=str,
help="Optional enter the name of the output (csv) file. if nothing is given, "
"we use the name of the input file and add .csv to it")
args = parser.parse_args()
if args.outname is None:
outname = args.inname + ".csv"
else:
outname = args.outname
wb = xlrd.open_workbook(args.inname)
xl_sheet = wb.sheet_by_index(0)
print args.inname
print ('Retrieved worksheet: %s' % xl_sheet.name)
print outname
output = open(outname, 'wb')
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
for rownum in xrange(wb.sheet_by_index(0).nrows):
wr.writerow(wb.sheet_by_index(0).row_values(rownum))
output.close()
anything I can do here to make sure these special characters get written to the csv in the same way as they appeared in the original xls?
thanks
andre

a simple
from os import sys
reload(sys)
sys.setdefaultencoding("utf-8")
did the trick
Andre

You could convert the script to Python 3, and then set the write mode when opening the the output file to "w" instead to write Unicode. Not trying to evangelize, but Python 3 makes this sort of thing easier. If you wanna stay with Python 2 checkout this guide: https://docs.python.org/2/howto/unicode.html

If you want to write a utf-8 encoded file, you have to use the codecs.open. Try this small example:
o1 = open('/tmp/o1.txt', 'wb')
try:
o1.write(u'\u20ac')
except Exception, exc:
print exc
o1.close()
import codecs
o2 = codecs.open('/tmp/o2.txt', 'w', 'utf-8')
o2.write(u'\u20ac')
o2.close()

Why not using UnicodeWriter class as in examples in csv doc https://docs.python.org/2/library/csv.html#examples . I think it should solve your problem.
If not I'll propose you different look to your problem if you have Excel - use win32com, Dispatch excel, and use Excel Object model. You can use build-in excel functions to rename, delete columns etc. and then save it as csv.
E.g.
import win32com.client
excelInstance = win32com.client.gencache.EnsureDispatch('Excel.Application')
workbook = excelInstance.Workbooks.Open(filepath)
worksheet = workbook.Worksheets('WorksheetName')
#### do what you like
worksheet.UsedRange.Find('customernumer').Value2 = 'Customer_Number__c'
####
workbook.SaveAs('Filename.csv', 6) #6 means csv in XlFileFormat enumeration

Related

Read CSV file into pandas dataframe from FTPS server

I am unable to grab the data from a CSV file to be able to put it into a pandas dataframe. I am able to get into the directory and see all of the files there, but I haven't been able to access the document.
Here is my code:
from ftplib import FTP_TLS
import socket
import pandas as pd
server=ftplib.FTP_TLS(‘server’,certfile = r'C:/’)
server.login(user,pw)
# get into respective directory
server.cwd('Banana')
server.prot_p()
# This piece here is needed in order to see what is in my directory, I don't understand why.
# Something about the server not being set up correctly?
server.af = socket.AF_INET6
# check location
server.pwd()
# check files
server.dir()
# Get CSV file data
import io
download_file = io.BytesIO()
download_file.seek(0)
server.retrbinary('RETR ' + str('file.csv'), download_file.write)
download_file.seek(0)
file_to_process = pd.read_csv(download_file, engine='python')
The error that I get is that the last code from import io down to file_to_process just sits there and does nothing. Maybe it times itself out? Unsure the issue.
New error is this:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3376: character maps to <undefined>
Edit: Now I'm trying to save to disk. But this code deletes the contents of the file. Do I not understand how write works?
filematch = ‘Try20.csv’
target_dir = r'\\server’
import os
for filename in server.nlst(filematch):
target_file_name = os.path.join(target_dir,os.path.basename(filename))
with open(target_file_name,'wb') as fhandle:
server.retrbinary('RETR %s' %filename, fhandle.write)
Secondarily, I don't understand how to write the contents of fhandle into a dataframe now.

how to convert xlsx to tab delimited files

I have quite a lot of xlsx files which is a pain to convert them one by one to tab delimited files
I would like to know if there is any solution to do this by python. Here what I found and what tried to do with failure
This I found and I tried the solution but did not work Mass Convert .xls and .xlsx to .txt (Tab Delimited) on a Mac
I also tried to do it for one file to see how it works but with no success
#!/usr/bin/python
import xlrd
import csv
def main():
# I open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# I don't know the name of sheet
mysheet = myfile.sheet_by_index(0)
# I open the output csv
myCsvfile = open('my.csv', 'wb')
# I write the file into it
wr = csv.writer(myCsvfile, delimiter="\t")
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
myCsvfile.close()
if __name__ == '__main__':
main()
No real need for the main function.
And not sure about your indentation problems, but this is how I would write what you have. (And should work, according to first comment above)
#!/usr/bin/python
import xlrd
import csv
# open the output csv
with open('my.csv', 'wb') as myCsvfile:
# define a writer
wr = csv.writer(myCsvfile, delimiter="\t")
# open the xlsx file
myfile = xlrd.open_workbook('myfile.xlsx')
# get a sheet
mysheet = myfile.sheet_by_index(0)
# write the rows
for rownum in xrange(mysheet.nrows):
wr.writerow(mysheet.row_values(rownum))
Why go with so much pain when you can do it in 3 lines:
import pandas as pd
file = pd.read_excel('myfile.xlsx')
file.to_csv('myfile.xlsx',
sep="\t",
index=False)

CSV new-line character seen in unquoted field error

the following code worked until today when I imported from a Windows machine and got this error:
new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
import csv
class CSV:
def __init__(self, file=None):
self.file = file
def read_file(self):
data = []
file_read = csv.reader(self.file)
for row in file_read:
data.append(row)
return data
def get_row_count(self):
return len(self.read_file())
def get_column_count(self):
new_data = self.read_file()
return len(new_data[0])
def get_data(self, rows=1):
data = self.read_file()
return data[:rows]
How can I fix this issue?
def upload_configurator(request, id=None):
"""
A view that allows the user to configurator the uploaded CSV.
"""
upload = Upload.objects.get(id=id)
csvobject = CSV(upload.filepath)
upload.num_records = csvobject.get_row_count()
upload.num_columns = csvobject.get_column_count()
upload.save()
form = ConfiguratorForm()
row_count = csvobject.get_row_count()
colum_count = csvobject.get_column_count()
first_row = csvobject.get_data(rows=1)
first_two_rows = csvobject.get_data(rows=5)
It'll be good to see the csv file itself, but this might work for you, give it a try, replace:
file_read = csv.reader(self.file)
with:
file_read = csv.reader(self.file, dialect=csv.excel_tab)
Or, open a file with universal newline mode and pass it to csv.reader, like:
reader = csv.reader(open(self.file, 'rU'), dialect=csv.excel_tab)
Or, use splitlines(), like this:
def read_file(self):
with open(self.file, 'r') as f:
data = [row for row in csv.reader(f.read().splitlines())]
return data
I realize this is an old post, but I ran into the same problem and don't see the correct answer so I will give it a try
Python Error:
_csv.Error: new-line character seen in unquoted field
Caused by trying to read Macintosh (pre OS X formatted) CSV files. These are text files that use CR for end of line. If using MS Office make sure you select either plain CSV format or CSV (MS-DOS). Do not use CSV (Macintosh) as save-as type.
My preferred EOL version would be LF (Unix/Linux/Apple), but I don't think MS Office provides the option to save in this format.
For Mac OS X, save your CSV file in "Windows Comma Separated (.csv)" format.
If this happens to you on mac (as it did to me):
Save the file as CSV (MS-DOS Comma-Separated)
Run the following script
with open(csv_filename, 'rU') as csvfile:
csvreader = csv.reader(csvfile)
for row in csvreader:
print ', '.join(row)
Try to run dos2unix on your windows imported files first
This is an error that I faced. I had saved .csv file in MAC OSX.
While saving, save it as "Windows Comma Separated Values (.csv)" which resolved the issue.
This worked for me on OSX.
# allow variable to opened as files
from io import StringIO
# library to map other strange (accented) characters back into UTF-8
from unidecode import unidecode
# cleanse input file with Windows formating to plain UTF-8 string
with open(filename, 'rb') as fID:
uncleansedBytes = fID.read()
# decode the file using the correct encoding scheme
# (probably this old windows one)
uncleansedText = uncleansedBytes.decode('Windows-1252')
# replace carriage-returns with new-lines
cleansedText = uncleansedText.replace('\r', '\n')
# map any other non UTF-8 characters into UTF-8
asciiText = unidecode(cleansedText)
# read each line of the csv file and store as an array of dicts,
# use first line as field names for each dict.
reader = csv.DictReader(StringIO(cleansedText))
for line_entry in reader:
# do something with your read data
I know this has been answered for quite some time but not solve my problem. I am using DictReader and StringIO for my csv reading due to some other complications. I was able to solve problem more simply by replacing delimiters explicitly:
with urllib.request.urlopen(q) as response:
raw_data = response.read()
encoding = response.info().get_content_charset('utf8')
data = raw_data.decode(encoding)
if '\r\n' not in data:
# proably a windows delimited thing...try to update it
data = data.replace('\r', '\r\n')
Might not be reasonable for enormous CSV files, but worked well for my use case.
Alternative and fast solution : I faced the same error. I reopened the "wierd" csv file in GNUMERIC on my lubuntu machine and exported the file as csv file. This corrected the issue.

python unicode csv export using pyramid

I'm trying to export mongodb that has non ascii characters into csv format.
Right now, I'm dabbling with pyramid and using pyramid.response.
from pyramid.response import Response
from mycart.Member import Member
#view_config(context="mycart:resources.Member", name='', request_method="POST", permission = 'admin')
def member_export( context, request):
filename = 'member-'+time.strftime("%Y%m%d%H%M%S")+".csv"
download_path = os.getcwd() + '/MyCart/mycart/static/downloads/'+filename
member = Members(request)
my_list = [['First Name,Last Name']]
record = member.get_all_member( )
for r in record:
mystr = [ r['fname'], r['lname']]
my_list.append(mystr)
with open(download_path, 'wb') as f:
fileWriter = csv.writer(f, delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
for l in my_list:
print(l)
fileWriter.writerow(l)
size = os.path.getsize(download_path)
response = Response(content_type='application/force-download', content_disposition='attachment; filename=' + filename)
response.app_iter = open(download_path , 'rb')
response.content_length = size
return response
In mongoDB, first name is showing 王, when I'm using print, it too is showing 王. However, when I used excel to open it up, it shows random stuff - ç¾…
However, when I tried to view it in shell
$ more member-20130227141550.csv
It managed to display the non ascii character correctly.
How should I rectify this problem?
I'm not a Windows guy, so I am not sure whether the problem may be with your code or with excel just not handling non-ascii characters nicely. But I have noticed that you are writing your file with python csv module, which is notorious for headaches with unicode.
Other users have reported success with using unicodecsv as a replacement for the csv module. Perhaps you could try dropping in this module as a csv writer and see if your problem magically goes away.

python file copy gives larger file

I stumbled across something that is not a problem, but something rather puzzling. I am copying a xml file myxml.xml to myxml_copy.xml and the file size of the output file is bigger. I don't understand why this is so. Does this have anything to do with file encoding?
Anyway, the code I am using (although it is fairly trivial):
from xml.dom.minidom import parseString
import sys
def parseXml():
data = open(in_filename,'r').read()
return data
try:
in_filename = sys.argv[1]
out_filename = sys.argv[2]
out_file = open(out_filename,'w')
out_file.write(parseXml())
out_file.close()
except Exception,e:
print "usage: python copy.py <in_file> <out_file>"
print "Error",e
NOTE: I am not looking for a way to copy a file. I will be modifying the original xml file later (cutting and pasting different parts of it).
I think the problem is that the mode you open the file with needs to be rb and not just r and wb instead of w. (means - with binary mode)
When it's rb - strings like \r\n will stay this way, but when the mode is r - they will become \n.
In short - just change the lines:
data = open(in_filename,'r').read()
out_file = open(out_filename,'w')
to
data = open(in_filename,'rb').read()
out_file = open(out_filename,'wb')
Did that help?

Categories

Resources