Handling special characters from Excel to CSV using Python - python

Hello am having issues handling a special character from Excel sheet to CSV using python
when I used
else:
# Encode strings into format to preserve content of cell
row_values.append(cell.value.encode("UTF-8").strip())
am getting special character as 'Â'
and when I use
else:
# Encode strings into ISO-8859-1 format to preserve content of cell
row_values.append(cell.value.encode("iso-8859-1").strip())
am getting Special charater as '�' easy to say ? in diamond
i believe it's something to do with encoding but not sure which one to use. these characters are from Excel sheet converted to CSV.
here is the code I used
def convert_to_csv(excel_file, input_dir, output_dir):
"""Convert an excel file to a CSV file by removing irrelevant data"""
try:
sheet = read_excel(excel_file)
except UnicodeDecodeError:
print 'File %s is possibly corrupt. Please check again.' % (excel_file)
sys.exit(1)
row_num = sheet.get_highest_row() # Number of rows
col_num = sheet.get_highest_column() # Number of columns
all_rows = []
# Loop through rows and columns
for row in range(row_num):
row_values = []
for column in range(col_num):
# Get cell element
cell = sheet.cell(row=row, column=column)
# Ignore empty cells
if cell.value is not None:
if type(cell.value) == int or type(cell.value) == float:
# String encoding not applicable for integers and floating point numbers
row_values.append(cell.value)
else:
# Encode strings into ISO-8859-1 format to preserve content of cell
row_values.append(cell.value.encode("iso-8859-1").strip())
else:
row_values.append('')
# Append rows only having more than three values each
if len(set(row_values)-{''}) > 3:
# print row_values
all_rows.append(row_values)
# Saving the data to a csv extension with the same name as the given excel file
output_path = os.path.join(output_dir, excel_file.split('.')[0] + '.csv')
with open(output_path, 'wb') as f:
writer = csv.writer(f, delimiter=";", quoting=csv.QUOTE_ALL)
writer.writerows(all_rows[1:])
using Python 2.6.9
was wondering if we can use regular expresion just before writing to CSV
Is there anyway we can handle this ?
Thanks in Advance.

well got it fixed
` else:
# Encode strings into ISO-8859-1 format to preserve content of cell
row_values.append(
re.sub(r'[^\x00-\x7f]', r'', cell.value).strip())`

Related

Extract data from text file to "String only" csv with python

I am trying to extract data from a text file into a csv sheet.
I am using this:
fileOutput = open(outputFolder + '/' + outputfile, mode='w+', newline='', encoding='utf-8')
file_writer = csv.writer(fileOutput, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
However, when I have a value like "0009", it's parsed as "9" in csv.
My question is:
Is there a way I can force all values to be parsed as strings to get the data as it is?
Thank you
The CSV writer includes the leading 0s if you pass the value as a string, but not if you've already converted the value to an integer.
writer.writerow([int('0009')]) # writes "9" to the file
writer.writerow(['0009']) # writes "0009" to the file
When you pass it an integer, it has no way of knowing how many leading zeros there were in the original text - that information has already been discarded. You need to look at the code that's extracting your data from the original text file, and keep that code from doing a conversion to integer.

How to preserve trailing zeros with Python CSV Writer

I am trying to convert a json file with individual json lines to csv. The json data has some elements with trailng zeros that I need to maintain (ex. 1.000000). When writing to csv the value is changed to 1.0, removing all trailing zeros except the first zero following the decimal point. How can I keep all trailing zeros? The number of trailing zeros may not always static.
Updated the formatting of the sample data.
Here is a sample of the json input:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":2.0000000000,"RETIRED":0.0000000000,"INVOICEDAYOFWEEK":5.0000000000,"ID":1234567.0000000000,"BEANVERSION":69.0000000000,"ACCOUNTTYPE":1.0000000000,"ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":4321987.0000000000,"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":12345.0000000000,"INVOICEDELIVERYTYPE":98765.0000000000,"DISTRIBUTIONLIMITTYPE":3.0000000000,"CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":1.0000000000,"HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"xx:1234346"}
Here is a sample of the output:
ACCOUNTNAMEDENORM,DELINQUENCYSTATUS,RETIRED,INVOICEDAYOFWEEK,ID,BEANVERSION,ACCOUNTTYPE,ORGANIZATIONTYPEDENORM,HIDDENTACCOUNTCONTAINERID,NEWPOLICYPAYMENTDISTRIBUTABLE,ACCOUNTNUMBER,PAYMENTMETHOD,INVOICEDELIVERYTYPE,DISTRIBUTIONLIMITTYPE,CLOSEDATE,FIRSTTWICEPERMTHINVOICEDOM,HELDFORINVOICESENDING,FEINDENORM,COLLECTING,ACCOUNTNUMBERDENORM,CHARGEHELD,PUBLICID
John Smith,2.0,0.0,5.0,1234567.0,69.0,1.0,,4321987.0,1,000-000-000-00,10012.0,10002.0,3.0,,1.0,0,,0,000-000-000-00,0,bc:1234346
Here is the code:
import json
import csv
f=open('test2.json') #open input file
outputFile = open('output.csv', 'w', newline='') #load csv file
output = csv.writer(outputFile) #create a csv.writer
i=1
for line in f:
try:
data = json.loads(line) #reads current line into tuple
except:
print("Can't load line {}".format(i))
if i == 1:
header = data.keys()
output.writerow(header) #Writes header row
i += 1
output.writerow(data.values()) #writes values row
f.close() #close input file
The desired output would look like:
ACCOUNTNAMEDENORM,DELINQUENCYSTATUS,RETIRED,INVOICEDAYOFWEEK,ID,BEANVERSION,ACCOUNTTYPE,ORGANIZATIONTYPEDENORM,HIDDENTACCOUNTCONTAINERID,NEWPOLICYPAYMENTDISTRIBUTABLE,ACCOUNTNUMBER,PAYMENTMETHOD,INVOICEDELIVERYTYPE,DISTRIBUTIONLIMITTYPE,CLOSEDATE,FIRSTTWICEPERMTHINVOICEDOM,HELDFORINVOICESENDING,FEINDENORM,COLLECTING,ACCOUNTNUMBERDENORM,CHARGEHELD,PUBLICID
John Smith,2.0000000000,0.0000000000,5.0000000000,1234567.0000000000,69.0000000000,1.0000000000,,4321987.0000000000,1,000-000-000-00,10012.0000000000,10002.0000000000,3.0000000000,,1.0000000000,0,,0,000-000-000-00,0,bc:1234346
I've been trying and I think this may solve your problem:
Pass the str function to the parse_float argument in json.loads :)
data = json.loads(line, parse_float=str)
This way when json.loads() tries to parse a float it will use the str method so it will be parsed as string and maintain the zeroes. Tried doing that and it worked:
i=1
for line in f:
try:
data = json.loads(line, parse_float=str) #reads current line into tuple
except:
print("Can't load line {}".format(i))
if i == 1:
header = data.keys()
print(header) #Writes header row
i += 1
print(data.values()) #writes values row
More information here: Json Documentation
PS: You could use a boolean instead of i += 1 to get the same behaviour.
The decoder of the json module parses real numbers with float by default, so trailing zeroes are not preserved as they are not in Python. You can use the parse_float parameter of the json.loads method to override the constructor of a real number for the JSON decoder with the str constructor instead:
data = json.loads(line, parse_float=str)
Use format but here need to give static decimal precision.
>>> '{:.10f}'.format(10.0)
'10.0000000000'

python error when writing data in csv file

write a python program to write data in .csv file,but find that every item in the .csv has a "b'" before the content, and there are blank line, I do not know how to remove the blank lines; and some item in the .csv file are unrecognizable characters,such as "b'\xe7\xbe\x85\xe5\xb0\x91\xe5\x90\x9b'", because some data are in Chinese and Japanese, so I think maybe something wrong when writing these data in the .csv file.Please help me to solve the problem
the program is:
#write data in .csv file
def data_save_csv(type,data,id_name,header,since = None):
#get the date when storage data
date_storage()
#create the data storage directory
csv_parent_directory = os.path.join("dataset","csv",type,glovar.date)
directory_create(csv_parent_directory)
#write data in .csv
if type == "group_members":
csv_file_prefix = "gm"
if since:
csv_file_name = csv_file_prefix + "_" + since.strftime("%Y%m%d-%H%M%S") + "_" + time_storage() + id_name + ".csv"
else:
csv_file_name = csv_file_prefix + "_" + time_storage() + "_" + id_name + ".csv"
csv_file_directory = os.path.join(csv_parent_directory,csv_file_name)
with open(csv_file_directory,'w') as csvfile:
writer = csv.writer(csvfile,delimiter=',',quotechar='"',quoting=csv.QUOTE_MINIMAL)
#csv header
writer.writerow(header)
row = []
for i in range(len(data)):
for k in data[i].keys():
row.append(str(data[i][k]).encode("utf-8"))
writer.writerow(row)
row = []
the .csv file
You have a couple of problems. The funky "b" thing happens because csv will cast data to a string before adding it to a column. When you did str(data[i][k]).encode("utf-8"), you got a bytes object and its string representation is b"..." and its filled with utf-8 encoded data. You should handle encoding when you open the file. In python 3, open opens a file with the encoding from sys.getdefaultencoding() but its a good idea to be explicit about what you want to write.
Next, there's nothing that says that two dicts will enumerate their keys in the same order. The csv.DictWriter class is built to pull data from dictionaries, so use it instead. In my example I assumed that header has the names of the keys you want. It could be that header is different, and in that case, you'll also need to pass in the actual dict key names you want.
Finally, you can just strip out empty dicts while you are writing the rows.
with open(csv_file_directory,'w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=header, delimiter=',',
quotechar='"',quoting=csv.QUOTE_MINIMAL)
writer.writeheader()
writer.writerows(d for d in data if d)
It sounds like at least some of your issues have to do with incorrect unicode.
try implementing the snippet below into your existing code. As the comment say, the first part takes your input and converts it into utf-8.
The second bit will return your output in the expected format of ascii.
import codecs
import unicodedata
f = codecs.open('path/to/textfile.txt', mode='r',encoding='utf-8') #Take input and turn into unicode
for line in f.readlines():
line = unicodedata.normalize('NFKD', line).encode('ascii', 'ignore'). #Ensure output is in ASCII

Python: Read csv file with an arbitrary number of tabs as delimiter

I have my csv file formatted with all columns nicely alligned by using one or more tabs in between different values.
I know it is possible to use a single tab as delimiter with csv.register_dialect("tab_delimiter", delimiter="\t"). But this only works with exaclty one tab between the values. I would like to process the file keeping its format, i.e., not deleting duplicate tabs. Each field (row, column) contains a value.
Is it possible to use a number of 1+ tabs as delimiter or ignore additional tabs without affecting the numbering of the values in a row? row[1] should be the second value independent of how many tabs are in between row[0].
##Sample.txt
##ID name Age
##1 11 111
##2 22 222
import pandas as pd
df=pd.read_csv('Sample.txt' ,sep=r'\t+')
print df
Assuming that there will never be empty fields, you can use a generator to remove duplicates from the incoming CSV file and then use the csv module as usual:
import csv
def de_dup(f, delimiter='\t'):
for line in f:
yield delimiter.join(field for field in line.split(delimiter) if field)
with open('data.csv') as f:
for row in csv.reader(de_dup(f), delimiter='\t'):
print(row)
An alternative way is to use re.sub() in the generator:
import re
def de_dup(f, delimiter='\t'):
for line in f:
yield re.sub(r'{}{{2,}}'.format(delimiter), delimiter, line)
but this still has the limitation that all fields must contain a value.
The most convenient way for me to deal with the multiple tabs was using an additonal function that takes the row and removes the empty values/fields that are created by multiple tabs in a row. This doesn't affect the formating of the csv-file and I can access the second value in the row with row[1] - even with multiple tabs before it.
def remove_empty(line):
result = []
for i in range(len(line)):
if line[i] != "":
result.append(line[i])
return result
And in the code where I read the file and process the values:
for row in reader:
row = remove_empty(row)
**continue processing normally**
I think this solution is similar to mhawke's, but with his solution I couldn't access the same values with row[i] as before (i.e., with only one delimiter in between each value).
Or the completely general solution for any type of repeated separators is to recursively replace each multiple separator by single separator and write to a new file (although it is slow for gigabyte sized CSV files):
def replaceMultipleSeparators( fileName, oldSeparator, newSeparator ):
linesOfCsvInputFile = open( fileName, encoding='utf-8', mode='r' ).readlines()
csvNewFileName = fileName + ".new"
print('Writing: %s replacing %s with %s' % ( csvNewFileName, oldSeparator, newSeparator ) , end='' )
outputFileStream = open( newFileName, 'w' )
for line in linesOfCsvInputFile:
newLine = line.rstrip()
processedLine = ""
while newLine != processedLine:
processedLine = newLine
newLine = processedLine.replace( oldSeparator + oldSeparator, oldSeparator )
newLine = newLine.replace( oldSeparator, newSeparator )
outputFileStream.write( newLine + '\n' )
outputFileStream.close()
which given input testFile.csv will generator testFile.csv.new with TABs replaced by PIPEs if you run:
replaceMultipleSeparators( 'testFile.csv', '\t', '|' )
Sometimes you will need to replace 'utf-8' encoding with 'latin-1' for some microsoft US generated CSV files. See errors related to 0xe4 reading for this issue.

Ascii Code error while converting from xlsx to csv

I have referred some post related to unicode error but didn't get any solution for my problem. I am converting xlsx to csv fom a workbook of 6 sheets.
Use the following code
def csv_from_excel(file_loc):
#file_acess check
print os.access(file_loc, os.R_OK)
wb = xlrd.open_workbook(file_loc)
print wb.nsheets
sheet_names = wb.sheet_names()
print sheet_names
counter = 0
while counter < wb.nsheets:
try:
sh = wb.sheet_by_name(sheet_names[counter])
file_name = str(sheet_names[counter]) + '.csv'
print file_name
fh = open(file_name, 'wb')
wr = csv.writer(fh, quoting=csv.QUOTE_ALL)
for rownum in xrange(sh.nrows):
wr.writerow(sh.row_values(rownum))
except Exception as e:
print str(e)
finally:
fh.close()
counter += 1
I get an error in 4th sheet
'ascii' codec can't encode character u'\u2018' in position 0: ordinal not in range(128)"
but position 0 is blank and it has converted to csv till 33rd row.
I am unable to figure out. CSV was easy way to read content and put in my data structure .
You'll need to manually encode Unicode values to bytes; for CSV usually UTF-8 is fine:
for rownum in xrange(sh.nrows):
wr.writerow([unicode(c).encode('utf8') for c in sh.row_values(rownum)])
Here I use unicode() for column data that is not text.
The character you encountered is the U+2018 LEFT SINGLE QUOTATION MARK, which is just a fancy form of the ' single quote. Office software (spreadsheets, word processors, etc.) often auto-replace single and double quotes with the 'fancy' versions. You could also just replace those with ASCII equivalents. You can do that with the Unidecode package:
from unidecode import unidecode
for rownum in xrange(sh.nrows):
wr.writerow([unidecode(unicode(c)) for c in sh.row_values(rownum)])
Use this when non-ASCII codepoints are only used for quotes and dashes and other punctuation.

Categories

Resources