Related
I'm trying to export a Series object to a text file, in which the text file needs a very specific format.
Content of series are rows in which all columns of a dataframe are concatenated, each with a very specific length (padded with either spaces or 0s). Because of this, each line is a string.
Format rules:
All lines should be exactly the same length. Meaning that shorter values should be padded with either spaces (alphanumeric) or 0s (numeric)
Output should be a flat file (.txt)
Content of input can't be adjusted
Used code:
IPGJ_flat = IPGJ['period']
IPGJ_flat.to_csv('20201119_DAF IPGJ Test.txt', index = False)
Sample Output (Fake)
ABCDEFGHIJK
"LMNOPQRST123456789 abcdf,gh,i abcd"
"UVWXYZABC123456789 abc,def,gh abcd"
UVWXYZABC123456789 abcdefghij abcd
Needed output:
ABCDEFGHIJK
LMNOPQRST123456789 abcdf,gh,i abcd
UVWXYZABC123456789 abc,def,gh abcd
UVWXYZABC123456789 abcdefghij abcd
Quotations are only applied if there is rows where there is a ',' present.
I've already tried the following:
IPGJ_flat = IPGJ['period'].to_frame()
IPGJ_flat.to_csv('20201119_DAF IPGJ Test.txt', index = False, sep = '|', quoting=csv.QUOTE_NONE, escapechar = ' ')
With variations of seperator and escapechar, but this seems to mess up the formatting (new lines aren't identified correctly).
Any ideas on how to solve this?
You're trying to use a CSV function (.to_csv()) for something that's clearly not CSV. So why not just write it to file, without treating it as CSV?
import pandas
# Recreating your test data
s = pandas.Series([
"LMNOPQRST123456789 abcdf,gh,i abcd",
"UVWXYZABC123456789 abc,def,gh abcd",
"UVWXYZABC123456789 abcdefghij abcd",
], name="ABCDEFGHIJK")
# Open a file to write to
with open("test.txt", "w") as f:
# Write the header to file
f.write(s.name)
f.write("\n")
# Construct the content and write it to file
content = "\n".join(s)
f.write(content)
# For demo purposes, show the content of the file
with open("test.txt", "r") as f:
print(f.read())
try 'to_string' instead of 'to_csv':
IPGJ_flat.to_string('20201119_DAF IPGJ Test.txt', index = False)
(read up on DataFrame.to_string()) for more options.
As an addition to the answer from Gijs Wobben, I solved it with the following:
for item in IPGJ_flat:
with open('Test.txt','a') as f:
f.write(item)
f.write("\n")
I am trying to convert a json file with individual json lines to csv. The json data has some elements with trailng zeros that I need to maintain (ex. 1.000000). When writing to csv the value is changed to 1.0, removing all trailing zeros except the first zero following the decimal point. How can I keep all trailing zeros? The number of trailing zeros may not always static.
Updated the formatting of the sample data.
Here is a sample of the json input:
{"ACCOUNTNAMEDENORM":"John Smith","DELINQUENCYSTATUS":2.0000000000,"RETIRED":0.0000000000,"INVOICEDAYOFWEEK":5.0000000000,"ID":1234567.0000000000,"BEANVERSION":69.0000000000,"ACCOUNTTYPE":1.0000000000,"ORGANIZATIONTYPEDENORM":null,"HIDDENTACCOUNTCONTAINERID":4321987.0000000000,"NEWPOLICYPAYMENTDISTRIBUTABLE":"1","ACCOUNTNUMBER":"000-000-000-00","PAYMENTMETHOD":12345.0000000000,"INVOICEDELIVERYTYPE":98765.0000000000,"DISTRIBUTIONLIMITTYPE":3.0000000000,"CLOSEDATE":null,"FIRSTTWICEPERMTHINVOICEDOM":1.0000000000,"HELDFORINVOICESENDING":"0","FEINDENORM":null,"COLLECTING":"0","ACCOUNTNUMBERDENORM":"000-000-000-00","CHARGEHELD":"0","PUBLICID":"xx:1234346"}
Here is a sample of the output:
ACCOUNTNAMEDENORM,DELINQUENCYSTATUS,RETIRED,INVOICEDAYOFWEEK,ID,BEANVERSION,ACCOUNTTYPE,ORGANIZATIONTYPEDENORM,HIDDENTACCOUNTCONTAINERID,NEWPOLICYPAYMENTDISTRIBUTABLE,ACCOUNTNUMBER,PAYMENTMETHOD,INVOICEDELIVERYTYPE,DISTRIBUTIONLIMITTYPE,CLOSEDATE,FIRSTTWICEPERMTHINVOICEDOM,HELDFORINVOICESENDING,FEINDENORM,COLLECTING,ACCOUNTNUMBERDENORM,CHARGEHELD,PUBLICID
John Smith,2.0,0.0,5.0,1234567.0,69.0,1.0,,4321987.0,1,000-000-000-00,10012.0,10002.0,3.0,,1.0,0,,0,000-000-000-00,0,bc:1234346
Here is the code:
import json
import csv
f=open('test2.json') #open input file
outputFile = open('output.csv', 'w', newline='') #load csv file
output = csv.writer(outputFile) #create a csv.writer
i=1
for line in f:
try:
data = json.loads(line) #reads current line into tuple
except:
print("Can't load line {}".format(i))
if i == 1:
header = data.keys()
output.writerow(header) #Writes header row
i += 1
output.writerow(data.values()) #writes values row
f.close() #close input file
The desired output would look like:
ACCOUNTNAMEDENORM,DELINQUENCYSTATUS,RETIRED,INVOICEDAYOFWEEK,ID,BEANVERSION,ACCOUNTTYPE,ORGANIZATIONTYPEDENORM,HIDDENTACCOUNTCONTAINERID,NEWPOLICYPAYMENTDISTRIBUTABLE,ACCOUNTNUMBER,PAYMENTMETHOD,INVOICEDELIVERYTYPE,DISTRIBUTIONLIMITTYPE,CLOSEDATE,FIRSTTWICEPERMTHINVOICEDOM,HELDFORINVOICESENDING,FEINDENORM,COLLECTING,ACCOUNTNUMBERDENORM,CHARGEHELD,PUBLICID
John Smith,2.0000000000,0.0000000000,5.0000000000,1234567.0000000000,69.0000000000,1.0000000000,,4321987.0000000000,1,000-000-000-00,10012.0000000000,10002.0000000000,3.0000000000,,1.0000000000,0,,0,000-000-000-00,0,bc:1234346
I've been trying and I think this may solve your problem:
Pass the str function to the parse_float argument in json.loads :)
data = json.loads(line, parse_float=str)
This way when json.loads() tries to parse a float it will use the str method so it will be parsed as string and maintain the zeroes. Tried doing that and it worked:
i=1
for line in f:
try:
data = json.loads(line, parse_float=str) #reads current line into tuple
except:
print("Can't load line {}".format(i))
if i == 1:
header = data.keys()
print(header) #Writes header row
i += 1
print(data.values()) #writes values row
More information here: Json Documentation
PS: You could use a boolean instead of i += 1 to get the same behaviour.
The decoder of the json module parses real numbers with float by default, so trailing zeroes are not preserved as they are not in Python. You can use the parse_float parameter of the json.loads method to override the constructor of a real number for the JSON decoder with the str constructor instead:
data = json.loads(line, parse_float=str)
Use format but here need to give static decimal precision.
>>> '{:.10f}'.format(10.0)
'10.0000000000'
I have a file saved as .csv
"400":0.1,"401":0.2,"402":0.3
Ultimately I want to save the data in a proper format in a csv file for further processing. The problem is that there are no line breaks in the file.
pathname = r"C:\pathtofile\file.csv"
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
print(reader)
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file)
csv_writer.writerow(reader)
The print reader output looks exactly how I want (or at least it's a format I can further process).
"400":0.1
"401":0.2
"402":0.3
And now I want to save that to a new csv file. However the output looks like
"""",4,0,0,"""",:,0,.,1,"
","""",4,0,1,"""",:,0,.,2,"
","""",4,0,2,"""",:,0,.,3
I'm sure it would be intelligent to convert the format to
400,0.1
401,0.2
402,0.3
at this stage instead of doing later with another script.
The main problem is that my current code
with open(pathname, newline='') as file:
reader = file.read().replace(',', '\n')
reader = csv.reader(reader,delimiter=':')
x = []
y = []
print(reader)
for row in reader:
x.append( float(row[0]) )
y.append( float(row[1]) )
print(x)
print(y)
works fine for the type of csv files I currently have, but doesn't work for these mentioned above:
y.append( float(row[1]) )
IndexError: list index out of range
So I'm trying to find a way to work with them too. I think I'm missing something obvious as I imagine that it can't be too hard to properly define the linebreak character and delimiter of a file.
with open(pathname, newline=',') as file:
yields
ValueError: illegal newline value: ,
The right way with csv module, without replacing and casting to float:
import csv
with open('file.csv', 'r') as f, open('filenew.csv', 'w', newline='') as out:
reader = csv.reader(f)
writer = csv.writer(out, quotechar=None)
for r in reader:
for i in r:
writer.writerow(i.split(':'))
The resulting filenew.csv contents (according to your "intelligent" condition):
400,0.1
401,0.2
402,0.3
Nuances:
csv.reader and csv.writer objects treat comma , as default delimiter (no need to file.read().replace(',', '\n'))
quotechar=None is specified for csv.writer object to eliminate double quotes around the values being saved
You need to split the values to form a list to represent a row. Presently the code is splitting the string into individual characters to represent the row.
pathname = r"C:\pathtofile\file.csv"
with open(pathname) as old_file:
with open(r"C:\pathtofile\filenew.csv", 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter=',')
text_rows = old_file.read().split(",")
for row in text_rows:
items = row.split(":")
csv_writer.writerow([int(items[0]), items[1])
If you look at the documentation, for write_row, it says:
Write the row parameter to the writer’s file
object, formatted according to the current dialect.
But, you are writing an entire string in your code
csv_writer.writerow(reader)
because reader is a string at this point.
Now, the format you want to use in your CSV file is not clearly mentioned in the question. But as you said, if you can do some preprocessing to create a list of lists and pass each sublist to writerow(), you should be able to produce the required file format.
write a python program to write data in .csv file,but find that every item in the .csv has a "b'" before the content, and there are blank line, I do not know how to remove the blank lines; and some item in the .csv file are unrecognizable characters,such as "b'\xe7\xbe\x85\xe5\xb0\x91\xe5\x90\x9b'", because some data are in Chinese and Japanese, so I think maybe something wrong when writing these data in the .csv file.Please help me to solve the problem
the program is:
#write data in .csv file
def data_save_csv(type,data,id_name,header,since = None):
#get the date when storage data
date_storage()
#create the data storage directory
csv_parent_directory = os.path.join("dataset","csv",type,glovar.date)
directory_create(csv_parent_directory)
#write data in .csv
if type == "group_members":
csv_file_prefix = "gm"
if since:
csv_file_name = csv_file_prefix + "_" + since.strftime("%Y%m%d-%H%M%S") + "_" + time_storage() + id_name + ".csv"
else:
csv_file_name = csv_file_prefix + "_" + time_storage() + "_" + id_name + ".csv"
csv_file_directory = os.path.join(csv_parent_directory,csv_file_name)
with open(csv_file_directory,'w') as csvfile:
writer = csv.writer(csvfile,delimiter=',',quotechar='"',quoting=csv.QUOTE_MINIMAL)
#csv header
writer.writerow(header)
row = []
for i in range(len(data)):
for k in data[i].keys():
row.append(str(data[i][k]).encode("utf-8"))
writer.writerow(row)
row = []
the .csv file
You have a couple of problems. The funky "b" thing happens because csv will cast data to a string before adding it to a column. When you did str(data[i][k]).encode("utf-8"), you got a bytes object and its string representation is b"..." and its filled with utf-8 encoded data. You should handle encoding when you open the file. In python 3, open opens a file with the encoding from sys.getdefaultencoding() but its a good idea to be explicit about what you want to write.
Next, there's nothing that says that two dicts will enumerate their keys in the same order. The csv.DictWriter class is built to pull data from dictionaries, so use it instead. In my example I assumed that header has the names of the keys you want. It could be that header is different, and in that case, you'll also need to pass in the actual dict key names you want.
Finally, you can just strip out empty dicts while you are writing the rows.
with open(csv_file_directory,'w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=header, delimiter=',',
quotechar='"',quoting=csv.QUOTE_MINIMAL)
writer.writeheader()
writer.writerows(d for d in data if d)
It sounds like at least some of your issues have to do with incorrect unicode.
try implementing the snippet below into your existing code. As the comment say, the first part takes your input and converts it into utf-8.
The second bit will return your output in the expected format of ascii.
import codecs
import unicodedata
f = codecs.open('path/to/textfile.txt', mode='r',encoding='utf-8') #Take input and turn into unicode
for line in f.readlines():
line = unicodedata.normalize('NFKD', line).encode('ascii', 'ignore'). #Ensure output is in ASCII
my csv file is created using values from a table as:
with open(fullpath,'wb') as csvFile:
writer = csv.writer(csvFile,delimiter='|',
escapechar=' ',
quoting=csv.QUOTE_NONE)
string_values = 'abc|xyz|mno'
for obj in queryset.iterator():
writer.writerow([
smart_str(obj.code),#code
smart_str(string_values),# string of values
])
The csv generated should output result as:
code1|abc|xyz|mno
but what is generated is:
code1|abc |xyz |mno
I am unable to remove the space from the string while creating the csv. in whatever way may be. The string values are generated dynamically and can have 1 or more string values. I cannot use quoting=csv.QUOTE_NONE without specyfying escapechar=' ' i.e a space. Please suggest.
You can't just write strings containing the delimiter, unescaped!
Use, instead:
string_values = 'abc|xyz|mno'.split('|')
for obj in queryset.iterator():
writer.writerow([smart_str(obj.code)] + string_values)
IOW, writerow a list of strings, and let the writer insert the delimiters between the strings in the list!