saving as csv corrupts dataframe - python

I have a pandas dataframe of shape (455698, 62). I want to save it as a csv file, and load it again later with pandas. For now I do this :
df.to_csv("/path/to/file.csv",index=False,sep="\\", encoding='utf-8') #saving
df=pd.read_csv("/path/to/file.csv",delimiter="\\",encoding ='utf-8') #loading
and I get a dataframe with shape (455700, 62) : 2 more lines ? When I check in detail, (looking at all unique values in each columns), I found that some values changed columns in the process.
I've tried multiple separators, forcing dtype ="object", and I can't figure out where the bug is. What should I try?

Is it possible that some of your strings contain new-line (\n) character?
In this case i would suggest to use quoting when saving your CSV file:
import csv
df.to_csv("/path/to/file.csv",index=False,sep="\\", encoding='utf-8', quoting=csv.QUOTE_NONNUMERIC)
...

Related

Preserve input file format with pandas

I need to process hundreds of fairly large CSV files. Each file contains 4 header lines followed by 864000 lines of data and weight more than 200 Mo. Columns type are most of the time recognized as object because missing values are indicated as "NAN" (with quotes). I want to perform couple of operations on these data and export them to a new file in a format similar to the input file. To do so, I wrote the following code
df = pd.read_csv(in_file, skiprows=[0,2,3])
# Get file header
with open(in_file, 'r') as fi:
header = [next(fi) for x in range(4)]
# Write header to destination file
with open(out_file, 'w') as fo:
for i_line in header:
fo.write(i_line)
# Do some data transformation here
df = foobar(df)
# Append data to destination file
df.to_csv(out_file, header=False, index=False, mode='a')
I struggle to preserve exactly the input format. For instance, I have dates in the input files formated as "2019-08-28 00:00:00.2" while they are written in the output files as 2019-08-28 00:00:00.2, i.e. without the quotation marks.
Same for "NAN" values that are rewritten without their quotes.Pandas wants to clean everything out.
I tried other variants that worked, but because of the file size, running time was unreasonable.
Include quoting parameter in to_csv i.e. quoting=csv.QUOTE_NONNUMERIC or quoting=2
so your to csv statement will be as follows:
df.to_csv(out_file, header=False, index=False, mode='a', quoting=2)
Note: you need to import csv if you want to use csv.QUOTE_NONNUMERIC
More details about the parameters can be found on the documentation (below): https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

How to get the contents of rows from csv file?

I tried to get some basic statistics of my columns in csv file, but apparently, I can't even get the contents of the columns in my output.
I tried data['columnname']
import pandas as p
data = p.read_csv('Amazon.csv',delimiter='~}',na_values='nan')
data.columns
data['Title']
I expect to get the contents of 'Title' in my output
Without knowing the exact format of the .csv it's a bit hard, but hopefully something like this helps:
data = pd.read_csv("Amazon.csv", ... , header=0)
Setting header=0 will read the first line of the file and make it the column names.
If names aren't defined by the .csv you can either add them to the first line and use header=0 or use names=<array-like>.
data = pd.read_csv("Amazon.csv", ... , names=['Title',...,'LastCol'])
See: Pandas Docs for read_csv

how to write comma separated list items to csv in a single column in python

I have a list(fulllist) of 292 items and converted to data frame. Then tried writing it to csv in python.
import pandas as pd
my_df = pd.DataFrame(fulllist)
my_df.to_csv('Desktop/pgm/111.csv', index=False,sep=',')
But the some comma separated values fills each columns of csv. I am trying to make that values in single column.
Portion of output is shown below.
I have tried with writerows but wont work.
import csv
with open('Desktop/pgm/111.csv', "wb") as f:
writer = csv.writer(fulllist)
writer.writerows(fulllist)
Also tried with "".join at each time, when the length of list is higher than 1. It also not giving the result. How to make the proper csv so that each fields fill each columns?
My expected output csv is
Please keep in mind that .csv files are in fact plain text files and understanding of .csv by given software depends on implementation, for example some might allow newline character as part of field, when it is between " and ", while other treat every newline character as next row.
Do you have to use .csv format? If not consider other possibilities:
DSV https://en.wikipedia.org/wiki/Delimiter-separated_values is similiar to csv, but you can use for example ; instead of ,, which should help if you do not have ; in your data
openpyxl allows writing and reading of .xlsx files.

Saving DataFrame to csv but output cells type becomes number instead of text

import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".

Unable to get correct output from tsv file using pandas

I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.

Categories

Resources