I need to output a csv file in python, and as the file is too large, I use the package 'zipfile' to zip it. However, when the csv file is outputed and unzipped, the columns merged......
The code is like:
for i in dealers:
data_1=data_dealer[data_dealer['DEALER_ID']==i]
data=data_1.to_string(index=False, header=True).encode("utf_8_sig")
azip=zipfile.Zipfile('data%s.zip'%i,mode='w')
azip.writestr('data%s.csv'%i,data=data,compress_type=zipfile.ZIP_DEFLATED)
azip.close()
the csv was originally like (separated by comma):
a,1600,2018,NaN,......
now there's only one column, or separated by space:
a 1600 2018 NaN ......
Anyone knows how to zip a csv in Python without merging the columns?
Thanks a lot!!
By using Pandas to_string() function, you were creating a textual fixed column width table output suitable for displaying on a console. You wanted though to create a CSV output with , delimiters between the columns. As such you need to use the to_csv() function. If a filename is not given to the function, it returns the whole table as a string. This could then be passed using writestr():
for i in dealers:
data_1 = data_dealer[data_dealer['DEALER_ID']==i]
data = data_1.to_csv(index=False, header=True).encode("utf_8_sig")
azip = zipfile.Zipfile('data{}.zip'.format(i), mode='w')
azip.writestr('data{}.csv'.format(i), data=data, compress_type=zipfile.ZIP_DEFLATED)
azip.close()
Related
From Python i want to export to csv format a dataframe
The dataframe contains two columns like this
So when i write this :
df['NAME'] = df['NAME'].astype(str) # or .astype('string')
df.to_csv('output.csv',index=False,sep=';')
The excel output in csv format returns this :
and reads the value "MAY8218" as a date format "may-18" while i want it to be read as "MAY8218".
I've tried many ways but none of them is working. I don't want an alternative like putting quotation marks to the left and the right of the value.
Thanks.
If you want to export the dataframe to use it in excel just export it as xlsx. It works for me and maintains the value as string in the original format.
df.to_excel('output.xlsx',index=False)
The CSV format is a text format. The file contains no hint for the type of the field. The problem is that Excel has the worst possible support for CSV files: it assumes that CSV files always use its own conventions when you try to read one. In short, one Excel implementation can only read correctly what it has written...
That means that you cannot prevent Excel to interpret the csv data the way it wants, at least when you open a csv file. Fortunately you have other options:
import the csv file instead of opening it. This time you have options to configure the way the file should be processed.
use LibreOffice calc for processing CSV files. LibreOffice is a little behind Microsoft Office on most points except for csv file handling where it has an excellent support.
I have two different CSV files which i have imported using pd.read_csv.
Both files have different header names. I would like to export this specific column under the header name of ["Model"] in the first CSV file to the second CSV file under the header name of ["Product"]
I have tried using the following code but produced value error:
writer=df1[df1['Model']==df2['Product']]
Would appreciate any help.
Try joining the DataFrames on the index using pandas.DataFrame.join then exporting the result as a csv using pandas.DataFrame.to_csv.
df1.join(df2)
df1.to_csv('./df2.csv')
I regularly get sent on a regular basis a csv containing 100+ columns and millions or rows. These csv files always contain certain set of columns, Core_cols = [col_1, col_2, col_3], and a variable number of other columns, Var_col = [a, b, c, d, e]. The core columns are always there and there could be 0-200 of the variable columns. Sometimes one of the columns in the variable columns will contain a carriage return. I know which columns this can happen in, bad_cols = [a, b, c].
When import the csv with pd.read_csv these carriage returns make corrupt rows in the resultant dataframe. I can't re-make the csv without these columns.
How do I either:
Ignore these columns and the carriage return contained within? or
Replace the carriage returns with blanks in the csv?
My current code looks something like this:
df = pd.read_csv(data.csv, dtype=str)
I've tried things like removing the columns after the import, but the damage seems to already have been done by this point. I can't find the code now, but when testing one fix the error said something like "invalid character u000D in data". I don't control the source of the data so can't make the edits to that.
Pandas supports multiline CSV files if the file is properly escaped and quoted. If you cannot read a CSV file in Python using pandas or csv modules nor open it in MS Excel then it's probably a non-compliant "CSV" file.
Recommend to manually edit a sample of the CSV file and get it working so can open with Excel. Then recreate the steps to normalize it programmatically in Python to process the large file.
Use this code to create a sample CSV file copying first ~100 lines into a new file.
with open('bigfile.csv', "r") as csvin, open('test.csv', "w") as csvout:
line = csvin.readline()
count = 0
while line and count < 100:
csvout.write(line)
count += 1
line = csvin.readline()
Now you have a small test file to work with. If the original CSV file has millions of rows and "bad" rows are found much later in the file then you need to add some logic to find the "bad" lines.
Is there a method I can use to output the inferred schema on a large CSV using pandas?
In addition, any way to have it tell me with that type if it is nullable/blank based off the CSV?
File is about 500k rows with 250 columns.
With my new job, I'm constantly being handed CSV files with zero format documentation.
Is it necessary to load the whole csv file? At least you could use the read_csv function if you know the separator or doing a cat of the file to know the separator. Then use the .info():
df = pd.read_csv(path_to_file,...)
df.info()
I have a list(fulllist) of 292 items and converted to data frame. Then tried writing it to csv in python.
import pandas as pd
my_df = pd.DataFrame(fulllist)
my_df.to_csv('Desktop/pgm/111.csv', index=False,sep=',')
But the some comma separated values fills each columns of csv. I am trying to make that values in single column.
Portion of output is shown below.
I have tried with writerows but wont work.
import csv
with open('Desktop/pgm/111.csv', "wb") as f:
writer = csv.writer(fulllist)
writer.writerows(fulllist)
Also tried with "".join at each time, when the length of list is higher than 1. It also not giving the result. How to make the proper csv so that each fields fill each columns?
My expected output csv is
Please keep in mind that .csv files are in fact plain text files and understanding of .csv by given software depends on implementation, for example some might allow newline character as part of field, when it is between " and ", while other treat every newline character as next row.
Do you have to use .csv format? If not consider other possibilities:
DSV https://en.wikipedia.org/wiki/Delimiter-separated_values is similiar to csv, but you can use for example ; instead of ,, which should help if you do not have ; in your data
openpyxl allows writing and reading of .xlsx files.