Is there a method I can use to output the inferred schema on a large CSV using pandas?
In addition, any way to have it tell me with that type if it is nullable/blank based off the CSV?
File is about 500k rows with 250 columns.
With my new job, I'm constantly being handed CSV files with zero format documentation.
Is it necessary to load the whole csv file? At least you could use the read_csv function if you know the separator or doing a cat of the file to know the separator. Then use the .info():
df = pd.read_csv(path_to_file,...)
df.info()
Related
From Python i want to export to csv format a dataframe
The dataframe contains two columns like this
So when i write this :
df['NAME'] = df['NAME'].astype(str) # or .astype('string')
df.to_csv('output.csv',index=False,sep=';')
The excel output in csv format returns this :
and reads the value "MAY8218" as a date format "may-18" while i want it to be read as "MAY8218".
I've tried many ways but none of them is working. I don't want an alternative like putting quotation marks to the left and the right of the value.
Thanks.
If you want to export the dataframe to use it in excel just export it as xlsx. It works for me and maintains the value as string in the original format.
df.to_excel('output.xlsx',index=False)
The CSV format is a text format. The file contains no hint for the type of the field. The problem is that Excel has the worst possible support for CSV files: it assumes that CSV files always use its own conventions when you try to read one. In short, one Excel implementation can only read correctly what it has written...
That means that you cannot prevent Excel to interpret the csv data the way it wants, at least when you open a csv file. Fortunately you have other options:
import the csv file instead of opening it. This time you have options to configure the way the file should be processed.
use LibreOffice calc for processing CSV files. LibreOffice is a little behind Microsoft Office on most points except for csv file handling where it has an excellent support.
I'm importing an .xlsx file with pd.read_excel(). I received this .xlsx file as an CSV file and used excel to seperate it by comma so I get the proper .xlsx file with columns etc. Six of the dataframe columns have a number as header (e.g. 5030, 5031,...). When I want to change the column name with df = df.rename(columns={...}) this does not work. Also df["5030"] does not work, it throws an error: KeyError:'5030'. This code works for columns which have regular/non-integer names.
However, when I import the raw .csv file with pd.read_csv(), all the code above does work. I can just rename column names. The df's do look exactly the same when imported with both techniques, but apparently something is different.
It is not a serious issue as I can change the column name to non-integers manually in excel, but I'm very curious about what the underlying "problem" is here and how these two function operate in a different way.
Thanks!
I am importing CSV based data in an Excel spreadsheet via Python. I would like to know if it is possible to import the data and divide it in several columns (like we would do via the importing menu under DATA in Excel).
So far, I convert my CSV to a pandas and imported it in Excel, but all my data is clustered in 1 column :
df = pd.read_csv(r'C:\Users\Contractuel\Desktop\Test\Candiac_TypeLum_UTF8.csv')
writer = pd.ExcelWriter('TypeLum_TEST.xlsx')
df.to_excel(writer, index=False)
writer.save()
Thanks!
The read_csv method takes an argument sep= which tells pandas what separates the data. You probably need to use this to specify what the separator in the CSV file is. Default is , but CSVs sometimes have ; or other things as separators.
I need to output a csv file in python, and as the file is too large, I use the package 'zipfile' to zip it. However, when the csv file is outputed and unzipped, the columns merged......
The code is like:
for i in dealers:
data_1=data_dealer[data_dealer['DEALER_ID']==i]
data=data_1.to_string(index=False, header=True).encode("utf_8_sig")
azip=zipfile.Zipfile('data%s.zip'%i,mode='w')
azip.writestr('data%s.csv'%i,data=data,compress_type=zipfile.ZIP_DEFLATED)
azip.close()
the csv was originally like (separated by comma):
a,1600,2018,NaN,......
now there's only one column, or separated by space:
a 1600 2018 NaN ......
Anyone knows how to zip a csv in Python without merging the columns?
Thanks a lot!!
By using Pandas to_string() function, you were creating a textual fixed column width table output suitable for displaying on a console. You wanted though to create a CSV output with , delimiters between the columns. As such you need to use the to_csv() function. If a filename is not given to the function, it returns the whole table as a string. This could then be passed using writestr():
for i in dealers:
data_1 = data_dealer[data_dealer['DEALER_ID']==i]
data = data_1.to_csv(index=False, header=True).encode("utf_8_sig")
azip = zipfile.Zipfile('data{}.zip'.format(i), mode='w')
azip.writestr('data{}.csv'.format(i), data=data, compress_type=zipfile.ZIP_DEFLATED)
azip.close()
I have 4 separate CSV files that I wish to read into Pandas. I want to merge these CSV files into one dataframe.
The problem is that the columns within the CSV files contain the following: , ; | and spaces. Therefore I have to use different delimiters when reading the different CSV files and do some transformations to get them in the correct format.
Each CSV file contains an 'ID' column. When I merge my dataframes, it is not done correctly and I get 'NaN' in the column which has been merged.
Do you have to use the same delimiter in order for the dataframes to merge properly?
In short : no, you do not need similar delimiters within your files to merge pandas Dataframes - in fact, once data has been imported (which requires setting the right delimiter for each of your files), the data is placed in memory and does not keep track of the initial delimiter (you can see this by writing down your imported dataframes to csv using the .to_csv method : the delimiter will always be , by default).
Now, in order to understand what is going wrong with your merge, please post more details about your data and the code your are using to perform the operation.