Pandas Converting CSV to Parquet - String having , not able to convert - python

I am using Pandas to Convert CSV to Parquet and below is the code, it is straight Forward.
import pandas as pd
df = pd.read_csv('path/xxxx.csv')
print(df)
df.to_parquet('path/xxxx.parquet')
Problem
In a String for Example :- David,Johnson. If there is a , getting error saying there is a problem in the data.
If i remove the , the CSV File is converting to Parquet.
Any suggesions, need help
Thanks
Madhu
If i remove the , the CSV File is converting to Parquet

Do you need to keep comma in the name of the file? Otherwise you can do input='David,Johnson', output=input.replace(',','_'). I don't think it is generally a good practice to have comma in your file names.

Related

String to pandas dataframe Name too long error

Having a comma separated string.When i export it in a CSV file,and then load the resultant CSV in pandas,i get my desired result. But when I try to load the string directly in pandas dataframe,i get error Filename too big. Please help me solve it.
I found the error. Actually it was showing that behaviour for large strings.
I used
import io
df= pd.read_csv(io.StringIO(str),sep=',', engine = 'python')
It solved the issue..str is the name of the string.

Can Pandas output inferred schema for a CSV file?

Is there a method I can use to output the inferred schema on a large CSV using pandas?
In addition, any way to have it tell me with that type if it is nullable/blank based off the CSV?
File is about 500k rows with 250 columns.
With my new job, I'm constantly being handed CSV files with zero format documentation.
Is it necessary to load the whole csv file? At least you could use the read_csv function if you know the separator or doing a cat of the file to know the separator. Then use the .info():
df = pd.read_csv(path_to_file,...)
df.info()

How to stop python auto date parsing while reading a excel file

i need to read a excel file without changing any date , time format , float format and convert to data-frame. This is working fine if i convert the excel to CSV and read it using read_csv() .
eg:
import pandas as pd
import numpy as np
#code for reading excel
df=pd.read_excel("605.xlsx",parse_dates=False,sheet_name="Group 1",keep_default_na=False,dtype=str)
print("df_excel:")
#code for reading csv
df1=pd.read_csv("Group 1.csv",parse_dates=False,dtype=str,na_filter = False)
print("df_csv:",df1)
output:
in the above code parse_dates=False is working fine while reading CSV file, but parse_dates=False is not working in read_excel()
Expected output:
Need the exact excel data into a data-frame without changing the date , time format.
From the Pandas docs on the parse_dates parameter for read_excel():
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. If you don`t want to parse some cells as date just change their type in Excel to “Text”.
You could try this:
df = pd.read_excel("605.xlsx",parse_dates=False,sheet_name="Group1",keep_default_na=False,dtype=str, converters={'as_at_date': str})
Explicitly converting the date column to string might help.

Importing CSV Data and formatting in Excel via Python

I am importing CSV based data in an Excel spreadsheet via Python. I would like to know if it is possible to import the data and divide it in several columns (like we would do via the importing menu under DATA in Excel).
So far, I convert my CSV to a pandas and imported it in Excel, but all my data is clustered in 1 column :
df = pd.read_csv(r'C:\Users\Contractuel\Desktop\Test\Candiac_TypeLum_UTF8.csv')
writer = pd.ExcelWriter('TypeLum_TEST.xlsx')
df.to_excel(writer, index=False)
writer.save()
Thanks!
The read_csv method takes an argument sep= which tells pandas what separates the data. You probably need to use this to specify what the separator in the CSV file is. Default is , but CSVs sometimes have ; or other things as separators.

Saving DataFrame to csv but output cells type becomes number instead of text

import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".

Categories

Resources