I have imported a csv file in pandas that contains fields that look like 'datetime' but initially parsed as 'object'. I make the required conversion from 'datetime' to 'object' using 'df.X = pd.to_datetime(df.X)'.
Now, when I try to save these changes by writing this out to a new .csv file and importing that, the format is still 'object'. Is there anyway to fix it's datatype so that on importing it I don't have to perform the conversion everytime? My dataset is quite big and conversion takes some time, which I want to save.
Date parsing can be expensive, so pandas doesn't parse dates by default. You need to specify parse_dates argument when call read_csv
df = pd.read_csv('my_file.csv', parse_dates=['date_column'])
Related
i need to read a excel file without changing any date , time format , float format and convert to data-frame. This is working fine if i convert the excel to CSV and read it using read_csv() .
eg:
import pandas as pd
import numpy as np
#code for reading excel
df=pd.read_excel("605.xlsx",parse_dates=False,sheet_name="Group 1",keep_default_na=False,dtype=str)
print("df_excel:")
#code for reading csv
df1=pd.read_csv("Group 1.csv",parse_dates=False,dtype=str,na_filter = False)
print("df_csv:",df1)
output:
in the above code parse_dates=False is working fine while reading CSV file, but parse_dates=False is not working in read_excel()
Expected output:
Need the exact excel data into a data-frame without changing the date , time format.
From the Pandas docs on the parse_dates parameter for read_excel():
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. If you don`t want to parse some cells as date just change their type in Excel to “Text”.
You could try this:
df = pd.read_excel("605.xlsx",parse_dates=False,sheet_name="Group1",keep_default_na=False,dtype=str, converters={'as_at_date': str})
Explicitly converting the date column to string might help.
I have a dataframe that contains a column with sets. When I save the dataframe using .to_csv() and then re-open it with pd.read_csv(), the column that contained sets now contains strings.
Here is a code example:
df = pd.DataFrame({'numbers':[1,2,3], 'sets':[set('abc'),set('XYZ'),set([1,2,3])]})
print(type(df.sets[0])) # Type = set
df.to_csv('xxx/test.csv')
df = pd.read_csv('xxx/test.csv', header=0, index_col=0)
print(type(df.sets[0])) # Type = str
Is there a way to avoid the type changing ? I can't find which parameter from either .to_csv() or pd.read_csv() controls this behavior.
The only way I found to get around this problem is by using pickle but I'm guessing there is a way of doing it with Pandas.
Do you know what a csv file is? It is just a text file. You can open it with vi or notepad to make sure.
That means that what is saved in a csv file is just a text representation of the fields. read_csv does its best to convert back integer and floating point values. It can even find date if you use the parse_date parameter.
Here you could use ast.literal_eval as a custom converter:
import ast
...
df = pd.read_csv('xxx/test.csv', header=0, index_col=0,
converters={'sets': ast.literal_eval})
I can't seem to find any information around converting an NDFrame to a DataFrame. I am looking to do this as I can't seem to write an NDFrame to a CSV file. I've tried the code below, but it still returns an NDFrame. How do I make this conversion? Or how do I write an NDFrame to CSV?
df = pd.DataFrame(df)
Here is the error I'm receiving:
ImportError: cannot import name 'get_compression_method' from 'pandas.io.common'
I have one service running pandas version 0.25.2. This service reads data from a database and stores a snapshot as csv
df = pd.read_sql_query(sql_cmd, oracle)
the query result in a dataframe with some very large datetime values. (e.g. 3000-01-02 00:00:00)
Afterwards I use df.to_csv(index=False) to create a csv snapshot and write it into a file
on a diffrent machine with pandas 0.25.3 installed, i am reading the content of the csv file into a dataframe and try to change the datatype of the date column to datetime. This results in a OutOfBoundsDatetime Exception
df = pd.read_csv("xy.csv")
pd.to_datetime(df['val_until'])
pandas._libs.tslibs.np_datetime.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-01-02 00:00:00
I am thinking about using pickle to create the snapshots an load the dataframes directly. However, I am curious why pandas is able to handle a big datetime in the first case and not in the second one.
Also any suggestions how I keep using csv as transfer format are appreciated
I believe I got it.
In the first case, I'm not sure what the actual data type is that is stored in the sql database, but if not otherwise specified, reading it into the df likely results in some generic or string type which has a much higher overflow value.
Eventually though, it ends up in a csv file which is a string type. This can be incredibly (infinitely?) long without any overflow, whereas the data type you are trying to cast into using pandas.to_datetime docs. has a maximum value of _'2262-04-11 23:47:16.854775807' according to the Timestamp.max shown in the first doc link at the bottom.
When I run the following code
import glob,os
import pandas as pd
dirpath = os.getcwd()
inputdirectory = dirpath
for xls_file in glob.glob(os.path.join(inputdirectory,"*.xls*")):
data_xls = pd.read_excel(xls_file, sheet_name=0, index_col=None)
csv_file = os.path.splitext(xls_file)[0]+".csv"
data_xls.to_csv(csv_file, encoding='utf-8', index=False)
It will convert all xls files in the folder into CSV as I want.
HOWEVER, on doing so, any dates such as 20/12/2018 will be converted to 20/12/2018 00:00:00 which is causing major issues with later data processing.
What is going wrong with this?
Nothing is "going wrong" per se. You simply need to provide a custom date_format to df.to_csv:
date_format : string, default None
Format string for datetime objects
In your case that would be
data_xls.to_csv(csv_file, encoding='utf-8', index=False, date_format='%d/%m/%Y')
This will fix the way the raw data is saved to the file. If you will open the file in Excel you may still see it using the full format. This is because Excel tries to assume the cell formats based on their content. You will need to right click the column and select another cell formatting, there is nothing that pandas or Python can do about that (as long as you are using to_csv and not to_excel).
if the above answers still don't work, try this?
import datetime as dt
xls_data['date']=pd.to_datetime(xls_data['date'], format="%d/%m/%y")
xls_data['date'] = xls_data['date'].dt.date
The original xls file is actually storing this fields as datetime.
When you open it with Excel - you seeing it formated the way Excel think you want to see it based on your settings / OS locale / etc.
When python reads the file, the date cells becomes python date objects.
CSV files are basically just text, it cannot holds datetime objects.
When python needs to write datetime object to a text file it gets the full text.
So you have 2 options:
Change the original file date column to text type.
or the better option:
Use python to iterate this fields and change it the text format you would like to see in the csv.
I just tried to reproduce your issue with no success:
>>>import pandas as pd
>>>xls_data = pd.read_excel('test.xls', sheet_name=0, index_cole=None)
>>>xls_data
name date
0 walla 1988-12-10
1 cool 1999-12-10
>>>xls_data.to_csv(encoding='utf-8', index=False)
'name,date\nwalla,1988-12-10\ncool,1999-12-10\n'`
P.S. Any time you deal with datetime objects you should test the result to see if anything change based on your pc locale settings.