I'm having an issue with Dates not showing as expected once they have been written to a parquet file from a Pandas df.
Here is a brief description of my work flow:
Step1: Parquet file1 is located in a storage account and can be queried using Synapse serverless SQL. When queried values in the Date column shows as expected i.e. 2022-01-01 (yyyy-MM-dd) No time is included in the source data.
Step2: Parquet file1 is loaded into into a pandas df using pd.read_parquet. Once the parquet file has been loaded into the df the dtype for the Date column is datetime64[ns].
Step3: Some processing of the df is performed that effectively adds some columns to the existing columns in the df, keeping the same indexes. The Date column is not changed.
Step4: Before the df is written to parquet it is confirmed that the Date column is still of dtype datetime64[ns]. and listing the contents df['Date'] gives values such as 2022-01-13.
Step5: The df is written to parquet file2 using df.to_parquet.
Step6: Parquet file2 is queried in synapse serverless SQL and the values in the Date column show as a EPOCH time for example: 1640995200000000
How do I get the Date to be stored in file2 in the same way as it is in file1? I don't need a timestamp but if required to get it to work one could be added i.e. 'T00:00:00'?
Pandas is using pyarrow for the parquet parsing in my current setup.
pandas does not support dates natively, only timestamps.
For some reason when you read your data from parquet, dates are re-interpreted as timestamps and when you save it back to parquet they remain as timestamp.
You can either change your code so that dates don't get converted to timestamps.
Or you can convert your timestamps back to date before saving it to parquet:
import pandas as pd
df = pd.DataFrame(
{"Date": pd.Series(pd.Timestamp(2023,1,1))}
)
def convert_timestamps_to_dates_df(df):
for col in df.columns:
if df[col].dtype == "datetime64[ns]":
df[col] = df[col].dt.date
return df
convert_timestamps_to_dates_df(df).to_parquet("file.parquet")
Related
i need to read a excel file without changing any date , time format , float format and convert to data-frame. This is working fine if i convert the excel to CSV and read it using read_csv() .
eg:
import pandas as pd
import numpy as np
#code for reading excel
df=pd.read_excel("605.xlsx",parse_dates=False,sheet_name="Group 1",keep_default_na=False,dtype=str)
print("df_excel:")
#code for reading csv
df1=pd.read_csv("Group 1.csv",parse_dates=False,dtype=str,na_filter = False)
print("df_csv:",df1)
output:
in the above code parse_dates=False is working fine while reading CSV file, but parse_dates=False is not working in read_excel()
Expected output:
Need the exact excel data into a data-frame without changing the date , time format.
From the Pandas docs on the parse_dates parameter for read_excel():
If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. If you don`t want to parse some cells as date just change their type in Excel to “Text”.
You could try this:
df = pd.read_excel("605.xlsx",parse_dates=False,sheet_name="Group1",keep_default_na=False,dtype=str, converters={'as_at_date': str})
Explicitly converting the date column to string might help.
the below code simply reads in an excel file, stores it as a df and writes the df back into an excel file. When I open the output file in excel, the columns (Dates, numbers) are not the same... some are text , some or numbers ect..
import pandas as pd
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype=object)
writer = pd.ExcelWriter('outputt.xlsx', engine='xlsxwriter')
df.to_excel(writer, index = False, sheet_name='Sheet1') #drop the index
writer.save()
Is there a way to have the column types (as defined in the initial file) be preserved or revert back to the datatypes when the file was read in?
You are reading in a csv file which is certainly not the same as an excel file. You can read a csv file with excel in Windows, but the encoding is different when the file is saved. You can certainly format cells according xlsxwriter specifications.
However, it is important to note that xlsxwriter cannot format any cells that already have a format such as the header or index, or dates or datetime objects. If you have multiple datatypes in a single column, that will also be problematic, as pandas will then default that column to object. An item of type "object" will be inferred in output, so again it will be dynamically assigned as a "best guess".
When you read your csv in you should specify the format if you want it to be maintained. Right now you are having pandas do this dynamically (Pandas will try to guess column types using the first 100 or so rows).
Change the line where you read in to include dtypes and they will be preserved in output. I am going to assume your columns have headers "ColumnA", "ColumnB", "ColumnC":
import pandas as pd
from datetime import datetime
df = pd.read_csv("test.csv", encoding = "ISO-8859-1", dtype={'ColumnA': int,
'ColumnB': float,
'ColumnC': str})
Let's use "ColumnC" as a column example of dates. I like to first read in dates as a string, then ensure the formatting I desire. So you could add this:
df['ColumnC'] = pd.to_datetime(df['ColumnC'].dt.strftime('%m/%d/%Y')
# date would look like: 06/08/2016, but you can look at other formatting for dt.strftime
This will ensure specific types in output. Further formatting can be applied such as the number of decimals in a float, including percents in output by following guides here.
My advice if you have columns with multiple data types: Don't. This is unorganized and makes use cases much more complex for downstream applications. Spend more time organizing data on the front end so you have less headache on the back end.
i have a csv file. that have a column named DOB. but when i want to change the data type into date type. its gave error.
here is the code
b['DOB'] = pd.to_datetime(b['DOB'], format='%Y-%m-%d')
When you read csv in pandas, read it like below: pd.read_csv(file_name,parse_dates=True)
parse_dates=True converts data to date format if it has date.
I have an xlsx file with multiple tabs, each tab has a Date column in the format of MM/DD/YYYY
Read each tab into a pandas dataframe, applied some operations on each tab, and then write the dataframe back into two formats: xlsx and csv
In the xlsx file, the Date column (index) becomes a format that has the time attached: 1/1/2013 12:00:00 AM, while the Date column in the csv file remains unchanged: MM/DD/YYYY
How can I make sure the Date column in the xlsx file maintains the same format MM/DD/YYYY?
You are seeing the datetime in the default pandas Excel datetime format. However, you can easily set it to whatever you want:
# Set the default datetime and/or date formats.
writer = pd.ExcelWriter("pandas_datetime.xlsx",
engine='xlsxwriter',
date_format='mm/dd/yyy',
datetime_format='mm/dd/yyyy')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
See a full example in the XlsxWriter docs: Example: Pandas Excel output with date times.
You can convert the date to a string. If the date is in the index, you could do:
df.set_index(df.index.map(lambda x: x.strftime('%m/%d/%Y'))).to_excel()
i am having a huge pickle file which needs to be updated in every 3 hrs from a dailydata file(a csv file.)
there are two field named TRX_DATE and TIME_STAMP in each two having values like 24/11/2015 and 24/11/2015 10:19:02 respectively.(also 50 additionl fields are there)
so what i am doing is first reading the huge pickle to a dataframe. Then dropping any values for today's date by comparing with TRX_DATE field.
Then reading that csv file to another dataframe. then appending both dataframe and again creating new pickle.
my scripts looks like
import pandas as pd
import datetime as dt
import pickle
df = pd.read_pickle('hugedata pickle')
Today = dt.datetime.today()
df = df[(df.TRX_DATE > Today)] #delete any entries for today in main pickle
df1 = pd.read_csv(daily data csv file)
df = df.append(df1,ignore_index=True)
df.to_pickle('same huge data pickle')
problem is as follows
1.it is taking huge memory as well as time reading that huge pickle.
2.i need to append df1 to df and only columns from df should only remain and it should exclude if any new column from df1 getting appended. But i am getting new column values having NUN values at so many places.
So need assistance on these things
1.is there way that i will read the small sized csv only and append to pickle file ...(or reading that pickle is mandatory)
2.can it be done like converting the csv to pickle and merge two pickles. by load ,dump method (actually never used that)
3.how to read time from TIME_STAMP field and getting datas between two timestamp (filtering by TIME_STAMP).and upadting that to main pickle.previously i am filtering by TRX_DATE values.
Is there a better way--- please suggest.
HDF5 is made for what you are trying to do.
import tables
import numpy as np
from pandas import HDFStore,DataFrame
df.to_hdf('test.h5',key='test1') # create an hdf5 file
pd.read_hdf('test.h5',key='test1') # read an hdf5 file
df.to_hdf() defaults to append mode.