Pandas does not retain frequency format when exporting to excel - python

I have a monthly dataframe and after resampling to annual data, I used Pandas to_period('M') to have index shown in monthly format. That works fine. When I export to excel, it is in datetime format there.
How to retain the format when exporting to excel.
Data sample in Jupyter Notebook:
2014 1463 146.416667 1110.877414 197.230546 199.230546
Data sample in excel:
2014-01-01 00:00:00 1463 146.416667 1110.877414 197.230546 199.230546

#Arthur Gouveia: Thanks for your response!
I tried to change the data type to string and worked. But any better solution?
if type(TNA_BB2_a.index)==pd.tseries.period.PeriodIndex:
TNA_BB2_a.index = TNA_BB2_a.index.strftime('%Y')
if type(tna_n_m_BB.index)==pd.tseries.period.PeriodIndex:
tna_n_m_BB.index = tna_n_m_BB.index.strftime('%Y-%m')

Related

How to force google colab and gspread to write strings and dates to google sheets in the correct and respective format

I have a dataframe that has a series with dates that look like this:
2022-06-01
and also series that have ids that look like this:
7582857e38
when I write the df to a google sheet I have 2 options when writing data:
value_input_option = 'RAW' or value_input_option='USER_ENTERED'
If I use RAW (the default), the id is written properly but the date is written as forced string
'2022-06-01
if I use USER_ENTERED, then some of id's are converted to a number in scientific notation but the date is written properly.
is there a way to deal with this?
I figured it out,
I ended up writing the data frame to google sheets in 2 stages
1st stage: write the entire df as raw
2nd stage: make a temp df that would hold the date series(column) and then write over that column with user_entered
#write entire df as RAW
cam_ws.update([cam_df.columns.values.tolist()] + cam_df.values.tolist(),value_input_option='RAW')
#re-rewrite date columns as user-entered so they are interpreted as dates
Reseller_Order_Date_df = cam_df[['Reseller Order Date']]
cam_ws.update('J1',[Reseller_Order_Date_df.columns.values.tolist()] + Reseller_Order_Date_df.values.tolist(),value_input_option='USER_ENTERED')
Expired_At_df = cam_df[['Expired At']]
cam_ws.update('P1',[Expired_At_df.columns.values.tolist()] + Expired_At_df.values.tolist(),value_input_option='USER_ENTERED')

Why pyspark converting string date values to null?

Question: Why the myTimeStampCol1 in the following code is returning a null value in the third row, and how can we fix the issue?
from pyspark.sql.functions import *
df=spark.createDataFrame(data = [ ("1","Arpit","2021-07-24 12:01:19.000"),("2","Anand","2019-07-22 13:02:20.000"),("3","Mike","11-16-2021 18:00:08")],
schema=["id","Name","myTimeStampCol"])
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),"yyyy-MM-dd HH:mm:ss.SSSS").alias("myTimeStampCol1")).show()
Output
+--------------------+-------------------+
|myTimeStampCol | myTimeStampCol1|
+--------------------+-------------------+
|2021-07-24 12:01:...|2021-07-24 12:01:19|
|2019-07-22 13:02:...|2019-07-22 13:02:20|
| 11-16-2021 18:00:08| null|
Remarks:
I'm running the code in a python notebook in Azure Databricks (that is almost the same as Databricks)
Above example is just a sample to explain the issue. The real code is importing a data file with millions of records. And the file has a column that has the format MM-dd-yyyy HH:mm:ss (for example 11-16-2021 18:00:08) and all the values in that column have exact same format MM-dd-yyyy HH:mm:ss
The error occurs because of the difference in formats. Since all the records in this column are in the format MM-dd-yyyy HH:mm:ss, You can modify the code as following.
df.select(col("myTimeStampCol"),to_timestamp(col("myTimeStampCol"),'MM-dd-yyyy HH:mm:ss').alias("myTimeStampCol1")).show(truncate=False)
#only if all the records in this column are 'MM-dd-yyyy HH:mm:ss' format
to_timestamp() column expects either 1 or 2 arguments, a column with these timestamp values and the second is the format of these values. Since all these values are the same format MM-dd-yyyy HH:mm:ss, you can specify this as the second argument.
A sample output for this case is given in the below image:
It seem like your timestamp pattern at index #3 is not aligned with others.
Spark uses the default patterns: yyyy-MM-dd for dates and yyyy-MM-dd HH:mm:ss for timestamps.
Changing the format should solve the problem: 2021-11-16 18:00:08
Edit #1:
Alternatively, creating custom transformation function may be a good idea. (Sorry I only found the example with scala).
Spark : Parse a Date / Timestamps with different Formats (MM-dd-yyyy HH:mm, MM/dd/yy H:mm ) in same column of a Dataframe

how to efficiently store pandas series in Postgresql?

I am trying to store a Pandas series column with around 3000 rows in a new table in Postgresql.This column is part of an excel file holding the time series data followed by a number of different sensors. My excel file looks like this :
dateTime sensor1 sensor2 ...sensor-n
2021-06-12 00:00:01 0.0,,,,
2021-06-13 00:00:03 0.0,1.0,,,
2021-06-14 00:00:05 0.0,,,,
...
If I store the name of the sensor and the pandas series for each sensor this will give me redundancy. Do you have any idea how can I store a pandas series efficiently for different sensors in Postgresql? Please help, I am new at Postgresql. Thank you.

Wrong Max values returned by panda on reading a CSV Data File

I am attempting to calculate the maximum stock price and the latest date (today) on reading through two CSV Files - using the pandas max() function. However, the maximum value returned from one of the CSV Files 'Close/Last' column seems implausible.
# Read in Libaries
import pandas as pd
# Define Functions
def get_max_close(symbol):
""" Return the max closing value for stock indicated by symbol."""
df = pd.read_csv("Data\{}.csv".format(symbol)) # Read in data
return df[' Close/Last'].max(), df['Date'].max() #compute Max and return the data to test_run
def test_run():
"""Function called by Test Run"""
for symbol in ['AAPL','IBM']:
print ("Max close")
print (symbol, get_max_close(symbol))
# Main Program
if __name__=="__main__":
test_run()
The answer I get is:
Max close
AAPL (' $99.99', '12/31/2019')
Max close
IBM (' $215.8', '12/31/2019')
Clearly, the max values are higher than $99.99, and the date outdated.
I updated the pandas library too. However, the mistake still persists. Any help here would be appreciated.
The CSV File AAPL has data such as (example):
AAPL.CSV File Data Image
' $99.99' is a string, not a number. I don't know how Pandas' max method works on strings. In any case it would be safer to transform your data to proper float values before taking the maximum. The same issue could cause your problem with the dates. Pandas has a dedicated type for dates. The function pd.to_datetime() could be useful to convert your data accordingly. Then I would expect the max method to work as you intended.
Date values should be converted to date type using pd.to_datetime(). Currency are in string format, so you need to strip off the $ sign and convert them to float
df['Date'] = pd.to_datetime(df['Date'])
df['Close/Last'] = df['Close/Last'].apply(lambda x: x[1:]).astype(float)
print(df['Close/Last'].max())
print(df['Date'].max())
Output:
302.74
2020-03-13 00:00:00

Calculating and plotting a 20 year Climatology

I am working on plotting a 20 year climatology and have had issues with averaging.
My data is hourly data since December 1999 in CSV format. I used an API to get the data and currently have it in a pandas data frame. I was able to split up hours, days, etc like this:
dfROVC1['Month'] = dfROVC1['time'].apply(lambda cell: int(cell[5:7]))
dfROVC1['Day'] = dfROVC1['time'].apply(lambda cell: int(cell[8:9]))
dfROVC1['Year'] = dfROVC1['time'].apply(lambda cell: int(cell[0:4]))
dfROVC1['Hour'] = dfROVC1['time'].apply(lambda cell: int(cell[11:13]))
So I averaged all the days using:
z=dfROVC1.groupby([dfROVC1.index.day,dfROVC1.index.month]).mean()
That worked, but I realized I should take the average of the mins and average of the maxes of all my data. I have been having a hard time figuring all of this out.
I want my plot to look like this:
Monthly Average Section
but I can't figure out how to make it work.
I am currently using Jupyter Notebook with Python 3.
Any help would be appreciated.
Is there a reason you didn't just use datetime to convert your time column?
The minimums by month would be:
z=dfROVC1.groupby(['Year','Month']).min()

Categories

Resources