Pandas aligning multiple dataframes with TimeStamp index - python

This has been the bane of my life for the past couple of days. I have numerous Pandas Dataframes that contain time series data with irregular frequencies. I try to align these into a single dataframe.
Below is some code, with representative dataframes, df1, df2, and df3 ( I actually have n=5, and would appreciate a solution that would work for all n>2):
# df1, df2, df3 are given at the bottom
import pandas as pd
import datetime
# I can align df1 to df2 easily
df1aligned, df2aligned = df1.align(df2)
# And then concatenate into a single dataframe
combined_1_n_2 = pd.concat([df1aligned, df2aligned], axis =1 )
# Since I don't know any better, I then try to align df3 to combined_1_n_2 manually:
combined_1_n_2.align(df3)
error: Reindexing only valid with uniquely valued Index objects
I have an idea why I get this error, so I get rid of the duplicate indices in combined_1_n_2 and try again:
combined_1_n_2 = combined_1_n_2.groupby(combined_1_n_2.index).first()
combined_1_n_2.align(df3) # But stll get the same error
error: Reindexing only valid with uniquely valued Index objects
Why am I getting this error? Even if this worked, it is completely manual and ugly. How can I align >2 time series and combine them in a single dataframe?
Data:
df1 = pd.DataFrame( {'price' : [62.1250,62.2500,62.2375,61.9250,61.9125 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:59.614000', '2008-06-01 06:03:59.692000',
'2008-06-01 06:15:42.004000', '2008-06-01 06:15:42.083000','2008-06-01 06:17:01.654000' ] ])
df2 = pd.DataFrame({'price': [241.0625, 241.5000, 241.3750, 241.2500, 241.3750 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:13:34.524000', '2008-06-01 06:13:34.602000',
'2008-06-01 06:15:05.399000', '2008-06-01 06:15:05.399000','2008-06-01 06:15:42.082000' ] ])
df3 = pd.DataFrame({'price': [67.656, 67.875, 67.8125, 67.75, 67.6875 ]},
index = [pd.DatetimeIndex([datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')])[0]
for s in ['2008-06-01 06:03:52.281000', '2008-06-01 06:03:52.359000',
'2008-06-01 06:13:34.848000', '2008-06-01 06:13:34.926000','2008-06-01 06:15:05.321000' ] ])

Your specific error is due the column names of combined_1_n_2 having duplicates (both columns will be named 'price'). You could rename the columns and the second align would work.
One alternative way would be to chain the join operator, which merges frames on the index, as below.
In [23]: df1.join(df2, how='outer', rsuffix='_1').join(df3, how='outer', rsuffix='_2')
Out[23]:
price price_1 price_2
2008-06-01 06:03:52.281000 NaN NaN 67.6560
2008-06-01 06:03:52.359000 NaN NaN 67.8750
2008-06-01 06:03:59.614000 62.1250 NaN NaN
2008-06-01 06:03:59.692000 62.2500 NaN NaN
2008-06-01 06:13:34.524000 NaN 241.0625 NaN
2008-06-01 06:13:34.602000 NaN 241.5000 NaN
2008-06-01 06:13:34.848000 NaN NaN 67.8125
2008-06-01 06:13:34.926000 NaN NaN 67.7500
2008-06-01 06:15:05.321000 NaN NaN 67.6875
2008-06-01 06:15:05.399000 NaN 241.3750 NaN
2008-06-01 06:15:05.399000 NaN 241.2500 NaN
2008-06-01 06:15:42.004000 62.2375 NaN NaN
2008-06-01 06:15:42.082000 NaN 241.3750 NaN
2008-06-01 06:15:42.083000 61.9250 NaN NaN
2008-06-01 06:17:01.654000 61.9125 NaN NaN

Related

I am trying to join a dataframe with a pandas series but get Nan values for the dataframe

I have a pandas series called mean_return:
BHP 0.094214
GOOG 0.180892
INTC -0.179899
MRK 0.065741
MSFT 0.205519
MXL 0.153332
SHEL 0.001714
TSM 0.162741
WBD -0.233863
dtype: float64
pandas.core.series.Series
When i try to merge the above with the dataframe below I get Nan values for mean return: (excuse the formatting Im not sure how to copy and paste the dataframe). I see the tickers are not ordered the same for the series as the DF , what can i do to merge both DF and series?
0
mkt_value weights investment shares mean_return
0 GOOG 51.180000 0.115308 14413.469698 281.623087 NaN
1 BHP 99.570000 0.140488 17560.996495 176.368349 NaN
INTC 25.719999 0.092804 11600.473577 451.029311 NaN
MXL 87.599998 0.110175 13771.865664 157.213081 NaN
MRK 234.240005 0.102416 12801.944297 54.653108 NaN
MSFT 34.220001 0.123298 15412.217160 450.386225 NaN
SHEL 51.970001 0.142114 17764.225757 341.816920 NaN
TSM 69.750000 0.134963 16870.389838 241.869388 NaN
WBD 11.980000 0.038435 4804.417515 40
Here is the code is used:
df=pd.DataFrame(tickers_list)
df.rename({'index':'tickers_list'},axis='columns',inplace=True)
df['mkt_value']=data.values[-1]
df['weights']=weights
df['investment']=port_size*weights
df['shares']=df['investment']/df['mkt_value']
df['mean_return']=mean_return
df
If the tickers in the df are unique, then I would set them as an index. and then use pd.concat which concatenates based on index.
df = df.set_index('tickers_list')
mean_returns.name = "mean_returns"
df = pd.concat([df, mean_returns], axis=1)
You set the name attribute as that will be the name of the new column.

Pandas Timeseries reindex producing NaNs

I am surprised that my reindex is producing NaNs in whole dataframe when the original dataframe does have numerical values init. Don't know why?
Code:
df =
A ... D
Unnamed: 0 ...
2022-04-04 11:00:05 NaN ... 2419.0
2022-04-04 11:00:10 NaN ... 2419.0
## exp start and end times
exp_start, exp_end = '2022-04-04 11:00:00','2022-04-04 13:00:00'
## one second index
onesec_idx = pd.date_range(start=exp_start,end=exp_end,freq='1s')
## map new index to the df
df = df.reindex(onesec_idx)
Result:
df =
A ... D
2022-04-04 11:00:00 NaN ... NaN
2022-04-04 11:00:01 NaN ... NaN
2022-04-04 11:00:02 NaN ... NaN
2022-04-04 11:00:03 NaN ... NaN
2022-04-04 11:00:04 NaN ... NaN
2022-04-04 11:00:05 NaN ... NaN
From the documentation you can see that df.reindex() will Places NA/NaN in locations having no value in the previous index.
However you can also provide a value that you want to replace missing values with (It defaults to NaN):
df.reindex(onesec_idx, fill_value='')
If you want to replace the NaN in a particular column or even in the whole dataframe you can run something like after doing a reindex:
df.fillna('',inplace=True) # for replacing NaN in the entire df with ''
df['d'].fillna(0, inplace=True) # if you want to replace all NaN in the D column with 0
Sources:
Documentation for reindex: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html
Documentation for fillna: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

NaN columns pandas turning into empty row when using pivot_table

I use read_csv to fill a pandas. In this pandas I have a full NaN empty columns and this turns into a problem when I use pivot_table.
Here my situation:
d= {'dates': ['01/01/20','01/02/20','01/03/20'], 'country':['Fra','Fra','Fra'], 'val': [np.nan,np.nan,np.nan]}
df = pd.DataFrame(data=d)
piv=df.pivot_table(index='country',values='val',columns='dates')
print(piv)
Empty DataFrame
Columns: []
Index: []
I would like to have this :
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
from docs, set dropna = False DataFrame.pivot_table
piv = df.pivot_table(index='country',values='val',columns='dates', dropna=False)
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN
Just use the dropna argument of pivot:
df.pivot_table(index='country',columns='dates', values='val', dropna = False)
The output is:
dates 01/01/20 01/02/20 01/03/20
country
Fra NaN NaN NaN

Selecting column values of a dataframe which is in a range and put it in appropriate columns of another dataframe in pandas

I have a csv file which is something like below
date,mean,min,max,std
2018-03-15,3.9999999999999964,inf,0.0,100.0
2018-03-16,0.46403712296984756,90.0,0.0,inf
2018-03-17,2.32452732452731,,0.0,143.2191767899579
2018-03-18,2.8571428571428523,inf,0.0,100.0
2018-03-20,0.6928406466512793,100.0,0.0,inf
2018-03-22,2.8675703858185635,,0.0,119.05383697172658
I want to select those column values which is > 20 and < 500 that is (20 to 500) and put those values along with date in another column of a dataframe.The other dataframe looks something like this
Date percentage_change location
2018-02-14 23.44 BOM
So I want to get the date, value from the csv and add it into the new dataframe at appropriate columns.Something like
Date percentage_change location
2018-02-14 23.44 BOM
2018-03-15 100.0 NaN
2018-03-16 90.0 NaN
2018-03-17 143.2191767899579 NaN
.... .... ....
Now I am aware of functions like df.max(axis=1) and df.min(axis=1) which gives you the min and max but not sure for finding values based on a range.So how can this be achieved?
Given dataframes df1 and df2, you can achieve this via aligning column names, cleaning numeric data, and then using pd.DataFrame.append.
df_app = df1.loc[:, ['date', 'mean', 'min', 'std']]\
.rename(columns={'date': 'Date'})\
.replace(np.inf, 0)\
.fillna(0)
print(df_app)
df_app['percentage_change'] = np.maximum(df_app['min'], df_app['std'])
print(df_app)
df_app = df_app[df_app['percentage_change'].between(20, 500)]
res = df2.append(df_app.loc[:, ['Date', 'percentage_change']])
print(res)
# Date location percentage_change
# 0 2018-02-14 BOM 23.440000
# 0 2018-03-15 NaN 100.000000
# 1 2018-03-16 NaN 90.000000
# 2 2018-03-17 NaN 143.219177
# 3 2018-03-18 NaN 100.000000
# 4 2018-03-20 NaN 100.000000
# 5 2018-03-22 NaN 119.053837

Extract series objects from Pandas DataFrame

I have a dataframe with the columns
['CPL4', 'Part Number', 'Calendar Year/Month', 'Sales', 'Inventory']
For each 'Part Number', 'Calendar Year/Month' will be unique on each Part Number.
I want to convert each part number to a univariate Series with 'Calendar Year/Month' as the index and either 'Sales' or 'Inventory' as the value.
How can I accomplish this using pandas built-in functions and not iterating through the dataframe manually?
In pandas this is called a MultiIndex. Try:
import pandas as pd
df = pd.DataFrame(file,
index=['Part Number', 'Calendar Year/Month'],
columns = ['Sales', 'Inventory'])
you can use the groupby method such has:
grouped_df = df.groupby('Part Number')
and then you can access the df of a certain part number and set the index easily such has:
new_df = grouped_df.get_group('THEPARTNUMBERYOUWANT').set_index('Calendar Year/Month')
if you only want the 2 columns you can do:
print new_df[['Sales', 'Inventory']]]
From the answers and comments here, along with a little more research, I ended with the following solution.
temp_series = df[df[ "Part Number" == sku ] ].pivot(columns = ["Calendar Year/Month"], values = "Sales").iloc[0]
Where sku is a specific part number from df["Part Number"].unique()
This will give you a univariate time series(temp_series) indexed by "Calendar Year/Month" with values of "Sales" EG:
1.2015 NaN
1.2016 NaN
2.2015 NaN
2.2016 NaN
3.2015 NaN
3.2016 NaN
4.2015 NaN
4.2016 NaN
5.2015 NaN
5.2016 NaN
6.2015 NaN
6.2016 NaN
7.2015 NaN
7.2016 NaN
8.2015 NaN
8.2016 NaN
9.2015 NaN
10.2015 NaN
11.2015 NaN
12.2015 NaN
Name: 161, dtype: float64
<class 'pandas.core.series.Series'>])
from the columns
['CPL4', 'Part Number', 'Calendar Year/Month', 'Sales', 'Inventory']

Categories

Resources