using cumsum method on multi level index in pandas - python

I have the following multi-level dataframe (partial)
Px_last FINAL RETURN Stock_RES WANTED
Stock Date
ALKM 10/27/2016 0.0013 1 -53.85 NaN -53.85
1/17/2017 0.0009 1 111.11 NaN 57.26
1/18/2017 0.0012 1 233.33 NaN 290.60
1/23/2018 0.0012 1 16.67 NaN 307.26
1/30/2018 0.0019 1 -42.11 NaN 265.16
ANDI 12/28/2017 0.0017 1 370.59 NaN 370.59
2/14/2018 0.0324 1 20.00 NaN 390.59
APPZ 9/22/2017 0.0002 1 -50.00 NaN -50.00
12/5/2017 0.0001 1 -100.00 NaN -150.00
12/6/2017 0.0001 1 0.00 NaN -150.00
I can do a cumulative sum for the entire dataframe with the following code
df3['TTL_SUM'] = df3['RETURN'].cumsum()
But want I want to do is a cumulative sum by each stock but when I do the following I get a column of NaN. Does anyone know what I am doing wrong here? SEE dataframe above
df3['Stock_RES'] = df3.groupby(level=0)['RETURN'].sum()
It does seem to work when I assign that to a variable but ultimately I want to get it in the dataframe
RESULTS = df3.groupby(level=0)['RETURN'].sum()
Can someone help me out. Seems like the same code to me so not sure why it won't add directly into a dataframe.

You were using sum and not cumsum in a groupby context.
df.assign(WANTED1=df.groupby('Stock').RETURN.cumsum())
Px_last FINAL RETURN Stock_RES WANTED WANTED1
Stock Date
ALKM 10/27/2016 0.0013 1 -53.85 NaN -53.85 -53.85
1/17/2017 0.0009 1 111.11 NaN 57.26 57.26
1/18/2017 0.0012 1 233.33 NaN 290.60 290.59
1/23/2018 0.0012 1 16.67 NaN 307.26 307.26
1/30/2018 0.0019 1 -42.11 NaN 265.16 265.15
ANDI 12/28/2017 0.0017 1 370.59 NaN 370.59 370.59
2/14/2018 0.0324 1 20.00 NaN 390.59 390.59
APPZ 9/22/2017 0.0002 1 -50.00 NaN -50.00 -50.00
12/5/2017 0.0001 1 -100.00 NaN -150.00 -150.00
12/6/2017 0.0001 1 0.00 NaN -150.00 -150.00

Related

Is there a way to calculate between two value of NaN value?

i'm having a trouble to code into finding in between values in pandas dataframe.
the dataframe:
value
30
NaN
NaN
25
NaN
20
NaN
NaN
NaN
NaN
15
...
the formula is like this:
value before nan - ((value before nan - value after nan)/div by no. of nan in between the values)
example of expected value should be like this:
30 - (30-25)/2 = 27.5
27.5 - (27.5-25)/1 = 25
so the expected dataframe will look like this:
value
expected value
30
30
NaN
27.5
NaN
25
25
25
NaN
20
20
20
NaN
18.75
NaN
17.5
NaN
16.25
NaN
15
15
15
...
...
IIUC, you can generalize your formula into two parts:
Any nan right before a non-nan is just same as that number
{value-before-nan} - ({value-before-nan} - {value-after-nan})/1 = {value-after-nan}
Rest of nan are linear interpolation.
So you can use bfill with interpolate:
df.bfill(limit=1).interpolate()
Output:
value
0 30.00
1 27.50
2 25.00
3 25.00
4 20.00
5 20.00
6 18.75
7 17.50
8 16.25
9 15.00
10 15.00

Resampling in Pandas is thrown off by daylight savings time

I have a dataset that marks the occurrences of an event every minute for four years. Here's a sample:
In [547]: result
Out[547]:
uuid timestamp col1 col2 col3
0 100 2016-03-30 00:00:00+02:00 NaN NaN NaN
1 100 2016-03-30 00:01:00+02:00 NaN NaN NaN
2 100 2016-03-30 00:02:00+02:00 NaN NaN NaN
3 100 2016-03-30 00:03:00+02:00 1.49 1.79 0.979
4 100 2016-03-30 00:04:00+02:00 NaN NaN NaN
... ... ... .. ...
1435 100 2016-03-30 23:55:00+02:00 NaN NaN NaN
1436 100 2016-03-30 23:56:00+02:00 1.39 2.19 1.09
1437 100 2016-03-30 23:57:00+02:00 NaN NaN NaN
1438 100 2016-03-30 23:58:00+02:00 NaN NaN NaN
1439 100 2016-03-30 23:59:00+02:00 NaN NaN NaN
[1440 rows x 5 columns]
I am trying to get summary statistics every time there is a non-blank row and get these statistics for every six hours. To do this, the resample() function works great. Here's a sample:
In [548]: result = result.set_index('timestamp').tz_convert('Europe/Berlin').resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]
Out[548]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-30 00:00:00+02:00 NaN NaN ... NaN 0
2016-03-30 07:00:00+02:00 1.0690 1.069 ... 1.279 1
2016-03-30 13:00:00+02:00 1.0365 1.009 ... 1.239 4
2016-03-30 19:00:00+02:00 1.0150 0.989 ... 1.209 5
2016-03-30 01:00:00+02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
This looks great and is the format I'd like to work with. However, when I run my code on all data (spanning many years), here's an excerpt of what the output looks like:
In [549]: result
Out[549]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-27 00:00:00+01:00 NaN NaN ... NaN 0
2016-03-27 07:00:00+02:00 1.0690 1.069 ... 1.279 1
2016-03-27 13:00:00+02:00 1.0365 1.009 ... 1.239 4
2016-03-27 19:00:00+02:00 1.0150 0.989 ... 1.209 5
2016-03-28 01:00:00+02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
The new index takes DST into consideration and throws everything off by an hour. I would like the new times to still be between 0–6, 6–12 etc.
Is there a way to coerce my dataset to adhere to a 0–6, 6–12 format? If there's an extra hour, maybe the aggregations from that could still be tucked into the 0–6 range?
The timezone I'm working with is Europe/Berlin and I tried converting everything to UTC. However, values are not at their right date or time — for example, an occurrence at 00:15hrs would be 23:15hrs the previous day, which throws off those summary statistics.
Are there any creative solutions to fix this?
Have you tried this? I think it should work
(First converts to local timezone, and then truncates the timezone info by .tz_localize(None))
result = result.set_index('timestamp').tz_convert('Europe/Berlin').tz_localize(None).resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]

Regex in pandas filter the columns with ^

I am working with Pandas and want to filter the columns with an regex. It returns something when I change the regex to rf"{c}(\.)?(\d)*" but if I want it to start with a certain letter it breaks and the filtered dataframe is empty.
for c in self.variables.split():
reg = rf"^{c}(\.)?(\d)*$"
print(reg)
filtered = self.raw_data.filter(regex=reg)
What did I do wrong and how can I fix it.
PS: This a sample of the data
variable T T.1 T.2 T.3 T.4 ... T.8 T.9 l phi dl
0 29.63 27.87 26.95 26.64 26.25 ... 23.3 22.42 2.141 0.093551 0.002
1 29.70 NaN NaN NaN NaN ... NaN NaN 2.043 0.098052 0.002
2 29.62 NaN NaN NaN NaN ... NaN NaN 1.892 0.089973 0.002
3 29.65 NaN NaN NaN NaN ... NaN NaN 1.828 0.093132 0.002
And I would like it to return 4 dfs each only containing the data of a specific variable e.g.
variable T T.1 T.2 T.3 T.4 T.5 T.6 T.7 T.8 T.9
0 29.63 27.87 26.95 26.64 26.25 25.62 24.99 23.85 23.3 22.42
1 29.70 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 29.62 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 29.65 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 29.38 NaN NaN NaN NaN NaN NaN NaN NaN NaN
or only l without the dl(this is why I thought I needed to use ^ in my regex)
variable l
0 2.141
1 2.043
2 1.892
3 1.828
Thx in advance dear community
Details
variable match literal string variable
| logical or, since you want the column variable with every other dataframe
^ - start of a string
{c} - followed by an f-string with the desired variable
(\.\d+)? - an optional sequence of a literal . follow by one or more digits
$ - end of string.
import pandas as pd
df = pd.read_csv("sample.csv", sep='\s+')
print(df)
variables = ['T', 'l', 'phi', 'dl']
for c in variables:
ds = df.filter(regex=rf"variable|^{c}(\.\d+)?$")
print(f'\n---Variable: [{c}] ---')
print(ds)
---Variable: [T] ---
variable T T.1 T.2 T.3 T.4 T.5 T.6 T.7 T.8 T.9
0 0 29.63 27.87 26.95 26.64 26.25 25.62 24.99 23.85 23.3 22.42
1 1 29.70 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2 29.62 NaN NaN NaN NaN NaN NaN NaN NaN NaN
...
---Variable: [l] ---
variable l
0 0 2.141
1 1 2.043
2 2 1.892
...
---Variable: [phi] ---
variable phi
0 0 0.093551
1 1 0.098052
2 2 0.089973
...
---Variable: [dl] ---
variable dl
0 0 0.002
1 1 0.002
2 2 0.002
...

How to get max of a slice of a dataframe based on column values?

I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.
You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')
Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

Nuances of merging multiple pandas dataframes (3+) on a key column

First question here and a long one - there are a couple of things I am struggling with regarding merging and formatting my dataframes. I have some half working solutions ones but I am unsure if they are the best possible based on what I want.
Here are the standard formats of the dataframes I am merging with pandas.
df1 =
RT %Area RRT
0 4.83 5.257 0.509
1 6.76 0.424 0.712
2 7.27 0.495 0.766
3 7.70 0.257 0.811
4 7.79 0.122 0.821
5 9.49 92.763 1.000
6 11.40 0.681 1.201
df2=
RT %Area RRT
0 4.83 0.731 0.508
1 6.74 1.243 0.709
2 7.28 0.109 0.766
3 7.71 0.287 0.812
4 7.79 0.177 0.820
5 9.50 95.824 1.000
6 11.31 0.348 1.191
7 11.40 1.166 1.200
8 12.09 0.113 1.273
df3 = ...
Currently I am using a reduce operation on pd.merge_ordered() like below to merge my dataframes (3+). This kind of yields what I want and was from a previous question (pandas three-way joining multiple dataframes on columns). I am merging on RRT, and want the indexes with the same RRT values to be placed on the same row - and if the RRT values are unique for that dataset I want a NaN for missing data from other datasets.
#The for loop I use to generate the list of formatted dataframes prior to merging
dfs = []
for entry in os.scandir(directory):
if (entry.path.endswith(".csv")) and entry.is_file():
entry = pd.read_csv(entry.path, header=None)
#Block of formatting code removed
dfs.append(entry.round(2))
dfs = [df1ar,df2ar,df3ar]
df_final = reduce(lambda left,right: pd.merge_ordered(left,right,on='RRT'), dfs)
cols = ['RRT', 'RT_x', '%Area_x', 'RT_y', '%Area_y', 'RT', '%Area']
df_final = df_final[cols]
print(df_final)
RRT RT_x %Area_x RT_y %Area_y RT %Area
0 0.508 NaN NaN 4.83 0.731 NaN NaN
1 0.509 4.83 5.257 NaN NaN 4.83 5.257
2 0.709 NaN NaN 6.74 1.243 NaN NaN
3 0.712 6.76 0.424 NaN NaN 6.76 0.424
4 0.766 7.27 0.495 7.28 0.109 7.27 0.495
5 0.811 7.70 0.257 NaN NaN 7.70 0.257
6 0.812 NaN NaN 7.71 0.287 NaN NaN
7 0.820 NaN NaN 7.79 0.177 NaN NaN
8 0.821 7.79 0.122 NaN NaN 7.79 0.122
9 1.000 9.49 92.763 9.50 95.824 9.49 92.763
10 1.191 NaN NaN 11.31 0.348 NaN NaN
11 1.200 NaN NaN 11.40 1.166 NaN NaN
12 1.201 11.40 0.681 NaN NaN 11.40 0.681
13 1.273 NaN NaN 12.09 0.113 NaN NaN
This works, but:
Can I can insert a multiindex based on the filename of the dataframe that the data came from from and place it above the corresponding columns? Like the suffix option but related back to filename and for more than two sets of data. Is this better done prior to merging? and if so how do I do it? (I've included the for loop I use for to create a list of tables prior to merging.
Is this reduced merge_ordered the simplest way of doing this?
Can I do a similar merge with pd.merge_asof() and use the tolerance value to fine tune the merging based on the similarities between the RRT values? That is, can it be done without cutting off data from the longer dataframes?
I've tried the above and searched for answers, but I'm struggling to find the most efficient way to do everything I want.
concat = pd.concat(dfs, axis=1, keys=['A','B','C'])
concat_final = concat.round(3)
print(concat_final)
A B C
RT %Area RRT RT %Area RRT RT %Area RRT
0 4.83 5.257 0.509 4.83 0.731 0.508 4.83 5.257 0.509
1 6.76 0.424 0.712 6.74 1.243 0.709 6.76 0.424 0.712
2 7.27 0.495 0.766 7.28 0.109 0.766 7.27 0.495 0.766
3 7.70 0.257 0.811 7.71 0.287 0.812 7.70 0.257 0.811
4 7.79 0.122 0.821 7.79 0.177 0.820 7.79 0.122 0.821
5 9.49 92.763 1.000 9.50 95.824 1.000 9.49 92.763 1.000
6 11.40 0.681 1.201 11.31 0.348 1.191 11.40 0.681 1.201
7 NaN NaN NaN 11.40 1.166 1.200 NaN NaN NaN
8 NaN NaN NaN 12.09 0.113 1.273 NaN NaN NaN
I have also tried this - and I get the multiindex to denote which file (A,B,C, just as placeholders) it came from. However, it has obviously not merged based on the RRT value like I want.
Can I apply an operation to change this into a similar format to the pd.merge_ordered() format above? Would groupby() work?
Thanks!

Categories

Resources