Pandas, create new columns based on existing with repeated count - python

It's a bit complicated for explain, so I'll do my best. I have a pandas with two columns: hour (from 1 to 24) and value(corresponding to each hour). Dataset index is huge but column hour is repeated on that 24 hours basis (from 1 to 24). I am trying to create new 24 columns: value -1, value -2, value -3...value -24 that will correspond to each row and value from -1 hour, value from -2 hours(from above rows).
hour | value | value -1 | value -2 | value -3| ... | value - 24
1 10 0 0 0 0
2 11 10 0 0 0
3 12 11 10 0 0
4 13 12 11 10 0
...
24 32 31 30 29 0
1 33 32 31 30 10
2 34 33 32 31 11
and so on...
All value numbers are for the example. As I said there are lots of rows, not only 24 for all hours in a day time but all following time series from 1 to 24 and etc.
Thanks in advance and may the force be with you!

Is this what you need?
df = pd.DataFrame([[1,10],[2,11],
[3,12],[4,13]], columns=['hour','value'])
for i in range(1, 24):
df['value -' + str(i)] = df['value'].shift(i).fillna(0)
result:

Is this what you are looking for?
import pandas as pd
df = pd.DataFrame({'hour': list(range(24))*2,
'value': list(range(48))})
shift_cols_n = 10
for shift in range(1, shift_cols_n):
new_columns_name = 'value - ' + str(shift)
# Assuming that you don't have any NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift).fillna(0)
# A safer (and a less simple) way, in case you have NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift)
df.loc[:shift, new_columns_name] = 0
print(df.head(9))
hour value value - 1 value - 2 value - 3 value - 4 value - 5 \
0 0 0 0.0 0.0 0.0 0.0 0.0
1 1 1 0.0 0.0 0.0 0.0 0.0
2 2 2 1.0 0.0 0.0 0.0 0.0
3 3 3 2.0 1.0 0.0 0.0 0.0
4 4 4 3.0 2.0 1.0 0.0 0.0
5 5 5 4.0 3.0 2.0 1.0 0.0
6 6 6 5.0 4.0 3.0 2.0 1.0
7 7 7 6.0 5.0 4.0 3.0 2.0
8 8 8 7.0 6.0 5.0 4.0 3.0
value - 6 value - 7 value - 8 value - 9
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
7 1.0 0.0 0.0 0.0
8 2.0 1.0 0.0 0.0

Related

Python pandas How to pick up certain values by internal numbering?

I have a dataframe that looks like this:
Answers all_answers Score
0 0.0 0 72
1 0.0 0 73
2 0.0 0 74
3 1.0 1 1
4 -1.0 1 2
5 1.0 1 3
6 -1.0 1 4
7 1.0 1 5
8 0.0 0 1
9 0.0 0 2
10 -1.0 1 1
11 0.0 0 1
12 0.0 0 2
13 1.0 1 1
14 0.0 0 1
15 0.0 0 2
16 1.0 1 1
The first column is a signal that the sign has changed in the calculation flow
The second one is I just removed the minus from the first one
The third is an internal account for the second column - how much was one and how much was zero
I want to add a fourth column to it that would show me only those units that went in a row for example 5 times while observing the sign of the first column.
To get something like this
Answers all_answers Score New
0 0.0 0 72 0
1 0.0 0 73 0
2 0.0 0 74 0
3 1.0 1 1 1
4 -1.0 1 2 -1
5 1.0 1 3 1
6 -1.0 1 4 -1
7 1.0 1 5 1
8 0.0 0 1 0
9 0.0 0 2 0
10 -1.0 1 1 0
11 0.0 0 1 0
12 0.0 0 2 0
13 1.0 1 1 0
14 0.0 0 1 0
15 0.0 0 2 0
16 1.0 1 1 0
17 0.0 0 1 0
Is it possible to do this by Pandas ?
You can use:
# group by consecutive 0/1
g = df['all_answers'].ne(df['all_answers'].shift()).cumsum()
# get size of each group and compare to threshold
m = df.groupby(g)['all_answers'].transform('size').ge(5)
# mask small groups
df['New'] = df['Answers'].where(m, 0)
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0
A faster way (with regex):
import pandas as pd
import re
def repl5(m):
return '5' * len(m.group())
s = df['all_answers'].astype(str).str.cat()
d = re.sub('(?:1{5,})', repl5, s)
d = [x=='5' for x in list(d)]
df['New'] = df['Answers'].where(d, 0.0)
df
Output:
Answers all_answers Score New
0 0.0 0 72 0.0
1 0.0 0 73 0.0
2 0.0 0 74 0.0
3 1.0 1 1 1.0
4 -1.0 1 2 -1.0
5 1.0 1 3 1.0
6 -1.0 1 4 -1.0
7 1.0 1 5 1.0
8 0.0 0 1 0.0
9 0.0 0 2 0.0
10 -1.0 1 1 0.0
11 0.0 0 1 0.0
12 0.0 0 2 0.0
13 1.0 1 1 0.0
14 0.0 0 1 0.0
15 0.0 0 2 0.0
16 1.0 1 1 0.0

Sklearn's RobustScaler doesn't scale certain columns at all

I have the following Pandas DF:
A B
0 0.0 114422.0
1 99997.0 174382.0
2 0.0 24863.0
3 0.0 91559.0
4 0.0 94248.0
5 0.0 66020.0
6 0.0 61543.0
7 0.0 69267.0
8 0.0 6253.0
9 0.0 93002.0
10 0.0 13891.0
11 0.0 49261.0
12 0.0 20050.0
13 0.0 24710.0
14 0.0 10034.0
15 0.0 24508.0
16 0.0 18249.0
17 0.0 50646.0
18 0.0 150033.0
19 0.0 68424.0
20 0.0 125526.0
21 0.0 110526.0
22 40000.0 217450.0
23 0.0 75543.0
24 145000.0 305310.0
25 12000.0 98583.0
26 0.0 262202.0
27 0.0 277680.0
28 0.0 101420.0
29 0.0 109480.0
30 0.0 65230.0
which I tried to normalize (columnswise) with scikit-learn's RobustScaler:
array_scaled = RobustScaler().fit_transform(df)
df_scaled = pd.DataFrame(array_scaled, columns = df.columns)
However, in the resulted df_scaled the first column has not been scaled (or changed) at all:
A B
0 0.0 0.515555
1 99997.0 1.310653
2 0.0 -0.672042
3 0.0 0.212380
4 0.0 0.248037
5 0.0 -0.126280
6 0.0 -0.185647
7 0.0 -0.083223
8 0.0 -0.918819
9 0.0 0.231515
10 0.0 -0.817536
11 0.0 -0.348512
12 0.0 -0.735864
13 0.0 -0.674070
14 0.0 -0.868681
15 0.0 -0.676749
16 0.0 -0.759746
17 0.0 -0.330146
18 0.0 0.987774
19 0.0 -0.094401
20 0.0 0.662799
21 0.0 0.463892
22 40000.0 1.881756
23 0.0 0.000000
24 145000.0 3.046823
25 12000.0 0.305522
26 0.0 2.475190
27 0.0 2.680435
28 0.0 0.343142
29 0.0 0.450021
30 0.0 -0.136755
I do not understand this. I expect column A to be scaled (and centered) too by the interquartile range (like in case of column B). What is the explanation here?
your middle 50% of values in A are all zero, thus the IQR as well as the overall median are both zero - effectively leading to no change when the median is removed as well as no change when the data is scaled according to the quantile range.

Pandas apply function to column taking the value of previous column

I have to create a timeseries using column values for computing the Recency of a customer.
The formula I have to use is R(t) = 0 if the customer has bought something in that month, R(t-1) + 1 otherwise.
I managed to compute a dataframe
CustomerID -1 0 1 2 3 4 5 6 7 8 9 10 11 12
0 17850 0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1 13047 0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0
2 12583 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 14688 0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 15311 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3750 15471 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3751 13436 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3752 15520 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3753 14569 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3754 12713 0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
In which there's a 0 if the customer has bought something in that month and one otherwise. The column names indicate a time period, with the column "-1" as a dummy column.
How can I replace the value in each column with 0 if the current value is 0 and with the value of the previous column + 1 otherwise?
For example, the final result for the second customer should be 0 1 0 0 1 0 0 1 0 1 0 1 2
I know how to apply a function to a column, but I don't know how to make that function use the value from the previous column.
Just use apply function to iterate throw columns or rows of dataframe and do manipulation.
def apply_function(row):
return [item if i == 0 else 0 if item == 0 else item+row[i-1] for i,item in enumerate(row)]
new_df = df.apply(apply_function, axis=1, result_type='expand')
new_df.columns = df.columns # just to set previous column names
Do you insist on using the column structure? It is common with time series to use rows, e.g., a dataframe with columns CustomerID, hasBoughtThisMonth. You can then easily add the Recency column by using a pandas transform().
I cannot yet place comments hence the question in this way.
Edit: here is another way to go by. I took two customers as an example, and some random numbers of whether or not they bought something in a month.
Basically, you pivot your table, and use a groupby+cumsum to get your result. Notice that I avoid your dummy column in this way.
import pandas as pd
import numpy as np
np.random.seed(1)
# Make example dataframe
df = pd.DataFrame({'CustomerID': [1]*12+[2]*12,
'Month': [1,2,3,4,5,6,7,8,9,10,11,12]*2,
'hasBoughtThisMonth': np.random.randint(2,size=24)})
# Make Recency column by finding contiguous groups of ones, and groupby
contiguous_groups = df['hasBoughtThisMonth'].diff().ne(0).cumsum()
df['Recency']=df.groupby(by=['CustomerID', contiguous_groups],
as_index=False)['hasBoughtThisMonth'].cumsum().reset_index(drop=True)
The result is
CustomerID Month hasBoughtThisMonth Recency
0 1 1 1 1
1 1 2 1 2
2 1 3 0 0
3 1 4 0 0
4 1 5 1 1
5 1 6 1 2
6 1 7 1 3
7 1 8 1 4
8 1 9 1 5
9 1 10 0 0
10 1 11 0 0
11 1 12 1 1
12 2 1 0 0
13 2 2 1 1
14 2 3 1 2
15 2 4 0 0
16 2 5 0 0
17 2 6 1 1
18 2 7 0 0
19 2 8 0 0
20 2 9 0 0
21 2 10 1 1
22 2 11 0 0
23 2 12 0 0
It would be easier if you first set CustomerID as index and transpose your dataframe.
then apply your custom function.
i.e. something like:
df.T.apply(custom_func)

Calculating RSI in Python

I am trying to calculate RSI on a dataframe
df = pd.DataFrame({"Close": [100,101,102,103,104,105,106,105,103,102,103,104,103,105,106,107,108,106,105,107,109]})
df["Change"] = df["Close"].diff()
df["Gain"] = np.where(df["Change"]>0,df["Change"],0)
df["Loss"] = np.where(df["Change"]<0,abs(df["Change"]),0 )
df["Index"] = [x for x in range(len(df))]
print(df)
Close Change Gain Loss Index
0 100 NaN 0.0 0.0 0
1 101 1.0 1.0 0.0 1
2 102 1.0 1.0 0.0 2
3 103 1.0 1.0 0.0 3
4 104 1.0 1.0 0.0 4
5 105 1.0 1.0 0.0 5
6 106 1.0 1.0 0.0 6
7 105 -1.0 0.0 1.0 7
8 103 -2.0 0.0 2.0 8
9 102 -1.0 0.0 1.0 9
10 103 1.0 1.0 0.0 10
11 104 1.0 1.0 0.0 11
12 103 -1.0 0.0 1.0 12
13 105 2.0 2.0 0.0 13
14 106 1.0 1.0 0.0 14
15 107 1.0 1.0 0.0 15
16 108 1.0 1.0 0.0 16
17 106 -2.0 0.0 2.0 17
18 105 -1.0 0.0 1.0 18
19 107 2.0 2.0 0.0 19
20 109 2.0 2.0 0.0 20
RSI_length = 7
Now, I am stuck in calculating "Avg Gain". The logic for average gain here is for first average gain at index 6 will be mean of "Gain" for RSI_length periods. For consecutive "Avg Gain" it should be
(Previous Avg Gain * (RSI_length - 1) + "Gain") / RSI_length
I tried the following but does not work as expected
df["Avg Gain"] = np.nan
df["Avg Gain"] = np.where(df["Index"]==(RSI_length-1),df["Gain"].rolling(window=RSI_length).mean(),\
np.where(df["Index"]>(RSI_length-1),(df["Avg Gain"].iloc[df["Index"]-1]*(RSI_length-1)+df["Gain"]) / RSI_length,np.nan))
The output of this code is:
print(df)
Close Change Gain Loss Index Avg Gain
0 100 NaN 0.0 0.0 0 NaN
1 101 1.0 1.0 0.0 1 NaN
2 102 1.0 1.0 0.0 2 NaN
3 103 1.0 1.0 0.0 3 NaN
4 104 1.0 1.0 0.0 4 NaN
5 105 1.0 1.0 0.0 5 NaN
6 106 1.0 1.0 0.0 6 0.857143
7 105 -1.0 0.0 1.0 7 NaN
8 103 -2.0 0.0 2.0 8 NaN
9 102 -1.0 0.0 1.0 9 NaN
10 103 1.0 1.0 0.0 10 NaN
11 104 1.0 1.0 0.0 11 NaN
12 103 -1.0 0.0 1.0 12 NaN
13 105 2.0 2.0 0.0 13 NaN
14 106 1.0 1.0 0.0 14 NaN
15 107 1.0 1.0 0.0 15 NaN
16 108 1.0 1.0 0.0 16 NaN
17 106 -2.0 0.0 2.0 17 NaN
18 105 -1.0 0.0 1.0 18 NaN
19 107 2.0 2.0 0.0 19 NaN
20 109 2.0 2.0 0.0 20 NaN
Desired output is:
Close Change Gain Loss Index Avg Gain
0 100 NaN 0 0 0 NaN
1 101 1.0 1 0 1 NaN
2 102 1.0 1 0 2 NaN
3 103 1.0 1 0 3 NaN
4 104 1.0 1 0 4 NaN
5 105 1.0 1 0 5 NaN
6 106 1.0 1 0 6 0.857143
7 105 -1.0 0 1 7 0.734694
8 103 -2.0 0 2 8 0.629738
9 102 -1.0 0 1 9 0.539775
10 103 1.0 1 0 10 0.605522
11 104 1.0 1 0 11 0.661876
12 103 -1.0 0 1 12 0.567322
13 105 2.0 2 0 13 0.771990
14 106 1.0 1 0 14 0.804563
15 107 1.0 1 0 15 0.832483
16 108 1.0 1 0 16 0.856414
17 106 -2.0 0 2 17 0.734069
18 105 -1.0 0 1 18 0.629202
19 107 2.0 2 0 19 0.825030
20 109 2.0 2 0 20 0.992883
​
(edited)
Here's an implementation of your formula.
RSI_LENGTH = 7
rolling_gain = df["Gain"].rolling(RSI_LENGTH).mean()
df.loc[RSI_LENGTH-1, "RSI"] = rolling_gain[RSI_LENGTH-1]
for inx in range(RSI_LENGTH, len(df)):
df.loc[inx, "RSI"] = (df.loc[inx-1, "RSI"] * (RSI_LENGTH -1) + df.loc[inx, "Gain"]) / RSI_LENGTH
The result is:
Close Change Gain Loss Index RSI
0 100 NaN 0.0 0.0 0 NaN
1 101 1.0 1.0 0.0 1 NaN
2 102 1.0 1.0 0.0 2 NaN
3 103 1.0 1.0 0.0 3 NaN
4 104 1.0 1.0 0.0 4 NaN
5 105 1.0 1.0 0.0 5 NaN
6 106 1.0 1.0 0.0 6 0.857143
7 105 -1.0 0.0 1.0 7 0.734694
8 103 -2.0 0.0 2.0 8 0.629738
9 102 -1.0 0.0 1.0 9 0.539775
10 103 1.0 1.0 0.0 10 0.605522
11 104 1.0 1.0 0.0 11 0.661876
12 103 -1.0 0.0 1.0 12 0.567322
13 105 2.0 2.0 0.0 13 0.771990
14 106 1.0 1.0 0.0 14 0.804563
15 107 1.0 1.0 0.0 15 0.832483
16 108 1.0 1.0 0.0 16 0.856414
17 106 -2.0 0.0 2.0 17 0.734069
18 105 -1.0 0.0 1.0 18 0.629202
19 107 2.0 2.0 0.0 19 0.825030
20 109 2.0 2.0 0.0 20 0.992883

Filling the missing rows in pandas dataframe

data = {
'node1': [1, 1,1, 2,2,5],
'node2': [8,16,22,5,25,10],
'weight': [1,1,1,1,1,1], }
df = pd.DataFrame(data, columns = ['node1','node2','weight'])
df2=df.assign(Cu=df.groupby('node1').cumcount()).set_index('Cu').groupby('node1') \
.apply(lambda x : x['node2']).unstack('Cu').fillna(np.nan)
Output:
1 8.0 16.0 22.0
2 5.0 25.0 0.0
5 10.0 0.0 0.0
This the output I am gettting but I require the output:
1 8 16 22
2 5 25 0
3 0 0 0
4 0 0 0
5 10 0 0
The rows which are missing in the data like the 3,4 should have the columns as zeros
Here are few ways of doing it.
Option 1
In [36]: idx = np.arange(df.node1.min(), df.node1.max()+1)
In [37]: df.groupby('node1')['node2'].apply(list).apply(pd.Series).reindex(idx).fillna(0)
Out[37]:
0 1 2
node1
1 8.0 16.0 22.0
2 5.0 25.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 10.0 0.0 0.0
Option 2
In [39]: (df.groupby('node1')['node2'].apply(lambda x: pd.Series(x.values))
.unstack().reindex(idx).fillna(0))
Out[39]:
0 1 2
node1
1 8.0 16.0 22.0
2 5.0 25.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 10.0 0.0 0.0
Option 3
In [55]: pd.DataFrame.from_dict(
{i: x.values for i, x in df.groupby('node1')['node2']},
orient='index').reindex(idx).fillna(0)
Out[55]:
0 1 2
1 8.0 16.0 22.0
2 5.0 25.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 10.0 0.0 0.0
And, measure the efficiency, readability based on your usecase.
In [15]: idx = np.arange(df.node1.min(), df.node1.max()+1)
In [16]: df.pivot_table(index='node1',
columns=df.groupby('node1').cumcount(),
values='node2',
fill_value=0) \
.reindex(idx) \
.fillna(0)
Out[16]:
0 1 2
node1
1 8.0 16.0 22.0
2 5.0 25.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 10.0 0.0 0.0

Categories

Resources