Using bins in pandas data frame - python

I am working on a data frame which has 4 columns in total, i want to bin each column of that data frame iteratively in 8 equal parts. The bin number should be assigned to the data in a separate column for each column.
The code should work even if any different data frame is provided with different column names.
Here, is the code i tried.
for c in df3.columns:
df3['bucket_' + c] = (df3.max() - df3.min()) // 2 + 1
buckets = pd.cut(df3['bucket_' + c], 8, labels=False)
sample data frame
expected output
The respected bin columns display the bin number assigned to each data point according to the range in which they will (using pd.cut to cut column in 8 equal parts) fall.
Thanks in advance!!
sample data
gp1_min gp2 gp3 gp4
17.39 23.19 28.99 44.93
0.74 1.12 3.35 39.78
12.63 13.16 13.68 15.26
72.76 73.92 75.42 94.35
77.09 84.14 74.89 89.87
73.24 75.72 77.28 92.3
78.63 84.35 64.89 89.31
65.59 65.95 66.49 92.43
76.79 83.93 75.89 89.73
57.78 57.78 2.22 71.11
99.9 99.1 100 100
100 100 40.963855 100
expected output
gp1_min gp2 gp3 gp4 bin_gp1 bin_gp2 bin_gp3 bin_gp4
17.39 23.19 28.99 44.93 2 2 2 3
0.74 1.12 3.35 39.78 1 1 1 3
12.63 13.16 13.68 15.26 1 2 2 2
72.76 73.92 75.42 94.35 5 6 6 7
77.09 84.14 74.89 89.87 6 7 6 7
73.24 75.72 77.28 92.3 6 6 6 7
78.63 84.35 64.89 89.31 6 7 5 7
65.59 65.95 66.49 92.43 5 6 5 7
76.79 83.93 75.89 89.73 6 7 6 7
57.78 57.78 2.22 71.11 4 4 1 6
99.9 99.1 100 100 8 8 8 8
100 100 40.96 100 8 8 3 8

I would use a couple of functions from numpy, namely np.linspace to make the bin boundaries and np.digitize to put the dataframe's values into bins:
import numpy as np
def binner(df,num_bins):
for c in df.columns:
cbins = np.linspace(min(df[c]),max(df[c]),num_bins+1)
df[c + '_binned'] = np.digitize(df[c],cbins)
return df

Related

Pandas rolling mean with offset by (not continuously available) date

given the following example table
Index
Date
Weekday
Value
1
05/12/2022
2
10
2
06/12/2022
3
20
3
07/12/2022
4
40
4
09/12/2022
6
10
5
10/12/2022
7
60
6
11/12/2022
1
30
7
12/12/2022
2
40
8
13/12/2022
3
50
9
14/12/2022
4
60
10
16/12/2022
6
20
11
17/12/2022
7
50
12
18/12/2022
1
10
13
20/12/2022
3
20
14
21/12/2022
4
10
15
22/12/2022
5
40
I want to calculate a rolling average of the last three observations (at least) a week ago. I cannot use .shift as some dates are randomly missing, and .shift would therefore not produce a reliable output.
Desired output example for last three rows in the example dataset:
Index 13: Avg of indices 8, 7, 6 = (30+40+50) / 3 = 40
Index 14: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
Index 15: Avg of indices 9, 8, 7 = (40+50+60) / 3 = 50
What would be a working solution for this? Thanks!
Thanks!
MOSTLY inspired from #Aidis you could, make his solution an apply:
df['mean']=df.apply(lambda y: df["Value"][df['Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
or spliting the data at each call which may run faster if you have lots of data (to be tested):
df['mean']=df.apply(lambda y: df.loc[:y.name, "Value"][ df.loc[:y.name,'Date'] <= y['Date'] - pd.Timedelta(1, "W")].tail(3).mean(), axis=1)
which returns:
Index Date Weekday Value mean
0 1 2022-12-05 2 10 NaN
1 2 2022-12-06 3 20 NaN
2 3 2022-12-07 4 40 NaN
3 4 2022-12-09 6 10 NaN
4 5 2022-12-10 7 60 NaN
5 6 2022-12-11 1 30 NaN
6 7 2022-12-12 2 40 10.000000
7 8 2022-12-13 3 50 15.000000
8 9 2022-12-14 4 60 23.333333
9 10 2022-12-16 6 20 23.333333
10 11 2022-12-17 7 50 36.666667
11 12 2022-12-18 1 10 33.333333
12 13 2022-12-20 3 20 40.000000
13 14 2022-12-21 4 10 50.000000
14 15 2022-12-22 5 40 50.000000
I apologize for this ugly code. But it seems to work:
df = df.set_index("Index")
df['Date'] = df['Date'].astype("datetime64")
for id in df.index:
dfs = df.loc[:id]
mean = dfs["Value"][dfs['Date'] <= dfs.iloc[-1]['Date'] - pd.Timedelta(1, "W")].tail(3).mean()
print(id, mean)
Result:
1 nan
2 10.0
3 15.0
4 23.333333333333332
5 23.333333333333332
6 36.666666666666664
7 33.333333333333336
8 33.333333333333336
9 33.333333333333336
10 33.333333333333336
11 33.333333333333336
12 33.333333333333336
13 40.0
14 50.0
15 50.0

Data augmentation with pandas

I'm doing some data augmentation in my data.
Basically they look like this:
country. size. price. product
CA. 1. 3.99. 12
US. 1. 2.99. 12
BR. 1. 10.99. 13
What I want to do is that because the size is fixed to 1, I want to add 3 more sizes per country, per product and increase the price accordingly. So, if the size is 2 then the price is price for 1 times 2, etc...
So basically, I'm looking for this:
country. size. price. product
CA. 1. 3.99. 12
CA. 2. 7.98. 12
CA. 3. 11.97. 12
CA. 4. 15.96. 12
US. 1. 2.99. 12
US. 2. 5.98. 12
US. 3. 8.97. 12
US. 4. 11.96. 12
BR. 1. 10.99. 13
BR. 2. 21.98. 13
BR. 3. 32.97. 13
BR. 4. 43.96. 13
What is a good way to do this with pandas?
I'm tried doing it in a loop with iterrows() but that wasn't a fast solution for my data. So am I missing something?
Use Index.repeat for add new rows, then aggregate GroupBy.cumsum and add counter by GroupBy.cumcount, last reset index for default unique one:
df = df.loc[df.index.repeat(4)]
df['size'] = df.groupby(level=0).cumcount().add(1)
df['price'] = df.groupby(level=0)['price'].cumsum()
df = df.reset_index(drop=True)
print (df)
country size price product
0 CA 1 3.99 12
1 CA 2 7.98 12
2 CA 3 11.97 12
3 CA 4 15.96 12
4 US 1 2.99 12
5 US 2 5.98 12
6 US 3 8.97 12
7 US 4 11.96 12
8 BR 1 10.99 13
9 BR 2 21.98 13
10 BR 3 32.97 13
11 BR 4 43.96 13
Another idea without cumcount, but with numpy.tile:
add = 3
df1 = df.loc[df.index.repeat(add + 1)]
df1['size'] = np.tile(range(1, add + 2), len(df))
df1['price'] = df1.groupby(level=0)['price'].cumsum()
df1 = df1.reset_index(drop=True)
print (df1)
country size price product
0 CA 1 3.99 12
1 CA 2 7.98 12
2 CA 3 11.97 12
3 CA 4 15.96 12
4 US 1 2.99 12
5 US 2 5.98 12
6 US 3 8.97 12
7 US 4 11.96 12
8 BR 1 10.99 13
9 BR 2 21.98 13
10 BR 3 32.97 13
11 BR 4 43.96 13
Construct 2 columns using assign and lambda:
s = np.tile(np.arange(4), df.shape[0])
df_final = df.loc[df.index.repeat(4)].assign(size=lambda x: x['size'] + s,
price=lambda x: x['price'] * (s+1))
Out[90]:
country size price product
0 CA 1.0 3.99 12
0 CA 2.0 7.98 12
0 CA 3.0 11.97 12
0 CA 4.0 15.96 12
1 US 1.0 2.99 12
1 US 2.0 5.98 12
1 US 3.0 8.97 12
1 US 4.0 11.96 12
2 BR 1.0 10.99 13
2 BR 2.0 21.98 13
2 BR 3.0 32.97 13
2 BR 4.0 43.96 13
Since the size is always 1, you basically only need to multiply size and price by a constant factor. You can do this straightforward, write the result into a seperate DataFrame and then use pd.concat to join them together
In [20]: df2 = pd.concat([df[['country.', 'product']], df[['size.', 'price.']] * 2], axis=1)
In [21]: pd.concat([df, df2])
Out[21]:
country. size. price. product
0 CA. 1.0 3.99 12
1 US. 1.0 2.99 12
2 BR. 1.0 10.99 13
0 CA. 2.0 7.98 12
1 US. 2.0 5.98 12
2 BR. 2.0 21.98 13
To augment some more, simply loop over all desired prices:
In [22]: list_of_dfs = []
In [23]: list_of_dfs.append(df)
In [24]: for size in range(2,5):
...: list_of_dfs.append(pd.concat([df[['country.', 'product']], df[['size.', 'price.']] * size], axis=1))
...:
In [25]: pd.concat(list_of_dfs)
Out[25]:
country. size. price. product
0 CA. 1.0 3.99 12
1 US. 1.0 2.99 12
2 BR. 1.0 10.99 13
0 CA. 2.0 7.98 12
1 US. 2.0 5.98 12
2 BR. 2.0 21.98 13
0 CA. 3.0 11.97 12
1 US. 3.0 8.97 12
2 BR. 3.0 32.97 13
0 CA. 4.0 15.96 12
1 US. 4.0 11.96 12
2 BR. 4.0 43.96 13
This is a relatively naive approach, but should work fine in your case and makes good use of vectorization under the hood.

Subtracting Two Columns of DataFrames not giving expected result - Python, Pandas

I have two data frames, each with 672 rows of data.
I want to subtract the values in a column of one data frame from the values in a column of the other data frame. The result can either be a new data frame, or a series, it does not really matter to me. The size of the result should obviously be 672 rows or 672 values.
I have the code:
stock_returns = beta_portfolios_196307_201906.iloc[:,6] - \
fama_french_factors_196307_201906.iloc[:,4]
I also tried
stock_returns = beta_portfolios_196307_201906["Lo 10"] + \
fama_french_factors_196307_201906["RF"]
For both, the result is a series size (1116, ), and most of the value in the series are NaN, with a few being numeric values.
Could someone please explain why this happening and how I can get the result I want?
Here is a the .head() of my data frames:
beta_portfolios_196307_201906.head()
Date Lo 20 Qnt 2 Qnt 3 Qnt 4 ... Dec 6 Dec 7 Dec 8 Dec 9 Hi 10
0 196307 1.13 -0.08 -0.97 -0.94 ... -1.20 -0.49 -1.39 -1.94 -0.77
1 196308 3.66 4.77 6.46 6.23 ... 7.55 7.57 4.91 9.04 10.47
2 196309 -2.78 -0.76 -0.78 -0.81 ... -0.27 -0.63 -1.00 -1.92 -3.68
3 196310 0.74 3.56 2.03 5.70 ... 1.78 6.63 4.78 3.10 3.01
4 196311 -0.63 -0.26 -0.81 -0.92 ... -0.69 -1.32 -0.51 -0.20 0.52
[5 rows x 16 columns]
fama_french_factors_196307_201906.head()
Date Mkt-RF SMB HML RF
444 196307 -0.39 -0.56 -0.83 0.27
445 196308 5.07 -0.94 1.67 0.25
446 196309 -1.57 -0.30 0.18 0.27
447 196310 2.53 -0.54 -0.10 0.29
448 196311 -0.85 -1.13 1.71 0.27
One last thing I should add: At first, all of the values in both data frames were strings, so I had to convert the values to numeric values using:
beta_portfolios_196307_201906 = beta_portfolios_196307_201906.apply(pd.to_numeric, errors='coerce')
Let's explain the issue on an example with just 5 rows.
When both DataFrames, a and b have the same indices, e.g.:
a b
Lo 10 Xxx RF Yyy
0 10 1 0 9 1
1 20 1 1 8 1
2 30 1 2 7 1
3 40 1 3 6 1
4 50 1 4 5 1
The result of subtraction a['Lo 10'] - b['RF'] is:
0 1
1 12
2 23
3 34
4 45
dtype: int64
Rows of both DataFrames are aligned on the index and then corresponding
elements are subtracted.
And now take a look at the case when b has some other indices, e.g.:
RF Yyy
0 9 1
1 8 1
2 7 1
8 6 1
9 5 1
i.e. last 2 rows have index 8 and 9 absent in a.
Then the result of the same subtraction is:
0 1.0
1 12.0
2 23.0
3 NaN
4 NaN
8 NaN
9 NaN
dtype: float64
i.e.:
rows with index 0, 1 and 2 - as before - both DataFrames have these
values.
but if some index is present in only one DataFrame, the result is
NaN,
the number of rows in this result is bigger.
If you want to align both columns by position instead of by the index, you
can run a.reset_index()['Lo 10'] - b.reset_index()['RF'], getting the
result as in the first case.

Change rolling window size as it rolls

I have a pandas data frame like this;
>df
leg speed
1 10
1 11
1 12
1 13
1 12
1 15
1 19
1 12
2 10
2 10
2 12
2 15
2 19
2 11
: :
I want to make a new column roll_speed where it takes a rolling average speed of the last 5 positions. But I wanna put more detailed condition in it.
Groupby leg(it doesn't take into account the speed of the rows in different leg.
I want the rolling window to be changed from 1 to 5 maximum according to the available rows. For example in leg == 1, in the first row there is only one row to calculate, so the rolling speed should be 10/1 = 10. For the second row, there are only two rows available for calculation, the rolling speed should be (10+11)/2 = 10.5.
leg speed roll_speed
1 10 10 # 10/1
1 11 10.5 # (10+11)/2
1 12 11 # (10+11+12)/3
1 13 11.5 # (10+11+12+13)/4
1 12 11.6 # (10+11+12+13+12)/5
1 15 12.6 # (11+12+13+12+15)/5
1 19 14.2 # (12+13+12+15+19)/5
1 12 14.2 # (13+12+15+19+12)/5
2 10 10 # 10/1
2 10 10 # (10+10)/2
2 12 10.7 # (10+10+12)/3
2 15 11.8 # (10+10+12+15)/4
2 19 13.2 # (10+10+12+15+19)/5
2 11 13.4 # (10+12+15+19+11)/5
: :
My attempt:
df['roll_speed'] = df.speed.rolling(5).mean()
But it just returns NA for rows where less than five rows are available for calculation. How should I solve this problem? Thank you for any help!
Set the parameter min_periods to 1
df['roll_speed'] = df.groupby('leg').speed.rolling(5, min_periods = 1).mean()\
.round(1).reset_index(drop = True)
leg speed roll_speed
0 1 10 10.0
1 1 11 10.5
2 1 12 11.0
3 1 13 11.5
4 1 12 11.6
5 1 15 12.6
6 1 19 14.2
7 1 12 14.2
8 2 10 10.0
9 2 10 10.0
10 2 12 10.7
11 2 15 11.8
12 2 19 13.2
13 2 11 13.4
Using rolling(5) will get you your results for all but the first 4 occurences of each group. We can fill the remaining values with the expanding mean:
(df.groupby('leg').speed.rolling(5)
.mean().fillna(df.groupby('leg').speed.expanding().mean())
).reset_index(drop=True)
0 10.000000
1 10.500000
2 11.000000
3 11.500000
4 11.600000
5 12.600000
6 14.200000
7 14.200000
8 10.000000
9 10.000000
10 10.666667
11 11.750000
12 13.200000
13 13.400000
Name: speed, dtype: float64

How to create new df based on columns of two different data frames?

I am working on following data frames, though original data frames are quite large with thousands of lines, for illustration purpose I am using much basic df.
My first df is the following :
ID value
0 3 7387
1 8 4784
2 11 675
3 21 900
And there is another huge df, say df2
x y final_id
0 -7.35 2.09 3
1 -6.00 2.76 3
2 -5.89 1.90 4
3 -4.56 2.67 5
4 -3.46 1.34 8
5 -4.67 1.23 8
6 -1.99 3.44 8
7 -5.67 2.40 11
8 -7.56 1.66 11
9 -9.00 3.12 21
10 -8.01 3.11 21
11 -7.90 3.19 22
Now, from the first df, I want to consider only "ID" column and match it's values to the "final_id" column in the second data frame(df2).
I want to create another df which contains only the filtered rows of df2, ie only the rows which contains "final_id" as 3, 8, 11, 21 (as per the "ID" column of df1).
Below would the resultant df:
x y final_id
0 -7.35 2.09 3
1 -6.00 2.76 3
2 -3.46 1.34 8
3 -4.67 1.23 8
4 -1.99 3.44 8
5 -5.67 2.40 11
6 -7.56 1.66 11
7 -9.00 3.12 21
8 -8.01 3.11 21
We can see rows 2, 3, 11 from df2 has been removed from resultant df.
Please help.
You can use isin to create a mask and then use the boolean mask to subset your df2:
mask = df2["final_id"].isin(df["ID"])
print(df2[mask])
x y final_id
0 -7.35 2.09 3
1 -6.00 2.76 3
4 -3.46 1.34 8
5 -4.67 1.23 8
6 -1.99 3.44 8
7 -5.67 2.40 11
8 -7.56 1.66 11
9 -9.00 3.12 21
10 -8.01 3.11 21

Categories

Resources