I have a multi-indexed dataframe and I wish to extract a subset based on index values and on a boolean criteria. I wish to overwrite the values of a specific new values using multi-index keys and boolean indexers to select the records to modify.
import pandas as pd
import numpy as np
years = [1994,1995,1996]
householdIDs = [ id for id in range(1,100) ]
midx = pd.MultiIndex.from_product( [years, householdIDs], names = ['Year', 'HouseholdID'] )
householdIncomes = np.random.randint( 10000,100000, size = len(years)*len(householdIDs) )
householdSize = np.random.randint( 1,5, size = len(years)*len(householdIDs) )
df = pd.DataFrame( {'HouseholdIncome':householdIncomes, 'HouseholdSize':householdSize}, index = midx )
df.sort_index(inplace = True)
Here's what the sample data looks like...
df.head()
=> HouseholdIncome HouseholdSize
Year HouseholdID
1994 1 23866 3
2 57956 3
3 21644 3
4 71912 4
5 83663 3
I'm able to successfully query the dataframe using the indices and column labels.
This example gives me the HouseholdSize for household 3 in year 1996
df.loc[ (1996,3 ) , 'HouseholdSize' ]
=> 1
However, I'm unable to combine boolean selection with multi-index queries...
The pandas docs on Multi-indexing says there is a way to combine boolean indexing with multi-indexing and gives an example...
In [52]: idx = pd.IndexSlice
In [56]: mask = dfmi[('a','foo')]>200
In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']]
Out[57]:
lvl0 a b
lvl1 foo foo
A3 B0 C1 D1 204 206
C3 D0 216 218
D1 220 222
B1 C1 D0 232 234
D1 236 238
C3 D0 248 250
D1 252 254
...which I can't seem to replicate on my dataframe
idx = pd.IndexSlice
housholdSizeAbove2 = ( df.HouseholdSize > 2 )
df.loc[ idx[ housholdSizeAbove2, 1996, :] , 'HouseholdSize' ]
Traceback (most recent call last):
File "python", line 1, in <module>
KeyError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (3), lexsort depth (2)'
In this example I would want to see all the households in 1996 with householdsize above 2
Pandas.query() should work in this case:
df.query("Year == 1996 and HouseholdID > 2")
Demo:
In [326]: with pd.option_context('display.max_rows',20):
...: print(df.query("Year == 1996 and HouseholdID > 2"))
...:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 3 28664 4
4 11057 1
5 36321 2
6 89469 4
7 35711 2
8 85741 1
9 34758 3
10 56085 2
11 32275 4
12 77096 4
... ... ...
90 40276 4
91 10594 2
92 61080 4
93 65334 2
94 21477 4
95 83112 4
96 25627 2
97 24830 4
98 85693 1
99 84653 4
[97 rows x 2 columns]
UPDATE:
Is there a way to select a specific column?
In [333]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdIncome']
Out[333]:
Year HouseholdID
1996 3 28664
4 11057
5 36321
6 89469
7 35711
8 85741
9 34758
10 56085
11 32275
12 77096
...
90 40276
91 10594
92 61080
93 65334
94 21477
95 83112
96 25627
97 24830
98 85693
99 84653
Name: HouseholdIncome, dtype: int32
and ultimately I want to overwrite the data on the dataframe.
In [331]: df.loc[df.eval("Year == 1996 and HouseholdID > 2"), 'HouseholdSize'] *= 10
In [332]: df.loc[df.eval("Year == 1996 and HouseholdID > 2")]
Out[332]:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 3 28664 40
4 11057 10
5 36321 20
6 89469 40
7 35711 20
8 85741 10
9 34758 30
10 56085 20
11 32275 40
12 77096 40
... ... ...
90 40276 40
91 10594 20
92 61080 40
93 65334 20
94 21477 40
95 83112 40
96 25627 20
97 24830 40
98 85693 10
99 84653 40
[97 rows x 2 columns]
UPDATE2:
I want to pass a variable year instead of a specific value. Is there
a cleaner way to do it than Year == " + str(year) + " and HouseholdID > " + str(householdSize) ?
In [5]: year = 1996
In [6]: household_ids = [1, 2, 98, 99]
In [7]: df.loc[df.eval("Year == #year and HouseholdID in #household_ids")]
Out[7]:
HouseholdIncome HouseholdSize
Year HouseholdID
1996 1 42217 1
2 66009 3
98 33121 4
99 45489 3
Related
I want to sum up data across overlapping bins. Basically the question here but instead of the bins being (0-8 years old), (9 - 17 years old), (18-26 years old), (27-35 years old), and (26 - 44 years old) I want them to be (0-8 years old), (1 - 9 years old), (2-10 years old), (3-11 years old), and (4 - 12 years old).
Starting with a df like this
id
awards
age
1
100
24
1
150
26
1
50
54
2
193
34
2
209
50
I am using the code from this answer to calculate summation across non-overlapping bins.
bins = [9 * i for i in range(0, df['age'].max() // 9 + 2)]
cuts = pd.cut(df['age'], bins, right=False)
print(cuts)
0 [18, 27)
1 [18, 27)
2 [54, 63)
3 [27, 36)
4 [45, 54)
Name: age, dtype: category
Categories (7, interval[int64, left]): [[0, 9) < [9, 18) < [18, 27) < [27, 36) < [36, 45) < [45, 54) < [54, 63)]
df_out = (df.groupby(['id', cuts])
.agg(total_awards=('awards', 'sum'))
.reset_index(level=0)
.reset_index(drop=True)
)
df_out['age_interval'] = df_out.groupby('id').cumcount()
Result
print(df_out)
id total_awards age_interval
0 1 0 0
1 1 0 1
2 1 250 2
3 1 0 3
4 1 0 4
5 1 0 5
6 1 50 6
7 2 0 0
8 2 0 1
9 2 0 2
10 2 193 3
11 2 0 4
12 2 209 5
13 2 0 6
Is it possible to work off the existing code to do this with overlapping bins?
First pivot_table your data to get a row per id and the columns being the ages. then reindex to get all the ages possible, from 0 to at least the max in the column age (here I use the max plus the interval length). Now you can use rolling along the columns. Rename the columns to create meaningful names. Finally stack and reset_index to get a dataframe with the expected shape.
interval = 9 #include both bounds like 0 and 8 for the first interval
res = (
df.pivot_table(index='id', columns='age', values='awards',
aggfunc=sum, fill_value=0)
.reindex(columns=range(0, df['age'].max()+interval), fill_value=0)
.rolling(interval, axis=1, min_periods=interval).sum()
.rename(columns=lambda x: f'{x-interval+1}-{x} y.o.')
.stack()
.reset_index(name='awards')
)
and you get with the input data provided in the question
print(res)
# id age awards
# 0 1 0-8 y.o. 0.0
# 1 1 1-9 y.o. 0.0
# ...
# 15 1 15-23 y.o. 0.0
# 16 1 16-24 y.o. 100.0
# 17 1 17-25 y.o. 100.0
# 18 1 18-26 y.o. 250.0
# 19 1 19-27 y.o. 250.0
# 20 1 20-28 y.o. 250.0
# 21 1 21-29 y.o. 250.0
# 22 1 22-30 y.o. 250.0
# 23 1 23-31 y.o. 250.0
# 24 1 24-32 y.o. 250.0
# 25 1 25-33 y.o. 150.0
# 26 1 26-34 y.o. 150.0
# 27 1 27-35 y.o. 0.0
# ...
# 45 1 45-53 y.o. 0.0
# 46 1 46-54 y.o. 50.0
# 47 1 47-55 y.o. 50.0
# 48 1 48-56 y.o. 50.0
# 49 1 49-57 y.o. 50.0
# ...
I think the best would be to first compute per-age sums, and then a rolling window to get all 9 year intervals. This only works because all your intervals have the same size − otherwise it would be much harder.
>>> totals = df.groupby('age')['awards'].sum()
>>> totals = totals.reindex(np.arange(0, df['age'].max() + 9)).fillna(0, downcast='infer')
>>> totals
0 6
1 2
2 4
3 6
4 4
..
98 0
99 0
100 0
101 0
102 0
Name: age, Length: 103, dtype: int64
>>> totals.rolling(9).sum().dropna().astype(int).rename(lambda age: f'{age-8}-{age}')
0-8 42
1-9 43
2-10 45
3-11 47
4-12 47
..
90-98 31
91-99 27
92-100 20
93-101 13
94-102 8
Name: age, Length: 95, dtype: int64
This is slightly complicated by the fact you also want to group by id, but the idea stays the same:
>>> idx = pd.MultiIndex.from_product([df['id'].unique(), np.arange(0, df['age'].max() + 9)], names=['id', 'age'])
>>> totals = df.groupby(['id', 'age']).sum().reindex(idx).fillna(0, downcast='infer')
>>> totals
awards
1 0 128
1 204
2 136
3 367
4 387
... ...
2 98 0
99 0
100 0
101 0
102 0
[206 rows x 1 columns]
>>> totals.groupby('id').rolling(9).sum().droplevel(0).dropna().astype(int).reset_index('id')
id awards
age
8 1 3112
9 1 3390
10 1 3431
11 1 3609
12 1 3820
.. .. ...
98 2 1786
99 2 1226
100 2 900
101 2 561
102 2 317
[190 rows x 2 columns]
This is the same as #Ben.T’s answer except we keep the Series shape and his answer pivots it to a dataframe. At any step you could .stack('age') or .unstack('age') to switch between both answer’s formats.
IIUC, you can use pd.IntervalIndex with some list comprehension:
ii = pd.IntervalIndex.from_tuples(
[
(s, e)
for e, s in pd.Series(np.arange(51)).rolling(9).agg(min).dropna().iteritems()
]
)
df_out = pd.concat(
[
pd.Series(ii.contains(x["age"]) * x["awards"], index=ii)
for i, x in df[["age", "awards"]].iterrows()
],
axis=1,
).groupby(level=0).sum().T
df_out.stack()
Output:
0 (0.0, 8.0] 0
(1.0, 9.0] 0
(2.0, 10.0] 0
(3.0, 11.0] 0
(4.0, 12.0] 0
...
4 (38.0, 46.0] 0
(39.0, 47.0] 0
(40.0, 48.0] 0
(41.0, 49.0] 0
(42.0, 50.0] 209
Length: 215, dtype: int64
A old way without pd.cut using a for loop and some masks.
import pandas as pd
max_age = df["age"].max()
interval_length = 8
values = []
for min_age in range(max_age - interval_length + 1):
max_age = min_age + interval_length
awards = df.query("#min_age <= age <= #max_age").loc[:, "age"].sum()
values.append([min_age, max_age, awards])
df_out = pd.DataFrame(values, columns=["min_age", "max_age", "awards"])
Let me know if this is what you want :)
Let df be a DataFrame:
import pandas as pd
import random
def r(b, e):
return [random.randint(b, e) for _ in range(300)]
df = pd.DataFrame({'id': r(1, 3), 'awards': r(0, 400), 'age': r(1, 99)})
For binning by age, I would advise creating a new column since it is clearer (and faster):
df['bin'] = df['age'].apply(lambda x: x // 9)
print(df)
The number of awards per id per bin can be obtained using simply:
totals_separate = df.groupby(['id', 'bin'])['awards'].sum()
print(totals_separate)
If I understand correctly, you would like the sum for each window of size 9 rows:
totals_rolling = df.groupby(['id', 'bin'])['awards'].rolling(9, min_periods=1).sum()
print(totals_rolling)
Reference: https://pandas.pydata.org/docs/reference/api/pandas.Series.rolling.html
I need to check df.head() and df.tail() many times.
When using df.head(), df.tail() jupyter notebook dispalys the ugly output.
Is there any single line command so that we can select only first 5 and last 5 rows:
something like:
df.iloc[:5 | -5:] ?
Test example:
df = pd.DataFrame(np.random.rand(20,2))
df.iloc[:5]
Update
Ugly but working ways:
df.iloc[(np.where( (df.index < 5) | (df.index > len(df)-5)))[0]]
or,
df.iloc[np.r_[np.arange(5), np.arange(df.shape[0]-5, df.shape[0])]]
Try look at numpy.r_
df.iloc[np.r_[0:5, -5:0]]
Out[358]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
Also head + tail is not a bad solution
df.head(5).append(df.tail(5))
Out[362]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
df.query("index<5 | index>"+str(len(df)-5))
Here's a way to query the index. You can change the values to whatever you want.
Another approach (per this SO post)
uses only Pandas .isin()
Generate some dummy/demo data
df = pd.DataFrame({'a':range(10,100)})
print(df.head())
a
0 10
1 11
2 12
3 13
4 14
print(df.tail())
a
85 95
86 96
87 97
88 98
89 99
print(df.shape)
(90, 1)
Generate list of required indexes
ls = list(range(5)) + list(range(len(df)-5, len(df)))
print(ls)
[0, 1, 2, 3, 4, 85, 86, 87, 88, 89]
Slice DataFrame using list of indexes
df_first_last_5 = df[df.index.isin(ls)]
print(df_first_last_5)
a
0 10
1 11
2 12
3 13
4 14
85 95
86 96
87 97
88 98
89 99
Here is the problem:
import numpy
import pandas
dfl = pandas.DataFrame(numpy.random.randn(30,10))
now, I want the following cells put in a data frame:
For row 1: columns 3 to 6 (length = 4 cells),
For row 2: columns 4 to 7 (length = 4 cells),
For row 3: columns 1 to 4 (length = 4 cells),
ect...
Each of these range is always 4 cells wide, but the start/end are different columns.
The row-wise start point are in a list [3, 4, 1,...] and so are the row-wise end points. The list of rows I'm interested in is also a list [1, 2, 3].
Finally, dfl has an datetime-index which I would like to preserve
(meaning the end result should be a data frame with indexes dfl.index[1, 2, 3].
Edit: range exceeds
Some of the entries of the vector of row-wise start points are too large (say a row-wise start point of 9 in the example matrix above). In those case, I just want all the columns from the row-wise start point and then as many NaN's as necessary to get the right shape (so since 9+4 > 10, then the corresponding row of the result data frame should be [9, 10, NaN, NaN]
Using NumPy broadcasting to create all those column indices and then advanced-indexing into the array data -
def extract_rows(dfl, starts, L, fillval=np.nan):
a = dfl.values
idx = np.asarray(starts)[:,None] + range(L)
valid_mask = idx < dfl.shape[1]
idx[~valid_mask] = 0
val = a[np.arange(len(idx))[:,None],idx]
return pd.DataFrame(np.where(valid_mask, val, fillval))
Sample runs -
In [541]: np.random.seed(0)
In [542]: dfl = pandas.DataFrame(numpy.random.randint(11,99,(3,10)))
In [543]: dfl
Out[543]:
0 1 2 3 4 5 6 7 8 9
0 55 58 75 78 78 20 94 32 47 98
1 81 23 69 76 50 98 57 92 48 36
2 88 83 20 31 91 80 90 58 75 93
In [544]: extract_rows(dfl, starts=[3,4,8], L=4, fillval=np.nan)
Out[544]:
0 1 2 3
0 78.0 78.0 20.0 94.0
1 50.0 98.0 57.0 92.0
2 75.0 93.0 NaN NaN
In [545]: extract_rows(dfl, starts=[3,4,8], L=4, fillval=-1)
Out[545]:
0 1 2 3
0 78 78 20 94
1 50 98 57 92
2 75 93 -1 -1
Or we can using .iloc and enumerate
l=[3, 4, 1]
pd.DataFrame(data=[df.iloc[x:x+1,y:y+4].values[0] for x,y in enumerate(l)])
Out[107]:
0 1 2 3
0 1.224124 -0.938459 -1.114081 -1.128225
1 -0.445288 0.445390 -0.154295 -1.871210
2 0.784677 0.997053 2.144286 -0.179895
I'm trying to substract a df "stock_returns" (144 rows x 517 col) by a df "p_df" (144 rows x 1 col).
I have tried;
stock_returns - p_df
stock_returns.rsub(p_df,axis=1)
stock_returns.substract(p_df)
But none of them work and all return Nan values.
I'm passing it through this fnc, and using the for loop to get args:
def disp_calc(returns, p, wi): #apply(disp_calc, rows = ...)
wi = wi/np.sum(wi)
rp = (col_len(returns)*(returns-p)**2).sum() #returns - p causing problems
return np.sqrt(rp)
for i in sectors:
stock_returns = returns_rolling[sectordict[i]]#.apply(np.mean,axis=1)
portfolio_return = returns_rolling[i]; p_df = portfolio_return.to_frame()
disp_df[i] = stock_returns.apply(disp_calc,args=(portfolio_return,wi))
My expected output is to subtract all 517 cols in the first df by the 1 col in p_df. so final results would still have 517 cols. Thanks
You're almost there, just need to set axis=0 to subtract along the indexes:
>>> stock_returns = pd.DataFrame([[10,100,200],
[15, 115, 215],
[20,120, 220],
[25,125,225],
[30,130,230]], columns=['A', 'B', 'C'])
>>> stock_returns
A B C
0 10 100 200
1 15 115 215
2 20 120 220
3 25 125 225
4 30 130 230
>>> p_df = pd.DataFrame([1,2,3,4,5], columns=['P'])
>>> p_df
P
0 1
1 2
2 3
3 4
4 5
>>> stock_returns.sub(p_df['P'], axis=0)
A B C
0 9 99 199
1 13 113 213
2 17 117 217
3 21 121 221
4 25 125 225
data['new_col3'] = data['col1'] - data['col2']
I have the following df:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
4 9 2 64 32 343
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
What I'm trying to do is:
For every 'clienthostid' look for the 'usersidid' with the highest 'LoginDaysSum', I check if there is a usersidid which is the highest LoginDaysSum in two different clienthostid (for instance, usersidid = 9 ia the highest LoginDaysSum in both clienthostid 1, 2 and 3 in rows 0, 4 and 7 accordingly).
In this case, I want to choose the higher LoginDaysSum (in the example it would be the row with 1728), lets call it maxRT.
I want to calculate the ratio of LoginDaysSumLast7Days between maxRT and each of the other rows (in example, it would be rows index 7 and 4).
If the ratio is below 0.8 than I want to drop the row:
index 4- LoginDaysSumLast7Days_ratio = 7/32 < 0.8 //row will drop!
index 7- LoginDaysSumLast7Days_ratio = 7/3 > 0.8 //row will stay!
The same condition also will be applied of LoginDaysSumLastMonth.
So for the example the result will be:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200
Now here's the snag- performance is critical.
I tried to implement it using .apply but not only i couldn't make it to work right, it also ran way too slow :(
My code so far (forgive me of it's written terribly wrong, I only started working for the first time with SQL, Pandas and Python last week and everything I learned is from examples I found here ^_^):
df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]
Would very much appreciate any suggestions on how to do it
Thank you
I believe this is what you need:
# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)
The result in data is:
usersidid clienthostid LoginDaysSumLastMonth LoginDaysSumLast7Days LoginDaysSum
index
0 9 1 50 7 1728
1 3 1 43 3 1331
2 6 1 98 9 216
3 4 1 10 6 64
5 12 3 45 43 1000
6 8 3 87 76 512
7 9 3 16 3 1200