What is the proper way to query top N rows by group in python datatable?
For example to get top 2 rows having largest v3 value by id2, id4 group I would do pandas expression in the following way:
df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)
in R using data.table:
DT[order(-v3), head(v3, 2L), by=.(id2, id4)]
or in R using dplyr:
DF %>% arrange(desc(v3)) %>% group_by(id2, id4) %>% filter(row_number() <= 2L)
Example data and expected output using pandas:
import datatable as dt
dt = dt.Frame(id2=[1, 2, 1, 2, 1, 2], id4=[1, 1, 1, 1, 1, 1], v3=[1, 3, 2, 3, 3, 3])
df = dt.to_pandas()
df.sort_values('v3', ascending=False).groupby(['id2','id4']).head(2)
# id2 id4 v3
#1 2 1 3
#3 2 1 3
#4 1 1 3
#2 1 1 2
Starting from datatable version 0.8.0, this can be achieved by combining grouping, sorting and filtering:
from datatable import *
DT = Frame(id2=[1, 2, 1, 2, 1, 2],
id4=[1, 1, 1, 1, 1, 1],
v3=[1, 3, 2, 3, 3, 3])
DT[:2, :, by(f.id2, f.id4), sort(-f.v3)]
which produces
id2 id4 v3
--- --- --- --
0 1 1 3
1 1 1 2
2 2 1 3
3 2 1 3
[4 rows x 3 columns]
Explanation:
by(f.id2, f.id4) groups the data by columns "id2" and "id4";
the sort(-f.v3) command tells datatable to sort the records by column "v3" in descending order. In the presence of by() this operator will be applied within each group;
the first :2 selects the top 2 rows, again within each group;
the second : selects all columns. If needed, this could have been a list of columns or expressions, allowing you to perform some operation(s) on the first 2 rows of each group.
Related
I have the following dataframe:
d_test = {
'random_staff' : ['gfda', 'fsd','gec', 'erw', 'gd', 'kjhk', 'fd', 'kui'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
cluster_number column contains values from 1 to n. Some values could have repetition but no missing values are presented. For example above such values are: 1, 2, 3, 4.
I want to be able to select some value from cluster_number column and change every occurrence of this value to set of unique values. No missing value should be presented. For example if we select value 2 then desirable outcome for cluster_number is [1, 2, 3, 3, 5, 1, 4, 6]. Note we had three 2 in the column. We kept first one as 2 we change next occurrence of 2 to 5 and we changed last occurrence of 2 to 6.
I wrote code for the logic above and it works fine:
cluster_number_to_change = 2
max_cluster = max(df_test['cluster_number'])
first_iter = True
i = cluster_number_to_change
for index, row in df_test.iterrows():
if row['cluster_number'] == cluster_number_to_change:
df_test.loc[index, 'cluster_number'] = i
if first_iter:
i = max_cluster + 1
first_iter = False
else:
i += 1
But it is written as for-loop and I am trying understand if can be transformed in form of pandas .apply method (or any other effective vectorized solution).
Using boolean indexing:
# get cluster #2
m1 = df_test['cluster_number'].eq(2)
# identify duplicates
m2 = df_test['cluster_number'].duplicated()
# increment duplicates using the max as reference
df_test.loc[m1&m2, 'cluster_number'] = (
m2.where(m1).cumsum()
.add(df_test['cluster_number'].max())
.convert_dtypes()
)
print(df_test)
Output:
random_staff cluster_number
0 gfda 1
1 fsd 2
2 gec 3
3 erw 3
4 gd 5
5 kjhk 1
6 fd 4
7 kui 6
I have a pandas dataframe with a column such as :
df1 = pd.DataFrame({ 'val': [997.95, 997.97, 989.17, 999.72, 984.66, 1902.15]})
I have 2 types of events that can be detected from this column, I wanna label them 1 and 2 .
I need to get the indexes of each label , and to do so I need to find where the 'val' column has changed a lot (± 7 ) from previous row.
Expected output:
one = [0, 1, 3, 5]
two = [2, 4 ]
Use Series.diff with mask for test less values like 0, last use boolean indexing with indices:
m = df1.val.diff().lt(0)
#if need test less like -7
#m = df1.val.diff().lt(-7)
one = df1.index[~m]
two = df1.index[m]
print (one)
Int64Index([0, 1, 3, 5], dtype='int64')
print (two)
nt64Index([2, 4], dtype='int64')
If need lists:
one = df1.index[~m].tolist()
two = df1.index[m].tolist()
Details:
print (df1.val.diff())
0 NaN
1 0.02
2 -8.80
3 10.55
4 -15.06
5 917.49
Name: val, dtype: float64
I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]
You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]
In the below dataframe the column "CumRetperTrade" is a column which consists of a few vertical vectors (=sequences of numbers) separated by zeros. (= these vectors correspond to non-zero elements of column "Portfolio"). I would like to find the cumulative local maxima of every non-zero vector contained in column "CumRetperTrade".
To be precise, I would like to transform (using vectorization - or other - methods) column "CumRetperTrade" to the column "PeakCumRet" (desired result) which gives for every vector ( = subset corresponding to ’Portfolio =1 ’) contained in column "CumRetperTrade" the cumulative maximum value of (all its previous) values. The numeric example is below. Thanks in advance!
PS In other words, I guess that we need to use cummax() but to apply it only to the consequent (where 'Portfolio' = 1) subsets of 'CumRetperTrade'
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"Portfolio": [1, 1, 1, 1, 0 , 0, 0, 1, 1, 1],
"CumRetperTrade": [2, 3, 2, 1, 0 , 0, 0, 4, 2, 1],
"PeakCumRet": [2, 3, 3, 3, 0 , 0, 0, 4, 4, 4]})
df1
Portfolio CumRetperTrade PeakCumRet
0 1 2 2
1 1 3 3
2 1 2 3
3 1 1 3
4 0 0 0
5 0 0 0
6 0 0 0
7 1 4 4
8 1 2 4
9 1 1 4
PPS I already asked a similar question previously (Dataframe column: to find local maxima) and received a correct answer to my question, however in my question I did not explicitly mention the requirement of cumulative local maxima
You only need a small modification to the previous answer:
df1["PeakCumRet"] = (
df1.groupby(df1["Portfolio"].diff().ne(0).cumsum())
["CumRetperTrade"].expanding().max()
.droplevel(0)
)
expanding().max() is what produces the local maxima.
I have the following table in a Pandas dataframe:
Seconds
Color
Break
NaN
End
0.639588
123
4
NaN
-
1.149597
123
1
NaN
-
1.671333
123
2
NaN
-
1.802052
123
2
NaN
-
1.900091
123
1
NaN
-
2.031240
123
4
NaN
-
2.221477
123
3
NaN
-
2.631840
123
2
NaN
-
2.822245
123
1
NaN
-
2.911147
123
4
NaN
-
3.133344
123
1
NaN
-
3.531246
123
1
NaN
-
3.822389
123
1
NaN
-
3.999389
123
2
NaN
-
4.327990
123
4
NaN
-
I'm trying to extract subgroups of the column labelled as 'Break' in such a way that the first and last item of each group is a '4'. So, the first group should be: [4,1,2,2,1,4]; the second group: [4,3,2,1,4]; the third group: [4,1,1,1,2,4]. The last '4' of each group is the first '4' of the following group.
I have the following code:
groups = []
def extract_phrases_between_four(data, new_group = []):
row_iterator = data.iterrows()
for i, row in row_iterator: #for index, row in row_iterator
if row['Break_Level_Annotation'] != '4':
new_group.append(row['Break_Level_Annotation'])
if row['Break_Level_Annotation'] == '4':
new_group = []
new_group.append(row['Break_Level_Annotation'])
groups.append(new_group)
return groups
but my output is:
[[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,3,2,1],[4,3,2,1],[4,3,2,1],[4,3,2,1],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2],[4,1,1,1,2]].
It's returning the same new_group repeatedly as many times as there are items in each new_group, while at the same time not including the final '4' of each new_group.
I've tried to move around the code but I can't seem to understand what the problem is. How can I get each new_group to include its first and final '4' and for the new_group to be included only once in the array 'groups'?
IIUC you can extract the index and use list comprehension:
s = df.loc[df["Break"].eq(4)].index
print ([df.loc[np.arange(x, y+1), "Break"].tolist() for x, y in zip(s, s[1:])])
[[4, 1, 2, 2, 1, 4], [4, 3, 2, 1, 4], [4, 1, 1, 1, 2, 4]]
The problem is that in each step in the for loop, you are adding new_group to groups, although you are still adding elements to new_group. You need to execute groups.append(new_group) inside the second if statement.
Also, pay attention that you can iterate directly the "Break" column values instead of the whole dataframe and accessing each time to get the value.
I rewrote the code a little bit, and it looks as follows:
groups = []
new_group = []
for i in data["Break"]:
new_group.append(i)
if i == 4:
if len(new_group) > 1:
groups.append(new_group)
new_group = [4]
print(groups)
And there is the result:
[[4, 1, 2, 2, 1, 4], [4, 3, 2, 1, 4], [4, 1, 1, 1, 2, 4]]