Python Dataframe GroupBy Function

Python Dataframe GroupBy Function - python

I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean

The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).

Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1

Related

Subtract fixed row value in reference to column value in pandas dataframe

I would like to subtract a fixed row value in rows, in reference to their values in another column.
My data looks like this:
TRACK TIME POSITION_X
0 1 0 12
1 1 30 13
2 1 60 15
3 1 90 11
4 2 0 10
5 2 20 11
6 2 60 13
7 2 90 17
I would like to subtract a fixed row value (WHEN TIME=0) of the POSITION_X column in reference to the TRACK column, and create a new column ("NEW_POSX") with those values. The output should be like this:
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7
I have been using the following code to get this done:
import pandas as pd
data = {'TRACK': [1,1,1,1,2,2,2,2],
'TIME': [0,30,60,90,0,20,60,90],
'POSITION_X': [12,13,15,11,10,11,13,17],
}
df = pd.DataFrame (data, columns = ['TRACK','TIME','POSITION_X'])
df['NEW_POSX']= df.groupby('TRACK')['POSITION_X'].diff().fillna(0).astype(int)
df.head(8)
... but I don't get the desired output. Instead, I get a new column where every row is subtracted by the previous row (according to the "TRACK" column):
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 2
3 1 90 11 -4
4 2 0 10 0
5 2 20 11 1
6 2 60 13 2
7 2 90 17 4
can anyone help me with this?

You can use transform and first to get the value at time 0, and then substract it to the 'POSITION_X' column:
s=df.groupby('TRACK')['POSITION_X'].transform('first')
df['NEW_POSX']=df['POSITION_X']-s
#Same as:
#df['NEW_POSX']=df['POSITION_X'].sub(s)
Output:
df
TRACK TIME POSITION_X NEW_POSX
0 1 0 12 0
1 1 30 13 1
2 1 60 15 3
3 1 90 11 -1
4 2 0 10 0
5 2 20 11 1
6 2 60 13 3
7 2 90 17 7

Remove duplicates after a certain number of occurrences

How do we filter the dataframe below to remove all duplicate ID rows after a certain number of ID occurrence. I.E. remove all rows of ID == 0 after the 3rd occurrence of ID == 0
Thanks
pd.DataFrame(np.random.randint(0,10,size=(100, 2)), columns=['ID', 'Value']).sort_values('ID')
Output:
ID Value
0 7
0 8
0 5
0 5
... ... ...
9 7
9 7
9 1
9 3
Desired Output for filter_count = 3:
Output:
ID Value
0 7
0 8
0 5
1 7
1 7
1 1
2 3

If you want to do this for all IDs, use:
df.groupby("ID").head(3)
For single ID, you can assign a new column using cumcount and then filter by conditions:
df["count"] = df.groupby("ID")["Value"].cumcount()
print (df.loc[(df["ID"].ne(0))|((df["ID"].eq(0)&(df["count"]<3)))])
ID Value count
64 0 6 0
77 0 6 1
83 0 0 2
44 1 7 0
58 1 5 1
40 1 2 2
35 1 7 3
89 1 9 4
19 1 7 5
10 1 3 6
45 2 4 0
68 2 1 1
74 2 4 2
75 2 8 3
34 2 4 4
60 2 6 5
78 2 0 6
31 2 8 7
97 2 9 8
2 2 6 9
93 2 8 10
13 2 2 11
...

I will do without groupby
df = pd.concat([df.loc[df.ID==0].head(3),df.loc[df.ID!=0]])

Thanks Henry,
I modified your code and I think this should work as well.
Your df.groupby("ID").head(3) is great. Thanks.
df["count"] = df.groupby("ID")["Value"].cumcount()
df.loc[df["count"]<3].drop(['count'], axis=1)

How to check if the value is between two consecutive rows in dataframe or numpy array?

I need to write a code that checks if a certain value is between 2 consecutive rows, for example:
row <50 < next row
meaning if the value is between row and its consecutive row.
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
The output is:
A
0 67
1 78
2 53
3 44
4 84
5 2
6 63
7 13
8 56
9 24
What I'd like to do is to check if (let's say I have a set value) "50" is between all consecutive rows.
Say, we check if 50 is between 67 and 78 and then between 78 and 53, obviously the answer is no, therefore in column B the result would be 0.
Now, if we check if 50 is between 53 and 44, then we'll get 1 in column B and we'll use cumsum() to count how many times the value of 50 is between consecutive rows in column A.
UPDATE: Let's say, if I have column C where I have 2 categories only: 1 and 2. How would I ensure that the check is performed within each of the categories separately? In other words, the check is reset once the category changes?
The desired output is:
A B C count
0 67 0 1 0
1 78 0 1 0
2 53 0 1 0
3 44 1 2 0
4 84 2 1 0
5 2 3 2 0
6 63 4 1 0
7 13 5 2 0
8 56 6 1 0
9 24 7 1 1
Greatly appreciate your help.

Let's just subtract "50" from series and check sign change:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[67,78,53,44,84,2,63,13,56,24]}, columns=list('A'))
s = df['A'] - 50
df['count'] = np.sign(s).diff().fillna(0).ne(0).cumsum()
print(df)
Output:
A count
0 67 0
1 78 0
2 53 0
3 44 1
4 84 2
5 2 3
6 63 4
7 13 5
8 56 6
9 24 7

This should work:
what = ((df.A < 50) | (50 > df.A.shift())) & ((df.A > 50) | (50 < df.A.shift()))
df['count'] = what.astype(int).cumsum()
A count
0 67 0
1 78 0
2 53 0
3 44 1
4 84 2
5 2 3
6 63 4
7 13 5
8 56 6
9 24 7
or
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
what = ((df.A < 50) | (50 > df.A.shift())) & ((df.A > 50) | (50 < df.A.shift()))
df['count'] = what.astype(int).cumsum()
A count
0 45 0
1 53 1
2 44 2
3 87 3
4 47 4
5 13 4
6 20 4
7 89 5
8 81 5
9 53 5
Would your second output look like this:
df
A B C
0 67 0 1
1 78 0 1
2 53 0 1
3 44 1 2
4 84 2 1
5 2 3 2
6 63 4 1
7 13 5 2
8 56 6 1
9 24 7 1
df_new = df
what = ((df_new.A < 50) | (50 > df_new.A.shift())) & ((df_new.A > 50) | (50 < df_new.A.shift())) & ((df_new.C == df_new.C.shift() ))
df['count'] = what.astype(int).cumsum()
df
Ouput:
A B C count
0 67 0 1 0
1 78 0 1 0
2 53 0 1 0
3 44 1 2 0
4 84 2 1 0
5 2 3 2 0
6 63 4 1 0
7 13 5 2 0
8 56 6 1 0
9 24 7 1 1

Is there a way to parallelise this loop over a pandas dataframe?

I've got 8 columns in my dataframe where the values can range from the numbers 1 to 99. I'm trying to create additional columns i.e. '1_observed', '2_observed', '3_observed'... '99_observed', depending on whether any of those numbers are appearing in that observation.
The code I'm running works, but it's very slow as I'm running a loop within a loop.
for index in df[observed_nos].index:
for num in range(1,100):
if num in df[observed_nos].iloc[index].values:
df[f'{num}_observed'][index] = '1'
else:
df[f'{num}_observed'][index] = '0'
I am not massively experienced with pandas, is there a way to run this faster / parallelise it?
EDIT:
sample dataframe as below:
data = [[12,14,33,45,22,33,86,56],
[78,12,52,1,99,22,4,19],
[15,11,7,23,30,19,63,71],
[2,14,52,36,17,95,8,39],
[1,4,31,42,72,23,67,15],
[92,28,32,52,77,19,55,10],
[42,16,64,25,92,11,26,36],
[12,21,38,17,90,32,41,74],
]
df = pd.DataFrame(data, columns =['N1','N2','N3','N4','N5','N6','N7','N8'])
this results in the following df
. N1 N2 N3 N4 N5 N6 N7 N8
0 12 14 33 45 22 33 86 56
1 78 12 52 1 99 22 4 19
2 15 11 7 23 30 19 63 71
3 2 14 52 36 17 95 8 39
4 1 4 31 42 72 23 67 15
5 92 28 32 52 77 19 55 10
6 42 16 64 25 92 11 26 36
7 12 21 38 17 90 32 41 74
the output i'm trying to get to would be as follows:
N1 N2 N3 N4 N5 N6 N7 N8 1_ 2_ 3_ 4_ 5_ 6_ 7_ 8_ 9_
0 12 14 33 45 22 33 86 56 0 0 0 0 0 0 0 0 0
1 78 12 52 1 99 22 4 19 1 0 0 1 0 0 0 0 0
2 15 11 7 23 30 19 63 71 0 0 0 0 0 0 1 0 0
3 2 14 52 36 17 95 8 39 0 1 0 0 0 0 0 1 0
4 1 4 31 42 72 23 67 15 1 0 0 1 0 0 0 0 0
5 92 28 32 52 77 19 55 10 0 0 0 0 0 0 0 0 0
6 42 16 64 25 92 11 26 36 0 0 0 0 0 0 0 0 0
7 12 21 38 17 90 32 41 74 0 0 0 0 0 0 0 0 0
(I've truncated the above example to only check for the occurrences of numbers 1 - 9, to make it easier to view)

I played around a bit with pandas and found another solution that might work for you. Although it does not provide 0 and 1, but instead Trua and False (you might have to modify the data to fit your needs).
Also, you might want to check if this code is in fact any faster than yours:
rand = np.random.RandomState(42)
items = rand.randint(1, 100, 800).reshape((100, 8))
df = pd.DataFrame(items)
for n in range(1, 100):
df[f'{n}_observed'] = df[df == n].any(axis=1)
print(df)
Hope this suggestion helps you!

If the numbers are positive numbers, you can just treat them as indices on a 2D mapping grid. So, use a boolean grid array, use the given values as column indices and for each row of the input dataframe, use the same row indices. Now, with these row and col indices, assign True values there. This grid will also be your final array, when viewed as int array. So, the implementation would look something like this -
def presence_df(df, start=1, stop=99, str_postfix='_'):
c = df.to_numpy()
n = len(c)
id_ar = np.zeros((n,stop+1), dtype=bool)
id_ar[np.arange(n)[:,None],c] = 1
df1 = pd.DataFrame(id_ar[:,start:stop+1].view('i1'))
df1.columns = [str(i) + str_postfix for i in range(start,stop+1)]
df_out = pd.concat([df,df1],axis=1)
return df_out
Sample run -
In [41]: np.random.seed(0)
...: df = pd.DataFrame(np.random.randint(1,10,(8,10)))
In [42]: presence_df(df,start=1, stop=9)
Out[42]:
0 1 2 3 4 5 6 7 8 9 1_ 2_ 3_ 4_ 5_ 6_ 7_ 8_ 9_
0 6 1 4 4 8 4 6 3 5 8 1 0 1 1 1 1 0 1 0
1 7 9 9 2 7 8 8 9 2 6 0 1 0 0 0 1 1 1 1
2 9 5 4 1 4 6 1 3 4 9 1 0 1 1 1 1 0 0 1
3 2 4 4 4 8 1 2 1 5 8 1 1 0 1 1 0 0 1 0
4 4 3 8 3 1 1 5 6 6 7 1 0 1 1 1 1 1 1 0
5 9 5 2 5 9 2 2 8 4 7 0 1 0 1 1 0 1 1 1
6 8 3 1 4 6 5 5 7 5 5 1 0 1 1 1 1 1 1 0
7 4 5 5 9 5 4 8 6 6 1 1 0 0 1 1 1 0 1 1
Timings on given sample data and larger one -
In [17]: data = [[12,14,33,45,22,33,86,56],
...: [78,12,52,1,99,22,4,19],
...: [15,11,7,23,30,19,63,71],
...: [2,14,52,36,17,95,8,39],
...: [1,4,31,42,72,23,67,15],
...: [92,28,32,52,77,19,55,10],
...: [42,16,64,25,92,11,26,36],
...: [12,21,38,17,90,32,41,74],
...: ]
...: df = pd.DataFrame(data, columns =['N1','N2','N3','N4','N5','N6','N7','N8'])
In [18]: %timeit presence_df(df)
1000 loops, best of 3: 575 µs per loop
In [19]: df = pd.DataFrame(np.random.randint(1,100,(1000,1000)))
In [20]: %timeit presence_df(df)
100 loops, best of 3: 8.86 ms per loop

Pivot column and column values in pandas dataframe

I have a dataframe that looks like this, but with 26 rows and 110 columns:
index/io 1 2 3 4
0 42 53 23 4
1 53 24 6 12
2 63 12 65 34
3 13 64 23 43
Desired output:
index io value
0 1 42
0 2 53
0 3 23
0 4 4
1 1 53
1 2 24
1 3 6
1 4 12
2 1 63
2 2 12
...
I have tried with dict and lists by transforming the dataframe to dict, and then create a new list with index values and update in new dict with io.
indx = []
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indx.append(key)
indxio = {}
for element in indx:
for key, value in mydict.iteritems():
for k, v in value.iteritems():
indxio.update({element:k})
I know this is too far probably, but it's the only thing I could think of. The process was too long, so I stopped.

You can use set_index, stack, and reset_index().
df.set_index("index/io").stack().reset_index(name="value")\
.rename(columns={'index/io':'index','level_1':'io'})
Output:
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43

You need set_index + stack + rename_axis + reset_index:
df = df.set_index('index/io').stack().rename_axis(('index','io')).reset_index(name='value')
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
Solution with melt, rename, but there is different order of values, so sort_values is necessary:
d = {'index/io':'index'}
df = df.melt('index/io', var_name='io', value_name='value') \
.rename(columns=d).sort_values(['index','io']).reset_index(drop=True)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43
And alternative solution for numpy lovers:
df = df.set_index('index/io')
a = np.repeat(df.index, len(df.columns))
b = np.tile(df.columns, len(df.index))
c = df.values.ravel()
cols = ['index','io','value']
df = pd.DataFrame(np.column_stack([a,b,c]), columns = cols)
print (df)
index io value
0 0 1 42
1 0 2 53
2 0 3 23
3 0 4 4
4 1 1 53
5 1 2 24
6 1 3 6
7 1 4 12
8 2 1 63
9 2 2 12
10 2 3 65
11 2 4 34
12 3 1 13
13 3 2 64
14 3 3 23
15 3 4 43

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dataframe GroupBy Function - python

Related

Subtract fixed row value in reference to column value in pandas dataframe

Remove duplicates after a certain number of occurrences

How to check if the value is between two consecutive rows in dataframe or numpy array?

Is there a way to parallelise this loop over a pandas dataframe?

Pivot column and column values in pandas dataframe

Categories

Resources