I currently have two dataframes, df_ages and df_count:
In [1]: df_ages
Out [1]:
Enrolled Age
1 Y 44
2 Y 35
3 N 37
4 Y 55
5 N 26
6 Y 19
7 N 18
8 N 49
9 Y 26
10 Y 25
11 Y 25
12 Y 32
13 Y 25
14 N 50
15 N 58
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25
2 26 35
3 36 45
4 46 55
5 56 65
I am looking for code to populate df_count [count] column with the sum of people who fit within the min and max age range in the previous columns.
The [percentage] column should be the percentage of number of entries.
The desired resulting output is shown below:
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25 5 33.3
2 26 35 4 26.7
3 36 45 2 13.3
4 46 55 3 20.0
5 56 65 1 6.7
You can try apply on rows with Series.between
df_count['counts'] = df_count.apply(lambda row: df_ages['Age'].between(row['Min'], row['Max']).sum(), axis=1)
df_count['percentage'] = df_count['counts'].div(len(df_ages)).mul(100).round(1)
print(df_count)
Min Max counts percentage
0 18 25 5 33.3
1 26 35 4 26.7
2 36 45 2 13.3
3 46 55 3 20.0
4 56 65 1 6.7
Suppose we take a pandas dataframe...
item MRP sold
0 A 10 10
1 A 36 4
2 B 32 6
3 A 26 7
4 B 30 9
Then do a groupby('item').mean()
it becomes
item MRP sold
0 A 24 7
1 B 31 7.5
Is there a way to retain the mean values of MRP, of all the unique items and make another column which will contain those values when ungrouped.
Basically what i want is
item MRP sold Mean_MRP
0 A 10 10 24
1 A 36 4 24
2 B 32 6 31
3 A 26 7 24
4 B 30 9 31
There are a lot of items, so i need a faster and optimised way to do this
Use the Transform function :
df = (df
.assign(Mean_MRP = lambda x:x.groupby('item')['MRP']
.transform('mean')))
df
item MRP sold Mean_MRP
0 A 10 10 24
1 A 36 4 24
2 B 32 6 31
3 A 26 7 24
4 B 30 9 31
You could also use the pyjanitor module, which makes the code a bit cleaner:
import janitor
df.groupby_agg(by='item',
agg='mean',
agg_column_name="MRP",
new_column_name='Mean_MRP')
Try using transform:
df['Mean_MRP'] = df.groupby('item').transform('mean')
Assume the following:
df1:
x y z
1 10 11
2 20 22
3 30 33
4 40 44
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
3 40 43
4 10 14
4 20 24
4 30 34
df2:
x b
1 100
2 200
df3:
y c
10 1000
20 2000
I want all rows from df1, for which either x or y appears in either df2 or df3 respectively, meaning in this case
out:
x y z
1 10 11
2 20 22
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
4 10 14
4 20 24
I would like to do this in pure pandas, with no for loops, seems standard enough to me, but I don't really know what to look for
You can use isin on both cases, chain the conditions with a bitwise OR and perform boolean indexation on the dataframe with the result:
df1[df1.x.isin(df2.x) | df1.y.isin(df3.y)]
I have a dataframe like the following.
idx vals
0 10
1 21
2 12
3 33
4 14
5 55
6 16
7 77
I would like to perform a cumsum (and avoid a for) but only considering rows with the same idx mod 2. For instance, for row 3 I would like to obtain 21+33=54, while for row 4, 10+12+14 = 36.
Any ideas?
You just need groupby here
df.vals.groupby(df.idx%2).cumsum()
Out[75]:
0 10
1 21
2 22
3 54
4 36
5 109
6 52
7 186
Name: vals, dtype: int64
I need to find the quickest way to sort each row in a dataframe with millions of rows and around a hundred columns.
So something like this:
A B C D
3 4 8 1
9 2 7 2
Needs to become:
A B C D
8 4 3 1
9 7 2 2
Right now I'm applying sort to each row and building up a new dataframe row by row. I'm also doing a couple of extra, less important things to each row (hence why I'm using pandas and not numpy). Could it be quicker to instead create a list of lists and then build the new dataframe at once? Or do I need to go cython?
I think I would do this in numpy:
In [11]: a = df.values
In [12]: a.sort(axis=1) # no ascending argument
In [13]: a = a[:, ::-1] # so reverse
In [14]: a
Out[14]:
array([[8, 4, 3, 1],
[9, 7, 2, 2]])
In [15]: pd.DataFrame(a, df.index, df.columns)
Out[15]:
A B C D
0 8 4 3 1
1 9 7 2 2
I had thought this might work, but it sorts the columns:
In [21]: df.sort(axis=1, ascending=False)
Out[21]:
D C B A
0 1 8 4 3
1 2 7 2 9
Ah, pandas raises:
In [22]: df.sort(df.columns, axis=1, ascending=False)
ValueError: When sorting by column, axis must be 0 (rows)
To Add to the answer given by #Andy-Hayden, to do this inplace to the whole frame... not really sure why this works, but it does. There seems to be no control on the order.
In [97]: A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
In [98]: A
Out[98]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [99]: A.values.sort
Out[99]: <function ndarray.sort>
In [100]: A
Out[100]:
one two three four five
0 22 63 72 46 49
1 43 30 69 33 25
2 93 24 21 56 39
3 3 57 52 11 74
In [101]: A.values.sort()
In [102]: A
Out[102]:
one two three four five
0 22 46 49 63 72
1 25 30 33 43 69
2 21 24 39 56 93
3 3 11 52 57 74
In [103]: A = A.iloc[:,::-1]
In [104]: A
Out[104]:
five four three two one
0 72 63 49 46 22
1 69 43 33 30 25
2 93 56 39 24 21
3 74 57 52 11 3
I hope someone can explain the why of this, just happy that it works 8)
You could use pd.apply.
Eg:
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
one two three four five
0 2 75 44 53 46
1 18 51 73 80 66
2 35 91 86 44 25
3 60 97 57 33 79
A = A.apply(np.sort, axis = 1)
print(A)
one two three four five
0 2 44 46 53 75
1 18 51 66 73 80
2 25 35 44 86 91
3 33 57 60 79 97
Since you want it in descending order, you can simply multiply the dataframe with -1 and sort it.
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
A = A * -1
A = A.apply(np.sort, axis = 1)
A = A * -1
Instead of using pd.DataFrame constructor, an easier way to assign the sorted values back is to use double brackets:
original dataframe:
A B C D
3 4 8 1
9 2 7 2
df[['A', 'B', 'C', 'D']] = np.sort(df)[:, ::-1]
A B C D
0 8 4 3 1
1 9 7 2 2
This way you can also sort a part of the columns:
df[['B', 'C']] = np.sort(df[['B', 'C']])[:, ::-1]
A B C D
0 3 8 4 1
1 9 7 2 2
One could try this approach to preserve the integrity of the df:
import pandas as pd
import numpy as np
A = pd.DataFrame(np.random.randint(0,100,(4,5)), columns=['one','two','three','four','five'])
print (A)
print(type(A))
one two three four five
0 85 27 64 50 55
1 3 90 65 22 8
2 0 7 64 66 82
3 58 21 42 27 30
<class 'pandas.core.frame.DataFrame'>
B = A.apply(lambda x: np.sort(x), axis=1, raw=True)
print(B)
print(type(B))
one two three four five
0 27 50 55 64 85
1 3 8 22 65 90
2 0 7 64 66 82
3 21 27 30 42 58
<class 'pandas.core.frame.DataFrame'>