How can I extract data using the 'groupby' - python

import pandas as pd
df= pd.DataFrame({'date':[1,2,3,4,5,1,2,3,4,5,1,2,3,4,5],
'name':list('aaaaabbbbbccccc'),
'v1':[10,20,30,40,50,10,20,30,40,50,10,20,30,40,50],
'v2':[10,20,30,40,50,10,20,30,40,50,10,20,30,40,50],
'v3':[10,20,30,40,50,10,20,30,40,50,10,20,30,40,50]})
a= list(set(list(df.name)))
plus=[]
for i in a:
sep=df[df.name==i]
sep2=sep[(sep.v1>=10)&(sep.v2>=20)&(sep.v3<=40)]
plus.append(sep2)
result=pd.concat(plus)
print(result)
I know this is not a good example anyway,
I would like to handle separately by name.
It takes too long in a big data
How can I extract data using the 'groupby'?
Even better if the function is used(def..apply...)
df.groupby(['name'])(df['v1']>20)...???? It cannot work...

looking at your desired data set i don't think you need to groupby your df, you can simply filter it:
In [112]: df.query('v1 >= 10 and v2 >= 20 and v3 <= 40')
Out[112]:
date name v1 v2 v3
1 2 a 20 20 20
2 3 a 30 30 30
3 4 a 40 40 40
6 2 b 20 20 20
7 3 b 30 30 30
8 4 b 40 40 40
11 2 c 20 20 20
12 3 c 30 30 30
13 4 c 40 40 40

Related

Pandas - How to offset all values with if it less than previous value on whole dataframe

I've got dataframe as follows:
time | value
0 30
1 40
5 55
10 10
11 25
20 10
As value stored should only increment (but sometimes it's being resetted) I want to create an output like follows;
0 30
1 40
5 55
10 65 //offset 55
11 80 //offset 55
20 90 //offset 80
Any easy way to achieve it with pandas?

Is there any way to ungroup the groupby dataframe with adding an additional column

Suppose we take a pandas dataframe...
item MRP sold
0 A 10 10
1 A 36 4
2 B 32 6
3 A 26 7
4 B 30 9
Then do a groupby('item').mean()
it becomes
item MRP sold
0 A 24 7
1 B 31 7.5
Is there a way to retain the mean values of MRP, of all the unique items and make another column which will contain those values when ungrouped.
Basically what i want is
item MRP sold Mean_MRP
0 A 10 10 24
1 A 36 4 24
2 B 32 6 31
3 A 26 7 24
4 B 30 9 31
There are a lot of items, so i need a faster and optimised way to do this
Use the Transform function :
df = (df
.assign(Mean_MRP = lambda x:x.groupby('item')['MRP']
.transform('mean')))
df
item MRP sold Mean_MRP
0 A 10 10 24
1 A 36 4 24
2 B 32 6 31
3 A 26 7 24
4 B 30 9 31
You could also use the pyjanitor module, which makes the code a bit cleaner:
import janitor
df.groupby_agg(by='item',
agg='mean',
agg_column_name="MRP",
new_column_name='Mean_MRP')
Try using transform:
df['Mean_MRP'] = df.groupby('item').transform('mean')

Select rows from pandas df, where index appears somewhere in another df

Assume the following:
df1:
x y z
1 10 11
2 20 22
3 30 33
4 40 44
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
3 40 43
4 10 14
4 20 24
4 30 34
df2:
x b
1 100
2 200
df3:
y c
10 1000
20 2000
I want all rows from df1, for which either x or y appears in either df2 or df3 respectively, meaning in this case
out:
x y z
1 10 11
2 20 22
1 20 21
1 30 31
1 40 41
2 10 12
2 30 32
2 40 42
3 10 31
3 20 23
4 10 14
4 20 24
I would like to do this in pure pandas, with no for loops, seems standard enough to me, but I don't really know what to look for
You can use isin on both cases, chain the conditions with a bitwise OR and perform boolean indexation on the dataframe with the result:
df1[df1.x.isin(df2.x) | df1.y.isin(df3.y)]

Checking if values of a row are consecutive

I have a df like this:
1 2 3 4 5 6
0 5 10 12 35 70 80
1 10 11 23 40 42 47
2 5 26 27 38 60 65
Where all the values in each row are different and have an increasing order.
I would like to create a new column with 1 or 0 if there are at least 2 consecutive numbers.
For example the second and third row have 10 and 11, and 26 and 27. Is there a more pythonic way than using an iterator?
Thanks
Use DataFrame.diff for difference per rows, compare by 1, check if at least one True per rows and last cast to integers:
df['check'] = df.diff(axis=1).eq(1).any(axis=1).astype(int)
print (df)
1 2 3 4 5 6 check
0 5 10 12 35 70 80 0
1 10 11 23 40 42 47 1
2 5 26 27 38 60 65 1
For improve performance use numpy:
arr = df.values
df['check'] = np.any(((arr[:, 1:] - arr[:, :-1]) == 1), axis=1).astype(int)

Create a column with periodically repeated values in pandas

I have a sample data frame df with one column:
Cost
30
49
98
10
37
20
10
48
70
20
30
40
50
29
90
39
30
29
50
40
and a list: id_list = ["A","B","C","D"] which is a list with 4 different id types. I would like to create a new column in the data frame where the first 5 cost values will be "A" the next 5 cost values will be "B" .... and the last 5 cost values will be "D". Therefore, I want to repeat the elements of the id_list 5 times and my new df will be like this:
Cost ID
30 A
49 A
98 A
10 A
37 A
20 B
10 B
48 B
70 B
20 B
30 C
40 C
50 C
29 C
90 C
39 D
30 D
29 D
50 D
40 D
My actual data frame has many rows and the actual id_list has many elements.
The row-number is multiple of 5 so there will be an exact fill in the final data frame.
In general I know how to add a column with specific values in pandas data frame
but I don't know how to do this with the repeated values.
Could you suggest how can I do this in python?
Thanks in advance for any help
There is function from numpy , repeat
df['New']=np.repeat(id_list,5)
df
Out[23]:
Cost New
0 30 A
1 49 A
2 98 A
3 10 A
4 37 A
5 20 B
6 10 B
7 48 B
8 70 B
9 20 B
10 30 C
11 40 C
12 50 C
13 29 C
14 90 C
15 39 D
16 30 D
17 29 D
18 50 D
19 40 D
Numpy free v1
df.assign(ID=sum(zip(*[id_list] * 5), tuple()))
Cost ID
0 30 A
1 49 A
2 98 A
3 10 A
4 37 A
5 20 B
6 10 B
7 48 B
8 70 B
9 20 B
10 30 C
11 40 C
12 50 C
13 29 C
14 90 C
15 39 D
16 30 D
17 29 D
18 50 D
19 40 D
Numpy free v2
df.assign(ID=[x for x in id_list for _ in range(5)])
I would suggest something like this, which takes advantage of the [item]*n => [item, item, item, ...] expansion that python does:
labels = ['label1', 'label2', 'label3']
num = 5
repeated = []
for i in labels:
repeated.extend([i]*num)
You can then add the column to your dataframe.

Categories

Resources