How to create a dataframe from another dataframe based on GroupBy condition

How to create a dataframe from another dataframe based on GroupBy condition - python

I don't know how can i create a dataframe based on another dataframe using a groupby conditions. For example, i have a dataframe that if i apply the function:
flights_df.groupby(by='DepHour')['Cancelled'].value_counts()
I obtain something like this
DepHour Cancelled
0.0 0 20361
1 7
1.0 0 5857
1 4
2.0 0 1850
1 1
**3.0 0 833**
4.0 0 3389
1 1
5.0 0 148143
1 24
As can be seen, for DepHour == 3.0 there's no cancelled flights.
Using the same dataframe that i used to generate this output i want to create another dataframe containing only of values where there's no cancelled flighs for DepHour. In this case, the output will be a dataframe containing only values of DepHour == 3.0.
I know that i can use mask, but it allows only filter values where cancelled == 0 (i.e. all other values for where DepHour cancelled == 0 are included).
Thanks and sorry for my bad english!

There might be a cleaner way (probably without using groupby twice) but this should should work:
flights_df.groupby('DepHour') \
.filter(lambda x: (x['Cancelled'].unique()==[0]).all()) \
.groupby('DepHour')['Cancelled'].value_counts()

Related

Column are missing in the result of groupby

I have dataframe like this:
source_ip dest_ip dest_ip_usage dest_ip_count
0 4:107:27:41 1:23:54:114 2028544 2
1 4:107:27:41 2:112:41:134 3145639 1
2 4:107:27:41 2:112:41:178 4145639 1
3 1:192:221:145 32:107:27:134 6358000 1
4 1:192:344:161 3:243:82:204 6341359 1
I am using syntax: df1 = df.groupby(['source_ip','dest_ip'])['dest_ip_usage'].nlargest(2)
But I am not getting indexes and getting result:
0 2028544
1 3145639
2 4145639
3 6358000
4 6341359

This is not possible when using groupby on multiple columns.
If you want to find nlargest with groupby on multiple columns you must use apply method on that specific column on which you are trying to find nlargest.
df.groupby(['source_ip'])['dest_ip','dest_ip_usage'].apply(lambda x: x.nlargest(2, columns=['dest_ip_usage']))

Apply different functions to different columns with a singe pandas groupby command

My data is stored in df. I have multiple users per group. I want to group df by group and apply different functions to different columns. The twist is that I would like to assign custom names to the new columns during this process.
np.random.seed(123)
df = pd.DataFrame({"user":range(4),"group":[1,1,2,2],"crop":["2018-01-01","2018-01-01","2018-03-01","2018-03-01"],
"score":np.random.randint(400,1000,4)})
df["crop"] = pd.to_datetime(df["crop"])
print(df)
user group crop score
0 0 1 2018-01-01 910
1 1 1 2018-01-01 765
2 2 2 2018-03-01 782
3 3 2 2018-03-01 722
I want to get the mean of score, and the min and max values of crop grouped by group and assign custom names to each new column. The desired output should look like this:
group mean_score min_crop max_crop
0 1 837.5 2018-01-01 2018-01-01
1 2 752.0 2018-03-01 2018-03-01
I don't know how to do this in a one-liner in Python. In R, I would use data.table and get the following:
df[, list(mean_score = mean(score),
max_crop = max(crop),
min_crop = min(crop)), by = group]
I know I could group the data and use .agg combined with a dictionary. Is there an alternative way where I can custom-name each column in this process?

Try creating a function with the required operations using groupby().apply():
def f(x):
d = {}
d['mean_score'] = x['score'].mean()
d['min_crop'] = x['crop'].min()
d['max_crop'] = x['crop'].max()
return pd.Series(d, index=['mean_score', 'min_crop', 'max_crop'])
data = df.groupby('group').apply(f)

comparing each value in two columns

How can I compare two columns in a dataframe and create a new column based on the difference of those two columns efficiently?
I have a feature in my table that has a lot of missing values and I need to backfill those information by using other tables in the database that contain that same feature. I have used np.select to compare the feature in my original table with the same feature in other table, but I feel like there should be an easy method.
Eg: pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
I expect the new column to contain values [1,2,"different",4,np.nan]. Any help will be appreciated!

pandas.Series.combine_first or pandas.DataFrame.combine_first could be useful here. These operate like a SQL COALESCE and combine the two columns by choosing the first non-null value if one exists.
df = pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
C = df.A.combine_first(df.B)
C looks like:
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
Then, to capture your requirement that two different non-null values should give "different" when combined, just find those indices and update the values.
mask = ~df.A.isna() & ~df.B.isna() & (df.A != df.B)
C[mask] = 'different'
C now looks like:
0 1
1 2
2 different
3 4
4 NaN

Another way is to use pd.DataFrame.iterrows with nunique:
import pandas as pd
df['C'] = [s['A'] if s.nunique()<=1 else 'different' for _, s in df.iterrows()]
Output:
A B C
0 1.0 1.0 1
1 2.0 NaN 2
2 3.0 30.0 different
3 4.0 4.0 4
4 NaN NaN NaN

Mean value of DataFrame columns based on Columns name extension

I have a DataFrame A in Jupiter that looks like the following
Index Var1.A.1 Var1.B.1 Var1.CA.1 Var2.A.1 Var2.B.1 Var2.CA.1
0 1 21 3 3 4 4
1 3 5 4 9 5 1
....
100 9 75 2 4 8 2
I'd like to assess the mean value based on the extension of the name, i.e.
Mean value of .A.1
Mean Value of .B.1
Mean value of .CA.1
For example, to assess the mean value of the variable with extension .A.1, I've tried the following, which doesn't return what I look for
List=['.A.1', '.B.1', '.CA.1']
A[List[List.str.contains('.A.1')]].mean()
However, in this way I get the mean values of the different variables, getting also CA.1, which is not what it look for.
Any advice?
thanks

If want mean per rows by all values after first . use groupby with lambda function and mean:
df = df.groupby(lambda x: x.split('.', 1)[-1], axis=1).mean()
print (df)
A.1 B.1 CA.1
0 2.0 12.5 3.5
1 6.0 5.0 2.5
100 6.5 41.5 2.0

Here is a thrid option:
columns = A.columns
A[[s for s in columns if ".A.1" in s]].stack().reset_index().mean()

dfA.filter(like='.A.1') - gives you the column containing the '.A.1' substring

Add column to pandas dataframe based on previous values

I have a dataframe with an observation number, and id, and a number
Obs# Id Value
--------------------
1 1 5.643
2 1 7.345
3 2 0.567
4 2 1.456
I want to calculate a new column that is the mean of the previous values of a specific id
I am trying to use something like this but it only acquires the previous value:
df.groupby('Id')['Value'].apply(lambda x: x.shift(1) ...
My question is how do I acquire the range of previous values filtered by the Id so I can calculate the mean ?
So the new column based on this example should be
5.643
6.494
0.567
1.0115

It seems that you want expanding, then mean
df.groupby('Id').Value.expanding().mean()
Id
1.0 1 5.6430
2 6.4940
2.0 3 0.5670
4 1.0115
Name: Value, dtype: float64

You can also do it like:
df = pd.DataFrame({'Obs':[1,2,3,4],'Id':[1,1,2,2],'Value':[5.643,7.345, 0.567,1.456]})
df.groupby('Id')['Value'].apply(lambda x: x.cumsum()/np.arange(1, len(x)+1))
It gives output as :
5.643
6.494
0.567
1.0115

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a dataframe from another dataframe based on GroupBy condition - python

There might be a cleaner way (probably without using groupby twice) but this should should work: flights_df.groupby('DepHour') \ .filter(lambda x: (x['Cancelled'].unique()==[0]).all()) \ .groupby('DepHour')['Cancelled'].value_counts()

Related

Column are missing in the result of groupby

Apply different functions to different columns with a singe pandas groupby command

comparing each value in two columns

Mean value of DataFrame columns based on Columns name extension

Add column to pandas dataframe based on previous values

Categories

Resources