I have dataframe like this:
source_ip dest_ip dest_ip_usage dest_ip_count
0 4:107:27:41 1:23:54:114 2028544 2
1 4:107:27:41 2:112:41:134 3145639 1
2 4:107:27:41 2:112:41:178 4145639 1
3 1:192:221:145 32:107:27:134 6358000 1
4 1:192:344:161 3:243:82:204 6341359 1
I am using syntax: df1 = df.groupby(['source_ip','dest_ip'])['dest_ip_usage'].nlargest(2)
But I am not getting indexes and getting result:
0 2028544
1 3145639
2 4145639
3 6358000
4 6341359
This is not possible when using groupby on multiple columns.
If you want to find nlargest with groupby on multiple columns you must use apply method on that specific column on which you are trying to find nlargest.
df.groupby(['source_ip'])['dest_ip','dest_ip_usage'].apply(lambda x: x.nlargest(2, columns=['dest_ip_usage']))
Related
I have a dataframe which I want to group, filter columns by regex, and then sum.
My code looks like this:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,3],
'Invasive' : [12,1,1,0,1,0],
'invasive': [1,4,5,3,4,6],
'Wild':[4,7,1,0,0,0],
'wild':[0,0,9,8,3,2],
'Crop':[0,0,0,0,0,0],
'Crop_2':[2,3,2,2,1,2]})
df.groupby(['ID']).filter(regex='(Invasive)|(invasive)|(Wild)|(wild)').sum()
The error message I get is:
DataFrameGroupBy.filter() missing 1 required positional argument: 'func'
I get the same Err message if groupby comes after filter
Why does this happen? Where do I input the func argument?
EDIT:
My Expected output is one column that has summed across the filtered columns and is grouped by ID. E.g.:
ID Output
0 1 29
1 2 27
2 3 16
What you want to do doesn't make sense, groupby.filter is to filter rows, not to be confused with DataFrame.filter.
You likely want to filter the columns, then to aggregate:
df.filter(regex='(?i)(Invasive|Wild)').groupby(df['ID']).sum()
NB. I replaced (Invasive)|(invasive)|(Wild)|(wild) by (?i)(Invasive|Wild), which means 'InvasiveORWild` independently of the case.
Output:
Invasive invasive Wild wild
ID
1 13 5 11 0
2 1 8 1 17
3 1 10 0 5
edit:
the output that you show needs a further summation per row:
out = (df.filter(regex='(?i)(Invasive|Wild)')
.groupby(df['ID']).sum()
.sum(axis=1)
.reset_index(name='Output')
)
# or with summation before:
out = (df.filter(regex='(?i)(Invasive|Wild)')
.sum(axis=1)
.groupby(df['ID']).sum()
.reset_index(name='Output')
)
Output:
ID Output
0 1 29
1 2 27
2 3 16
I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0
I am trying to find not common data from 2 data frame.
df1
df1 = pd.DataFrame({
'contact_id': [1,2,3,4]
})
contact_id
0 1
1 2
2 3
3 4
df2
df2 = pd.DataFrame({
'contact_id': [1,3,4,5]
})
contact_id
0 1
1 3
2 4
3 5
Expected output
contact_id
0 2
1 5
I have tried to use below code but getting incorrect
df = df2[~df2.contact_id.isin(df1.contact_id)]
Can anyone help me how can I get expected output
try merge() with indicator=True and then filter out values by using query() finally drop extra column by using drop():
out=(df1.merge(df2,indicator=True,on='contact_id',how='outer')
.query("_merge!='both'").drop('_merge',1))
output of out:
contact_id
1 2
4 5
Alternatively,
you can merge two datasets , drop duplicates and the index. If you want to keep the index then remove the reset_index method.
pd.concat([df1,df2]).drop_duplicates(keep=False).reset_index(drop =True)
Pandas:
data = data.dropna(axis = 'columns')
I am trying to do something similar using a cudf dataframe but the apis don't offer this functionality.
My solution is to convert to a pandas df, do the above command, then re-convert to a cudf. Is there a better solution?
cuDF now supports column based dropna, so the following will work:
import cudf
df = cudf.DataFrame({'a':[0,1,None], 'b':[None,0,2], 'c':[1,2,3]})
print(df)
a b c
0 0 null 1
1 1 0 2
2 null 2 3
df.dropna(axis='columns')
c
0 1
1 2
2 3
Until dropna is implemented, you can check the null_count of each column and drop the ones with null_count>0.
I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?
Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4