Get only matching rows for groups in Pandas groupby - python

I have the following df:
d = {"Col1":['a','d','b','c','a','d','b','c'],
"Col2":['x','y','x','z','x','y','z','y'],
"Col3":['n','m','m','l','m','m','l','l'],
"Col4":[1,4,2,2,1,4,2,2]}
df = pd.DataFrame(d)
When I groupby on three fields, I get the result:
gb = df.groupby(['Col1', 'Col2', 'Col3'])['Col4'].agg(['sum', 'mean'])
How can I extract only the groups and rows where a row of a group matches with at least one other row of another group on grouped columns. Please see the picture below, I want to get the highlighted rows
I want to get the rows in red on the basis of the ones in Blue and Black which match eachother
Apologies if my statement is ambiguous. Any help would be appreciated

You can reset_index then use duplicated and boolean index filter your dataframe:
gb = gb.reset_index()
gb[gb.duplicated(subset=['Col2','Col3'], keep=False)]
Output:
Col1 Col2 Col3 sum mean
0 a x m 1 1
2 b x m 2 2
3 b z l 2 2
5 c z l 2 2

Make a table with all allowed combinations and then inner join it with this dataframe.

Related

Dropping column if more than half of the values are same - Python

I have pandas df which looks like the pic:
enter image description here
I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this
I trid using :pandas.Series.value_counts
but with no luck
You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.
n=len(df)
cols_to_drop=[]
for e in list(df.columns):
max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
if 2*max_occ>n: # Check if it is more than half the len of the dataset
cols_to_drop.append(e)
df=df.drop(cols_to_drop,axis=1)
You can use apply + value_counts and getting the first value to get the max count:
count = df.apply(lambda s: s.value_counts().iat[0])
col1 4
col2 2
col3 6
dtype: int64
Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:
count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)] # use 'lt' if needed to drop if exactly half
output:
col2
0 0
1 1
2 0
3 1
4 2
5 3
Use input:
df = pd.DataFrame({'col1': [0,1,0,0,0,1],
'col2': [0,1,0,1,2,3],
'col3': [0,0,0,0,0,0],
})
Boolean slicing with a comprension
df.loc[:, [
df.shape[0] // s.value_counts().max() >= 2
for _, s in df.iteritems()
]]
col2
0 0
1 1
2 0
3 1
4 2
5 3
Credit to #mozway for input data.

grouping and printing the maximum in a dataframe in python

A dataframe has 3 Columns
A B C
^0hand(%s)leg$ 27;30 42;54
^-(%s)hand0leg 39;30 47;57
^0hand(%s)leg$ 24;33 39;54
So column A has regex patterns like this if those patterns are similar for example now row 1 and row 3 is similar so it has to merge the two rows and output only the maximum as below:
Output:
A B C
^0hand(%s)leg$ 27;33 42;54
^-(%s)hand0leg 39;30 47;57
Any leads will be helpful
You could use:
(df.set_index('A').stack()
.str.extract('(\d+);(\d+)').astype(int)
.groupby(level=[0,1]).agg(max).astype(str)
.assign(s=lambda d: d[0]+';'+d[1])['s'] # OR # .apply(';'.join, axis=1)
.unstack(1)
.loc[df['A'].unique()] ## only if the order of rows matters
.reset_index()
)
output:
A B C
0 ^0hand(%s)leg$ 27;33 42;54
1 ^-(%s)hand0leg 39;30 47;57

Compare 2 columns and merge rows on match?

New to coding here and trying to make a project. I want to compare two DF, and if any of the rows in the product column matches, I want to copy it over to a new DF. The rows in DF1 and DF2 will not be in the same position. Like I want to compare row 1 DF1 against the entire column in DF2. Is there an easy solution to this?
Take a look at this: https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/
You can try:
df3 = df1[df1['Product'].isin(set(df2['Product']))]
Which gives:
>>> df1 = pd.DataFrame({'prod':[1,2], 'ean':[5,6]})
>>> df1
prod ean
0 1 5
1 2 6
>>> df2 = pd.DataFrame({'prod':[3,2]})
>>> df2
prod
0 3
1 2
>>> df1[df1['prod'].isin(set(df2['prod']))]
prod ean
1 2 6
To explain:
df1[...] is to filter the rows of df1 based on criterion ...
I'm using a set() here so it is fast to check whether a row in df1 is in df2's "Product" column

Unable to fillna a column in dataframe with values from a series

I am trying to fillna in a specific column of the dataframe with the mean of not-null values of the same type (based on the value from another column in the dataframe).
Here is the code to reproduce my issue:
import numpy as np
import pandas as pd
df = pd.DataFrame()
#Create the DateFrame with a column of floats
#And a column of labels (str)
np.random.seed(seed=6)
df['col0']=np.random.randn(100)
lett=['a','b','c','d']
df['col1']=np.random.choice(lett,100)
#Set some of the floats to NaN for the test.
toz = np.random.randint(0,100,25)
df.loc[toz,'col0']=np.NaN
df[df['col0'].isnull()==False].count()
#Create a DF with mean for each label.
w_series = df.loc[(~df['col0'].isnull())].groupby('col1').mean()
col0
col1
a 0.057199
b 0.363899
c -0.068074
d 0.251979
#This dataframe has our label (a,b,c,d) as the index. Doesn't seem
#to work when I try to df.fillna(w_series). So I try to reindex such
#that the labels (a,b,c,d) become a column again.
#
#For some reason I cannot just do a set_index and expect the
#old index to become column. So I append the new index and
#then reset it.
w_series['col2'] = list(range(w_series.size))
w_frame = w_series.set_index('col2',append=True)
w_frame.reset_index('col1',inplace=True)
#I try fillna() with the new dataframe.
df.fillna(w_frame)
Still no luck:
col0 col1
0 0.057199 b
1 0.729004 a
2 0.217821 d
3 0.251979 c
4 -2.486781 a
5 0.913252 b
6 NaN a
7 NaN b
What am I doing wrong?
How do I fillna the dataframe with the averages of specific rows that match the missing information?
Does the size of the dataframe being filled (df) and the filler dataframe (w_frame) have to match?
Thank you
fillna is base on index, so , you need same index for your target dataframe and process dataframe
df.set_index('col1')['col0'].fillna(w_frame.set_index('col1').col0).reset_index()
# I only show the first 11 row
Out[74]:
col1 col0
0 b 0.363899
1 a 0.729004
2 d 0.217821
3 c -0.068074
4 a -2.486781
5 b 0.913252
6 a 0.057199
7 b 0.363899
8 c -0.068074
9 b -0.429894
10 a 2.631281
My way to fillna
df['col1']=df.groupby("col1")['col0'].transform(lambda x: x.fillna(x.mean()))

drop rows that have duplicated indices

I have a DataFrame where each observation is identified by an index. However, for some indices the DF contains several observations. One of them has the most updated data. I would like to drop the outdated duplicated rows based on values from some of the columns.
For example, in the following DataFrame, how can I drop the first and third rows with index = 122?
index col1 col2
122 - -
122 one two
122 - two
123 four one
124 five -
That is, I would like to get a final DF like this:
index col1 col2
122 one two
123 four one
124 five -
This seems to be a very common problem when we get data through several different retrievals over time. But I cannot figure out an efficient way of cleaning the data.
You could use groupby/transform to create a boolean mask which is True where the group count is greater than 1 and any of the values in the row equals '-'. Then you could use df.loc[~mask] to select the unmasked rows of df:
import pandas as pd
df = pd.read_table('data', sep='\s+')
count = df.groupby(['index'])['col1'].transform('count') > 1
mask = (df['col1'] == '-') | (df['col2'] == '-')
mask = mask & count
result = df.loc[~mask]
print(result)
yields
index col1 col2
0 122 one two
1 123 four one
2 124 five -
If the index is already a column then you can drop_duplicates and pass param take-last=True:
In [14]:
df.drop_duplicates('index', take_last=True)
Out[14]:
index col1 col2
1 122 - two
2 123 four one
if it's actually your index, then you'd be better off calling reset_index first and then perform the above step and then set the index back again.
There is a method for Index to call drop_duplicates but this just removed duplicates from the index, the returned index with duplicates removed does not allow you to index back into the df with the duplicates removed so I recommend the above approach by calling drop_duplicates on the df itself.
EDIT
Based on your new information, the easiest maybe to replace the outdated data with NaN values and drop these:
In [36]:
df.replace('-', np.NaN).dropna()
Out[36]:
col1 col2
index
122 one two
123 four one
Another Edit
What you could do is groupby the index and take the first values of the remaining columns, then call reset_index:
In [56]:
df.groupby('index')['col1', 'col2'].first().reset_index()
Out[56]:
index col1 col2
0 122 - -
1 123 four one
2 124 five -

Categories

Resources