Delete keys with missing values? - python

My dataframe is
ID Alphabet Number1 Number2
1 A NaN 9
1 A 3 5
1 A 1 4
1 A 2 4
2 B 7 3
2 B 2 8
2 B 4 1
2 B 8 5
3 C 2 2
3 C 1 9
4 D 2 3
4 D 6 2
4 D 8 NaN
I got unique Alphabets by doing
df.groupby('Alphabet')['ID'].nunique()
and the result is
A 1
B 1
C 1
D 1
but I want to store Alphabets that does NOT have missing data in them
I want the result to be look like
B 1
C 1
and from this console result, how would I store "B" and "C" into a list?

IIUC, using all()
s=df.groupby('Alphabet').apply(lambda x : x.notnull().all()).all(1)
df.groupby('Alphabet').ID.nunique()[s[s].index]
Out[1082]:
Alphabet
B 1
C 1
Name: ID, dtype: int64
Or
df.loc[~df.Alphabet.isin(df.loc[s[s].index,'Alphabet'])].groupby('Alphabet').ID.nunique()
Out[1095]:
Alphabet
B 1
C 1
Name: ID, dtype: int64

Related

pandas dataframe duplicate values count not properly working

value count is : df['ID'].value_counts().values
-----> array([4,3,3,1], dtype=int64)
input:
ID emp
a 1
a 1
b 1
a 1
b 1
c 1
c 1
a 1
b 1
c 1
d 1
when I jumble the ID column
df.loc[~df.duplicated(keep='first', subset=['ID']), 'emp']= df['ID'].value_counts().values
output:
ID emp
a 4
c 3
d 3
c 1
b 1
a 1
c 1
a 1
b 1
b 1
a 1
expected result:
ID emp
a 4
c 3
d 1
c 1
b 3
a 1
c 1
a 1
b 1
b 1
a 1
problem :the count is not checking the ID before assigning it the emp.
Here is problem ouput of df['ID'].value_counts() is Series with counted values in different number of values like original data, for new column filled by couter value use Series.map:
df.loc[~df.duplicated(subset=['ID']), 'emp'] = df['ID'].map(df['ID'].value_counts())
Or GroupBy.transform with size:
df.loc[~df.duplicated(subset=['ID']), 'emp'] = df.groupby('ID')['ID'].transform('size')
Output Series with 4 values cannot assign back, because different index in df1.index and df['ID'].value_counts().index
print (df['ID'].value_counts())
a 4
b 3
c 3
d 1
Name: ID, dtype: int64
If convert to numpy array only first 4 values are assigned, because in this DataFrame are 4 groups a,b,c,d, so df.duplicated(subset=['ID']) returned 4 times Trues, but in order 4,3,3,1 what reason of wrong output:
print (df['ID'].value_counts().values)
[4 3 3 1]
What need - new column (Series) with same df.index:
print (df['ID'].map(df['ID'].value_counts()))
0 4
1 4
2 3
3 4
4 3
5 3
6 3
7 4
8 3
9 3
10 1
Name: ID, dtype: int64
print (df.groupby('ID')['ID'].transform('size'))
0 4
1 4
2 3
3 4
4 3
5 3
6 3
7 4
8 3
9 3
10 1
Name: ID, dtype: int64
This alone is giving df.loc[~df.duplicated(keep='first', subset=['ID']), 'emp']= df['ID'].value_counts().values desired output for your given sample dataframe
but you can try:
cond=~df.duplicated(keep='first', subset=['ID'])
df.loc[cond,'emp']=df.loc[cond,'ID'].map(df['ID'].value_counts())

Python Counting Same Values For Specific Columns

If i have a dataframe;
A B C D
1 1 2 2 1
2 1 1 2 1
3 3 1 0 1
4 2 4 4 4
I want to make addition B and C columns and counting whether or not the same values with D columns. Desired output is;
A B C B+C D
1 1 2 2 4 1
2 1 1 2 3 1
3 3 1 0 1 1
4 2 4 4 8 4
There are 3 different values compare the "B+C" and "D".
Could you please help me about this?
You could do something like:
df.B.add(df.C).ne(df.D).sum()
# 3
If you need to add the column:
df['B+C'] = df.B.add(df.C)
diff = df['B+C'].ne(df.D).sum()
print(f'There are {diff} different values compare the "B+C" and "D"')
#There are 3 different values compare the "B+C" and "D"
df.insert(3,'B+C', df['B']+df['C'])
3 is the index
df.head()
A B C B+C D
0 1 2 2 4 1
1 1 1 2 3 1
2 3 1 0 1 1
3 2 4 4 8 4
After that you can follow the steps of #yatu
df['B+C'].ne(df['D'])
0 True
1 True
2 False
3 True dtype: bool
df['B+C'].ne(df['D']).sum()
3

How to pandas groupby one column and filter dataframe based on the minimum unique values of another column?

I have a data frame that looks like this:
CP AID type
1 1 b
1 2 b
1 3 a
2 4 a
2 4 b
3 5 b
3 6 a
3 7 b
I would like to groupby the CP column and filter so it only returns rows where the CP has at least 3 unique 'pairs' from the AID column.
The result should look like this:
CP AID type
1 1 b
1 2 b
1 3 a
3 5 b
3 6 a
3 7 b
You can groupby in combination with unique:
m = df.groupby('CP').AID.transform('unique').str.len() >= 3
print(df[m])
CP AID type
0 1 1 b
1 1 2 b
2 1 3 a
5 3 5 b
6 3 6 a
7 3 7 b
Or as RafaelC mentioned in the comments:
m = df.groupby('CP').AID.transform('nunique').ge(3)
print(df[m])
CP AID type
0 1 1 b
1 1 2 b
2 1 3 a
5 3 5 b
6 3 6 a
7 3 7 b
You can do that:
count = df1[['CP', 'AID']].groupby('CP').count().reset_index()
df1 = df1[df1['CP'].isin(count.loc[count['AID'] == 3,'CP'].values.tolist())]

Pandas reverse column values groupwise

I want to reverse a column values in my dataframe, but only on a individual "groupby" level. Below you can find a minimal demonstration example, where I want to "flip" values that belong the same letter A,B or C:
df = pd.DataFrame({"group":["A","A","A","B","B","B","B","C","C"],
"value": [1,3,2,4,4,2,3,2,5]})
group value
0 A 1
1 A 3
2 A 2
3 B 4
4 B 4
5 B 2
6 B 3
7 C 2
8 C 5
My desired output looks like this: (column is added instead of replaced only for the brevity purposes)
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2
As always, when I don't see a proper vector-style approach, I end messing with loops just for the sake of final output, but my current code hurts me very much:
for i in list(set(df["group"].values.tolist())):
reversed_group = df.loc[df["group"]==i,"value"].values.tolist()[::-1]
df.loc[df["group"]==i,"value_desired"] = reversed_group
Pandas gurus, please show me the way :)
You can use transform
In [900]: df.groupby('group')['value'].transform(lambda x: x[::-1])
Out[900]:
0 2
1 3
2 1
3 3
4 2
5 4
6 4
7 5
8 2
Name: value, dtype: int64
Details
In [901]: df['value_desired'] = df.groupby('group')['value'].transform(lambda x: x[::-1])
In [902]: df
Out[902]:
group value value_desired
0 A 1 2
1 A 3 3
2 A 2 1
3 B 4 3
4 B 4 2
5 B 2 4
6 B 3 4
7 C 2 5
8 C 5 2

Handling duplicate rows in python

I have a date frame df, let's say with 5 columns : a, b, c, d, e.
a b c d e
1 6 x 8 3
2 3 y 2 3
3 5 d 1 1
3 4 g 3 4
5 3 z 3 1
This is what I want to do, for all the rows with same value of column a, I want to drop duplicates, but value of column b should be summed across those rows, and for rest of the columns, I want to keep the first value.
Final Data frame will be :
a b c d e
1 6 x 8 3
2 3 y 2 3
3 9 d 1 1
5 3 z 3 1
How to do this?
I'd assign to column 'b' the result of grouping on 'a' and summing, you can then drop the duplicates:
In [171]:
df['b'] = df.groupby('a')['b'].transform('sum')
df
Out[171]:
a b c d e
0 1 6 x 8 3
1 2 3 y 2 3
2 3 9 d 1 1
3 3 9 g 3 4
4 5 3 z 3 1
In [172]:
df.drop_duplicates('a')
Out[172]:
a b c d e
0 1 6 x 8 3
1 2 3 y 2 3
2 3 9 d 1 1
4 5 3 z 3 1

Categories

Resources