Group By Sum Multiple Columns in Pandas (Ignoring duplicates) - python

I have the following code where my dataframe contains 3 columns
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
oneframe = pd.concat([df['toBeSummed'],df['toBeSummed2'],df['toBesummed3']], axis=1).reset_index()
temp = oneframe.groupby(['toBeSummed']).size().reset_index()
temp2 = oneframe.groupby(['toBeSummed2']).size().reset_index()
temp3 = oneframe.groupby(['toBeSummed3']).size().reset_index()
temp.columns.values[0] = "SameName"
temp2.columns.values[0] = "SameName"
temp3.columns.values[0] = "SameName"
final = pd.concat([temp,temp2,temp3]).groupby(['SameName']).sum().reset_index()
final.columns.values[0] = "Letter"
final.columns.values[1] = "Sum"
The problem here is that with the code I have, it sums up all instances of each value. Meaning calling final would result in
Letter Sum
0 X 3
1 Y 4
2 Z 5
However I want it to not count more than once if the same value exists in the row (I.e in the first row there are two X's so it would only count the one X)
Meaning the desired output is
Letter Sum
0 X 2
1 Y 3
2 Z 3
I can update or add more comments if this is confusing.

Given df:
toBeSummed toBeSummed2 toBesummed3 someColumn
0 X X Y NaN
1 X Y Z NaN
2 Y Y Z NaN
3 Z Z Z NaN
Doing:
sum_cols = ['toBeSummed', 'toBeSummed2', 'toBesummed3']
out = df[sum_cols].apply(lambda x: x.unique()).explode().value_counts()
print(out.to_frame('Sum'))
Output:
Sum
Y 3
Z 3
X 2

Related

group column values with difference of 3(say) digit in python

I am new in python, problem statement is like we have below data as dataframe
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10], 'value':[x,x,y,x,x,x,y,x,z,x,x,y,y,z]})
Diff value
1 x
1 x
2 y
3 x
4 x
4 x
5 y
6 x
7 z
7 x
8 x
9 y
9 y
10 z
we need to group diff column with diff of 3 (let's say), like 0-3,3-6,6-9,>9, and value should be count
Expected output is like
Diff x y z
0-3 2 1
3-6 3 1
6-9 3 1
>=9 2 1
Example
example code is wrong. someone who want exercise, use following code
df = pd.DataFrame({'Diff':[1,1,2,3,4,4,5,6,7,7,8,9,9,10],
'value':'x,x,y,x,x,x,y,x,z,x,x,y,y,z'.split(',')})
Code
labels = ['0-3', '3-6', '6-9', '>=9']
grouper = pd.cut(df['Diff'], bins=[0, 3, 6, 9, float('inf')], right=False, labels=labels)
pd.crosstab(grouper, df['value'])
output:
value x y z
Diff
0-3 2 1 0
3-6 3 1 0
6-9 3 0 1
>=9 0 2 1

calculating the percentage of count in pandas groupby

I want to discover the underlying pattern between my features and target so I tried to use groupby but instead of the count I want to calculate the ratio or the percentage compared to the total of the count of each class
the following code is similar to the work I have done.
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])
You can achieve this more simply with:
out = df.groupby('class').value_counts(normalize=True).mul(100)
Output:
class fet1 fet2
0 A Y 13.859275
B Y 12.366738
X 12.153518
C X 11.513859
Y 10.660981
B Z 10.447761
A Z 10.021322
C Z 9.594883
A X 9.381663
1 A Y 14.124294
C Z 13.935970
B Z 11.676083
Y 11.111111
C Y 11.111111
X 11.111111
A X 10.169492
B X 9.416196
A Z 7.344633
dtype: float64
If you want the same order of multiindex:
out = (df
.groupby('class').value_counts(normalize=True).mul(100)
.reorder_levels(['fet1', 'fet2', 'class']).sort_index()
)
Output:
fet1 fet2 class
A X 0 9.381663
1 10.169492
Y 0 13.859275
1 14.124294
Z 0 10.021322
1 7.344633
B X 0 12.153518
1 9.416196
Y 0 12.366738
1 11.111111
Z 0 10.447761
1 11.676083
C X 0 11.513859
1 11.111111
Y 0 10.660981
1 11.111111
Z 0 9.594883
1 13.935970
dtype: float64
I achieved it by doing this
fet1=["A","B","C"]
fet2=["X","Y","Z"]
target=["0","1"]
df = pd.DataFrame(data={"fet1":np.random.choice(fet1,1000),"fet2":np.random.choice(fet2,1000),"class":np.random.choice(target,1000)})
df.groupby(['fet1','fet2','class'])['class'].agg(['count'])/df.groupby(['class'])['class'].agg(['count'])*100

Pandas remove reversed duplicates across two columns

An example DataFrame:
df = pd.DataFrame({'node_a': ['X', 'X', 'X', 'Y', 'Y', 'Y', 'Z', 'Z', 'Z'],
'node_b': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
'value': [ 2, 8, 1, 8, 7, 3, 1, 3, 2]})
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
3 Y X 8
4 Y Y 7
5 Y Z 3
6 Z X 1
7 Z Y 3
8 Z Z 2
I need to remove reversed duplicates, e.g. keep node_a = 'X', node_b = 'Y' but remove node_a = 'Y', node_b = 'X'.
Desired output:
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2
Please note I need a general solution not specific to this actual data.
Let's use np.sort along axis=1 to sort node_a and node_b and assign these sorted columns to the dataframe then use drop_duplicates on the dataframe to drop the duplicate entries in dataframe based on these assigned columns:
df[['x', 'y']] = np.sort(df[['node_a', 'node_b']], axis=1)
out = df.drop_duplicates(['x', 'y']).drop(['x', 'y'], 1)
Result:
print(out)
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2
You could do the following:
# duplicates regardless the order
un_dups = pd.Series([frozenset(row) for row in df[['node_a', 'node_b']].to_numpy()]).duplicated()
# duplicates with the same order
o_dups = df.duplicated(subset=['node_a', 'node_b'])
# keep only those that are not duplicates with reverse order xor
mask = ~(un_dups ^ o_dups)
print(df[mask])
Output
node_a node_b value
0 X X 2
1 X Y 8
2 X Z 1
4 Y Y 7
5 Y Z 3
8 Z Z 2
The idea is to create a mask that will be False if you are a duplicate in reverse order.
To better understand the approach see the truth values:
node_a node_b value un_dups o_dups xor
0 X X 2 False False False
1 X Y 8 False False False
2 X Z 1 False False False
3 Y X 8 True False True
4 Y Y 7 False False False
5 Y Z 3 False False False
6 Z X 1 True False True
7 Z Y 3 True False True
8 Z Z 2 False False False
As you can see the xor (exclusive or) shows that it output is true whenever the inputs differ. Given that an ordered duplicated is going to be also duplicated when unordered, then xor is only true when the values in the column are duplicates in reverse order.
Finally notice that the mask is the negation of the xor, i.e. those values that are not duplicates.
Here's one way to do it which involves creating a new temporary column that will sort the order of node_a and node_b in each row, and then drop duplicates, keeping the first instance of the ordering:
df['sorted'] = df.apply(lambda x: ''.join(sorted([x['node_a'],x['node_b']])),axis=1)
# node_a node_b value sorted
# 0 X X 2 XX
# 1 X Y 8 XY
# 2 X Z 1 XZ
# 3 Y X 8 XY
# 4 Y Y 7 YY
# 5 Y Z 3 YZ
# 6 Z X 1 XZ
# 7 Z Y 3 YZ
# 8 Z Z 2 ZZ
df.drop_duplicates(subset='sorted').drop('sorted',axis=1)
# node_a node_b value
# 0 X X 2
# 1 X Y 8
# 2 X Z 1
# 4 Y Y 7
# 5 Y Z 3
# 8 Z Z 2

How to change values in a dataframe Python

I've searched for an answer for the past 30 min, but the only solutions are either for a single column or in R. I have a dataset in which I want to change the ('Y/N') values to 1 and 0 respectively. I feel like copying and pasting the code below 17 times is very inefficient.
df.loc[df.infants == 'n', 'infants'] = 0
df.loc[df.infants == 'y', 'infants'] = 1
df.loc[df.infants == '?', 'infants'] = 1
My solution is the following. This doesn't cause an error, but the values in the dataframe doesn't change. I'm assuming I need to do something like df = df_new. But how to do this?
for coln in df:
for value in coln:
if value == 'y':
value = '1'
elif value == 'n':
value = '0'
else:
value = '1'
EDIT: There are 17 columns in this dataset, but there is another dataset I'm hoping to tackle which contains 56 columns.
republican n y n.1 y.1 y.2 y.3 n.2 n.3 n.4 y.4 ? y.5 y.6 y.7 n.5 y.8
0 republican n y n y y y n n n n n y y y n ?
1 democrat ? y y ? y y n n n n y n y y n n
2 democrat n y y n ? y n n n n y n y n n y
3 democrat y y y n y y n n n n y ? y y y y
4 democrat n y y n y y n n n n n n y y y y
This should work:
for col in df.columns():
df.loc[df[col] == 'n', col] = 0
df.loc[df[col] == 'y', col] = 1
df.loc[df[col] == '?', col] = 1
I think simpliest is use replace by dict:
np.random.seed(100)
df = pd.DataFrame(np.random.choice(['n','y','?'], size=(5,5)),
columns=list('ABCDE'))
print (df)
A B C D E
0 n n n ? ?
1 n ? y ? ?
2 ? ? y n n
3 n n ? n y
4 y ? ? n n
d = {'n':0,'y':1,'?':1}
df = df.replace(d)
print (df)
A B C D E
0 0 0 0 1 1
1 0 1 1 1 1
2 1 1 1 0 0
3 0 0 1 0 1
4 1 1 1 0 0
This should do:
df.infants = df.infants.map({ 'Y' : 1, 'N' : 0})
Maybe you can try apply,
import pandas as pd
# create dataframe
number = [1,2,3,4,5]
sex = ['male','female','female','female','male']
df_new = pd.DataFrame()
df_new['number'] = number
df_new['sex'] = sex
df_new.head()
# create def for category to number 0/1
def tran_cat_to_num(df):
if df['sex'] == 'male':
return 1
elif df['sex'] == 'female':
return 0
# create sex_new
df_new['sex_new']=df_new.apply(tran_cat_to_num,axis=1)
df_new
raw
number sex
0 1 male
1 2 female
2 3 female
3 4 female
4 5 male
after use apply
number sex sex_new
0 1 male 1
1 2 female 0
2 3 female 0
3 4 female 0
4 5 male 1
You can change the values using the map function.
Ex.:
x = {'y': 1, 'n': 0}
for col in df.columns():
df[col] = df[col].map(x)
This way you map each column of your dataframe.
All the solutions above are correct, but what you could also do is:
df["infants"] = df["infants"].replace("Y", 1).replace("N", 0).replace("?", 1)
which now that I read more carefully is very similar to using replace with dict !
Use dataframe.replace():
df.replace({'infants':{'y':1,'?':1,'n':0}},inplace=True)

Pandas calculate length of consecutive equal values from a grouped dataframe

I want to do what they've done in the answer here: Calculating the number of specific consecutive equal values in a vectorized way in pandas
, but using a grouped dataframe instead of a series.
So given a dataframe with several columns
A B C
------------
x x 0
x x 5
x x 2
x x 0
x x 0
x x 3
x x 0
y x 1
y x 10
y x 0
y x 5
y x 0
y x 0
I want to groupby columns A and B, then count the number of consecutive zeros in C. After that I'd like to return counts of the number of times each length of zeros occurred. So I want output like this:
A B num_consecutive_zeros count
---------------------------------------
x x 1 2
x x 2 1
y x 1 1
y x 2 1
I don't know how to adapt the answer from the linked question to deal with grouped dataframes.
Here is the code, count_consecutive_zeros() use numpy functions and pandas.value_counts() to get the results, and use groupby().apply(count_consecutive_zeros) to call count_consecutive_zeros() for every group. call reset_index() to change MultiIndex to columns:
import pandas as pd
import numpy as np
from io import BytesIO
text = """A B C
x x 0
x x 5
x x 2
x x 0
x x 0
x x 3
x x 0
y x 1
y x 10
y x 0
y x 5
y x 0
y x 0"""
df = pd.read_csv(BytesIO(text.encode()), delim_whitespace=True)
def count_consecutive_zeros(s):
v = np.diff(np.r_[0, s.values==0, 0])
s = pd.value_counts(np.where(v == -1)[0] - np.where(v == 1)[0])
s.index.name = "num_consecutive_zeros"
s.name = "count"
return s
df.groupby(["A", "B"]).C.apply(count_consecutive_zeros).reset_index()

Categories

Resources