nunique excluding some values in pandas - python

I am calculating unique values, per row. However I want to exclude the value 0 and then calculate uniques
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0],}
df = pd.DataFrame(data=d)
df
col1 col2 col3
0 1 3 0
1 2 4 4
2 3 0 0
Expected output
col1 col2 col3 uniques
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1
df.nunique(axis = 1), this includes all values

To do this you can simply replace zeroes with Nan values.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, 3], 'col2': [3, 4, 0], 'col3': [0, 4, 0]}
df = pd.DataFrame(data=d)
df['uniques'] = df.replace(0, np.NaN).nunique(axis=1)

Try this:
def func(x):
s = set(x)
s.discard(0)
return len(s)
df['uniq'] = df.apply(lambda x: func(x), axis=1)

A slightly more concise version without using replace:
df['unique'] = df[df!=0].nunique(axis=1)
df
Output:
col1 col2 col3 unique
0 1 3 0 2
1 2 4 4 2
2 3 0 0 1

Related

Sort dataframe based on minimum value of two columns

Let's assume I have the following dataframe:
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
I want this dataframe sorted by col1 and col2 on the minimum value. The order of the indexes should be 2, 0, 1, 3.
I tried this with df.sort_values(by=['col2', 'col1']), but than it takes the minimum of col1 first and then of col2. Is there anyway to order by taking the minimum of two columns?
Using numpy.lexsort:
order = np.lexsort(np.sort(df[['col1', 'col2']])[:, ::-1].T)
out = df.iloc[order]
Output:
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
Note that you can easily handle any number of columns:
df.iloc[np.lexsort(np.sort(df[['col1', 'col2', 'col3']])[:, ::-1].T)]
col1 col2 col3 outcome
1 2 2 0 0
2 3 1 1 1
0 1 4 1 1
3 4 3 1 0
One way (not the most efficient):
idx = df[['col2', 'col1']].apply(lambda x: tuple(sorted(x)), axis=1).sort_values().index
Output:
>>> df.loc[idx]
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
>>> idx
Int64Index([2, 0, 1, 3], dtype='int64')
you can decorate-sort-undecorate where decoration is minimal and other (i.e., maximal) values per row:
cols = ["col1", "col2"]
(df.assign(_min=df[cols].min(axis=1), _other=df[cols].max(axis=1))
.sort_values(["_min", "_other"])
.drop(columns=["_min", "_other"]))
to get
col1 col2 col3 outcome
2 3 1 1 1
0 1 4 1 1
1 2 2 0 0
3 4 3 1 0
I would compute min(col1, col2) as new column and then sort by it
import pandas as pd
d = {'col1': [1, 2,3,4], 'col2': [4, 2, 1, 3], 'col3': [1,0,1,1], 'outcome': [1,0,1,0]}
df = pd.DataFrame(data=d)
df['colmin'] = df[['col1','col2']].min(axis=1) # compute min
df = df.sort_values(by='colmin').drop(columns='colmin') # sort then drop min
print(df)
gives output
col1 col2 col3 outcome
0 1 4 1 1
2 3 1 1 1
1 2 2 0 0
3 4 3 1 0

How to count the number of elements in a list and create a new column?

I have a df as follows:
Col1 Col2
0 [7306914, 7306915]
1 [7295911, 7295912]
2 [7324496]
3 [7294109, 7294110]
4 [7313713]
The second column is a list.
what I would like is to create a new column that contains the total number of elements in the list
Expected Output:
Col1 Col2 Col3
0 [7306914, 7306915] 2
1 [7295911, 7295912] 2
2 [7324496] 1
3 [7294109, 7294110] 2
4 [7313713] 1
Use Series.str.len. This is a vectorized method and is more efficient than apply function, which is essentially looping under-the-hood:
df = pd.DataFrame([{'Col1': 0, 'Col2': [7306914, 7306915]}, {'Col1': 1, 'Col2': [7295911, 7295912]}, {'Col1': 2, 'Col2': [7324496]}, {'Col1': 3, 'Col2': [7294109, 7294110]}, {'Col1': 4, 'Col2': [7313713]}])
df['Col3'] = df['Col2'].str.len()
[out]
print(df)
Col1 Col2 Col3
0 0 [7306914, 7306915] 2
1 1 [7295911, 7295912] 2
2 2 [7324496] 1
3 3 [7294109, 7294110] 2
4 4 [7313713] 1
Try this:
df_tmp = pd.DataFrame({'col1':[[1,2,3], [1,2]]}).reset_index()
In [360]:
df_tmp.head()
Out[360]:
index col1
0 0 [1, 2, 3]
1 1 [1, 2]
In [364]:
df_tmp['len'] = df_tmp.apply(lambda x: len(x['col1']), axis=1)
In [365]:
df_tmp
Out[365]:
index col1 len
0 0 [1, 2, 3] 3
1 1 [1, 2] 2
Apply should be most faster way for that.
Using the DataFrame.apply() or DataFrame.apply() like this:
df['Col3'] = df['Col2'].apply(len)
Hope it could help you.

Conditional ratio with group by in pandas

I want to do a groupby on column 1 then get the sum of values from column 2, conditional on the value in column 3, which are then divided by the total sum in column 2, still grouped by column 1.
An example is given below:
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
I want to create a new column: col4. For this column I group by col1 and then get the percentage of col2 values where col3 is 1 divided by the total grouped sum of col2. Such that I would end up with the following result. ( I put it in fractions to make it easier to follow the calculations.
col1 col2 col3 col4
0 1 3 1 3/5
1 2 4 1 4/11
2 1 2 0 3/5
3 2 7 0 4/11
I tried the following, but this does not work unfortunately:
df.col4 = df.groupby(['col1']).transform(lambda x: np.where(x.col3 == 1, x.col2, 0).sum()) / df.groupby(['col1']).col2.transform('sum')
Edit | Extended example
I extended the example as the solution provided by Wen only covered the above simple example.
d = {'col1': [1, 2, 1, 2, 1, 2], 'col2': [3, 4, 2, 7, 6, 8], 'col3': [1, 1, 0, 0, 1, 0]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 3 1
1 2 4 1
2 1 2 0
3 2 7 0
4 1 6 1
5 2 8 0
Edit | Possible solution
I found a possible solution. I would like to do it in a cleaner way, but this is readable and pretty simple. Any alternatives to combine these two lines of code are still appreciated ofcourse.
df['col4'] = np.where(df.col3 == 1, df.col2, 0)
df['col4'] = df.groupby(['col1']).col4.transform('sum') / df.groupby(['col1']).col2.transform('sum')
You may need to correct your expected output , then using map after filter
df.col1.map(df.loc[df.col3==1,].set_index('col1').col2)/df.groupby(['col1']).col2.transform('sum')
Out[566]:
0 0.600000
1 0.363636
2 0.600000
3 0.363636
dtype: float64
simple :)
d = {'col1': [1, 2, 1, 2], 'col2': [3, 4, 2, 7], 'col3': [1, 1, 0, 0]}
df = pd.DataFrame(data=d)
df['col4'] = 0.0
def con(data):
part_a = sum(data[data['col3'] == 1]['col2'])
part_b = sum(data['col2'])
data.col4 = part_a/part_b
return data
df.groupby('col1').apply(con)
Output
col1 col2 col3 col4
0 1 3 1 0.600000
1 2 4 1 0.363636
2 1 2 0 0.600000
3 2 7 0 0.363636

pandas row operation to keep only the right most non zero value per row

How to keep the right most number in each row in a dataframe?
a = [[1, 2, 0], [1, 3, 0], [1, 0, 0]]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
df
col1 col2 col3
row0 1 2 NaN
row1 1 3 0
row2 1 0 0
Then after transformation
col1 col2 col3
row0 0 2 0
row1 0 3 0
row2 1 0 0
Based on the suggestion by divakar I've come up with the following:
import pandas as pd
a = [[1, 2, 0, None],
[1, 3, 0,0],
[1, 0, 0,0],
[1, 0, 0,0],
[1, 0, 0,0],
[0, 0, 0,1]]
df = pd.DataFrame(a, columns=['col1','col2','col3','col4'])
df.fillna(value=0,inplace=True) # Get rid of non numeric items
a
[[1, 2, 0, None],
[1, 3, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1]]
# Return index of first occurrence of maximum over requested axis.
# 0 or 'index' for row-wise, 1 or 'columns' for column-wise
df.idxmax(1)
0 col2
1 col2
2 col1
3 col1
4 col1
5 col4
dtype: object
Create a matrix to mask values
numberOfRows = df.shape[0]
df_mask= pd.DataFrame(columns=df.columns,index=np.arange(0, numberOfRows))
df_test.fillna(value=0,inplace=True) # Get rid of non numeric items
# Add mask entries
for row,col in enumerate(df.idxmax(1)):
df_mask.loc[row,col] = 1
df_result=df*df_mask
df_result
col1 col2 col3 col4
0 0 2 0 0.0
1 0 3 0 0.0
2 1 0 0 0.0
3 1 0 0 0.0
4 1 0 0 0.0
5 0 0 0 1.0
Here is a workaround that requires the use of helper functions:
import pandas as pd
#Helper functions
def last_number(lst):
if all(map(lambda x: x == 0, lst)):
return 0
elif lst[-1] != 0:
return len(lst)-1
else:
return last_number(lst[:-1])
def fill_others(lst):
new_lst = [0]*len(lst)
new_lst[last_number(lst)] = lst[last_number(lst)]
return new_lst
#Data
a = [[1, 2, 0], [1, 3, 0], [1, 0, 0]]
df = pd.DataFrame(a, columns=['col1','col2','col3'])
df.fillna(0, inplace = True)
print df
col1 col2 col3
0 1 2 0
1 1 3 0
2 1 0 0
#Application
print df.apply(lambda x: fill_others(x.values.tolist()), axis=1)
col1 col2 col3
0 0 2 0
1 0 3 0
2 1 0 0
As their names suggest, the functions get the last number in a given row and fill the other values with zeros.
I hope this helps.
Working at NumPy level, here's one vectorized approach using broadcasting -
np.where(((a!=0).cumsum(1).argmax(1))[:,None] == np.arange(a.shape[1]),a,0)
Sample run -
In [7]: a # NumPy array
Out[7]:
array([[1, 2, 0],
[1, 3, 0],
[1, 0, 0]])
In [8]: np.where(((a!=0).cumsum(1).argmax(1))[:,None] == np.arange(a.shape[1]),a,0)
Out[8]:
array([[0, 2, 0],
[0, 3, 0],
[1, 0, 0]])
Porting it to pandas, we would have an implementation like so -
idx = (df!=0).values.cumsum(1).argmax(1)
df_out = df*(idx[:,None] == np.arange(df.shape[1]))
Sample run -
In [19]: df
Out[19]:
col1 col2 col3 col4
0 1 2 0 0.0
1 1 3 0 0.0
2 2 2 2 0.0
3 1 0 0 0.0
4 1 0 0 0.0
5 0 0 0 1.0
In [20]: idx = (df!=0).values.cumsum(1).argmax(1)
In [21]: df*(idx[:,None] == np.arange(df.shape[1]))
Out[21]:
col1 col2 col3 col4
0 0 2 0 0.0
1 0 3 0 0.0
2 0 0 2 0.0
3 1 0 0 0.0
4 1 0 0 0.0
5 0 0 0 1.0
You can fill null values "from the left", and then take the values of the resulting last column:
In [49]: df.fillna(axis=0, method='bfill')['col3']
Out[49]:
0 0.0
1 0.0
2 0.0
Name: col3, dtype: float64
Full Example
In [50]: a = [[1, 2, None], [1, 3, 0], [0, 0, 0]]
In [51]: df = pd.DataFrame(a, columns=['col1','col2','col3'])
In [52]: df.fillna(axis=0, method='bfill')['col3']
Out[52]:
0 0.0
1 0.0
2 0.0
Name: col3, dtype: float64

create new columns from a list of columns in pandas

I have a pandas dataframe that has a column where the data is a list of statistics calculated from a groupby operation.
df = pd.DataFrame({'a':[1,1,1,2,2,2,3], 'b':[3,4,2,3,4,3,2]})
def calculate_stuff(x):
return len(x)/5, sum(x)/len(x), sum(x)
>>> df.groupby('a').apply(lambda row : calculate_stuff(row.b))
a
1 (0, 3, 9)
2 (0, 3, 10)
3 (0, 2, 2)
dtype: object
Basically, I have several statistics that depend on each other and have to be calculated for each groupby row. The function that does this returns a tuple of the statistics values. What I want is to create a new column for each index of the tuple so that it looks like this:
a col1 col2 col3
1 0 3 9
2 0 3 10
3 0 2 2
I don't think I can use df.groupby('a').agg because one of the calculations is required for the other calculations. Any suggestions?
edit: I realized my aggregate functions in my example were not aggregate functions so I changed them
Adding an extra a category item so the result is 4x3.
df = pd.DataFrame({'a': [1, 1, 1, 2, 2, 2, 3, 4],
'b': [3, 4, 2, 3, 4, 3, 2, 1]})
new_cols = ['col1', 'col2', 'col3']
gb = df.groupby('a').apply(lambda group: calculate_stuff(group.b))
>>> pd.DataFrame(zip(*gb), columns=gb.index, index=new_cols).T
col1 col2 col3
a
1 0 3 9
2 0 3 10
3 0 2 2
4 0 1 1
You can try list comprehension:
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2,3], 'b':[3,4,2,3,4,3,2]})
def calculate_stuff(x):
return len(x)/5, sum(x)/len(x), sum(x)
group_df = df.groupby('a').apply(lambda row : calculate_stuff(row.b))
print pd.DataFrame([x for x in group_df],
columns=['col1','col2','col3'],
index=group_df.index)
col1 col2 col3
a
1 0 3 9
2 0 3 10
3 0 2 2

Categories

Resources