Pandas Calculate Sum of Multiple Columns Given Multiple Conditions - python

I have a wide table in a format as follows (for up to 10 people):
person1_status | person2_status | person3_status | person1_type | person_2 type | person3_type
0 | 1 | 0 | 7 | 4 | 6
Where status can be a 0 or a 1 (first 3 cols).
Where type can be a # ranging from 4-7. The value here corresponds to another table that specifies a value based on type. So...
Type | Value
4 | 10
5 | 20
6 | 30
7 | 40
I need to calculate two columns, 'A', and 'B', where:
A is the sum of values of each person's type (in that row) where
status = 0.
B is the sum of values of each person's type (in that row) where
status = 1.
For example, the resulting columns 'A', and 'B' would be as follows:
A | B
70 | 10
An explanation of this:
'A' has value 70 because person1 and person3 have "status" 0 and have corresponding type of 7 and 6 (which corresponds to values 30 and 40).
Similarly, there should be another column 'B' that has the value "10" because only person2 has status "1" and their type is "4" (which has corresponding value of 10).
This is probably a stupid question, but how do I do this in a vectorized way? I don't want to use a for loop or anything since it'll be less efficient...
I hope that made sense... could anyone help me? I think I'm brain dead trying to figure this out.
For simpler calculated columns I was getting away with just np.where but I'm a little stuck here since I need to calculate the sum of values from multiple columns given certain conditions while pulling in those values from a separate table...
hope that made sense

Use the filter method which will filter the column names for those where a string appears in them.
Make a dataframe for the lookup values other_table and set the index as the type column.
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Full example below:
Create fake data
df = pd.DataFrame({'person_1_status':np.random.randint(0, 2,1000) ,
'person_2_status':np.random.randint(0, 2,1000),
'person_3_status':np.random.randint(0, 2,1000),
'person_1_type':np.random.randint(4, 8,1000),
'person_2_type':np.random.randint(4, 8,1000),
'person_3_type':np.random.randint(4, 8,1000)},
columns= ['person_1_status', 'person_2_status', 'person_3_status',
'person_1_type', 'person_2_type', 'person_3_type'])
person_1_status person_2_status person_3_status person_1_type \
0 1 0 0 7
1 0 1 0 6
2 1 0 1 7
3 0 0 0 7
4 0 0 1 4
person_3_type person_3_type
0 5 5
1 7 7
2 7 7
3 7 7
4 7 7
Make other_table
other_table = pd.Series({4:10, 5:20, 6:30, 7:40})
4 10
5 20
6 30
7 40
dtype: int64
Filter out status and type columns to their own dataframes
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
Make lookup table
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
Apply matrix multiplication and sum across rows.
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Output
person_1_status person_2_status person_3_status person_1_type \
0 0 0 1 7
1 0 1 0 4
2 0 1 1 7
3 0 1 0 6
4 0 0 1 5
person_2_type person_3_type A B
0 7 5 80 20
1 6 4 20 30
2 5 5 40 40
3 6 4 40 30
4 7 5 60 20

consider the dataframe df
mux = pd.MultiIndex.from_product([['status', 'type'], ['p%i' % i for i in range(1, 6)]])
data = np.concatenate([np.random.choice((0, 1), (10, 5)), np.random.rand(10, 5)], axis=1)
df = pd.DataFrame(data, columns=mux)
df
The way this is structured we can do this for type == 1
df.status.mul(df.type).sum(1)
0 0.935290
1 1.252478
2 1.354461
3 1.399357
4 2.102277
5 1.589710
6 0.434147
7 2.553792
8 1.205599
9 1.022305
dtype: float64
and for type == 0
df.status.rsub(1).mul(df.type).sum(1)
0 1.867986
1 1.068045
2 0.653943
3 2.239459
4 0.214523
5 0.734449
6 1.291228
7 0.614539
8 0.849644
9 1.109086
dtype: float64
You can get your columns in this format using the following code
df.columns = df.columns.str.split('_', expand=True)
df = df.swaplevel(0, 1, 1)

Related

pandas ascend sort multiple columns but reverse sort one column

I have a pandas DataFrame that has a little over 100 columns.
There are about 50 columns that I want to sort ascending and then there is one column (a date_time column) that I want to reverse sort.
How do I go about achieving this? I know I can do something like...
df = df.sort_values(by = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time'], ascending=[True, True, True, True,... False])
... but I am trying to avoid having to type 'True' 50 times.
Just wondering if there is a quick hand way of doing this.
Thanks.
Dan
You can use:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=[True]*49+[False])
Or, for a programmatic variant for which you don't need to know the position of the False, using numpy:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=np.array(cols)!='date_time')
It should go something like this.
to_be_reserved = "COLUMN_TO_BE_RESERVED"
df = df.sort_values(by=[col for col in df.columns if col != to_be_reserved],ignore_index=True)
df[to_be_reserved] = df[to_be_reserved].sort_values(ascending=False,ignore_index = True)
You can also use filter if your 49 columns have a regular pattern:
# if you have a column name pattern
cols = df.filter(regex=('^(column_|date_time)')).columns.tolist()
ascending_false = ['date_time']
ascending = [True if c not in ascending_false else False for c in cols]
df.sort_values(by=cols, ascending=ascending)
Example:
>>> df
column_0 column_1 date_time value other_value another_value
0 4 2 6 6 1 1
1 4 4 0 6 0 2
2 3 2 6 9 0 7
3 9 2 1 7 4 7
4 6 9 2 4 4 1
>>> df.sort_values(by=cols, ascending=ascending)
column_0 column_1 date_time value other_value another_value
2 3 2 6 9 0 7
0 4 2 6 6 1 1
1 4 4 0 6 0 2
4 6 9 2 4 4 1
3 9 2 1 7 4 7

Sample Pandas dataframe based on multiple values in column

I'm trying to even up a dataset for machine learning. There are great answers for how to sample a dataframe with two values in a column (a binary choice).
In my case I have many values in column x. I want an equal number of records in the dataframe where
x is 0 or not 0
or in a more complicated example the value in x is 0, 5 or other value
Examples
x
0 5
1 5
2 5
3 0
4 0
5 9
6 18
7 3
8 5
** For the first **
I have 2 rows where x = 0 and 7 where x != 0. The result should balance this up and be 4 rows: the two with x = 0 and 2 where x != 0 (randomly selected). Preserving the same index for the sake of illustration
1 5
3 0
4 0
6 18
** For the second **
I have 2 rows where x = 0, 4 rows where x = 5 and 3 rows where x != 0 && x != 5. The result should balance this up and be 6 rows in total: two for each condition. Preserving the same index for the sake of illustration
1 5
3 0
4 0
5 9
6 18
8 5
I've done examples with 2 conditions & 3 conditions. A solution that generalises to more would be good. It is better if it detects the minimum number of rows (for 0 in this example) so I don't need to work this out first before writing the condition.
How do I do this with pandas? Can I pass a custom function to .groupby() to do this?
IIUC, you could groupby on the condition whether "x" is 0 or not and sample the smallest-group-size number of entries from each group:
g = df.groupby(df['x']==0)['x']
out = g.sample(n=g.count().min()).sort_index()
(An example) output:
1 5
3 0
4 0
5 9
Name: x, dtype: int64
For the second case, we could use numpy.select and numpy.unique to get the groups (the rest are essentially the same as above):
import numpy as np
groups = np.select([df['x']==0, df['x']==5], [1,2], 3)
g = df.groupby(groups)['x']
out = g.sample(n=np.unique(groups, return_counts=True)[1].min()).sort_index()
An example output:
2 5
3 0
4 0
5 9
7 3
8 5
Name: x, dtype: int64
IIUC, and you want any two non-zero records:
mask = df['x'].eq(0)
pd.concat([df[mask], df[~mask].sample(mask.sum())]).sort_index()
Output:
x
1 5
2 5
3 0
4 0
Part II:
mask0 = df['x'].eq(0)
mask5 = df['x'].eq(5)
pd.concat([df[mask0],
df[mask5].sample(mask0.sum()),
df[~(mask0 | mask5)].sample(mask0.sum())]).sort_index()
Output:
x
2 5
3 0
4 0
6 18
7 3
8 5

Split a dataframe based on certain column values

Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12

Filling in a new data frame based on two other data frames

I want an efficient way to solve this problem below because my code seems inefficient.
First of all, let me provide a dummy dataset.
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df1= {'a0' : [1,2,2,1,3], 'a1' : [2,3,3,2,4], 'a2' : [3,4,4,3,5], 'a3' : [4,5,5,4,6], 'a4' : [5,6,6,5,7]}
df2 = {'b0' : [3,6,6,3,8], 'b1' : [6,8,8,6,9], 'b2' : [8,9,9,8,7], 'b3' : [9,7,7,9,2], 'b4' : [7,2,2,7,1]}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
My actual dataset has more than 100,000 rows and 15 columns. Now, what I want to do is pretty complicated to explain, but here we go.
Goal: I want to create a new df using the two dfs above.
find the global min and max from df1. Since the value is sorted by row, column 'a' will always have the minimum each row, and 'e' will have the maximum. Therefore, I will find the minimum in column 'a0' and maximum in 'a4'.
Min = df1['a0'].min()
Max = df1['a4'].max()
Min
Max
Then I will create a data frame filled with 0s and columns of range(Min, Max). In this case, 1 through 7.
column = []
for i in np.arange(Min, Max+1):
column.append(i)
newdf = pd.DataFrame(0, index = df1.index, columns=column)
The third step is to find the place where the values from df2 will go:
I want to loop through each value in df1 and match each value with the column name in the new df in the same row.
For example, if we are looking at row 0 and go through each column; the values in this case [1,2,3,4,5]. Then the row 0 of the newdf, column 1,2,3,4,5 will be filled with the corresponding values from df2.
Lastly, each corresponding values in df2 (same place) will be added to the place where we found in step 2.
So, the very first row of the new df will look like this:
output = {'1' : [3], '2' : [6], '3' : [8], '4' : [9], '5' : [7], '6' : [0], '7' : [0]}
output = pd.DataFrame(output)
Column 6 and 7 will not be updated because we didn't have 6 and 7 in the very first row of df1.
Here is my code for this process:
for rowidx in range(0, len(df1)):
for columnidx in range(0,len(df1.columns)):
new_column = df1[str(df1.columns[columnidx])][rowidx]
newdf.loc[newdf.index[rowidx], new_column] = df2['b' + df1.columns[columnidx][1:]][rowidx]
I think this does the job, but as I said, my actual dataset is huge with 2999999 rows and Min to Max range is 282 which means 282 columns in the new data frame.
So, the code above runs forever. Is there a faster way to do this? I think I learned something like map-reduce, but I don't know if that would apply here.
Idea is create default columns names in both DataFrames, then concat of DataFrame.stacked Series, add first 0 column to index, remove second level, so possible use DataFrame.unstack:
df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
newdf = (pd.concat([df1.stack(), df2.stack()], axis=1)
.set_index(0, append=True)
.reset_index(level=1, drop=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (newdf)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Another solutions:
comp =[pd.Series(a, index=df1.loc[i]) for i, a in enumerate(df2.values)]
df = pd.concat(comp, axis=1).T.fillna(0).astype(int)
print (df)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Or:
comp = [dict(zip(x, y)) for x, y in zip(df1.values, df2.values)]
c = pd.DataFrame(comp).fillna(0).astype(int)
print (c)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1

python pandas - remove duplicates in a column and keep rows according to a complex criteria

Suppose I have this DF:
s1 = pd.Series([1,1,2,2,2,3,3,3,4])
s2 = pd.Series([10,20,10,5,10,7,7,3,10])
s3 = pd.Series([0,0,0,0,1,1,0,2,0])
df = pd.DataFrame([s1,s2,s3]).transpose()
df.columns = ['id','qual','nm']
df
id qual nm
0 1 10 0
1 1 20 0
2 2 10 0
3 2 5 0
4 2 10 1
5 3 7 1
6 3 7 0
7 3 3 2
8 4 10 0
I want to get a new DF in which there are no duplicate ids, so there should be 4 rows with ids 1,2,3,4. The row that should be kept should be chosen based on the following criteria: take the one with smallest nm, if equal, take the one with largest qual, if still equal, just choose one.
I figure that my code should look something like:
df.groupby('id').apply(lambda x: ???)
And it should return:
id qual nm
0 1 20 0
1 2 10 0
2 3 7 0
3 4 10 0
But not sure what my function should take and return.
Or possibly there is an easier way?
Thanks!
Use boolean indexing with GroupBy.transform for minumum rows per groups, then for maximum values and last if still dupes remove them by DataFrame.drop_duplicates:
#get minimal nm
df1 = df[df['nm'] == df.groupby('id')['nm'].transform('min')]
#get maximal qual
df1 = df1[df1['qual'] == df1.groupby('id')['qual'].transform('max')]
#if still dupes get first id
df1 = df1.drop_duplicates('id')
print (df1)
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0
Use -
grouper = df.groupby(['id'])
df.loc[(grouper['nm'].transform(min) == df['nm'] ) & (grouper['qual'].transform(max) == df['qual']),:].drop_duplicates(subset=['id'])
Output
id qual nm
1 1 20 0
2 2 10 0
6 3 7 0
8 4 10 0

Categories

Resources