Changing a cell value based on other rows and columns - python

I have a dataframe called Result that comes from a SQL query:
Loc ID Bank
1 23 NULL
1 24 NULL
1 25 NULL
2 23 6
2 24 7
2 25 8
I am trying to set the values of Loc == 1 Bank equal to the Bank of Loc == 2 when the ID is the same, resulting in:
Loc ID Bank
1 23 6
1 24 7
1 25 8
2 23 6
2 24 7
2 25 8
Here is where I am at with the code, I know the ending is super simple I just can't wrap my head around a solution that doesn't involve iterating over every row (9000~).
result.loc[(result['Loc'] == '1'), 'bank'] = ???

You can try this. It uses map() to get the values from ID.
for_map = df.loc[df['Loc'] == 2].set_index('ID')['Bank'].squeeze().to_dict()
df.loc[df['Loc'] == 1,'Bank'] = df.loc[df['Loc'] == 1,'Bank'].fillna(df['ID'].map(for_map))

You could do a self merge on the dataframe, on ID, then filter for rows where it is equal to 2:
(
df.merge(df, on="ID")
.loc[lambda df: df.Loc_y == 2, ["Loc_x", "ID", "Bank_y"]]
.rename(columns=lambda x: x.split("_")[0] if "_" in x else x)
.astype({"Bank": "Int8"})
.sort_values("Loc", ignore_index=True)
)
Loc ID Bank
0 1 23 6
1 1 24 7
2 1 25 8
3 2 23 6
4 2 24 7
5 2 25 8
You could also stack/unstack, although this fails if you have duplicate indices:
(
df.set_index(["Loc", "ID"])
.unstack("Loc")
.bfill(1)
.stack()
.reset_index()
.reindex(columns=df.columns)
)
Loc ID Bank
0 1 23 6.0
1 2 23 6.0
2 1 24 7.0
3 2 24 7.0
4 1 25 8.0
5 2 25 8.0

Why not use pandas.MultiIndex ?
Commonalities
# Arguments,
_0th_level = 'Loc'
merge_key = 'ID'
value_key = 'Bank' # or a list of colnames or `slice(None)` to propagate all columns values.
src_key = '2'
dst_key = '1'
# Computed once for all,
df = result.set_index([_0th_level, merge_key])
df2 = df.xs(key=src_key, level=_0th_level, drop_level=False)
df1_ = df2.rename(level=_0th_level, index={src_key: dst_key})
First (naive) approach
df.loc[df1_.index, value_key] = df1_
# to get `result` back : df.reset_index()
Second (robust) approach
That being shown, the first approach may be illegal (since pandas version 1.0.0) if there is one or more missing label [...].
So if you must ensure that indexes exist both at source and destination, the following does the job on shared IDs only.
df1 = df.xs(key=dst_key, level=_0th_level, drop_level=False)
idx = df1.index.intersection(df1_.index) # <-----
df.loc[idx, value_key] = df1_.loc[idx, value_key]

Related

How to fill missing values in a column by random sampling another column by other column values

I have missing values in one column that I would like to fill by random sampling from a source distribution:
import pandas as pd
import numpy as np
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
source
age location x
0 21 0 1
1 21 0 2
2 21 1 3
3 21 1 4
4 21 1 4
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
target
age location x
0 21 0 NaN
1 21 0 NaN
2 21 0 NaN
3 21 1 NaN
4 21 2 NaN
Now I need to fill in the missing values of x in the target dataframe by choosing a random value of x from the source dataframe that have the same values for age and location as the missing x with replacement. If there is no value of x in source that has the same values for age and location as the missing value it should be left as missing.
Expected output:
age location x
0 21 0 1 with probability 0.5 2 otherwise
1 21 0 1 with probability 0.5 2 otherwise
2 21 0 1 with probability 0.5 2 otherwise
3 21 1 3 with probability 0.33 4 otherwise
4 21 2 NaN
I can loop through all the missing combinations of age and location and slice the source dataframe and then take a random sample, but my dataset is large enough that it takes quite a while to do.
Is there a better way?
You can create MultiIndex in both DataFrames and then in custom function replace NaN by another DataFrame in GroupBy.transform with numpy.random.choice:
source = pd.DataFrame({'age':5*[21],
'location':[0,0,1,1,1],
'x':[1,2,3,4,4]})
target = pd.DataFrame({'age':5*[21],
'location':[0,0,0,1,2],
'x':5*[np.nan]})
cols = ['age', 'location']
source1 = source.set_index(cols)['x']
target1 = target.set_index(cols)['x']
def f(x):
try:
a = source1.loc[x.name].to_numpy()
m = x.isna()
x[m] = np.random.choice(a, size=m.sum())
return x
except KeyError:
return np.nan
target1 = target1.groupby(level=[0,1]).transform(f).reset_index()
print (target1)
age location x
0 21 0 1.0
1 21 0 2.0
2 21 0 2.0
3 21 1 3.0
4 21 2 NaN
You can create a common grouper and perform a merge:
cols = ['age', 'location']
(target[cols]
.assign(group=target.groupby(cols).cumcount()) # compute subgroup for duplicates
.merge((# below: assigns a random row group
source.assign(group=source.sample(frac=1).groupby(cols, sort=False).cumcount())
.groupby(cols+['group'], as_index=False) # get one row per group
.first()
),
on=cols+['group'], how='left') # merge
#drop('group', axis=1) # column kept for clarity, uncomment to remove
)
output:
age location group x
0 20 0 0 0.339955
1 20 0 1 0.700506
2 21 0 0 0.777635
3 22 1 0 NaN

sum duplicate row with condition using pandas

I have a dataframe who looks like this:
Name rent sale
0 A 180 2
1 B 1 4
2 M 12 1
3 O 10 1
4 A 180 5
5 M 2 19
that i want to make condition that if i have a duplicate row and a duplicate value in column field => Example :
duplicate row A have duplicate value 180 in rent column
I keep only one (without making the sum)
Or make the sum => Example duplicate row A with different values 2 & 5 in Sale column and duplicate row M with different values in rent & sales columns
Expected output:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
I tried this code but it's not workin as i want
import pandas as pd
df=pd.DataFrame({'Name':['A','B','M','O','A','M'],
'rent':[180,1,12,10,180,2],
'sale':[2,4,1,1,5,19]})
df2 = df.drop_duplicates().groupby('Name',sort=False,as_index=False).agg(Name=('Name','first'),
rent=('rent', 'sum'),
sale=('sale','sum'))
print(df2)
I got this output
Name rent sale
0 A 360 7
1 B 1 4
2 M 14 20
3 O 10 1
Can try summing only the unique values per group:
def sum_unique(s):
return s.unique().sum()
df2 = df.groupby('Name', sort=False, as_index=False).agg(
Name=('Name', 'first'),
rent=('rent', sum_unique),
sale=('sale', sum_unique)
)
df2:
Name rent sale
0 A 180 7
1 B 1 4
2 M 14 20
3 O 10 1
You can first groupby by Name and rent, and then just by Name:
df2 = df.groupby(['Name', 'rent'], as_index=False).sum().groupby('Name', as_index=False).sum()

Split a dataframe based on certain column values

Let's say I have a DF like this:
Mean 1
Mean 2
Stat 1
Stat 2
ID
5
10
15
20
Z
3
6
9
12
X
Now, I want to split the dataframe to separate the data based on whether it is a #1 or #2 for each ID.
Basically I would double the amount of rows for each ID, with each one being dedicated to either #1 or #2, and a new column will be added to specify which number we are looking at. Instead of Mean 1 and 2 being on the same row, they will be listed in two separate rows, with the # column making it clear which one we are looking at. What's the best way to do this? I was trying pd.melt(), but it seems like a slightly different use case.
Mean
Stat
ID
#
5
15
Z
1
10
20
Z
2
3
9
X
1
6
12
X
2
Use pd.wide_to_long:
new_df = pd.wide_to_long(
df, stubnames=['Mean', 'Stat'], i='ID', j='#', sep=' '
).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12
Or set_index then str.split the columns then stack if order must match the OP:
new_df = df.set_index('ID')
new_df.columns = new_df.columns.str.split(expand=True)
new_df = new_df.stack().rename_axis(['ID', '#']).reset_index()
new_df:
ID # Mean Stat
0 Z 1 5 15
1 Z 2 10 20
2 X 1 3 9
3 X 2 6 12
Here is a solution with melt and pivot:
df = df.melt(id_vars=['ID'], value_name='Mean')
df[['variable', '#']] = df['variable'].str.split(expand=True)
df = (df.assign(idx=df.groupby('variable').cumcount())
.pivot(index=['idx', 'ID', '#'], columns='variable').reset_index().drop(('idx', ''), axis=1))
df.columns = [col[0] if col[1] == '' else col[1] for col in df.columns]
df
Out[1]:
ID # Mean Stat
0 Z 1 5 15
1 X 1 3 9
2 Z 2 10 20
3 X 2 6 12

3 data frames and 3 rules in operation to insert data into another dataframe - No common columns - Big data

I have 3 different data-frames which can be generated using the code given below
data_file= pd.DataFrame({'person_id':[1,2,3],'gender': ['Male','Female','Not disclosed'],'ethnicity': ['Chinese','Indian','European'],'Marital_status': ['Single','Married','Widowed'],'Smoke_status':['Yes','No','No']})
map_file= pd.DataFrame({'gender': ['1.Male','2. Female','3. Not disclosed'],'ethnicity': ['1.Chinese','2. Indian','3.European'],
'Marital_status':['1.Single','2. Married','3 Widowed'],'Smoke_status':['1. Yes','2. No',np.nan]})
hash_file = pd.DataFrame({'keys':['gender','ethnicity','Marital_status','Smoke_status','Yes','No','Male','Female','Single','Married','Widowed','Chinese','Indian','European'],'values':[21,22,23,24,125,126,127,128,129,130,131,141,142,0]})
And another empty dataframe in which the output should be filled can be generated using the code below
columns = ['person_id','obsid','valuenum','valuestring','valueid']
obs = pd.DataFrame(columns=columns)
What I am trying to achieve is shown in the table where you can see the rules and description of how data is to be filled
I did try via for loop approach but as soon as I unstack it, I lose the column names and not sure how I can proceed further.
a=1
for i in range(len(data_file)):
df_temp = data_file[i:a]
a=a+1
df_temp=df_temp.unstack()
df_temp = df_temp.to_frame().reset_index()
How can I get my output dataframe to be filled like as shown below (ps: I have shown only for person_id = 1 and 4 columns) but in real time, I have more than 25k persons and 400 columns for each person. So any elegant and efficient approach is helpful unlike my for loop.
After chat and remove duplicates data is possible use:
s = hash_file.set_index('VARIABLE')['concept_id']
df1 = map_file.melt().dropna(subset=['value'])
df1[['valueid','valuestring']] = df1.pop('value').str.extract('(\d+)\.(.+)')
df1['valuestring'] = df1['valuestring'].str.strip()
columns = ['studyid','obsid','valuenum','valuestring','valueid']
obs = data_file.melt('studyid', value_name='valuestring').sort_values('studyid')
#merge by 2 columns variable, valuestring
obs = (obs.merge(df1, on=['variable','valuestring'], how='left')
.rename(columns={'valueid':'valuenum'}))
obs['obsid'] = obs['variable'].map(s)
obs['valueid'] = obs['valuestring'].map(s)
#map by only one column variable
s1 = df1.drop_duplicates('variable').set_index('variable')['valueid']
obs['valuenum_new'] = obs['variable'].map(s1)
obs = obs.reindex(columns + ['valuenum_new'], axis=1)
print (obs)
#compare number of non missing rows
print (len(obs.dropna(subset=['valuenum'])))
print (len(obs.dropna(subset=['valuenum_new'])))
Here is an alterternative approach using DataFrame.melt and Series.map:
# Solution for pandas V 0.24.0 +
columns = ['person_id','obsid','valuenum','valuestring','valueid']
# Create map Series
hash_map = hash_file.set_index('keys')['values']
value_map = map_file.stack().str.split('\.\s?', expand=True).set_index(1, append=True).droplevel(0)[0]
# Melt and add mapped columns
obs = data_file.melt(id_vars=['person_id'], value_name='valuestring')
obs['obsid'] = obs.variable.map(hash_map)
obs['valueid'] = obs.valuestring.map(hash_map).astype('Int64')
obs['valuenum'] = obs[['variable', 'valuestring']].apply(tuple, axis=1).map(value_map)
# Reindex and sort for desired output
obs.reindex(columns=columns).sort_values('person_id')
[out]
person_id obsid valuenum valuestring valueid
0 1 21 1 Male 127
3 1 22 1 Chinese 141
6 1 23 1 Single 129
9 1 24 1 Yes 125
1 2 21 2 Female 128
4 2 22 2 Indian 142
7 2 23 2 Married 130
10 2 24 2 No 126
2 3 21 3 Not disclosed NaN
5 3 22 3 European 0
8 3 23 3 Widowed 131
11 3 24 2 No 126

Pandas Calculate Sum of Multiple Columns Given Multiple Conditions

I have a wide table in a format as follows (for up to 10 people):
person1_status | person2_status | person3_status | person1_type | person_2 type | person3_type
0 | 1 | 0 | 7 | 4 | 6
Where status can be a 0 or a 1 (first 3 cols).
Where type can be a # ranging from 4-7. The value here corresponds to another table that specifies a value based on type. So...
Type | Value
4 | 10
5 | 20
6 | 30
7 | 40
I need to calculate two columns, 'A', and 'B', where:
A is the sum of values of each person's type (in that row) where
status = 0.
B is the sum of values of each person's type (in that row) where
status = 1.
For example, the resulting columns 'A', and 'B' would be as follows:
A | B
70 | 10
An explanation of this:
'A' has value 70 because person1 and person3 have "status" 0 and have corresponding type of 7 and 6 (which corresponds to values 30 and 40).
Similarly, there should be another column 'B' that has the value "10" because only person2 has status "1" and their type is "4" (which has corresponding value of 10).
This is probably a stupid question, but how do I do this in a vectorized way? I don't want to use a for loop or anything since it'll be less efficient...
I hope that made sense... could anyone help me? I think I'm brain dead trying to figure this out.
For simpler calculated columns I was getting away with just np.where but I'm a little stuck here since I need to calculate the sum of values from multiple columns given certain conditions while pulling in those values from a separate table...
hope that made sense
Use the filter method which will filter the column names for those where a string appears in them.
Make a dataframe for the lookup values other_table and set the index as the type column.
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Full example below:
Create fake data
df = pd.DataFrame({'person_1_status':np.random.randint(0, 2,1000) ,
'person_2_status':np.random.randint(0, 2,1000),
'person_3_status':np.random.randint(0, 2,1000),
'person_1_type':np.random.randint(4, 8,1000),
'person_2_type':np.random.randint(4, 8,1000),
'person_3_type':np.random.randint(4, 8,1000)},
columns= ['person_1_status', 'person_2_status', 'person_3_status',
'person_1_type', 'person_2_type', 'person_3_type'])
person_1_status person_2_status person_3_status person_1_type \
0 1 0 0 7
1 0 1 0 6
2 1 0 1 7
3 0 0 0 7
4 0 0 1 4
person_3_type person_3_type
0 5 5
1 7 7
2 7 7
3 7 7
4 7 7
Make other_table
other_table = pd.Series({4:10, 5:20, 6:30, 7:40})
4 10
5 20
6 30
7 40
dtype: int64
Filter out status and type columns to their own dataframes
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
Make lookup table
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
Apply matrix multiplication and sum across rows.
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Output
person_1_status person_2_status person_3_status person_1_type \
0 0 0 1 7
1 0 1 0 4
2 0 1 1 7
3 0 1 0 6
4 0 0 1 5
person_2_type person_3_type A B
0 7 5 80 20
1 6 4 20 30
2 5 5 40 40
3 6 4 40 30
4 7 5 60 20
consider the dataframe df
mux = pd.MultiIndex.from_product([['status', 'type'], ['p%i' % i for i in range(1, 6)]])
data = np.concatenate([np.random.choice((0, 1), (10, 5)), np.random.rand(10, 5)], axis=1)
df = pd.DataFrame(data, columns=mux)
df
The way this is structured we can do this for type == 1
df.status.mul(df.type).sum(1)
0 0.935290
1 1.252478
2 1.354461
3 1.399357
4 2.102277
5 1.589710
6 0.434147
7 2.553792
8 1.205599
9 1.022305
dtype: float64
and for type == 0
df.status.rsub(1).mul(df.type).sum(1)
0 1.867986
1 1.068045
2 0.653943
3 2.239459
4 0.214523
5 0.734449
6 1.291228
7 0.614539
8 0.849644
9 1.109086
dtype: float64
You can get your columns in this format using the following code
df.columns = df.columns.str.split('_', expand=True)
df = df.swaplevel(0, 1, 1)

Categories

Resources