multiply and summing certain columns based on name pandas python - python

i have a small sample data set:
import pandas as pd
d = {
'measure1_x': [10,12,20,30,21],
'measure2_x':[11,12,10,3,3],
'measure3_x':[10,0,12,1,1],
'measure1_y': [1,2,2,3,1],
'measure2_y':[1,1,1,3,3],
'measure3_y':[1,0,2,1,1]
}
df = pd.DataFrame(d)
df = df.reindex_axis([
'measure1_x','measure2_x', 'measure3_x','measure1_y','measure2_y','measure3_y'
], axis=1)
it looks like:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y
10 11 10 1 1 1
12 12 0 2 1 0
20 10 12 2 1 2
30 3 1 3 3 1
21 3 1 1 3 1
i created the column names almost the same except for '_x' and '_y' to help identify which pair should be multiplying: i want to multiply the pair with the same column name when '_x' and '_y' are disregarded, then i want sum the numbers to get a total number, keep in mind my actual data set is huge and the columns are not in this perfect order so this naming is a way for identifying correct pairs to multiply:
total = measure1_x * measure1_y + measure2_x * measure2_y + measure3_x * measure3_y
so desired output:
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y total
10 11 10 1 1 1 31
12 12 0 2 1 0 36
20 10 12 2 1 2 74
30 3 1 3 3 1 100
21 3 1 1 3 1 31
my attempt and thought process, but cannot proceed anymore syntax wise:
#first identify the column names that has '_x' and '_y', then identify if
#the column names are the same after removing '_x' and '_y', if the pair has
#the same name then multiply them, do that for all pairs and sum the results
#up to get the total number
for colname in df.columns:
if "_x".lower() in colname.lower() or "_y".lower() in colname.lower():
if "_x".lower() in colname.lower():
colnamex = colname
if "_y".lower() in colname.lower():
colnamey = colname
#if colnamex[:-2] are the same for colnamex and colnamey then multiply and sum

filter + np.einsum
Thought I'd try something a little different this time—
get your _x and _y columns separately
do a product-sum. This is very easy to specify with einsum (and fast).
df = df.sort_index(axis=1) # optional, do this if your columns aren't sorted
i = df.filter(like='_x')
j = df.filter(like='_y')
df['Total'] = np.einsum('ij,ij->i', i, j) # (i.values * j).sum(axis=1)
df
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
A slightly more robust version which filters out non-numeric columns and performs an assertion beforehand—
df = df.sort_index(axis=1).select_dtypes(exclude=[object])
i = df.filter(regex='.*_x')
j = df.filter(regex='.*_y')
assert i.shape == j.shape
df['Total'] = np.einsum('ij,ij->i', i, j)
If the assertion fails, the the assumptions of 1) your columns being numeric, and 2) the number of x and y columns being equal, as your question would suggest, do not hold for your actual dataset.

Use df.columns.str.split to generate a new MultiIndex
Use prod with axis and level arguments
Use sum with axis argument
Use assign to create new column
df.assign(
Total=df.set_axis(
df.columns.str.split('_', expand=True),
axis=1, inplace=False
).prod(axis=1, level=0).sum(1)
)
measure1_x measure2_x measure3_x measure1_y measure2_y measure3_y Total
0 10 11 10 1 1 1 31
1 12 12 0 2 1 0 36
2 20 10 12 2 1 2 74
3 30 3 1 3 3 1 100
4 21 3 1 1 3 1 31
Restrict dataframe to just columns that look like 'meausre[i]_[j]'
df.assign(
Total=df.filter(regex='^measure\d+_\w+$').pipe(
lambda d: d.set_axis(
d.columns.str.split('_', expand=True),
axis=1, inplace=False
)
).prod(axis=1, level=0).sum(1)
)
Debugging
See if this gets you the correct Totals
d_ = df.copy()
d_.columns = d_.columns.str.split('_', expand=True)
d_.prod(axis=1, level=0).sum(1)
0 31
1 36
2 74
3 100
4 31
dtype: int64

Related

Pandas: finding mean in the dataframe based on condition included in another

I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850
here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0
You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0

rewritng a column cell values in a dataframe based on when the value change without using if statment

i have a column with faulty values as it is supposed to count cycles, but the device where the data from resets the count after 50 so i was left with exmalple [1,1,1,1,2,2,2,,3,3,3,3,...,50,50,50,1,1,1,2,2,2,2,3,3,3,...,50,50,.....,50]
My solution is and i cannt even make it work:(for simplicity i made the data resets from 10 cycles
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,1,1,1,2,3,3,3,3,
4,4,5,6,6,6,7,8,8,8,8,9,10]}
df = pd.DataFrame(data)
x=0
count=0
old_value=df.at[x,'Cyc-Count']
for x in range(x,len(df)-1):
if df.at[x,'Cyc-Count']==df.at[x+1,'Cyc-Count']:
old_value=df.at[x+1,'Cyc-Count']
df.at[x+1,'Cyc-Count']=count
else:
old_value=df.at[x+1,'Cyc-Count']
count+=1
df.at[x+1,'Cyc-Count']=count
i need to fix this but preferably without even using if statments
the desired output for the upper example should be
data = {'Cyc-Count':[1,1,2,2,2,3,4,5,6,7,7,7,8,9,10,11,11,11,12,13,13,13,13,
14,14,15,16,16,16,17,18,18,18,18,19,20]}
hint" my method has a big issue is that the last indexed value will be hard to change since when comparing it with its index+1 > it dosnt even exist
IIUC, you want to continue the count when the counter decreases.
You can use vectorial code:
s = df['Cyc-Count'].shift()
df['Cyc-Count2'] = (df['Cyc-Count']
+ s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer')
.cumsum()
)
Or, to modify the column in place:
s = df['Cyc-Count'].shift()
df['Cyc-Count'] += (s.where(s.gt(df['Cyc-Count']))
.fillna(0, downcast='infer').cumsum()
)
output:
Cyc-Count Cyc-Count2
0 1 1
1 1 1
2 1 1
3 1 1
4 2 2
5 2 2
6 2 2
7 3 3
8 3 3
9 3 3
10 3 3
11 4 4
12 5 5
13 5 5
14 5 5
15 1 6
16 1 6
17 1 6
18 2 7
19 2 7
20 2 7
21 2 7
22 3 8
23 3 8
24 3 8
25 4 9
26 5 10
27 5 10
28 1 11
29 2 12
30 2 12
31 3 13
32 4 14
33 5 15
34 5 15
used input:
l = [1,1,1,1,2,2,2,3,3,3,3,4,5,5,5,1,1,1,2,2,2,2,3,3,3,4,5,5,1,2,2,3,4,5,5]
df = pd.DataFrame({'Cyc-Count': l})
You can use df.loc to access a group of rows and columns by label(s) or a boolean array.
syntax: df.loc[df['column name'] condition, 'column name or the new one'] = 'value if condition is met'
for example:
import pandas as pd
numbers = {'set_of_numbers': [1,2,3,4,5,6,7,8,9,10,0,0]}
df = pd.DataFrame(numbers,columns=['set_of_numbers'])
print (df)
df.loc[df['set_of_numbers'] == 0, 'set_of_numbers'] = 999
df.loc[df['set_of_numbers'] == 5, 'set_of_numbers'] = 555
print (df)
before: ‘set_of_numbers’: [1,2,3,4,5,6,7,8,9,10,0,0]
After: ‘set_of_numbers’: [1,2,3,4,555,6,7,8,9,10,999,999]

How to add row with default values if not in consecutive order

I have a df like this:
time units cost
0 4 10
1 2 10
3 4 20
4 1 20
5 3 10
6 1 20
9 2 10
As you can see, df.time is not consecutive. If there is a missing value, I want to add a new row, populating df.time with the consecutive time value, df.units with 2 and df.cost with 20.
Expected output:
time units cost
0 4 10
1 2 10
2 2 20
3 4 20
4 1 20
5 3 10
6 1 20
7 2 20
8 2 20
9 2 10
How do I do this? I understand how to this by deconstructing all series into lists, looping through them and appending values when time is not equal to time - 1, but this seems inefficient.
You can use the reindex method with a call to fillna to do this:
# Build new index that ranges from time min to time max with a step of 1
new_index = range(df["time"].min(), df["time"].max() + 1)
out = (df.set_index("time") # Index our dataframe with the original time column
.reindex(new_index) # Reindex our dataframe with the new_index, all empty cells appear as nan
.fillna({"units": 2, "cost": 20}) # Fill in the nans for units and cost with 2 and 20 respectively
.astype(int)) # Due to NaNs that were in column from reindexing, we'll manually recast our
# data type from float to int (not necessary, but produces cleaner output)
print(out)
units cost
time
0 4 10
1 2 10
2 2 20
3 4 20
4 1 20
5 3 10
6 1 20
7 2 20
8 2 20
9 2 10
You can use df.reindex, then pd.Series.fillna.
idx = pd.RangeIndex(df['time'].min(), df['time'].max()+1)
# If `df.time` is always sorted then,
# idx = pd.RangeIndex(df['time'].iat[0], df['time'].iat[-1]+1)
df = df.set_index('time')
df = df.reindex(idx)
df['units'] = df['units'].fillna(2).astype(int)
df['cost'] = df['cost'].fillna(20).astype(int)
# if you prefer not to hard-code the names of the columns, replace last
# the two lines with:
# defaults = [2,20]
# for (name, default) in zip(df.columns, defaults):
# df[name] = df[name].fillna(default).astype(type(default))
units cost
time
0 4 10
1 2 10
2 2 20
3 4 20
4 1 20
5 3 10
6 1 20
7 2 20
8 2 20
9 2 10
You can construct new DataFrame with complete "time" column and then do .fillna() from original dataframe (df is your original dataframe):
r = range(df['time'].min(), df['time'].max()+1)
df_out = pd.DataFrame({'time': r, 'units': [np.nan]*len(r), 'cost': [np.nan]*len(r)}).set_index('time')
df_out = df_out.fillna(df.set_index('time'))
df_out['units'] = df_out['units'].fillna(2).astype(int)
df_out['cost'] = df_out['cost'].fillna(20).astype(int)
print(df_out)
Prints:
units cost
time
0 4 10
1 2 10
2 2 20
3 4 20
4 1 20
5 3 10
6 1 20
7 2 20
8 2 20
9 2 10

Pandas Calculate Sum of Multiple Columns Given Multiple Conditions

I have a wide table in a format as follows (for up to 10 people):
person1_status | person2_status | person3_status | person1_type | person_2 type | person3_type
0 | 1 | 0 | 7 | 4 | 6
Where status can be a 0 or a 1 (first 3 cols).
Where type can be a # ranging from 4-7. The value here corresponds to another table that specifies a value based on type. So...
Type | Value
4 | 10
5 | 20
6 | 30
7 | 40
I need to calculate two columns, 'A', and 'B', where:
A is the sum of values of each person's type (in that row) where
status = 0.
B is the sum of values of each person's type (in that row) where
status = 1.
For example, the resulting columns 'A', and 'B' would be as follows:
A | B
70 | 10
An explanation of this:
'A' has value 70 because person1 and person3 have "status" 0 and have corresponding type of 7 and 6 (which corresponds to values 30 and 40).
Similarly, there should be another column 'B' that has the value "10" because only person2 has status "1" and their type is "4" (which has corresponding value of 10).
This is probably a stupid question, but how do I do this in a vectorized way? I don't want to use a for loop or anything since it'll be less efficient...
I hope that made sense... could anyone help me? I think I'm brain dead trying to figure this out.
For simpler calculated columns I was getting away with just np.where but I'm a little stuck here since I need to calculate the sum of values from multiple columns given certain conditions while pulling in those values from a separate table...
hope that made sense
Use the filter method which will filter the column names for those where a string appears in them.
Make a dataframe for the lookup values other_table and set the index as the type column.
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Full example below:
Create fake data
df = pd.DataFrame({'person_1_status':np.random.randint(0, 2,1000) ,
'person_2_status':np.random.randint(0, 2,1000),
'person_3_status':np.random.randint(0, 2,1000),
'person_1_type':np.random.randint(4, 8,1000),
'person_2_type':np.random.randint(4, 8,1000),
'person_3_type':np.random.randint(4, 8,1000)},
columns= ['person_1_status', 'person_2_status', 'person_3_status',
'person_1_type', 'person_2_type', 'person_3_type'])
person_1_status person_2_status person_3_status person_1_type \
0 1 0 0 7
1 0 1 0 6
2 1 0 1 7
3 0 0 0 7
4 0 0 1 4
person_3_type person_3_type
0 5 5
1 7 7
2 7 7
3 7 7
4 7 7
Make other_table
other_table = pd.Series({4:10, 5:20, 6:30, 7:40})
4 10
5 20
6 30
7 40
dtype: int64
Filter out status and type columns to their own dataframes
df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
Make lookup table
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values
Apply matrix multiplication and sum across rows.
df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)
Output
person_1_status person_2_status person_3_status person_1_type \
0 0 0 1 7
1 0 1 0 4
2 0 1 1 7
3 0 1 0 6
4 0 0 1 5
person_2_type person_3_type A B
0 7 5 80 20
1 6 4 20 30
2 5 5 40 40
3 6 4 40 30
4 7 5 60 20
consider the dataframe df
mux = pd.MultiIndex.from_product([['status', 'type'], ['p%i' % i for i in range(1, 6)]])
data = np.concatenate([np.random.choice((0, 1), (10, 5)), np.random.rand(10, 5)], axis=1)
df = pd.DataFrame(data, columns=mux)
df
The way this is structured we can do this for type == 1
df.status.mul(df.type).sum(1)
0 0.935290
1 1.252478
2 1.354461
3 1.399357
4 2.102277
5 1.589710
6 0.434147
7 2.553792
8 1.205599
9 1.022305
dtype: float64
and for type == 0
df.status.rsub(1).mul(df.type).sum(1)
0 1.867986
1 1.068045
2 0.653943
3 2.239459
4 0.214523
5 0.734449
6 1.291228
7 0.614539
8 0.849644
9 1.109086
dtype: float64
You can get your columns in this format using the following code
df.columns = df.columns.str.split('_', expand=True)
df = df.swaplevel(0, 1, 1)

GroupBy one column, custom operation on another column of grouped records in pandas

I wanted to apply a custom operation on a column by grouping the values on another column. Group by column to get the count, then divide the another column value with this count for all the grouped records.
My Data Frame:
emp opp amount
0 a 1 10
1 b 1 10
2 c 2 30
3 b 2 30
4 d 2 30
My scenario:
For opp=1, two emp's worked(a,b). So the amount should be shared like
10/2 =5
For opp=2, two emp's worked(b,c,d). So the amount should be like
30/3 = 10
Final Output DataFrame:
emp opp amount
0 a 1 5
1 b 1 5
2 c 2 10
3 b 2 10
4 d 2 10
What is the best possible to do so
df['amount'] = df.groupby('opp')['amount'].transform(lambda g: g/g.size)
df
# emp opp amount
# 0 a 1 5
# 1 b 1 5
# 2 c 2 10
# 3 b 2 10
# 4 d 2 10
Or:
df['amount'] = df.groupby('opp')['amount'].apply(lambda g: g/g.size)
does similar thing.
You could try something like this:
df2 = df.groupby('opp').amount.count()
df.loc[:, 'calculated'] = df.apply( lambda row: \
row.amount / df2.ix[row.opp], axis=1)
df
Yields:
emp opp amount calculated
0 a 1 10 5
1 b 1 10 5
2 c 2 30 10
3 b 2 30 10
4 d 2 30 10

Categories

Resources