Python Dataframe Sum/Subtract Many Columns to Many Coumns in one line - python

I want to sum many columns to many columns in a data frame.
My code:
df =
A1 B1 A2 B2
0 15 30 50 70
1 25 40 60 80
# I have many columns like this. I want to do something like this A1-A2, B1-B2, etc
# My approach is
first_cols = [A1,B1]
sec_cols = [A2,B2]
# New column names
sub_cols = [A_sub,B_sub]
df[sub_cols] = df[first_cols] - df[sec_cols]
Present output:
ValueError: Wrong number of items passed , placement implies 1
Expected output:
df =
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40

I think what you are trying to do is similar to this post. In Dataframes generally the arithmetic operations are aligned on column and row indices. Since you are tring to subtract different columns, pandas doesn't carry out the operation. So, df[sub_cols] = df[first_cols] - df[second_cols] won't work.
However, if you were to use numpy array and do the operation, pandas carries it out elementwise. So, df[sub_cols] = df[first_cols] - df[second_cols].values will work and give you the expected result.
import pandas as pd
df = {"A1":[15,25], "B1": [30, 40], "A2":[50,60], "B2": [70, 80]}
df = pd.DataFrame(df)
first_cols = ["A1", "B1"]
second_cols = ["A2", "B2"]
sub_cols = ["A_sub","B_sub"]
df[sub_cols] = df[first_cols] - df[second_cols].values
print(df)
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40

You could also pull it off with a groupby on the columns:
subtraction = (df.groupby(df.columns.str[0], axis = 1)
.agg(np.subtract.reduce, axis = 1)
.add_suffix("_sub")
)
df.assign(**subtraction)
A1 B1 A2 B2 A_sub B_sub
0 15 30 50 70 -35 -40
1 25 40 60 80 -35 -40

It's not quite clear what you want. If you want one column that's A1-A2 and another that's B1-B2, you can do df[[A1,A,2]].sub(df[[B1,B2]]).

Related

Create a master data set comprised of multiple data frames

I have been stuck on this problem for a while now! Included below is a very simplified version of my program, along with some context. Essentially I want to view is one large dataframe which has all of my desired permutations based on my input variables. This is in the context of scenario analysis and it will help me avoid doing on-demand calculations through my BI tool when the user wants to change variables to visualise the output.
I have tried:
Creating a function out of my code and trying to apply the function with each of the step size changes of my input variables ( no idea what I am doing there).
Literally manually changing the input variables myself (as a noob I realise this is not the way to go but had to first see my code was working to append df's).
Essentially what I want to achieve is as follows:
use the variables "date_offset" and "cost" and vary each of them by the required number of defined steps sizes and number of steps
As an example, if there are 2 values for date_offset (step size 1) and two values for cost (step size one) there are a possible 4 combinations, therefore the data set will be 4 times the size of the df in my code below.
Now I have all of the permutations of the input variable and the corresponding data frame to go with each of those permutations, I would like to append each one of the data frames together.
I should be left with one data frame for all of the possible scenarios which I can then visualise with a BI tool.
I hope you guys can help :)
Here is my code.....
import pandas as pd
import numpy as np
#want to iterate through starting at a date_offset of 0 with a total of 5 steps and a step size of 1
date_offset = 0
steps_1 = 5
stepsize_1 = 1
#want to iterate though starting at a cost of 5 with a total number of steps of 5 and a step size of 1
cost = 5
steps_2 = 4
step_size = 1
df = {'id':['1a', '2a', '3a', '4a'],'run_life':[10,20,30,40]}
df = pd.DataFrame(df)
df['date_offset'] = date_offset
df['cost'] = cost
df['calc_col1'] = df['run_life']*cost
Are you trying to do something like this:
from itertools import product
data = {'id': ['1a', '2a', '3a', '4a'], 'run_life': [10, 20, 30, 40]}
df = pd.DataFrame(data)
date_offset = 0
steps_1 = 5
stepsize_1 = 1
cost = 5
steps_2 = 4
stepsize_2 = 1
df2 = pd.DataFrame(
product(
range(date_offset, date_offset + steps_1 * stepsize_1 + 1, stepsize_1),
range(cost, cost + steps_2 * stepsize_2 + 1, stepsize_2)
),
columns=['offset', 'cost']
)
result = df.merge(df2, how='cross')
result['calc_col1'] = result['run_life'] * result['cost']
Output:
id run_life offset cost calc_col1
0 1a 10 0 5 50
1 1a 10 0 6 60
2 1a 10 0 7 70
3 1a 10 0 8 80
4 1a 10 0 9 90
.. .. ... ... ... ...
115 4a 40 5 5 200
116 4a 40 5 6 240
117 4a 40 5 7 280
118 4a 40 5 8 320
119 4a 40 5 9 360
[120 rows x 5 columns]

How to name the column when using value_count function in pandas?

I was counting the no of occurrence of angle and dist by the code below:
g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
the output:
current_angle current_dist 0
-50 30 1
-50 40 2
-50 41 6
-50 45 4
try1:
g.columns = ['angle','Distance','count','Percentage Missed'] - result was no change in the name of column
try2:
When I print the columns using print(g.columns) ended with error AttributeError: 'Series' object has no attribute 'columns'
I want to rename the column 0 as count and add a new column to the dataframe g as percent missed which is calculated by 100 - value in column 0
Expected output
current_angle current_dist count percent missed
-50 30 1 99
-50 40 2 98
-50 41 6 94
-50 45 4 96
1:How to modify the code? I mean instead of value_counts, is there any other function that can give the expected output?
2. How to get the expected output with the current method?
EDIT 1(exceptional case)
data:
angle
distance
velocity
0
124
-3
50
24
-25
50
34
25
expected output:
count is calculated based on distance
angle
distance
velocity
count
percent missed
0
124
-3
1
99
50
24
-25
1
99
50
34
25
1
99
First add Series.reset_index, because DataFrame.value_counts return Series, so possible use parameter name for change column 0 to count column and then subtract 100 to new column by Series.rsub for subtract from right side like 100 - df['count']:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
Or if need also set new columns names use DataFrame.set_axis:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.set_axis(['angle','Distance','count'], axis=1)
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
If need assign new columns names here is alternative solution:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index())
df.columns = ['angle','Distance','count']
df['percent missed'] = df['count'].rsub(100)
Assuming a DataFrame as input (if not reset_index first), simply use rename and a subtraction:
df = df.rename(columns={'0': 'count'}) # assuming string '0' here, else use 0
df['percent missed'] = 100 - df['count']
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
alternative: using groupby.size:
(new_df
.groupby(['current_angle','current_dist']).size()
.reset_index(name='count')
.assign(**{'percent missed': lambda d: 100-d['count']})
)
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96

determine the range of a value using a look up table

I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?
You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
firstly,explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly,judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear

Removing rows from pandas dataframe based on several columns

From a pandas dataframe, I want to remove the "rois" where half or more
of the rows have for any of the columns s, b1 or b2 a value of below 50.
Here an example dataframe:
roi s b1 b2
4 40 60 70
4 60 40 80
4 80 70 60
5 60 40 60
5 60 60 60
5 60 60 60
Only the three rows corresponding to roi 5 should be left over (roi 4 has 2 out of 3 rows
where at least one of the values of s, b1, b2 is below 50).
I have this implemented already, but wonder if there is a shorter (ie. faster and
cleaner) way to do this:
for roi in data.roi.unique():
subdata = data[data['roi']==roi];
subdatas = subdata[subdata['s']>=50];
subdatab1 = subdatas[subdatas['b1']>=50];
subdatab2 = subdatab1[subdatab1['b2']>=50]
if((subdatab2.size/10)/(subdata.size/10) < 0.5):
data = data[data['roi']!=roi];
You can do transform:
s = (data.set_index('roi') # filter `roi` out of later comparison
.lt(50).any(1) # check > 50 on all columns
.groupby('roi') # groupby
.transform('mean') # compute the mean
.lt(0.5) # make sure mean > 0.5
.values
)
data[s]
Output:
roi s b1 b2
3 5 60 40 60
4 5 60 60 60
5 5 60 60 60
You can use multiple filter conditions all at once to avoid creating intermediate data frames (space complexity efficiency), example:
for roi in data.roi.unique():
subdata2 = data[(data['roi']==roi) &
(data['s']>=50) &
(data['b2']>=50)]
if (subdata2.size/10)/(data[data['roi']==roi].size/10) < 0.5:
data = data[data['roi']!=roi]

broadcast groupby with boolean filter scalar in pandas

I have a data frame according to below.
df = pd.DataFrame({'var1' : list('a' * 3) + list('b' * 2) + list('c' * 4)
,'var2' : [i for i in range(9)]
,'var3' : [20, 40, 100, 10, 80, 12,24, 53, 90]
})
End result that I want is the following:
var1 var2 var3 var3_lt_50
0 a 0 20 60
1 a 1 40 60
2 a 2 100 60
3 b 3 10 10
4 b 4 80 10
5 c 5 12 36
6 c 6 24 36
7 c 7 53 36
8 c 8 90 36
I get this result in two steps, through a group-by and a merge, according to code below:
df = df.merge(df[df.var3 < 50][['var1', 'var3']].groupby('var1', as_index = False).sum().rename(columns = {'var3' : 'var3_lt_50'})
,how = 'left'
,left_on = 'var1'
,right_on = 'var1')
Can someone show me a way of doing this type of boolean logic expression + broadcasting of inter groupby scalar without the "groupby" + "merge" step im doing today. I want a smoother line of code.
Thanks in advance for input,
/Swepab
You can use groupby.transform which keeps the shape of the transformed variable as well as the index so you can just assign the result back to the data frame:
df['var3_lt_50'] = df.groupby('var1').var3.transform(lambda g: g[g < 50].sum())
df

Categories

Resources