I have a dataset, df, where I would like to create columns that display the output of a subtraction calculation:
Data
count power id p_q122 p_q222 c_q122 c_q222
100 1000 aa 200 300 10 20
100 2000 bb 400 500 5 10
Desired
cnt pwr id p_q122 avail1 p_q222 avail2 c_q122 count1 c_q222 count2
100 1000 aa 200 800 300 700 10 90 20 80
100 2000 bb 400 1600 500 1500 5 95 10 90
Doing
a = df['avail1'] = + df['pwr'] - df['p_q122']
b = df['avail2'] = + df['pwr'] - df['p_q222']
I am looking for a more elegant way that provides the desire output. Any suggestion is appreciated.
We can perform 2D subtraction with numpy:
pd.DataFrame(
df['power'].to_numpy()[:, None] - df.filter(like='p_').to_numpy()
).rename(columns=lambda i: f'avail{i + 1}')
avail1 avail2
0 800 700
1 1600 1500
Benefit here is that, no matter how many p_ columns there are, all will be subtracted from the power column.
We can concat all of the computations with df like:
df = pd.concat([
df,
# power calculations
pd.DataFrame(
df['power'].to_numpy()[:, None] - df.filter(like='p_').to_numpy()
).rename(columns=lambda i: f'avail{i + 1}'),
# Count calculations
pd.DataFrame(
df['count'].to_numpy()[:, None] - df.filter(like='c_').to_numpy()
).rename(columns=lambda i: f'count{i + 1}'),
], axis=1)
which gives df:
count power id p_q122 p_q222 ... c_q222 avail1 avail2 count1 count2
0 100 1000 aa 200 300 ... 20 800 700 90 80
1 100 2000 bb 400 500 ... 10 1600 1500 95 90
[2 rows x 11 columns]
If we have many column groups to do, we can build the list of DataFrames programmatically as well:
df = pd.concat([df, *(
pd.DataFrame(
df[col].to_numpy()[:, None] - df.filter(like=filter_prefix).to_numpy()
).rename(columns=lambda i: f'{new_prefix}{i + 1}')
for col, filter_prefix, new_prefix in [
('power', 'p_', 'avail'),
('count', 'c_', 'count')
]
)], axis=1)
Setup and imports:
import pandas as pd
df = pd.DataFrame({
'count': [100, 100], 'power': [1000, 2000], 'id': ['aa', 'bb'],
'p_q122': [200, 400], 'p_q222': [300, 500], 'c_q122': [10, 5],
'c_q222': [20, 10]
})
Try:
df['avail1'] = df['power'].sub(df['p_q122'])
df['avail2'] = df['power'].sub(df['p_q222'])
Related
I'm sure there's a better way to describe what I'm trying to do, but here's an example.
Say I have a dataframe:
d = {'col1': [1, 5, 10, 22, 36, 57], 'col2': [100, 450, 1200, 2050, 3300, 6000]}
df = pd.DataFrame(data=d)
df
col1 col2
0 1 100
1 5 450
2 10 1200
3 22 2050
and a second dataframe (or series I suppose):
d2 = {'col2': [100, 200, 450, 560, 900, 1200, 1450, 1800, 2050, 2600, 3300, 5000, 6000]}
df2 = pd.DataFrame(data=d2)
df2
col2
0 100
1 200
2 450
3 560
4 900
5 1200
6 1450
7 1800
8 2050
9 2600
10 3300
11 5000
12 6000
I need some efficient way to assign a value to a second column in df2 in the following way:
if the value in df2['col2'] matches a value in df['col2'], assign the value of df['col1'] in the same row.
if there isn't a matching value, find the range it fits in and approximate the value based on that. e.g for df2.loc[1,'col2'], the col2 value is 200, and it belongs between 100 and 450 in the first dataframe, so the new value would be (5-1)/(450-100) *200 = 2.2857
Edit: the correct example should be (5 - 1) / (450 - 100) * (200 - 100) +1 = 2.1429
Now that have you confirmed your requirement, we can make a solution. We can use a loop to find segments that are bounded by non-NaN values and linear-interpolate the points in-between.
This algorithm only works when col1 is anchored by non-NaN values on both ends, hence the assert statement.
col1, col2 = df2.merge(df, how='left', on='col2')[['col1', 'col2']].to_numpy().T
assert ~np.isnan(col1[[0, -1]]).any(), 'First and last elements of col1 must not be NaN'
n = len(col1)
i = 0
while i < n:
j = i + 1
while j < n and np.isnan(col1[j]):
j += 1
if j - i > 1:
# The linear equation
f = np.polyfit(col2[[i,j]], col1[[i,j]], deg=1)
# Apply the equation on all points between i and j
col1[i:j+1] = np.polyval(f, col2[i:j+1])
i = j
have you considered training a regression model on your first dataframe, then predicting the values on your 2nd?
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
Here's my code with 2 dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1, 2, 3, 5, 2], [2, 2, 3, 5, 2], [3, 2, 3, 5, 2], [10, 2, 3, 5, 2]]),
columns=['ID', 'itemX_2', 'itemK_3', 'itemC_5', 'itemH_2'])
df2 = pd.DataFrame(np.array([[1,1,1, 2,2,2, 3,3,3, 10,10,10], [2,3,5, 2,3,5, 2,3,5, 2,3,5], [20,40,60, 80,100,200, 220,240,260, 500,505,520]]).T,
columns=['ID', 'Item_id', 'value_to_assign'])
Based on df2 I want to modify df1
Expected output:
df_expected_output = pd.DataFrame(np.array([[1, 20, 40, 60, 20], [2, 80, 100, 200, 80], [3, 220, 240, 260, 220], [10, 500, 505, 520, 500]]),
columns=['ID', 'itemX_2', 'itemK_3', 'itemC_5', 'itemH_2'])
I have done it with iterating over columns and some operations. Hovever in my example i got more columns and rows in dataframes, so its pretty slow. Someone know how to do it in fast efficient way? Thanks
Here is one solution. pivot df2 to have a format similar to df1 and then replace column by column by matching on the number after the last '_'.
df2_pivot = df2.pivot(index='ID', columns='Item_id', values='value_to_assign').rename_axis(None, axis=1)
df3 = df1.set_index('ID')
for c in df3:
df3[c] = df2_pivot[int(c.rsplit('_', 1)[-1])]
Or, using a dictionary comprehension for the second part:
df3 = pd.DataFrame({c: df2_pivot[int(c.rsplit('_', 1)[-1])]
for c in df1.columns[1:]},
index=df1['ID']).reset_index()
output:
>>> df3.reset_index()
ID itemX_2 itemK_3 itemC_5 itemH_2
0 1 20 40 60 20
1 2 80 100 200 80
2 3 220 240 260 220
3 10 500 505 520 500
Another way would be:
Stack the original df which is to be replaced.
grab the index and split the second index to get values after _
using pd.Index.map, map the values of these index from df2
Create a dataframe keeping this mapped value as value and the stacked multiindex as index and then unstack them.
s = df1.set_index("ID").stack()
i = s.index.map(lambda x: (x[0],x[1].split("_")[1]))
v = i.map(df2.set_index(["ID",df2['Item_id'].map(str)])['value_to_assign'])
out = pd.DataFrame({"value":v},index=s.index)['value'].unstack().reset_index()
print(out)
ID itemX_2 itemK_3 itemC_5 itemH_2
0 1 20 40 60 20
1 2 80 100 200 80
2 3 220 240 260 220
3 10 500 505 520 500
DataFrame.replace
We can use pivot to reshape the dataframe df2 so that we can easily use replace method to substitute the values in df1
df1.set_index('ID').T.replace(df2.pivot('Item_id', 'ID', 'value_to_assign')).T
itemX_2 itemK_3 itemC_5 itemH_2
ID
1 20 40 60 20
2 80 100 200 80
3 220 240 260 220
10 500 505 520 500
You can iterate over the columns of df1 and perform a pd.merge :
for col in df1.columns:
if col == 'ID': continue
df_temp = pd.merge(df1.loc[:, ['ID', col]], df2, on = 'ID')
df1[col] = df_temp[df_temp[col] == df_temp['Item_id']]['value_to_assign'].reset_index(drop=True)
output:
ID itemX_2 itemK_3 itemC_5 itemH_2
0 1 20 40 60 20
1 2 80 100 200 80
2 3 220 240 260 220
3 10 500 505 520 500
Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90
I want to create a new column in my data frame with multiple if conditions and either add value or subtract a value from the previous row depending on the conditions.
I tried using a lambda function but I am pretty certain the syntax is wrong. So I have not really any idea to solve my problem. As df['b'] this is just a dummy variable to test the lambda function instead of df['a'].
Maybe an equation: a_i = 1t* 40 + a0 until 620 , after is reached T = 620 is goes down one Rate of a _i = a_i-1 -1t+40.
import pandas as pd
df = pd.DataFrame({'t' : [0, 8,32, 56, 96, 128],
'T' : [460, 500, 620, 500, 300, 460],
})
df1 = pd.DataFrame({'t' : [0, 8,32, 56, 96, 128],
'T' : [460, 500, 620, 500, 300, 460],
'a' : [10000, 10320, 11280, 10320, 8720, 10000]})
print (df)
df.loc[0,'a']=10000
df['a']= df['t']*5+ df.loc[0,'a']
df.loc[0,'b']=100
df['b'] = df['T'].apply(lambda x: df['t']*5+ df.loc[0,'b'] if x <= 500
df['t']*5+ df.loc[0,'b'] if x <= 620
-df['t']*5+ df.loc[0,'b'] if x <= 300 )
#df.loc[0,'i']=50
#df['i'] = [5+ df.loc[0,'i'] for x in range(df.shape[0])]
print(df)
print(df1)
I wanted to create something like this:
t T a
0 0 460 10000
1 8 500 10320
2 32 620 11280
3 56 500 10320
4 96 300 8720
5 128 460 10000
If you want to reference multiple columns with apply, you have to write your function for a full row and apply it to the full data frame. For example:
import pandas as pd
df = pd.DataFrame({
't' : [0, 8,32, 56, 96, 128],
'T' : [460, 500, 620, 500, 300, 460],
})
def a(row):
a_0 = 10000
return a_0 + row['t'] + row['T']
df['a'] = df.apply(a, axis=1)
print(df)
Which prints:
t T a
0 0 460 10460
1 8 500 10508
2 32 620 10652
3 56 500 10556
4 96 300 10396
5 128 460 10588
It's not a solution to your question, but you can see that values from both row['t'] and row['T'] are accessible.
However, in your case, I think you'd better just use a for loop until 620 is reached and then another for loop for the remaining rows:
import pandas as pd
df = pd.DataFrame({
't' : [0, 8,32, 56, 96, 128],
'T' : [460, 500, 620, 500, 300, 460],
})
a_0 = 10000
for i in range(len(df)):
df.loc[i, 'a'] = df['t'][i] * 40 + a_0
if df['T'][i] == 620:
break
for i in range(i + 1, len(df)):
df.loc[i, 'a'] = df['a'][i - 1] - (df['t'][i] - df['t'][i - 1]) * 40
if df['T'][i] == 300:
break
for i in range(i + 1, len(df)):
df.loc[i, 'a'] = df['a'][i - 1] + (df['t'][i] - df['t'][i - 1]) * 40
if df['T'][i] == 460:
break
print(df)
Which prints:
t T a
0 0 460 10000.0
1 8 500 10320.0
2 32 620 11280.0
3 56 500 10320.0
4 96 300 8720.0
5 128 460 10000.0
I am concatenating two Pandas dataframes as below.
part1 = pd.DataFrame({'id' :[100,200,300,400,500],
'amount': np.random.randn(5)
})
part2 = pd.DataFrame({'id' :[700,100,800,500,300],
'amount': np.random.randn(5)
})
concatenated = pd.concat([part1, part2], axis=0)
amount id
0 -0.458653 100
1 2.172348 200
2 0.072494 300
3 -0.253939 400
4 -0.061866 500
0 -1.187505 700
1 -0.810784 100
2 0.321881 800
3 -1.935284 500
4 -1.351507 300
How can I limit the operation so that a row in part2 is only included in concatenated if the row id does not already appear in part1? In a way, I want to treat the id column like a set.
Is it possible to do this during concat() or is this more a post-processing step?
Desired output for this example would be:
concatenated_desired
amount id
0 -0.458653 100
1 2.172348 200
2 0.072494 300
3 -0.253939 400
4 -0.061866 500
0 -1.187505 700
2 0.321881 800
call drop_duplicates() after concat():
part1 = pd.DataFrame({'id' :[100,200,300,400,500],
'amount': np.arange(5)
})
part2 = pd.DataFrame({'id' :[700,100,800,500,300],
'amount': np.random.randn(5)
})
concatenated = pd.concat([part1, part2], axis=0)
print concatenated.drop_duplicates(cols="id")
Calculate the id's not in part1
In [28]:
diff = part2.ix[~part2['id'].isin(part1['id'])]
diff
Out[28]:
amount id
0 -2.184038 700
2 -0.070749 800
now concat
In [29]:
concatenated = pd.concat([part1, diff], axis=0)
concatenated
Out[29]:
amount id
0 -2.240625 100
1 -0.348184 200
2 0.281050 300
3 0.082460 400
4 -0.045416 500
0 -2.184038 700
2 -0.070749 800
You can also put this in a one liner:
concatenated = pd.concat([part1, part2.ix[~part2['id'].isin(part1['id'])]], axis=0)
If you get a column with an id, then use it as an index. Perform manipulations with a real index will make things easier. Here you can use combine_first that does what you are searching for:
part1 = part1.set_index('id')
part2 = part2.set_index('id')
part1.combine_first(p2)
Out[38]:
amount
id
100 1.685685
200 -1.895151
300 -0.804097
400 0.119948
500 -0.434062
700 0.215255
800 -0.031562
If you really need not to get that index, reset it after:
part1.combine_first(p2).reset_index()
Out[39]:
id amount
0 100 1.685685
1 200 -1.895151
2 300 -0.804097
3 400 0.119948
4 500 -0.434062
5 700 0.215255
6 800 -0.031562