Sum based on a value in dataframe - python

I need to sum column "Runs" when MatchN is x, B is between i and j.
MatchN I B Runs
1000887 1 0.1 1
1000887 1 0.2 3
1000887 1 0.3 0
1000887 1 0.4 2
1000887 1 0.5 1
I tried using for loop but not able to crack it so far. Any suggestions?

You can first use a filter, and then sum op the B column, like:
df[(df['MatchN'] == x) & (i <= df['B']) & (df['B'] <= j)]['Runs'].sum()
# \_________________________ _________________________/ \___ __/\_ __/
# v v v
# filter part column sum part
So the filter part, is the logical and of three conditions:
df['MatchN'] == x;
i <= df['B']; and
df['B'] <= j.
We use the & operator to combine the three filters. Next we select these rows with df[<filter-condition>] (with <filter-condition> our previously discussed filter).
Next we select the Runs column of the filtered rows, and then finally we calculate the .sum() of that column.

You can use query:
x = '1000887'
i = 0.2
j = 0.4
df.query('MatchN == #x and #i <= B <= #j')['Runs'].sum()
Output:
5

Related

Pandas - New column based on the value of another column N rows back, when N is stored in a column

I have a pandas dataframe with example data:
idx price lookback
0 5
1 7 1
2 4 2
3 3 1
4 7 3
5 6 1
Lookback can be positive or negative but I want to take the absolute value of it for how many rows back to take the value from.
I am trying to create a new column that contains the value of price from lookback + 1 rows ago, for example:
idx price lookback lb_price
0 5 NaN NaN
1 7 1 NaN
2 4 2 NaN
3 3 1 7
4 7 3 5
5 6 1 3
I started with what felt like the most obvious way, this did not work:
df['sbc'] = df['price'].shift(dataframe['lb'].abs() + 1)
I then tried using a lambda, this did not work but I probably did it wrong:
sbc = lambda c, x: pd.Series(zip(*[c.shift(x+1)]))
df['sbc'] = sbc(df['price'], df['lb'].abs())
I also tried a loop (which was extremely slow, but worked) but I am sure there is a better way:
lookback = np.nan
for i in range(len(df)):
if df.loc[i, 'lookback']:
if not np.isnan(df.loc[i, 'lookback']):
lookback = abs(int(df.loc[i, 'lookback']))
if not np.isnan(lookback) and (lookback + 1) < i:
df.loc[i, 'lb_price'] = df.loc[i - (lookback + 1), 'price']
I have seen examples using lambda, df.apply, and perhaps Series.map but they are not clear to me as I am quite a novice with Python and Pandas.
I am looking for the fastest way I can do this, if there is a way without using a loop.
Also, for what its worth, I plan to use this computed column to create yet another column, which I can do as follows:
df['streak-roc'] = 100 * (df['price'] - df['lb_price']) / df['lb_price']
But if I can combine all of it into one really efficient way of doing it, that would be ideal.
Solution!
Several provided solutions worked great (thank you!) but all needed some small tweaks to deal with my potential for negative numbers and that it was a lookback + 1 not - 1 and so I felt it was prudent to post my modifications here.
All of them were significantly faster than my original loop which took 5m 26s to process my dataset.
I marked the one I observed to be the fastest as accepted as I improving the speed of my loop was the main objective.
Edited Solutions
From Manas Sambare - 41 seconds
df['lb_price'] = df.apply(
lambda x: df['price'][x.name - (abs(int(x['lookback'])) + 1)]
if not np.isnan(x['lookback']) and x.name >= (abs(int(x['lookback'])) + 1)
else np.nan,
axis=1)
From mannh - 43 seconds
def get_lb_price(row, df):
if not np.isnan(row['lookback']):
lb_idx = row.name - (abs(int(row['lookback'])) + 1)
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = dataframe.apply(get_lb_price, axis=1 ,args=(df,))
From Bill - 18 seconds
lookup_idxs = df.index.values - (abs(df['lookback'].values) + 1)
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df['price'].to_numpy()[lookup_idxs[valid_lookups].astype(int)]
By getting the row's index inside of the df.apply() call using row.name, you can generate the 'lb_price' data relative to which row you are currently on.
%time
df.apply(
lambda x: df['price'][x.name - int(x['lookback'] + 1)]
if not np.isnan(x['lookback']) and x.name >= x['lookback'] + 1
else np.nan,
axis=1)
# > CPU times: user 2 µs, sys: 0 ns, total: 2 µs
# > Wall time: 4.05 µs
FYI: There is an error in your example as idx[5]'s lb_price should be 3 and not 7.
Here is an example which uses a regular function
def get_lb_price(row, df):
lb_idx = row.name - abs(row['lookback']) - 1
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = df.apply(get_lb_price, axis=1 ,args=(df,))
Here's a vectorized version (i.e. no for loops) using numpy array indexing.
lookup_idxs = df.index.values - df['lookback'].values - 1
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df.price.to_numpy()[lookup_idxs[valid_lookups].astype(int)]
print(df)
Output:
price lookback lb_price
idx
0 5 NaN NaN
1 7 1.0 NaN
2 4 2.0 NaN
3 3 1.0 7.0
4 7 3.0 5.0
5 6 1.0 3.0
This solution loops of the values ot the column lockback and calculates the index of the wanted value in the column price which I store as a list.
The rule it, that the lockback value has to be a number and that the wanted index is not smaller than 0.
new = np.zeros(df.shape[0])
price = df.price.values
for i, lookback in enumerate(df.lookback.values):
# lookback has to be a number and the index is not allowed to be less than 0
# 0<i-lookback is equivalent to 0<=i-(lookback+1)
if lookback!=np.nan and 0<i-lookback:
new[i] = price[int(i-(lookback+1))]
else:
new[i] = np.nan
df['lb_price'] = new

Manual Feature Engineering in Pandas - Mean of 1 Column vs All Other Columns

Hard to describe this one, but for every column in a dataframe, create a new column that contains the mean of the current column vs the one next to it, then get the mean of that first column vs the next one down the line. Running Python 3.6.
For Example, given this dataframe:
I would like to get this output:
That exact order of the added columns at the end isn't important, but it needs to be able to handle every possible combination of means between all columns, with a depth of 2 (i.e. compare one column to another). Ideally, I would like to have the depth set as a separate variable, so I could have a depth of 3, where it would do this but compare 3 columns to one another.
Ideas? Thanks!
UPDATE
I got this to work, but wondering if there's a more computationally fast way of doing it. I basically just created 2 of the same loops (loop within a loop) to compare 1 column vs the rest, skipping the same column comparisons:
eng_features = pd.DataFrame()
for col in df.columns:
for col2 in df.columns:
# Don't compare same columns, or inversed same columns
if col == col2 or (str(col2) + '_' + str(col)) in eng_features:
continue
else:
eng_features[str(col) + '_' + str(col2)] = df[[col, col2]].mean(axis=1)
continue
df = pd.concat([df, eng_features], axis=1)
Use itertools, a python built in utility package for iterators:
from itertools import permutations
for col1, col2 in permutations(df.columns, r=2):
df[f'Mean_of_{col1}-{col2}'] = df[[col1,col2]].mean(axis=1)
and you will get what you need:
a b c Mean_of_a-b Mean_of_a-c Mean_of_b-a Mean_of_b-c Mean_of_c-a \
0 1 1 0 1.0 0.5 1.0 0.5 0.5
1 0 1 0 0.5 0.0 0.5 0.5 0.0
2 1 1 0 1.0 0.5 1.0 0.5 0.5
Mean_of_c-b
0 0.5
1 0.5
2 0.5

How can i replace values in a column using pandas?

It's my first time using python and pandas (plz help this old man). I have a column with float and negative numbers and I want to replace them with conditions.
I.e. if the number is between -2 and -1.6 all'replace it with -2 etc.
How can I create the condition (using if else or other) to modify my column. Thanks a lot
mean=[]
for row in df.values["mean"]:
if row <= -1.5:
mean.append(-2)
elif row <= -0.5 and =-1.4:
mean.append(-1)
elif row <= 0.5 and =-0.4:
mean.append(0)
else:
mean.append(1)
df = df.assign(mean=mean)
Doesn't work
create a function defining your conditions and then apply it to your column (I fixed some of your conditionals based on what I thought they should be):
df = pd.read_table('fun.txt')
# create function to apply for value ranges
def labels(x):
if x <= -1.5:
return(-2)
elif -1.5 < x <= -0.5:
return(-1)
elif -0.5 < x < 0.5:
return(0)
else:
return(1)
df['mean'] = df['mean'].apply(lambda x: labels(x)) # apply your function to your table
print(df)
another way to apply your function that returns the same result:
df['mean'] = df['mean'].map(labels)
fun.txt:
mean
0
-1.5
-1
-0.5
0.1
1.1
output from above:
mean
0 0
1 -2
2 -1
3 -1
4 0
5 1

Column operation conditioned on previous row

I presume similar questions exist, but could not locate them. I have Pandas 0.19.2 installed. I have a large dataframe, and for each row value I want to carry over the previous row's value for the same column based on some logical condition.
Below is a brute-force double for loop solution for a small example. What is the most efficient way to implement this? Is it possible to solve this in a vectorised manner?
import pandas as pd
import numpy as np
np.random.seed(10)
df = pd.DataFrame(np.random.uniform(low=-0.2, high=0.2, size=(10,2) ))
print(df)
for col in df.columns:
prev = None
for i,r in df.iterrows():
if prev is not None:
if (df[col].loc[i]<= prev*1.5) and (df[col].loc[i]>= prev*0.5):
df[col].loc[i] = prev
prev = df[col].loc[i]
print(df)
Output:
0 1
0 0.108528 -0.191699
1 0.053459 0.099522
2 -0.000597 -0.110081
3 -0.120775 0.104212
4 -0.132356 -0.164664
5 0.074144 0.181357
6 -0.198421 0.004877
7 0.125048 0.045010
8 0.125048 -0.083250
9 0.125048 0.085830
EDIT: Please note one value can be carried over multiple times, so long as it satisfies the logical condition.
prev = df.shift()
replace_mask = (0.5 * prev <= df) & (df <= 1.5 * prev)
df = df.where(~replace_mask, prev)
I came up with this:
keep_going = True
while keep_going:
df = df.mask((df.diff(1) / df.shift(1)<0.5) & (df.diff(1) / df.shift(1)> -0.5) & (df.diff(1) / df.shift(1)!= 0)).ffill()
trimming_to_do = ((df.diff(1) / df.shift(1)<0.5) & (df.diff(1) / df.shift(1)> -0.5) & (df.diff(1) / df.shift(1)!= 0)).values.any()
if not trimming_to_do:
keep_going= False
which gives the desired result (at least for this case):
print(df)
0 1
0 0.108528 -0.191699
1 0.053459 0.099522
2 -0.000597 -0.110081
3 -0.120775 0.104212
4 -0.120775 -0.164664
5 0.074144 0.181357
6 -0.198421 0.004877
7 0.125048 0.045010
8 0.125048 -0.083250
9 0.125048 0.085830

pandas merge by coordinates

I am trying to merge two pandas tables where I find all rows in df2 which have coordinates close to each row in df1. Example follows.
df1:
x y val
0 0 1 A
1 1 3 B
2 2 9 C
df2:
x y val
0 1.2 2.8 a
1 0.9 3.1 b
2 2.0 9.5 c
desired result:
x y val_x val_y
0 0 1 A NaN
1 1 3 B a
2 1 3 B b
3 2 0 C c
Each row in df1 can have 0, 1, or many corresponding entries in df2, and finding the match should be done with a cartesian distance:
(x1 - x2)^2 + (y1 - y2)^2 < 1
The input dataframes have different sizes, even though they don't in this example. I can get close by iterating over the rows in df1 and finding the close values in df2, but am not sure what to do from there:
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
# ?? What now?
Any help would be very much appreciated. I made this example with an ipython notebook, so which you can view/access here: http://nbviewer.ipython.org/gist/anonymous/49a3d821420c04169f02
I found an answer, though I am not real happy with having to loop over the rows in df1. In this case there are only a few hundred so I can deal with it, but it won't scale as well as something else. Solution:
df2_list = []
df1['merge_row'] = df1.index.values # Make a row to merge on with the index values
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
df2_subset['merge_row'] = i # Add a merge row
df2_list.append(df2_subset)
df2_found = pd.concat(df2_list)
result = pd.merge(df1, df2_found, on='merge_row', how='left')

Categories

Resources