Python dataframe groupby multiple columns with conditional sum - python

I have a df which looks like that:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
I am grouping the df by col1 and col2, and for each member of each group, I want to sum the target values, only of other group members, that their now date value, is smaller(before) than the current member's previous date value.
For example for:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
I want to sum the target values of:
col1 col2 now previous target
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
to eventually have:
col1 col2 now previous target sum
A 1 1-1-2015 4-1-2014 0.2 1.8

Interesting problem, I've got something that I think may work.
Although, slow time complexity of Worst case: O(n**3) and Best case: O(n**2).
Setup data
import pandas as pd
import numpy as np
import io
datastring = io.StringIO(
"""
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
C 1 31-12-2014 4-9-2014 1.9
""")
# arguments for pandas.read_csv
kwargs = {
"sep": "\s+", # specifies that it's a space separated file
"parse_dates": [2,3], # parse "now" and "previous" as dates
}
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)
Pseudo code for algorithm
For each row:
For each *other* row:
If "now" of *other* row comes before "previous" of row
Then add *other* rows "target" to "sum" of row
Run the algorithm
First start by setting up a function f(), that is to be applied over all the groups computed by df.groupby(["col1","col2"]). All that f() does is try to implement the pseudo code above.
def f(df):
_sum = np.zeros(len(df))
# represent the desired columns of the sub-dataframe as a numpy object
data = df[["now","previous","target"]].values
# loop through the rows in the sub-dataframe, df
for i, outer_row in enumerate(data):
# for each row, loop through all the rows again
for j, inner_row in enumerate(data):
# skip iteration if outer loop row is equal to the inner loop row
if i==j: continue
# get the dates from rows
outer_prev = outer_row[1]
inner_now = inner_row[0]
# if the "previous" datetime of the outer loop is greater than
# the "now" datetime of the inner loop, then add "target" to
# to the cumulative sum
if outer_prev > inner_now:
_sum[i] += inner_row[2]
# add a new column for this new "sum" that we calculated
df["sum"] = _sum
return df
Now just apply f() over the grouped data.
done = df.groupby(["col1","col2"]).apply(f)
Output
col1 col2 now previous target sum
0 A 1 2015-01-01 2014-04-01 0.20 1.7
1 B 0 2015-02-01 2014-02-05 0.33 0.0
2 A 0 2013-03-01 2011-03-09 0.10 0.0
3 A 1 2014-01-01 2011-04-09 1.70 0.0
4 A 1 2014-12-31 2014-04-09 1.90 1.7

Related

How to locate and replace values in dataframe based on some criteria

I would like to locate all places when in Col2 there is a change in value (for ex. change from A to C) and then modify value from Col1 (corresponding to row when the change happens, so when A -> C then it will be value in the same row as C) by dividing subtraction current value and previous value by two (in this example will be 1 + (1.5-1)/2 = 1.25.
Output table is result of replacing all that occurrences in whole table
How I can achieve that ?
Col1
Col2
1
A
1.5
C
2.0
A
2.5
A
3.0
D
3.5
D
OUTPUT:
Col1
Col2
1
A
1.25
C
1.75
A
2.5
A
2.75
D
3.5
D
Use np.where and series holding values of your formula
solution = df.Col1.shift() + ((df.Col1 - df.Col1.shift()) / 2)
df['Col1'] = np.where(~df.Col2.eq(df.Col2.shift()), solution.fillna(df.Col1), df.Col1)

Calculate perc in Pandas Dataframe based on rows having a specific condition for each distinct value in column

I have a dataframe with sample values as given below
`
col1 col2
A ['1','2','er']
A []
B ['3','4','ac']
B ['5']
C []
`
I want to calculate the percentage of total number of rows for each value in col1 against total number of rows in col2 that are not empty list.
I am able to do it if there is a single value in col1. I am looking for a solution to do this iteratively. Thanks for the help.
I believe you need compare length of lists greater like 0, convert to number and athen aggregate mean:
df1 = df['col2'].str.len().gt(0).view('i1').groupby(df['col1']).mean().reset_index(name='%')
print (df1)
col1 %
0 A 0.5
1 B 1.0
2 C 0.0

Pandas, Python : How to get the max value across columns while also satisfying a second condition

I am trying to select the max value from a set of columns while also satisfying a second condition. The max value here corresponds to pct_change relative to previous row. The second condition corresponds to % contribution of each column value to sum total for that row.
Essentially, I am trying to get the max among columns but only for those columns that satisfy the second condition. I have created an example using the code below.
import pandas as pd
import numpy as np
# Creating series to initialize df
series_1_units = pd.Series(np.array([1,20,25,1,9]))
series_2_units = pd.Series(np.array([1,1,30,25,1]))
series_3_units = pd.Series(np.array([1,1,1,25,30]))
df = pd.DataFrame({'Type1':series_1_units, 'Type2':series_2_units, 'Type3':series_3_units})
# Calculate the % contribution of each type to total units summed across row
df_contrib_to_total = df.div(df.sum(axis=1), axis=0)*100.0
# Calculate % difference to previous row
df_pct_diff = df.pct_change()
# Join the different df to compare
df_all_cols = df.join(df_pct_diff, rsuffix='_Pct_Change')
df_all_cols = df_all_cols.join(df_contrib_to_total,rsuffix='_Contrib')
# A final requirement is setting a threshold that decides whether a given column is to be included or excluded
# This is based on number of units relative to total for each row
# If value below threshold then do not include in max calculation for each week
contribution_threshold = 25.0
contribution_mask = df_contrib_to_total >= contribution_threshold
df_all_cols = df_all_cols.join(contribution_mask, rsuffix='_Contrib_Mask')
# Get the column with the highest Pct_change for each row - get the actual pct_change value as well as the column name responsible for it
df_all_cols['Highest_Pct_Diff'] = df_all_cols.iloc[:,3:6].max(axis=1)
df_all_cols['Type_With_Highest_Pct_Diff'] = df_all_cols.iloc[:,3:6].idxmax(axis=1)
# Above df has an incorrect result in row correspondint to index = 4
# The highest pct_diff column has a False for its contribution mask
# Desired result is as below:
# The highest pct_change for any column that has a True in contrib mask is Type_3
df_all_cols_desired_result = df_all_cols.copy(deep=True)
df_all_cols_desired_result.iloc[4,12] = 0.2
df_all_cols_desired_result.iloc[4,13] = 'Type3_Pct_Change'
How do I apply multiple conditions to achieve the above?
If you can only take the max values from some of your rows, then first filter your input dataframe on your second criteria, then apply your max function to the filtered dataframe:
df_contrib_to_total = df.div(df.sum(axis=1), axis=0)*100.0
contribution_threshold = 25.0
contribution_mask = df_contrib_to_total >= contribution_threshold
df_pct_diff = df[contribution_mask].pct_change()
This gives you NaN values in any position the mask has excluded, and thus they won't be taken into consideration when calculating the mask:
>>> df_pct_diff
Type1 Type2 Type3
0 NaN NaN NaN
1 19.00 NaN NaN
2 0.25 29.000000 NaN
3 NaN -0.166667 24.0
4 NaN NaN 0.2
>>> df_all_cols.iloc[:, 3:6].max(axis=1)
0 NaN
1 19.0
2 29.0
3 24.0
4 0.2
dtype: float64
>>> df_all_cols.iloc[:, 3:6].idxmax(axis=1)
0 NaN
1 Type1_Pct_Change
2 Type2_Pct_Change
3 Type3_Pct_Change
4 Type3_Pct_Change
dtype: object

Find index of first row closest to value in pandas DataFrame

So I have a dataframe containing multiple columns. For each column, I would like to get the index of the first row that is nearly equal to a user specified number (e.g. within 0.05 of desired number). The dataframe looks kinda like this:
ix col1 col2 col3
0 nan 0.2 1.04
1 0.98 nan 1.5
2 1.7 1.03 1.91
3 1.02 1.42 0.97
Say I want the first row that is nearly equal to 1.0, I would expect the result to be:
index 1 for col1 (not index 3 even though they are mathematically equally close to 1.0)
index 2 for col2
index 0 for col3 (not index 3 even though 0.97 is closer to 1 than 1.04)
I've tried an approach that makes use of argsort():
df.iloc[(df.col1-1.0).abs().argsort()[:1]]
This would, according to other topics, give me the index of the row in col1 with the value closest to 1.0. However, it returns only a dataframe full of nans. I would also imagine this method does not give the first value close to 1 it encounters per column, but rather the value that is closest to 1.
Can anyone help me with this?
Use DataFrame.sub for difference, convert to absolute values by abs, compare by lt (<) and last get index of first value by DataFrame.idxmax:
a = df.sub(1).abs().lt(0.05).idxmax()
print (a)
col1 1
col2 2
col3 0
dtype: int64
But for more general solution, working if failed boolean mask (no value is in tolerance) is appended new column filled by Trues with name NaN:
print (df)
col1 col2 col3
ix
0 NaN 0.20 1.07
1 0.98 NaN 1.50
2 1.70 1.03 1.91
3 1.02 1.42 0.87
s = pd.Series([True] * len(df.columns), index=df.columns, name=np.nan)
a = df.sub(1).abs().lt(0.05).append(s).idxmax()
print (a)
col1 1.0
col2 2.0
col3 NaN
dtype: float64
Suppose, you have some tolerance value tol for the nearly
match threshold. You can create a mask dataframe for
values below the threshold and use first_valid_index()
on each column to get the index of first match occurence.
tol = 0.05
mask = df[(df - 1).abs() < tol]
for col in df:
print(col, mask[col].first_valid_index())

Splitting array values in dataframe into new dataframe - python

I have a pandas dataframe with a variable that is an array of arrays. I would like to create a new dataframe from this variable.
My current dataframe 'fruits' looks like this...
Id Name Color price_trend
1 apple red [['1420848000','1.25'],['1440201600','1.35'],['1443830400','1.52']]
2 lemon yellow [['1403740800','0.32'],['1422057600','0.25']]
What I would like is a new dataframe from the 'price_trend' column that looks like this...
Id date price
1 1420848000 1.25
1 1440201600 1.35
1 1443830400 1.52
2 1403740800 0.32
2 1422057600 0.25
Thanks for the advice!
A groupby+apply should do the trick.
def f(group):
row = group.irow(0)
ids = [row['Id'] for v in row['price_trend']]
dates = [v[0] for v in row['price_trend']]
prices = [v[1] for v in row['price_trend']]
return DataFrame({'Id':ids, 'date': dates, 'price': prices})
In[7]: df.groupby('Id', group_keys=False).apply(f)
Out[7]:
Id date price
0 1 1420848000 1.25
1 1 1440201600 1.35
2 1 1443830400 1.52
0 2 1403740800 0.32
1 2 1422057600 0.25
Edit:
To filter out bad data (for instance, a price_trend column having value [['None']]), one option is to use pandas boolean indexing.
criterion = df['price_trend'].map(lambda x: len(x) > 0 and all(len(pair) == 2 for pair in x))
df[criterion].groupby('Id', group_keys=False).apply(f)

Categories

Resources