How to locate and replace values in dataframe based on some criteria - python

I would like to locate all places when in Col2 there is a change in value (for ex. change from A to C) and then modify value from Col1 (corresponding to row when the change happens, so when A -> C then it will be value in the same row as C) by dividing subtraction current value and previous value by two (in this example will be 1 + (1.5-1)/2 = 1.25.
Output table is result of replacing all that occurrences in whole table
How I can achieve that ?
Col1
Col2
1
A
1.5
C
2.0
A
2.5
A
3.0
D
3.5
D
OUTPUT:
Col1
Col2
1
A
1.25
C
1.75
A
2.5
A
2.75
D
3.5
D

Use np.where and series holding values of your formula
solution = df.Col1.shift() + ((df.Col1 - df.Col1.shift()) / 2)
df['Col1'] = np.where(~df.Col2.eq(df.Col2.shift()), solution.fillna(df.Col1), df.Col1)

Related

Group by and apply sum, divide , round functions to a single column along with aggregators on other column

I have a dataframe like this
df1 = pd.DataFrame([['a',1,100],['b',2,300],['c',3,400]],columns = ['col1','col2','col3'])
Required output
summary_df = df1.groupby('col1').agg({'col2':'sum','col3':'sum'}).reset_index() #line1
summary_df['col3'] = round(summary_df['col3']/1000,2)
Can we do the division and rounding function in the line1 itself. I have more columns to do like that. So adding a line for each column is not a good idea.
You can also pass a lambda function as aggregate, and perform column specific functions for a particular column.
>>> df1.groupby('col1').agg({'col2':'sum','col3':lambda x:round(x.sum()/1000,2)})
col2 col3
col1
a 1 0.1
b 2 0.3
c 3 0.4
If you need to apply the same function more than once, its better to create a normal function, and use it for multiple columns, instead of using lambda:
def func(x):
return round(x.sum()/1000,2)
df1.groupby('col1').agg({'col2':'sum','col3':func})
col2 col3
col1
a 1 0.1
b 2 0.3
c 3 0.4
Yes you can do that using assign.
summary_df = (df1.groupby('col1')
.agg({'col2':'sum','col3':'sum'})
.reset_index()
.assign(col3=lambda x: round(x['col3']/1000,2))) #line1

Calculate perc in Pandas Dataframe based on rows having a specific condition for each distinct value in column

I have a dataframe with sample values as given below
`
col1 col2
A ['1','2','er']
A []
B ['3','4','ac']
B ['5']
C []
`
I want to calculate the percentage of total number of rows for each value in col1 against total number of rows in col2 that are not empty list.
I am able to do it if there is a single value in col1. I am looking for a solution to do this iteratively. Thanks for the help.
I believe you need compare length of lists greater like 0, convert to number and athen aggregate mean:
df1 = df['col2'].str.len().gt(0).view('i1').groupby(df['col1']).mean().reset_index(name='%')
print (df1)
col1 %
0 A 0.5
1 B 1.0
2 C 0.0

pandas groupby where and else

I have a dataframe like this:
col1 col2
0 a 100
1 a 200
2 a 150
3 b 1000
4 c 400
5 c 200
what I want to do is group by col1 and count the number of occurrences and if count is equal or greater than 2, then calculate mean of col2 for those rows and if not apply another function. The output should be:
col1 mean
0 a 150
1 b whatever aggregator function returns
2 c 300
I followed #ansev solution in here pandas groupby count and then conditional mean however I don't want to replace them with NaN and actually want to replace it a value that returns from another function like this:
def aggregator(col1, col2):
return col1+col2
Please keep in mind that the actual aggregator function is more complicated and dependent to other tables and this is for just simplifying the question.
I'm not sure this is what you need, but you can resolve to apply:
def aggregator(x):
if len(x)==1:
return pd.Series( (x['col1'] + x['col2'].astype(str)).values)
else: return pd.Series(x['col2'].mean())
df.groupby('col1').apply(aggregator)
Output:
0
col1
a 150
b b1000
c 300

Find index of first row closest to value in pandas DataFrame

So I have a dataframe containing multiple columns. For each column, I would like to get the index of the first row that is nearly equal to a user specified number (e.g. within 0.05 of desired number). The dataframe looks kinda like this:
ix col1 col2 col3
0 nan 0.2 1.04
1 0.98 nan 1.5
2 1.7 1.03 1.91
3 1.02 1.42 0.97
Say I want the first row that is nearly equal to 1.0, I would expect the result to be:
index 1 for col1 (not index 3 even though they are mathematically equally close to 1.0)
index 2 for col2
index 0 for col3 (not index 3 even though 0.97 is closer to 1 than 1.04)
I've tried an approach that makes use of argsort():
df.iloc[(df.col1-1.0).abs().argsort()[:1]]
This would, according to other topics, give me the index of the row in col1 with the value closest to 1.0. However, it returns only a dataframe full of nans. I would also imagine this method does not give the first value close to 1 it encounters per column, but rather the value that is closest to 1.
Can anyone help me with this?
Use DataFrame.sub for difference, convert to absolute values by abs, compare by lt (<) and last get index of first value by DataFrame.idxmax:
a = df.sub(1).abs().lt(0.05).idxmax()
print (a)
col1 1
col2 2
col3 0
dtype: int64
But for more general solution, working if failed boolean mask (no value is in tolerance) is appended new column filled by Trues with name NaN:
print (df)
col1 col2 col3
ix
0 NaN 0.20 1.07
1 0.98 NaN 1.50
2 1.70 1.03 1.91
3 1.02 1.42 0.87
s = pd.Series([True] * len(df.columns), index=df.columns, name=np.nan)
a = df.sub(1).abs().lt(0.05).append(s).idxmax()
print (a)
col1 1.0
col2 2.0
col3 NaN
dtype: float64
Suppose, you have some tolerance value tol for the nearly
match threshold. You can create a mask dataframe for
values below the threshold and use first_valid_index()
on each column to get the index of first match occurence.
tol = 0.05
mask = df[(df - 1).abs() < tol]
for col in df:
print(col, mask[col].first_valid_index())

Python dataframe groupby multiple columns with conditional sum

I have a df which looks like that:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
I am grouping the df by col1 and col2, and for each member of each group, I want to sum the target values, only of other group members, that their now date value, is smaller(before) than the current member's previous date value.
For example for:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
I want to sum the target values of:
col1 col2 now previous target
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
to eventually have:
col1 col2 now previous target sum
A 1 1-1-2015 4-1-2014 0.2 1.8
Interesting problem, I've got something that I think may work.
Although, slow time complexity of Worst case: O(n**3) and Best case: O(n**2).
Setup data
import pandas as pd
import numpy as np
import io
datastring = io.StringIO(
"""
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
C 1 31-12-2014 4-9-2014 1.9
""")
# arguments for pandas.read_csv
kwargs = {
"sep": "\s+", # specifies that it's a space separated file
"parse_dates": [2,3], # parse "now" and "previous" as dates
}
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)
Pseudo code for algorithm
For each row:
For each *other* row:
If "now" of *other* row comes before "previous" of row
Then add *other* rows "target" to "sum" of row
Run the algorithm
First start by setting up a function f(), that is to be applied over all the groups computed by df.groupby(["col1","col2"]). All that f() does is try to implement the pseudo code above.
def f(df):
_sum = np.zeros(len(df))
# represent the desired columns of the sub-dataframe as a numpy object
data = df[["now","previous","target"]].values
# loop through the rows in the sub-dataframe, df
for i, outer_row in enumerate(data):
# for each row, loop through all the rows again
for j, inner_row in enumerate(data):
# skip iteration if outer loop row is equal to the inner loop row
if i==j: continue
# get the dates from rows
outer_prev = outer_row[1]
inner_now = inner_row[0]
# if the "previous" datetime of the outer loop is greater than
# the "now" datetime of the inner loop, then add "target" to
# to the cumulative sum
if outer_prev > inner_now:
_sum[i] += inner_row[2]
# add a new column for this new "sum" that we calculated
df["sum"] = _sum
return df
Now just apply f() over the grouped data.
done = df.groupby(["col1","col2"]).apply(f)
Output
col1 col2 now previous target sum
0 A 1 2015-01-01 2014-04-01 0.20 1.7
1 B 0 2015-02-01 2014-02-05 0.33 0.0
2 A 0 2013-03-01 2011-03-09 0.10 0.0
3 A 1 2014-01-01 2011-04-09 1.70 0.0
4 A 1 2014-12-31 2014-04-09 1.90 1.7

Categories

Resources