Sum of values based on a condition in pandas - python

ID change SX Supresult
0 UNITY NaN 0 NaN
1 UNITY -0.009434 100 -0.015283 (P1)
2 UNITY 0.003463 0 NaN
3 TRINITY 0.008628 100 -0.043363
4 TRINITY -0.027374 100 0.008423 (P2)
5 TRINITY -0.011002 0 NaN
6 TRINITY -0.004987 100 NaN
7 TRINITY 0.007566 0 NaN
I use the following program which creates a new column 'Supresult' if 'SX' is equal to 100. The new column stores the sum of NEXT three 'change' values. For instance in index 1 the supresult is a sum of change in index 2,3 & 4.
df['Supresult'] = df[df.SX == 100].index.to_series().apply(lambda x: df.change.shift(-1).iloc[x: x + 3].sum())
However, I am facing two problems that I need assistance with:
(P1): I want the sum to be 'ID' specific. For instance the result in index 1 goes ahead and take the sum of one value from UNITY and two from TRINITY. Sum should be made as long as it in within the same 'ID'. I tried to add .groupby('ID') at the end of my code but it gave a keyerror.
(P2): Since the index 3 already gave sum of next three change, index 4 shouldn't go ahead and make the sum of the next three days. The next sum should only be taken once the previous calculation period is complete i.e. index 6 and onwards.
Intended result:
ID change SX Supresult
0 UNITY NaN 0 NaN
1 UNITY -0.009434 100 NaN
2 UNITY 0.003463 0 NaN
3 TRINITY 0.008628 100 -0.043363
4 TRINITY -0.027374 100 NaN
5 TRINITY -0.011002 0 NaN
6 TRINITY -0.004987 100 NaN
7 TRINITY 0.007566 0 NaN
Little help will be appreciated, THANKS!

Given your complex requirements, I think a loop is appropriate:
# If your data frame is not indexed sequentially, this will make it so.
# The algorithm needs the frame to be indexed 0, 1, 2, ...
df.reset_index(inplace=True)
# Every row starts off in "unconsumed" state
consumed = np.repeat(0, len(df))
result = np.repeat(np.nan, len(df))
for i, sx in df['SX'].iteritems():
# The next three rows
slc = slice(i+1, i+4)
# A row is considered a match if:
# * It has SX == 100
# * The next three rows have the same ID
# * The next three rows are not involved in a previous summation
match = (
(sx == 100) and
(df.loc[slc, 'ID'].nunique() == 1) and
(consumed[i] == 0)
)
if match:
consumed[slc] = 1
result[i] = df.loc[slc, 'Supresult'].sum()
df['Supresult'] = result

Related

Pandas sum last n rows of df.count() results into one row

I am looking for a way to generate nice summary statistics of a dataframe. Consider the following example:
>> df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
>> df['category'].value_counts()
z 4
x 4
y 3
u 2
v 1
w 1
>> ??
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
The result sums the value counts of the n=3 last rows up, deletes them and then adds them as one row to the original value counts. Also it would be nice to have everything as percents. Any ideas how to implement this? Cheers!
For DataFrame with percentages use Series.iloc with indexing, crate DataFrame by Series.to_frame, add new row and new column filled by percentages:
s = df['category'].value_counts()
n= 3
out = s.iloc[:-n].to_frame('count')
out.loc['Other ({n})'] = s.iloc[-n:].sum()
out['pct'] = out['count'].div(out['count'].sum()).apply(lambda x: f"{x:.0%}")
print (out)
count pct
z 4 27%
x 4 27%
y 3 20%
Other (3) 4 27%
I would use tail(-3) to get the last values except for the first 3:
counts = df['category'].value_counts()
others = counts.tail(-3)
counts[f'Others ({len(others)})'] = others.sum()
counts.drop(others.index, inplace=True)
counts.to_frame(name='count').assign(pct=lambda d: d['count'].div(d['count'].sum()).mul(100).round())
Output:
count pct
z 4 27.0
x 4 27.0
y 3 20.0
Others (3) 4 27.0
This snippet
df = pd.DataFrame({"category":['u','v','w','u','y','z','y','z','x','x','y','z','x','z','x']})
cutoff_index = 3
categegory_counts = pd.DataFrame([df['category'].value_counts(),df['category'].value_counts(normalize=True)],index=["Count","Percent"]).T.reset_index()
other_rows = categegory_counts[cutoff_index:].set_index("index")
categegory_counts = categegory_counts[:cutoff_index].set_index("index")
summary_table = pd.concat([categegory_counts,pd.DataFrame(other_rows.sum(),columns=[f"Other ({len(other_rows)})"]).T])
summary_table = summary_table.astype({'Count':'int'})
summary_table['Percent'] = summary_table['Percent'].apply(lambda x: "{0:.2f}%".format(x*100))
print(summary_table)
will give you what you need. Also in a nice format;)
Count Percent
z 4 26.67%
x 4 26.67%
y 3 20.00%
Other (3) 4 26.67%

More effcient function to find next change in a dataframe column with time series data

I have a large dataframe with a price column that stays at the same value as the time increases and then will change values, and then stay at that value new value for a while before going up or down. I want to write a function that looks at the price column and creates a new column called next movement that indicates wheather or not the next movement of the price column will be up or down.
For example if the price column looked like [1,1,1,2,2,2,4,4,4,3,3,3,4,4,4,2,1] then the next movement column should be [1,1,1,1,1,1,0,0,0,1,1,1,0,0,0,0,-1] with 1 representing the next movement being up 0 representing the next movement being down, and -1 representing unknown.
def make_next_movement_column(DataFrame, column):
DataFrame["next movement"] = -1
for i in range (DataFrame.shape[0]):
for j in range(i + 1, DataFrame.shape[0]):
if(DataFrame[column][j] > DataFrame[column][i]):
DataFrame["next movement"][i:j] = 1
break;
if(DataFrame[column][j] < DataFrame[column][i]):
DataFrame["next movement"][i:j] = 0
break;
i = j - 1
return DataFrame
I wrote this function and it does work, but the problem is it is horribly ineffcient. I was wondering if there was a more effcient way to write this function.
This answer doesn't seem to work because the diff method only looks at the next column but I want to find the next movement no matter how far away it is.
Annotated code
# Calculate the diff between rows
s = df['column'].diff(-1)
# Broadcast the last diff value per group
s = s.mask(s == 0).bfill()
# Select from [1, 0] depending upon the value of diff
df['next_movement'] = np.select([s <= -1, s >= 1], [1, 0], -1)
Result
column next_movement
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 4 0
7 4 0
8 4 0
9 3 1
10 3 1
11 3 1
12 4 0
13 4 0
14 4 0
15 2 0
16 1 -1

Create a binary matrix after comparing columns' values in a dataframe

The text is long but the question is simple!
I have two dataframes that brings different informations about two variables and I need to create a binary matrix as my output after following some steps.
Let's say my dataframes are these:
market_values = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
variableA variableB variableC variableD variableE variableF variableG
0 1.0 NaN 9 18 36 99 42
1 2.0 2.0 10 25 11 10 19
2 3.0 NaN 15 43 12 98 27
negociation_values = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(0.14,np.nan,0.03), 'variableG': (1.4,0.134,0.111)})
variableA variableB variableC variableD variableE variableF variableG
0 0.1 0.500 0.9 0.12 NaN 1.4 0.141
1 0.2 NaN 0.1 0.11 0.13 NaN 0.134
2 0.3 0.303 0.4 0.09 0.21 0.03 0.111
And I need to follow these steps:
1 - Check if two columns in my 'market_values' df have at least one
value that is equal (for the same row)
2 - If a pair of columns has one value that is equal (for the same row),
then I need to compare these same columns in my
'negociation_values' df
3 - Then I have to discover which variable has the higher
negociation value (for a given row)
4 - Finally I need to create a binary matrix.
For those equal values' variable, I'll put 1 where one
negociation value is higher and 0 for the other. If a column
doesn't have an equal value with another column, I'll just put 1
for the entire column.
The desired output matrix will be like:
variableA variableB variableC variableD variableE variableF variableG
0 0 1 0 1 1 1 1
1 1 0 1 1 1 0 1
2 0 1 1 1 1 0 1
The main difficult is at steps 3 and 4.
I've done steps 1 and 2 so far. They're above:
arr = market_values.to_numpy()
is_equal = ((arr == arr[None].T).any(axis=1))
is_equal[np.tril_indices_from(is_equal)] = False
inds_of_same_cols = [*zip(*np.where(is_equal))]
equal_cols = [market_values.columns[list(inds)].tolist() for inds in inds_of_same_cols]
print(equal_cols)
-----------------
[['variableA', 'variableB'], ['variableC', 'variableF']]
h = []
for i in equal_cols:
op = pd.DataFrame(negociation_values[i])
h.append(op)
print(h)
-------
[ variableA variableB
0 0.1 0.500
1 0.2 NaN
2 0.3 0.303,
variableC variableF
0 0.9 0.14
1 0.1 NaN
2 0.4 0.03]
The code above returns me the negociation values for the columns that have at least one equal value in the market values df.
Unfortunately, I don't know where to go from here. I need to write a code that says something like: "If variableA > variableB (for a row), insert '1' in a new matrix under variableA column and a '0' under variableB column for that row. keep doing that and then do that for the others". Also, I need to say "If a variable doesn't have an equal value in some other column, insert 1 for all values in this binary matrix"
your negociation_values definition and presented table are not the same:
here is the definition I used
market_values = pd.DataFrame({'variableA': (1,2.0,3), 'variableB': (np.nan,2,np.nan), 'variableC': (9,10,15), 'variableD' : (18,25,43),'variableE':(36,11,12),'variableF':(99,10,98), 'variableG': (42,19,27)})
negociation_values = pd.DataFrame({'variableA': (0.1,0.2,0.3), 'variableB': (0.5,np.nan,0.303), 'variableC': (0.9,0.10,0.4), 'variableD' : (0.12,0.11,0.09),'variableE':(np.nan,0.13,0.21),'variableF':(1.4,np.nan,0.03), 'variableG': (0.141,0.134,0.111)})
The following code gives me the required matrix (though there are a number of edge cases you will need to consider)
cols = market_values.columns.values
bmatrix = pd.DataFrame(index=market_values.index, columns=cols, data=1)
for idx,col in enumerate(cols):
print(cols[idx+1:])
df_m = market_values[cols[idx+1:]]
df_n = negociation_values[cols[idx+1:]]
v = df_n.loc[:,df_m.sub(market_values[col],axis=0).eq(0).any()].sub(negociation_values[col], axis=0).applymap(lambda x: 1 if x > 0 else 0)
if v.columns.size > 0:
bmatrix[v.columns[0]] = v
bmatrix[col] = 1 - v
The result is as required:
The pseudo code is:
for each column of the market matrix:
subtract from the later columns,
keep columns with any zeros (edge case: more than one column),
from column with zero , find difference between corresponding negoc. matrix,
set result to 1 if > 0, otherwise 0,
enter into binary matrix
Hope that makes sense.

Pandas: select the highest and lowest values between two specifc values from another column

My original dataframe looks like this:
macd_histogram direct event
1.675475e-07 up crossing up
2.299171e-07 up 0
2.246809e-07 up 0
1.760860e-07 up 0
1.899371e-07 up 0
1.543226e-07 up 0
1.394901e-07 up 0
-3.461691e-08 down crossing down
1.212740e-06 up 0
6.448285e-07 up 0
2.227792e-07 up 0
-8.738289e-08 down crossing up
-3.109205e-07 down 0
The column event is filled with crossing up and crossing down! What I would need is between a crossing up and a crossing down substract the highest value from the column macd_histogram (between the same index) and substract it from the lowest and add it to a new column next to crossing up!
I tried to do it with a for loop, but I am a bit lost on how to select the range between each crossing up and crossing down... any help? Thanks!
What I expect in fact (following the above dataframe):
macd_histogram direct event magnitude
1.675475e-07 up crossing up (0.851908-07)
2.299171e-07 up 0
2.246809e-07 up 0
1.760860e-07 up 0
1.899371e-07 up 0
1.543226e-07 up 0
1.394901e-07 up 0
-3.461691e-08 down crossing down (2.651908-06)
1.212740e-06 up 0
6.448285e-07 up 0
2.227792e-07 up 0
-8.738289e-08 down crossing up etc..
-3.109205e-07 down 0
This is what I tried so far:
index_up = df[df.event == 'crossing up'].index.values
index_down = df[df.event == 'crossing down'].index.values
df['magnitude'] = 0
array = np.array([])
for i in index_up:
for idx in index_down:
values = df.loc[i:idx, 'macd_histogram'].tolist()
max = np.max(values)
min = np.min(values)
magnitutde = max-min
print(magnitude)
df.at[i,'magnitude'] = magnitude
But I have the following error message: ValueError: zero-size array to reduction operation maximum which has no identity
I think I understand what you are asking for, but my result numbers don't match your example, so maybe I don't understand fully. Hopefully this much answer will help you.
First create a column to place the result.
df['result'] = np.nan
Create a variable with just the index of rows with crossing up/down.
event_range = df[df['event'] != '0'].index
Make a for loop to loop through the index array. Create a start and end index number for each section, get a maximum and minimum of the range for each start/end index number, and subtract and place to the right column.
for x in range(len(event_range)-1):
start = event_range[x]
end = event_range[x+1] +1 # I'm not sure if this is the range you want
max = df.iloc[start:end, 0].max()
min = df.iloc[start:end, 0].min()
diff = max - min
df.iloc[start, 3] = diff
df
macd_histogram direct event result
0 1.675480e-07 up crossing up 2.645339e-07
1 2.299170e-07 up 0 NaN
2 2.246810e-07 up 0 NaN
3 1.760860e-07 up 0 NaN
4 1.899370e-07 up 0 NaN
5 1.543230e-07 up 0 NaN
6 1.394900e-07 up 0 NaN
7 -3.461690e-08 down crossing down 1.300123e-06
8 1.212740e-06 up 0 NaN
9 6.448290e-07 up 0 NaN
10 2.227790e-07 up 0 NaN
11 -8.738290e-08 down crossing up NaN
12 -3.109210e-07 down 0 NaN

create a summary of movements between prices by date in a pandas dataframe

I have a dataframe which shows; 1) dates, prices and 3) the difference between two prices by row.
dates | data | result | change
24-09 24 0 none
25-09 26 2 pos
26-09 27 1 pos
27-09 28 1 pos
28-09 26 -2 neg
I want to create a summary of the above data in a new dataframe. The summary would have 4 columns: 1) start date, 2) end date 3) number of days 4) run
For example using the above there was a positive run of +4 from 25-09 and 27-09, so I would want this in a row of a dataframe like so:
In the new dataframe there would be one new row for every change in the value of result from positive to negative. Where run = 0 this indicates no change from the previous days price and would also need its own row in the dataframe.
start date | end date | num days | run
25-09 27-09 3 4
27-09 28-09 1 -2
23-09 24-09 1 0
The first step I think would be to create a new column "change" based on the value of run which then shows either of: "positive","negative" or "no change". Then maybe I could groupby this column.
A couple of useful functions for this style of problem are diff() and cumsum().
I added some extra datapoints to your sample data to flesh out the functionality.
The ability to pick and choose different (and more than one) aggregation functions assigned to different columns is a super feature of pandas.
df = pd.DataFrame({'dates': ['24-09', '25-09', '26-09', '27-09', '28-09', '29-09', '30-09','01-10','02-10','03-10','04-10'],
'data': [24, 26, 27, 28, 26,25,30,30,30,28,25],
'result': [0,2,1,1,-2,0,5,0,0,-2,-3]})
def cat(x):
return 1 if x > 0 else -1 if x < 0 else 0
df['cat'] = df['result'].map(lambda x : cat(x)) # probably there is a better way to do this
df['change'] = df['cat'].diff()
df['change_flag'] = df['change'].map(lambda x: 1 if x != 0 else x)
df['change_cum_sum'] = df['change_flag'].cumsum() # which gives us our groupings
foo = df.groupby(['change_cum_sum']).agg({'result' : np.sum,'dates' : [np.min,np.max,'count'] })
foo.reset_index(inplace=True)
foo.columns = ['id','start date','end date','num days','run' ]
print foo
which yields:
id start date end date num days run
0 1 24-09 24-09 1 0
1 2 25-09 27-09 3 4
2 3 28-09 28-09 1 -2
3 4 29-09 29-09 1 0
4 5 30-09 30-09 1 5
5 6 01-10 02-10 2 0
6 7 03-10 04-10 2 -5

Categories

Resources