Compare a DataFrame to itself? pandas

Compare a DataFrame to itself? pandas - python

I have a dataframe with week number as int, item name, and ranking.
For instance:
item_name ranking week_number
0 test 4 1
1 test 3 2
I'd like to add a new column with the ranking evolution since the last week.
The math is very simple:
df['ranking_evolution'] = ranking_previous_week - df['ranking']
It would only require exception handling for week 1.
But I'm not sure how to return the ranking previous week.
I could do it by iterating over the rows but I'm wondering if there is a cleaner way so I can just declare a column?
The issue is that I'd have to compare the dataframe to itself.
I've candidly tried:
df['ranking_evolution'] = df['ranking'].loc[(df[item_name] == df['item_name]) & (df['week_number'] == df['week_number'] - 1) - df['ranking']
But this return NaN values.
Even using a copy returned NaN values.

I assume this is a simplistic example, you probably have different products and maybe missing weeks?
A robust way would be to perform a self-merge with the week+1:
(df.merge(df.assign(week_number=df['week_number']+1),
on=['item_name', 'week_number'],
suffixes=(None, '_evolution'),
how='left')
.assign(ranking_evolution=lambda d: d['ranking_evolution'].sub(d['ranking']))
)
Output:
item_name ranking week_number ranking_evolution
0 test 4 1 NaN
1 test 3 2 1.0

Shortly, try this code to figure out the trick.
import pandas as pd
data = {
'item_name': ['test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test'],
'ranking': [4, 3, 2, 1, 2, 3, 4, 5, 6, 7],
'week_number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
df = pd.DataFrame(data)
df['ranking_evolution'] = df['ranking'].diff(-1) # this is the line that does the trick
print(df)
Results
item_name ranking week_number ranking_evolution
test 4 1 1.0
test 3 2 1.0
test 2 3 1.0
test 1 4 -1.0

Related

Change some values in column if condition is true in Pandas dataframe without loop

I have the following dataframe:
d_test = {
'random_staff' : ['gfda', 'fsd','gec', 'erw', 'gd', 'kjhk', 'fd', 'kui'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
cluster_number column contains values from 1 to n. Some values could have repetition but no missing values are presented. For example above such values are: 1, 2, 3, 4.
I want to be able to select some value from cluster_number column and change every occurrence of this value to set of unique values. No missing value should be presented. For example if we select value 2 then desirable outcome for cluster_number is [1, 2, 3, 3, 5, 1, 4, 6]. Note we had three 2 in the column. We kept first one as 2 we change next occurrence of 2 to 5 and we changed last occurrence of 2 to 6.
I wrote code for the logic above and it works fine:
cluster_number_to_change = 2
max_cluster = max(df_test['cluster_number'])
first_iter = True
i = cluster_number_to_change
for index, row in df_test.iterrows():
if row['cluster_number'] == cluster_number_to_change:
df_test.loc[index, 'cluster_number'] = i
if first_iter:
i = max_cluster + 1
first_iter = False
else:
i += 1
But it is written as for-loop and I am trying understand if can be transformed in form of pandas .apply method (or any other effective vectorized solution).

Using boolean indexing:
# get cluster #2
m1 = df_test['cluster_number'].eq(2)
# identify duplicates
m2 = df_test['cluster_number'].duplicated()
# increment duplicates using the max as reference
df_test.loc[m1&m2, 'cluster_number'] = (
m2.where(m1).cumsum()
.add(df_test['cluster_number'].max())
.convert_dtypes()
)
print(df_test)
Output:
random_staff cluster_number
0 gfda 1
1 fsd 2
2 gec 3
3 erw 3
4 gd 5
5 kjhk 1
6 fd 4
7 kui 6

Get index of row where column value changes from previous row

I have a pandas dataframe with a column such as :
df1 = pd.DataFrame({ 'val': [997.95, 997.97, 989.17, 999.72, 984.66, 1902.15]})
I have 2 types of events that can be detected from this column, I wanna label them 1 and 2 .
I need to get the indexes of each label , and to do so I need to find where the 'val' column has changed a lot (± 7 ) from previous row.
Expected output:
one = [0, 1, 3, 5]
two = [2, 4 ]

Use Series.diff with mask for test less values like 0, last use boolean indexing with indices:
m = df1.val.diff().lt(0)
#if need test less like -7
#m = df1.val.diff().lt(-7)
one = df1.index[~m]
two = df1.index[m]
print (one)
Int64Index([0, 1, 3, 5], dtype='int64')
print (two)
nt64Index([2, 4], dtype='int64')
If need lists:
one = df1.index[~m].tolist()
two = df1.index[m].tolist()
Details:
print (df1.val.diff())
0 NaN
1 0.02
2 -8.80
3 10.55
4 -15.06
5 917.49
Name: val, dtype: float64

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]

You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6

subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]

Find the minimum value between a current and previous crossovers of moving-averages in Pandas

I have a large dataframe of stockprice data with df.columns = ['open','high','low','close']
Problem definition:
When an EMA crossover happens, i am mentioning df['cross'] = cross. Everytime a crossover happens, if we label the current crossover as crossover4, I want to check if the minimum value of df['low'] between crossover 3 and 4 IS GREATER THAN the minimum value of df['low'] between crossover 1 and 2. I have made an attempt at the code based on the help i have received from 'Gherka' so far. I have indexed the crossing over and found minimum values between consecutive crossovers.
So, everytime a crossover happens, it has to be compared with the previous 3 crossovers and I need to check MIN(CROSS4,CROSS 3) > MIN(CROSS2,CROSS1).
I would really appreciate it if you guys could help me complete.
import pandas as pd
import numpy as np
import bisect as bs
data = pd.read_csv("Nifty.csv")
df = pd.DataFrame(data)
df['5EMA'] = df['Close'].ewm(span=5).mean()
df['10EMA'] = df['Close'].ewm(span=10).mean()
condition1 = df['5EMA'].shift(1) < df['10EMA'].shift(1)
condition2 = df['5EMA'] > df['10EMA']
df['cross'] = np.where(condition1 & condition2, 'cross', None)
cross_index_array = df.loc[df['cross'] == 'cross'].index
def find_index(a, x):
i = bs.bisect_left(a, x)
return a[i-1]
def min_value(x):
"""Find the minimum value of 'Low' between crossovers 1 and 2, crossovers 3 and 4, etc..."""
cur_index = x.name
prev_cross_index = find_index(cross_index_array, cur_index)
return df.loc[prev_cross_index:cur_index, 'Low'].min()
df['min'] = None
df['min'][df['cross'] == 'cross'] = df.apply(min_value, axis=1)
print(df)

This should do the trick:
import pandas as pd
df = pd.DataFrame({'open': [1, 2, 3, 4, 5],
'high': [5, 6, 6, 5, 7],
'low': [1, 3, 3, 4, 4],
'close': [3, 5, 3, 5, 6]})
df['day'] = df.apply(lambda x: 'bull' if (
x['close'] > x['open']) else None, axis=1)
df['min'] = None
df['min'][df['day'] == 'bull'] = pd.rolling_min(
df['low'][df['day'] == 'bull'], window=2)
print(df)
# close high low open day min
# 0 3 5 1 1 bull NaN
# 1 5 6 3 2 bull 1
# 2 3 6 3 3 None None
# 3 5 5 4 4 bull 3
# 4 6 7 4 5 bull 4
Open for comments!

If I understand your question correctly, you need a dynamic "rolling window" over which to calculate the minimum value. Assuming your index is a default one meaning it's sorted in the ascending order, you can try the following approach:
import pandas as pd
import numpy as np
from bisect import bisect_left
df = pd.DataFrame({'open': [1, 2, 3, 4, 5],
'high': [5, 6, 6, 5, 7],
'low': [1, 3, 2, 4, 4],
'close': [3, 5, 3, 5, 6]})
This uses the same sample data as mommermi, but with low on the third day changed to 2 as the third day should also be included in the "rolling window".
df['day'] = np.where(df['close'] > df['open'], 'bull', None)
We calculate the day column using vectorized numpy operation which should be a little faster.
bull_index_array = df.loc[df['day'] == 'bull'].index
We store the index values of the rows (days) that we've flagged as bulls.
def find_index(a, x):
i = bisect_left(a, x)
return a[i-1]
Bisect from the core library will enable us to find the index of the previous bull day in an efficient way. This requires that the index is sorted which it is by default.
def min_value(x):
cur_index = x.name
prev_bull_index = find_index(bull_index_array, cur_index)
return df.loc[prev_bull_index:cur_index, 'low'].min()
Next, we define a function that will create our "dynamic" rolling window by slicing the original dataframe by previous and current index.
df['min'] = df.apply(min_value, axis=1)
Finally, we apply the min_value function row-wise to the dataframe, yielding this:
open high low close day min
0 1 5 1 3 bull NaN
1 2 6 3 5 bull 1.0
2 3 6 2 3 None 2.0
3 4 5 4 5 bull 2.0
4 5 7 4 6 bull 4.0

Manipulate pandas.DataFrame with multiple criterias

For example I have a dataframe:
df = pd.DataFrame({'Value_Bucket': [5, 5, 5, 10, 10, 10],
'DayofWeek': [1, 1, 3, 2, 4, 2],
'Hour_Bucket': [1, 5, 7, 4, 3, 12],
'Values': [1, 1.5, 2, 3, 5, 3]})
The actual data set is rather large (5000 rows+). I'm looking to perform functions on 'Values' if the "Value_Bucket" = 5, and for each possible combination of "DayofWeek" and "Hour_Bucket".
Essentially the data will be grouped to a table of 24 rows (Hour_Bucket) and 7 columns (DayofWeek), and each cell is filled with the result of a function (say average for example). I can use a groupby function for 1 criteria, can someone explain how I can group two criteria and tabulate the result in a table?

query to subset
groupby
unstack
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack()
DayofWeek 1 3
Hour_Bucket
1 1.0 NaN
5 1.5 NaN
7 NaN 2.0
If you want to have zeros instead of NaN
df.query('Value_Bucket == 5').groupby(
['Hour_Bucket', 'DayofWeek']).Values.mean().unstack(fill_value=0)
DayofWeek 1 3
Hour_Bucket
1 1.0 0.0
5 1.5 0.0
7 0.0 2.0

Pivot tables seem more natural to me than groupby paired with unstack though they do the exact same thing.
pd.pivot_table(data=df.query('Value_Bucket == 5'),
index='Hour_Bucket',
columns='DayofWeek',
values='Values',
aggfunc='mean',
fill_value=0)
Output
DayofWeek 1 3
Hour_Bucket
1 1.0 0
5 1.5 0
7 0.0 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare a DataFrame to itself? pandas - python

Related

Change some values in column if condition is true in Pandas dataframe without loop

Get index of row where column value changes from previous row

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

Find the minimum value between a current and previous crossovers of moving-averages in Pandas

Manipulate pandas.DataFrame with multiple criterias

Categories

Resources