how to figure out trend per unique key. dataframe - python

I have a DataFrame with 2 cols
ColA| ColB
D 2
D 12
D 15
A 20
A 40
A 60
C 60
C 55
C 70
C 45
L 45
L 23
L 10
L 5
RESULT/Output would be
L Down
Where UP is result of adding up all the relevant Weights: each successive weight for each key, must be less than the previous weight.
for UP you must have

Here's a simple technique, might not suit for all cases i.e :
def sum_t(x):
# Compare the value with previous value
m = x > x.shift()
# If all of them are increasing then return Up
if m.sum() == len(m)-1:
return 'UP'
# if all of them are decreasing then return Down
elif m.sum() == 0:
return 'DOWN'
# else return flat
return 'FLAT'
Name: ColB, dtype: object

Using diff and crosstab
s=df.groupby('ColA').ColB.diff().dropna()#Dropna since the first value for all group is invalid
pd.crosstab(df.ColA.loc[s.index],s>0,normalize = 'index' )[True].map({1:'Up',0:'Down'}).fillna('Flat')
A Up
C Flat
D Up
L Down
Name: True, dtype: object

Variation to #Dark's idea, I would first calculate GroupBy + diff and then use unique before feeding to a custom function.
Then use logic based on min / max values.
def calc_label(x):
if min(x) >= 0:
return 'UP'
elif max(x) <= 0:
return 'DOWN'
return 'FLAT'
res = df.assign(C=df.groupby('ColA').diff().fillna(0))\
Name: C, dtype: object

Using numpy.polyfit in a custom def
This way you can tweak the gradiant you would class as 'FLAT'
def trend(x, flat=3.5):
m = np.polyfit(np.arange(1, len(x)+1), x, 1)[0]
if abs(m) < flat:
return 'FLAT'
elif m > 0:
return 'UP'
return 'DOWN'

Solution by applying linear regression on each ID associated points and specifying the trend by slope of id associated point in 2 dimensional space
import numpy as np
from sklearn import linear_model
def slope(x,min_slope,max_slope):
reg = linear_model.LinearRegression(),x))
slope = reg.coef_[0][0]
if slope < min_slope:
return 'Down'
if slope > max_slope:
return 'Up'
else 'Flat'
min_slope = -1
max_slope = 1
df['slopes'] = df.groupby('ColA').apply(lambda x: slope(x['ColB'],min_slope,max_slope))


Calculating smallest within trio distance

I have a pandas dataframe similar to the one below:
Output var1 var2 var3
1 0.487981 0.297929 0.214090
1 0.945660 0.031666 0.022674
2 0.119845 0.828661 0.051495
2 0.095186 0.852232 0.052582
3 0.059520 0.053307 0.887173
3 0.091049 0.342226 0.566725
3 0.119295 0.414376 0.466329
... ... ... ... ...
Basically, I have 3 columns (propensity score values) and one output (treatment). I want to calculate the within-trio distance to find trios of outputs with the smallest within-trio distance.
The experiment is taken from the paper: "Matching by Propensity Score in Cohort Studies with Three Treatment Groups", Rassen et al. Looking at their explanation is like calculating the perimeter of a triangle, but I am not sure.
I think that at this GitHub link: there is Java code doing this stuff more or less, but I am not sure on how to use it. I use Python, so I have two options: try to adapt this previous code or write something else.
My idea is that var1, var2 and var3 can be considered like spatial x,y,z coordinates, and the output is like a point in the space.
I found a function that calculates the distance between 2 points:
#found here
import numpy as np
distance = lambda p1, p2: np.sqrt(np.sum((p1 - p2) ** 2, axis=0))
import itertools
def min_distance(cloud):
pairs = itertools.combinations(cloud, 2)
return np.min(map(lambda pair: distance(*pair), pairs))
def get_points(filename):
with open(filename, 'r') as file:
rows = np.genfromtxt(file, delimiter=',', skip_header=True)
return rows
filename = 'cloud.csv'
cloud = get_points(filename)
min_dist = min_distance(cloud)
However, I want to calculate the distance between 3 points, so I think that I need to iterate all the possible combinations of 3 points like XY, XZ and YZ, but I am not sure of this procedure.
Finally, I tried with my own solution, that I think it is correct, but maybe too much computationally expensive.
I created my 3 dataset, according to the Output value: dataset1 = dataset[dataset["Output"]==1] and the same for Output=2 and Output=3.
This is my distance function:
def Euclidean_Dist(df1, df2):
return np.linalg.norm(df1 - df2)
My variables:
tripletta_for = []
tripletta_tot_wr = []
p_inf = float('inf')
counter = 1
These are the steps used to computed the within-trio distance. Hope they are correct.
i[0] = index
i[1] = treatment prop1
i[1][0] = treatment
i[1][1] = prop
#io voglio calcolare la distanza tra i[1][1], j[1][1] e k[1][1]
for i in dataset1.iterrows():
minimum_distance = p_inf
counter = counter + 1
for j in dataset2.iterrows():
dist12 = Euclidean_Dist(i[1][1], j[1][1])
for k in dataset3.iterrows():
dist13 = Euclidean_Dist(i[1][1], k[1][1])
dist23 = Euclidean_Dist(j[1][1], k[1][1])
somma = dist12 + dist13 + dist23
if somma < minimum_distance:
minimum_distance = somma
tripletta_for = i[0], j[0], k[0]
dataset2.drop(index=tripletta_for[1], inplace=True)
dataset3.drop(tripletta_for[2], inplace=True)

How to get last value of column from a data frame

I have a data frame like this
ntil ureach_x ureach_y awgt
0 1 1 34 2204.25
1 2 35 42 1700.25
2 3 43 48 898.75
3 4 49 53 160.25
and an array of values like this
ulist = [41,57]
For each value in the list [41,57] I am trying to find if the values fall in between ureach_x and ureach_y and return the awgt value.
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
The above code works for within the value ranges of ureach_x and ureach_y. How do I check if the value in the list is greater than the last row of ureach_y. My data frame has dynamic shape with varying number of rows.
For example, The desired output for value 57 in the list is 160.25
I tried the following:
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
elif (u >= rows['ureach_x'] and u > rows['ureach_y']):
However, this returns multiple values for 41 in the list. How do I refer only the last value in the column of reach_y in a iterrows loop.
The expected output is as follows:
for values in list:
the corresponding values from df has to be returned.
[1700.25 ,160.25]
If I've understood correctly, you can perform a merge_asof:
s = pd.Series([41,57], name='index')
(pd.merge_asof(s, df, left_on='index', right_on='ureach_x')
41 1700.25
57 160.25
Name: awgt, dtype: float64
If you have 0 in the data and you want to have 2204.25 returned, you can add two lines to #mozway's code and perform merge_asof twice, once going backwards and once going forwards; then combine the two.
ulist = [0, 41, 57]
srs = pd.Series(ulist, name='num')
backward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x')
forward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x', direction='forward')
out = backward.combine_first(forward)['awgt']
0 2204.25
1 1700.25
2 160.25
Name: awgt, dtype: float64
Another option (an explicit loop over ulist):
out = []
for num in ulist:
if ((df['ureach_x'] <= num) & (num <= df['ureach_y'])).any():
x = df.loc[(df['ureach_x'] <= num) & (num <= df['ureach_y']), 'awgt'].iloc[-1]
elif (df['ureach_x'] > num).any():
x = df.loc[df['ureach_x'] > num, 'awgt'].iloc[0]
x = df.loc[df['ureach_y'] < num, 'awgt'].iloc[-1]
[2204.25, 1700.25, 160.25]

How to shorten my code with lambda statement in python?

I have trouble with shortening my code with lambda if possible. bp is my data name.
My data looks like this:
user label
1 b
2 b
3 c
I expect to have
user label Y
1 b 1
2 b 1
3 c 0
Here is my code:
counts = bp['Label'].value_counts()
def score_to_numeric(x):
if counts['b'] > counts['s']:
if x == 'b':
return 1
return 0
if x =='b':
return 0
return 1
bp['Y'] = bp['Label'].apply(score_to_numeric) # apply above function to convert data
It is a function converting a categorical data 'b' or 's' in column named 'Label' into numeric data: 0 or 1. The line counts = bp['Label'].value_counts() counts the number of 'b' or 's' in column 'Label'. Then, in score_to_numeric, if the count of 'b' is more than 's', then give value 1 to b in a new column called 'Y', and vice versa.
I would like to shorten my code into 3-4 lines at most. I think perhaps using a lambda statement will do this, but I'm not familiar enough with lambdas.
Since True and False evaluate to 1 and 0, respectively, you can simply return the Boolean expression, converted to integer.
def score_to_numeric(x):
return int((counts['b'] > counts['s']) == \
(x == 'b'))
It returns 1 iff both expressions have the same Boolean value.
I don't think you need to use the apply method. Something simple like this should work:
value_counts = bp.Label.value_counts()
bp.Label[bp.Label == 'b'] = 1 if value_counts['b'] > value_counts['s'] else 0
bp.Label[bp.Label == 's'] = 1 if value_counts['s'] > value_counts['b'] else 0
You could do the following
counts = bp['Label'].value_counts()
t = 1 if counts['b'] > counts['s'] else 0
bp['Y'] = bp['Label'].apply(lambda x: t if x == 'b' else 1 - t)

Compound inequality in if statement

This is a generalized function I want to use to check if each row of a dataframe follows a specific trend in column values.
def follows_trend(row):
trend = None
if row[("col_5" < "col_6" < "col_4" < "col_1" < "col_2" < "col_3")]:
trend = True
trend = False
return trend
I'll apply it like this
df_trend = df.apply(follows_trend, axis=1)
When I do, it returns all True when there are clearly some rows that should return False. I'm not sure if there is something wrong with the inequality I used or the function itself.
The compound comparisons don't "expand out of" the dict lookup. "col_5" < "col_6" < "col_4" < "col_1" < "col_2" < "col_3" will be evaluated first, producing False because the strings aren't sorted - so your if statement is actually if row[(False)]:. You need to do this:
if row["col_5"] < row["col_6"] < row["col_4"] < row["col_1"] < row["col_2"] < row["col_3"]:
If you have a lot of these expressions, you should probably extract this to a method that takes row and a list of the column names, and uses a loop for the comparisons. If you only have one, but want a somewhat more nice-looking version, try this:
a, b, c, d, e, f = (row[c] for c in ("col_5", "col_6", "col_4", "col_1", "col_2", "col_3"))
if a < b < c < d < e < f:
Also you can reorder the column names, use the diff function to check the difference along the rows and compare the result with 0:
(df[["col_5", "col_6", "col_4", "col_1", "col_2", "col_3"]]
.diff(axis=1).drop('col_5', 1).gt(0).all(1))
import pandas as pd
df = pd.DataFrame({"A": [1,2], "B": [3,1], "C": [4,2]})
# A B C
#0 1 3 4
#1 2 1 2
df.diff(axis=1).drop('A', 1).gt(0).all(1)
#0 True
#1 False
#dtype: bool
you could use query for this. See example below
df = pd.DataFrame(np.random.randn(5, 3), columns=['col1','col2','col3'])
print df
print df.query('col2>col3>col1') # query can accept a string with multiple comparisons.
results in
col1 col2 col3
0 -0.788909 1.591521 1.709402
1 -1.563310 1.188993 2.295683
2 -1.572323 -0.600015 -1.518411
3 1.786051 0.303291 -0.344720
4 0.756029 -0.393941 1.059874
col1 col2 col3
2 -1.572323 -0.600015 -1.518411

stratified sampling in numpy

In numpy I have a dataset like this. The first two columns are indices. I can divide my dataset into blocks via the indices, i.e. first block is 0 0 second block is 0 1 third block 0 2 then 1 0, 1 1, 1 2 and so on and so forth. Each block has at least two elements. The numbers in the indices columns can vary
I need to split the dataset along these blocks 80%-20% randomly such that after the split each block in both datasets has at least 1 element. How could I do that?
indices | real data
0 0 | 43.25 665.32 ... } 1st block
0 0 | 11.234 }
0 1 ... } 2nd block
0 1 }
0 2 } 3rd block
0 2 }
1 0 } 4th block
1 0 }
1 0 }
1 1 ...
1 1
1 2
1 2
2 0
2 0
2 1
2 1
2 1
See how do you like this. To introduce randomness, I am shuffling the entire dataset. It is the only way I have figured how to do the splitting vectorized. Maybe you could simply shuffle an indexing array, but that was one indirection too many for my brain today. I have also used a structured array, for ease in extracting the blocks. First, lets create a sample dataset:
from __future__ import division
import numpy as np
# Create a sample data set
c1, c2 = 10, 5
idx1, idx2 = np.arange(c1), np.arange(c2)
idx1, idx2 = np.repeat(idx1, c2), np.tile(idx2, c1)
items = 1000
i = np.random.randint(c1*c2, size=(items - 2*c1*c2,))
d = np.random.rand(items+5)
dataset = np.empty((items+5,), [('idx1',, ('idx2',,
('data', np.float)])
dataset['idx1'][:2*c1*c2] = np.tile(idx1, 2)
dataset['idx1'][2*c1*c2:-5] = idx1[i]
dataset['idx2'][:2*c1*c2] = np.tile(idx2, 2)
dataset['idx2'][2*c1*c2:-5] = idx2[i]
dataset['data'] = d
# Add blocks with only 2 and only 3 elements to test corner case
dataset['idx1'][-5:] = -1
dataset['idx2'][-5:] = [0] * 2 + [1]*3
And now the stratified sampling:
# For randomness, shuffle the entire array
blocks, _ = np.unique(dataset[['idx1', 'idx2']], return_inverse=True)
block_count = np.bincount(_)
where = np.argsort(_)
block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))
# If we have n elements in a block, and we assign 1 to each array, we
# are left with only n-2. If we randomly assign a fraction x of these
# to the first array, the expected ratio of items will be
# (x*(n-2) + 1) : ((1-x)*(n-2) + 1)
# Setting the ratio equal to 4 (80/20) and solving for x, we get
# x = 4/5 + 3/5/(n-2)
x = 4/5 + 3/5/(block_count - 2)
x = np.clip(x, 0, 1) # if n in (2, 3), the ratio is larger than 1
threshold = np.repeat(x, block_count)
threshold[block_start] = 1 # first item goes to A
threshold[block_start + 1] = 0 # seconf item goes to B
a_idx = threshold > np.random.rand(len(dataset))
A = dataset[where[a_idx]]
B = dataset[where[~a_idx]]
After running it, the split is roughly 80/20, and all blocks are represented in both arrays:
>>> len(A)
>>> len(B)
>>> np.all(np.unique(A[['idx1', 'idx2']]) == np.unique(B[['idx1', 'idx2']]))
Here's an alternative solution. I'm open for a code review if it is possible to implement this in a more numpy way (without for loops). #Jamie 's answer is really good, it's just that sometimes it produces skewed ratios within blocks of data.
ratio = 0.8
IDX1 = 0
IDX2 = 1
idx1s = np.arange(len(np.unique([:,IDX1])))
idx2s = np.arange(len(np.unique([:,IDX2])))
valid = None
train = None
for i1 in idx1s:
for i2 in idx2:
mask = np.nonzero((data[:,IDX1] == i1) & (data[:,IDX2] == i2))
curr_data = data[mask,:]
start = np.min(mask)
end = np.max(mask)
thres = start + np.around((end - start) * ratio).astype(
selected = mask < thres
train_idx = mask[0][selected[0]]
valid_idx = mask[0][~selected[0]]
if train != None:
train = np.vstack((train,data[train_idx]))
valid = np.vstack((valid,data[valid_idx]))
train = data[train_idx]
valid = data[valid_idx]
I'm assuming that each block has at least two entries and also that if it has more than two you want them assigned as closely as possible to 80/20. The easiest way to do this seems to be to assign a random number to all rows, and then choose based on percentiles within each stratified sample. Say this is the data in file strat_sample.csv:
Then this code (using Pandas data structures) works as desired
import numpy as np
import random as rnd
import pandas as pd
#sample data strat_sample.csv, contents to follow
def TreatmentOneCount(n , *args):
#assign a minimum one to each group but as close as possible to fraction OptimalRatio in group 1.
OptimalRatio = args[0]
if n < 2:
print("N too small, assignment not defined.")
a = NaN
elif n == 2:
a = 1
There are one of two numbers that are close to the target ratio, one above, the other below
If the number above is N and it is closest to optimal, then you need to set things to N-1 to ensure both groups have at least one member (recall n>2)
If the number below is 0 and it is closest to optimal, then you need to set things to 1 to ensure both groups have at least one member (recall n>2)
targetassigment = OptimalRatio * n
if targetassigment - floor(targetassigment) > 0.5:
a = min(ceil(targetassigment),n-1)
a = max(floor(targetassigment),1)
return a
df = pd.read_csv('strat_sample.csv', sep=',' , header=0)
#assign a random number to each entry
df['RandScore'] = np.random.uniform(0,1,df.shape[0])
df.sort(columns= ['Index_1' ,'Index_2','RandScore'], inplace = True)
#Within each block assign a rank based on random number.
df['RandRank'] = df.groupby(['Index_1','Index_2'])['RandScore'].rank()
#make a group index
df['MasterIdx'] = df['Index_1'].apply(str) + df['Index_2'].apply(str)
#Store the counts for members of each block
seriestest = df.groupby('MasterIdx')['RandRank'].count() = "Counts"
dftest = pd.DataFrame(seriestest)
#Add the block counts to the data
df = df.merge(dftest, how='left', left_on = 'MasterIdx', right_index= True)
#Make the actual assignments to the two groups
df['Assignment'] = (df['RandRank'] <= df['Counts'].apply(TreatmentOneCount, args = (0.8,))) * -1 + 2
df.drop(['MasterIdx', 'Counts', 'RandRank', 'RandScore'], axis=1)
from sklearn import cross_validation
X_train, X_test, Y_train, Y_test = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)

