I have 3 ranges of data values in series:
min_range:
27 893.151613
26 882.384516
20 817.781935
dtype: float64
max_range:
28 903.918710
27 893.151613
21 828.549032
dtype: float64
I have created a list of ranges:
range = zip(min_range, max_range)
output:
[(893.1516129032259, 903.91870967741943), (882.38451612903225, 893.1516129032259), (817.78193548387094, 828.54903225806447)]
I have got a sub range:
sub-range1: 824
sub-range2: 825
I want to find the region in which the sub range lies.
for p,q in zip(min_range, max_range):
if (sub-range1 > p) & (sub-range2 < q):
print p,q
output:
817.781935484 828.549032258
I want to find the respective position from that defined "range".
Expected Output:
817.781935484 828.549032258
range = 2 (Position in the range list)
How can i achieve this? Any help would be appreciated.
Use enumerate to get the index i.e
for i,(p,q) in enumerate(zip(min_range, max_range)):
if (sub_range1 > p) & (sub_range2 < q):
print(i)
Output : 2
A simple approach using counter.
cnt = 0
for p,q in zip(min_range, max_range):
if (sub-range1 > p) & (sub-range2 < q):
print p,q
print cnt
cnt = cnt + 1
Related
I have a data frame like this
ntil ureach_x ureach_y awgt
0 1 1 34 2204.25
1 2 35 42 1700.25
2 3 43 48 898.75
3 4 49 53 160.25
and an array of values like this
ulist = [41,57]
For each value in the list [41,57] I am trying to find if the values fall in between ureach_x and ureach_y and return the awgt value.
awt=[]
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
awt.append(rows['awgt'])
The above code works for within the value ranges of ureach_x and ureach_y. How do I check if the value in the list is greater than the last row of ureach_y. My data frame has dynamic shape with varying number of rows.
For example, The desired output for value 57 in the list is 160.25
I tried the following:
for u in ulist:
for index,rows in df.iterrows():
if (u >= rows['ureach_x'] and u <= rows['ureach_y']):
awt.append(rows['awgt'])
elif (u >= rows['ureach_x'] and u > rows['ureach_y']):
awt.append(rows['awgt'])
However, this returns multiple values for 41 in the list. How do I refer only the last value in the column of reach_y in a iterrows loop.
The expected output is as follows:
for values in list:
[41,57]
the corresponding values from df has to be returned.
[1700.25 ,160.25]
If I've understood correctly, you can perform a merge_asof:
s = pd.Series([41,57], name='index')
(pd.merge_asof(s, df, left_on='index', right_on='ureach_x')
.set_index('index')['awgt']
)
Output:
index
41 1700.25
57 160.25
Name: awgt, dtype: float64
If you have 0 in the data and you want to have 2204.25 returned, you can add two lines to #mozway's code and perform merge_asof twice, once going backwards and once going forwards; then combine the two.
ulist = [0, 41, 57]
srs = pd.Series(ulist, name='num')
backward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x')
forward = pd.merge_asof(srs, df, left_on='num', right_on='ureach_x', direction='forward')
out = backward.combine_first(forward)['awgt']
Output:
0 2204.25
1 1700.25
2 160.25
Name: awgt, dtype: float64
Another option (an explicit loop over ulist):
out = []
for num in ulist:
if ((df['ureach_x'] <= num) & (num <= df['ureach_y'])).any():
x = df.loc[(df['ureach_x'] <= num) & (num <= df['ureach_y']), 'awgt'].iloc[-1]
elif (df['ureach_x'] > num).any():
x = df.loc[df['ureach_x'] > num, 'awgt'].iloc[0]
else:
x = df.loc[df['ureach_y'] < num, 'awgt'].iloc[-1]
out.append(x)
Output:
[2204.25, 1700.25, 160.25]
I'm fairly new to python so I hope somebody can help me. we have a list
X LABELS
[0.85142858] 1
[0.85566274] 0
[0.85364912] 0
[0.81536489] 2
i applied k-means to clustering this elemnts with k=3.
The following script calculates max score of SI for each element.
for i in range(len(X)):
s = []
print("Client", i+1, X[i])
for label in range(3):
b = []
print('S',label, ':')
a=euclidean_distances(X[[i]], X[kmedoids.medoid_indices_][[label]])
print('a:', label, a)
for k in range(3):
if k != label:
b.append(euclidean_distances(X[[i]], X[kmedoids.medoid_indices_][[k]]))
print('b:', k, b)
bmin=min(b)
print('miminum b', min(b))
print('bi-ai', bmin-a)
print('max{a_i, b_i)', max(a, bmin))
s.append((bmin-a)/(max(a, bmin)))
print('SI', s, label)
print("-------------------")
max_value = max(s)
print("SI, max value:", max_value)
print("***********************")
if we assume the following results for element 1, where s0, s1, s2 represent clusters,
how could we assign element 1 to a new cluster with max value (here [0.76259282] for cluster 0) ? so we change cluster for element 1 to 0 instead of 1.
element 1 [0.85142858]
S 0 :
SI [ array([[0.76259282]])]
-------------------
S 1 :
SI [array([[-0.76259282]])]
-------------------
S 2 :
SI [ array([[-0.96782002]])]
-------------------
SI, max value: [[0.76259282]]
I have a DataFrame with 2 cols
ColA| ColB
D 2
D 12
D 15
A 20
A 40
A 60
C 60
C 55
C 70
C 45
L 45
L 23
L 10
L 5
RESULT/Output would be
D UP
A UP
C FLAT
L Down
Where UP is result of adding up all the relevant Weights: each successive weight for each key, must be less than the previous weight.
Example
for UP you must have
Here's a simple technique, might not suit for all cases i.e :
def sum_t(x):
# Compare the value with previous value
m = x > x.shift()
# If all of them are increasing then return Up
if m.sum() == len(m)-1:
return 'UP'
# if all of them are decreasing then return Down
elif m.sum() == 0:
return 'DOWN'
# else return flat
else:
return 'FLAT'
df.groupby('ColA')['ColB'].apply(sum_t)
Output:
ColA
A UP
C FLAT
D UP
L DOWN
Name: ColB, dtype: object
Using diff and crosstab
s=df.groupby('ColA').ColB.diff().dropna()#Dropna since the first value for all group is invalid
pd.crosstab(df.ColA.loc[s.index],s>0,normalize = 'index' )[True].map({1:'Up',0:'Down'}).fillna('Flat')
Out[100]:
ColA
A Up
C Flat
D Up
L Down
Name: True, dtype: object
Variation to #Dark's idea, I would first calculate GroupBy + diff and then use unique before feeding to a custom function.
Then use logic based on min / max values.
def calc_label(x):
if min(x) >= 0:
return 'UP'
elif max(x) <= 0:
return 'DOWN'
else:
return 'FLAT'
res = df.assign(C=df.groupby('ColA').diff().fillna(0))\
.groupby('ColA')['C'].unique()\
.apply(calc_label)
print(res)
ColA
A UP
C FLAT
D UP
L DOWN
Name: C, dtype: object
Using numpy.polyfit in a custom def
This way you can tweak the gradiant you would class as 'FLAT'
def trend(x, flat=3.5):
m = np.polyfit(np.arange(1, len(x)+1), x, 1)[0]
if abs(m) < flat:
return 'FLAT'
elif m > 0:
return 'UP'
return 'DOWN'
df.groupby('ColA')['ColB'].apply(np.array).apply(trend)
Solution by applying linear regression on each ID associated points and specifying the trend by slope of id associated point in 2 dimensional space
import numpy as np
from sklearn import linear_model
def slope(x,min_slope,max_slope):
reg = linear_model.LinearRegression()
reg.fit(np.arange(len(x),x))
slope = reg.coef_[0][0]
if slope < min_slope:
return 'Down'
if slope > max_slope:
return 'Up'
else 'Flat'
min_slope = -1
max_slope = 1
df['slopes'] = df.groupby('ColA').apply(lambda x: slope(x['ColB'],min_slope,max_slope))
Basically, I'm aggregating prices over three indices to determine: mean, std, as well as an upper/lower limit. So far so good. However, now I want to also find the lowest identified price which is still >= the computed lower limit.
My first idea was to use np.min to find the lowest price -> this obviously disregards the lower-limit and is not useful. Now I'm trying to store all the values the pivot table identified to find the price which still is >= lower-limit. Any ideas?
pivot = pd.pivot_table(temp, index=['A','B','C'],values=['price'], aggfunc=[np.mean,np.std],fill_value=0)
pivot['lower_limit'] = pivot['mean'] - 2 * pivot['std']
pivot['upper_limit'] = pivot['mean'] + 2 * pivot['std']
First, merge pivoted[lower_limit] back into temp. Thus, for each price in temp there is also a lower_limit value.
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
Then you can restrict your attention to those rows in temp for which the price is >= lower_limit:
temp.loc[temp['price'] >= temp['lower_limit']]
The desired result can be found by computing a groupby/min:
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
For example,
import numpy as np
import pandas as pd
np.random.seed(2017)
N = 1000
ABC = list('ABC')
temp = pd.DataFrame(np.random.randint(2, size=(N,3)), columns=ABC)
temp['price'] = np.random.random(N)
pivoted = pd.pivot_table(temp, index=['A','B','C'],values=['price'],
aggfunc=[np.mean,np.std],fill_value=0)
pivoted['lower_limit'] = pivoted['mean'] - 2 * pivoted['std']
pivoted['upper_limit'] = pivoted['mean'] + 2 * pivoted['std']
temp = pd.merge(temp, pivoted['lower_limit'].reset_index(), on=ABC)
result = temp.loc[temp['price'] >= temp['lower_limit']].groupby(ABC)['price'].min()
print(result)
yields
A B C
0 0 0 0.003628
1 0.000132
1 0 0.005833
1 0.000159
1 0 0 0.006203
1 0.000536
1 0 0.001745
1 0.025713
I have a nested list with values:
list = [
...
['Country1', 142.8576737907048, 207.69725105029553, 21.613192419863577, 15.129178465784218],
['Country2', 109.33326343550823, 155.6847323746669, 15.450489646386226, 14.131554442715336],
['Country3', 99.23033109735835, 115.37122637190915, 5.380298424850267, 5.422030104456135],
...]
I want to count values in the second index / column by order of magnitude, starting at the lowest order of magnitude and ending at the largest...e.g.
99.23033109735835 = 10 <= x < 100
142.8576737907048 = 100 <= x < 1000
9432 = 1000 <= x < 10000
The aim is to output a simple char (#) count for how many index values fall in each category, e.g.
10 <= x < 100: ###
100 <= x < 1000: #########
I've started by grabbing the max() and min() values for the index in order to automatically calculate the largest and smalles magnitude categories, but I'm not sure how to associate each value in the column to an order of magnitude...if someone could point me in the right direction or give me some ideas I would be most grateful.
This function will turn your double into an integer order of magnitude:
>>> def magnitude(x):
... return int(math.log10(x))
...
>>> magnitude(99.23)
1
>>> magnitude(9432)
3
(so 10 ** magnitude(x) <= x <= 10 ** (1 + magnitude(x)) for all x).
Just use the magnitude as a key, and count the occurrences per key. defaultdict may be helpful here.
Note this magnitude only works for positive powers of 10 (because int(double) truncation rounds towards zero).
Use
def magnitude(x):
return int(math.floor(math.log10(x)))
instead if this matters for your use case. (Thanks to larsmans for pointing this out).
Extending Useless' answer to all real numbers, you can use:
import math
def magnitude (value):
if (value == 0): return 0
return int(math.floor(math.log10(abs(value))))
Test cases:
In [123]: magnitude(0)
Out[123]: 0
In [124]: magnitude(0.1)
Out[124]: -1
In [125]: magnitude(0.02)
Out[125]: -2
In [126]: magnitude(150)
Out[126]: 2
In [127]: magnitude(-5280)
Out[127]: 3
If x is one of your numbers, what is len(str(int(x))) ?
Or, if you have numbers less than 0, what is int(math.log10(x)) ?
(See also log10's docs. Also note that int() rounding here may not be what you want - see ceil and floor, and note you may need int(ceil(...)) or int(floor(...)) to get an integer answer)
To categorize by the order of magnitude do:
from math import floor, log10
from collections import Counter
counter = Counter(int(floor(log10(x[1]))) for x in list)
1 is from 10 to less then 100, 2 from 100 to less then 1000.
print counter
Counter({2: 2, 1: 1})
Then its just simply printing it out
for x in sorted(counter.keys()):
print "%d <= x < %d: %d" % (10**x, 10**(x+1), counter[x])
In case you ever want overlapping ranges or ranges with arbitrary bounds (not sticked to orders of magnitude/powers of 2/any other predictable series):
from collections import defaultdict
lst = [
['Country1', 142.8576737907048, 207.69725105029553, 21.613192419863577, 15.129178465784218],
['Country2', 109.33326343550823, 155.6847323746669, 15.450489646386226, 14.131554442715336],
['Country3', 99.23033109735835, 115.37122637190915, 5.380298424850267, 5.422030104456135],
]
buckets = {
'10<=x<100': lambda x: 10<=x<100,
'100<=x<1000': lambda x: 100<=x<1000,
}
result = defaultdict(int)
for item in lst:
second_column = item[1]
for label, range_check in buckets.items():
if range_check(second_column):
result[label] +=1
print (result)
Another option, using bisect
import bisect
from collections import Counter
list0 = [
['Country1', 142.8576737907048, 207.69725105029553, 21.613192419863577, 15.129178465784218],
['Country2', 109.33326343550823, 155.6847323746669, 15.450489646386226, 14.131554442715336],
['Country3', 99.23033109735835, 115.37122637190915, 5.380298424850267, 5.422030104456135]
]
magnitudes = [10**x for x in xrange(5)]
c = Counter(bisect.bisect(magnitudes, x[1]) for x in list0)
for x in c:
print x, '#'*c[x]
import bisect
from collections import defaultdict
lis1 = [['Country1', 142.8576737907048, 207.69725105029553, 21.613192419863577, 15.129178465784218],
['Country2', 109.33326343550823, 155.6847323746669, 15.450489646386226, 14.131554442715336],
['Country3', 99.23033109735835, 115.37122637190915, 5.380298424850267, 5.422030104456135],
]
lis2 = [0, 100, 1000, 1000]
dic = defaultdict(int)
for x in lis1:
x = x[1]
ind=bisect.bisect(lis2,x)
if not (x >= lis2[-1] or x <= lis2[0]):
sm, bi = lis2[ind-1], lis2[ind]
dic ["{} <= {} <= {}".format(sm ,x, bi)] +=1
for k,v in dic.items():
print k,'-->',v
output:
0 <= 99.2303310974 <= 100 --> 1
100 <= 142.857673791 <= 1000 --> 1
100 <= 109.333263436 <= 1000 --> 1