Sort list based on another list - python

I have two lists in python3.6, and I would like to sort w by considering d values. This is similar to this question,
Sorting list based on values from another list? , though, I could not use zip because w and d are not paired data.
I have a code sample, and want to get t variable.
Updated
I could do it by using for loop. Is there any fasterh way?
import numpy as np
w = np.arange(0.0, 1.0, 0.1)
t = np.zeros(10)
d = np.array([3.1, 0.2, 5.3, 2.2, 4.9, 6.1, 7.7, 8.1, 1.3, 9.4])
ind = np.argsort(d)
print('w', w)
print('d', d)
for i in range(10):
t[ind[i]] = w[i]
print('t', t)
#w [ 0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
#d [ 3.1 0.2 5.3 2.2 4.9 6.1 7.7 8.1 1.3 9.4]
#ht [ 0.3 0. 0.5 0.2 0.4 0.6 0.7 0.8 0.1 0.9]

Use argsort like so:
>>> t = np.empty_like(w)
>>> t[d.argsort()] = w
>>> t
array([0.3, 0. , 0.5, 0.2, 0.4, 0.6, 0.7, 0.8, 0.1, 0.9])

They are paired data, but in the opposite direction.
Make a third list, i, np.arange(0, 10).
zip this with d.
Sort the tuples with d as the sort key; i still holds the original index of each d element.
zip this with w.
Sort the triples (well, pairs with a pair as one element) with i as the sort key.
Extract the w values in their new order; this is your t array.

The answers for this question are fantastic, but I feel it is prudent to point out you are not doing what you think you are doing.
What you want to do: (or at least what I gather) You want t to contain the values of w rearranged to be in the sorted order of d
What you are doing: Filling out t in the sorted order of d, with elements of w. You are only changing the order of how t gets filled up. You are not reflecting the sort of d into w on t
Consider a small variation in your for loop
for i in range(0,10):
t[i] = w[ind[i]]
This outputs a t
('t', array([0.1, 0.8, 0.3, 0. , 0.4, 0.2, 0.5, 0.6, 0.7, 0.9]))
You can just adapt PaulPanzer's answer to this as well.

Related

What's the best way for looping through pandas df and comparing 2 different dataframes then performing division on values returned?

I'm currently writing Python code that compares offensive and defensive stats in basketball and I want to be able to create weights with the given stats. I have my stats saved in a dataframe according to: team, position, and other numerical stats. I want to be able to loop through each team and their respective positions and corresponding stats. e.g.:
['DAL', 'C', 0.0, 3.0, 0.5, 0.4, 0.5, 0.7, 6.4] vs ['BOS', 'C', 1.7, 6.0, 2.1, 0.1, 0.7, 1.9, 9.0]
So I would like to compare BOS vs DAL at the C position and compare points, rebounds, assists etc. If one is greater than the other then divide the greater by the lesser.
The best thing I have so far is to convert the the dataframes to numpy and then proceed to loop through those and append into a blank list:
df1 = df1.to_numpy()
df2 = df2.to_numpy()
df1_array = []
df2_array = []
for x in range(len(df1)):
for a, h in zip(away, home):
if df1[x][0] == a or df1[x][0] == h:
df1_array.append(df1[x])
After I get the new arrays I would then loop through them again to compare values, however I feel like this is too rudimentary. What could be a more efficient or smarter way of executing this?
Use numpy.where to compare rows and return the truth value of ('team1' > 'team2') element-wise:
import pandas as pd
import numpy as np
# Creating the dataframe
team1 = ['DAL', 'C', 0.1, 3.0, 0.5, 0.4, 0.5, 0.7, 6.4]
team2 = ['BOS', 'C', 1.7, 6.0, 2.1, 0.1, 0.7, 1.9, 9.0]
df = pd.DataFrame(
{'team1':team1,
'team2':team2,
})
# Select the rows that contain numbers
df2 = df.iloc[2:].copy()
# Make the comparison, if team1 is larger than team2 then team1/team2 and viseversa.
df2['result'] = np.where(df2['team1']>df2['team2'], \
df2['team1']/df2['team2'], \
df2['team2']/df2['team1'])
df['result'] = df2['result'].fillna(0)
This yields
team1 team2 result
0 DAL BOS NaN
1 C C NaN
2 0.1 1.7 17.0
3 3.0 6.0 2.0
4 0.5 2.1 4.2
5 0.4 0.1 4.0
6 0.5 0.7 1.4
7 0.7 1.9 2.714286
8 6.4 9.0 1.40625
Becareful with the 0 in the first column of values in your problem description though, I changed it to 0.1 as otherwise it will give zero division error.

python check if float value is missing

I am trying to find "missing" values in a python array of floats.
Such that in this case [1.1, 1.3, 2.1, 2.2, 2.3] I would like to print "1.2"
I dont have much experience with floats, I have tried something like this How to find a missing number from a list? but it doesn't work on floats.
Thanks!
To solve this, the problem would need to be simplified first, I am assuming that all the values would be float and with one decimal place, also let's assume that there can be multiple ranges like 1.1-1.3 and 2.1-2.3, also assuming that the numbers are in sorted order, here is a solution. It is written in python 3 by the way
vals = [1.1, 1.3, 2.1, 2.2, 2.3] # This will be the values in which to find the missing number
# The logic starts from here
for i in range(len(vals) - 1):
if vals[i + 1] * 10 - vals[i] * 10 == 2:
print((vals[i] * 10 + 1)/10)
print("\nfinished")
You might want to use https://numpy.org/doc/stable/reference/generated/numpy.arange.html
and create a list of floats (if you know start, end, step values)
Then you can create two sets and use difference to find missing values
Simplest yet dumb way:
Split float to integer and decimal parts.
Create cartesian product of both to generate Full array.
Use set and XOR to find out missing ones.
from itertools import product
source = [1.1, 1.3, 2.1, 2.2, 2.3]
separated = [str(n).split(".") for n in source]
integers, decimals = map(set, zip(*separated))
products = [float(f"{i}.{d}") for i, d in product(integers, decimals)]
print(*(set(products) ^ set(source)))
output:
1.2
I guess that the solutions to the problem you quote proprably work on your case, you just need to adapt the built-in range function to numpy.arange that allow you to create a range of numbers with floats.
it gets something like that: (just did a simple example)
import numpy as np
np_range = np.arange(1, 2, 0.1)
float_list = [1.2, 1.3, 1.4, 1.6]
for i in np_range:
if not round(i, 1) in float_list:
print(round(i, 1))
output:
1.0
1.1
1.5
1.7
1.8
1.9
This is an absolutely AWFUL way to do this, but depending on how many numbers you have in the list and how difficult the other solutions are you might appreciate it.
If you write
firstvalue = 1.1
secondvalue = 1.2
thirdvalue = 1.3
#assign these for every value you are keeping track of
if firstvalue in val: #(or whatever you named your list)
print("1.1 is in the list")
else:
print("1.1 is missing!")
if secondvalue in val:
print("1.2 is in the list")
else:
print("1.2 is missing!")
#etc etc etc for every value in the list. It's tedious and dumb but if you have few enough values in your list it might be your simplest option
With numpy
import numpy as np
arr = [1.1, 1.3, 2.1, 2.2, 2.3]
find_gaps = np.array(arr).round(1)
find_gaps[np.r_[np.diff(find_gaps).round(1), False] == 0.2] + 0.1
Output
array([1.2])
Test with random data
import numpy as np
np.random.seed(10)
arr = np.arange(0.1, 10.4, 0.1)
mask = np.random.randint(0,2, len(arr)).astype(np.bool)
gaps = arr[mask]
print(gaps)
find_gaps = np.array(gaps).round(1)
print('missing values:')
print(find_gaps[np.r_[np.diff(find_gaps).round(1), False] == 0.2] + 0.1)
Output
[ 0.1 0.2 0.4 0.6 0.7 0.9 1. 1.2 1.3 1.6 2.2 2.5 2.6 2.9
3.2 3.6 3.7 3.9 4. 4.1 4.2 4.3 4.5 5. 5.2 5.3 5.4 5.6
5.8 5.9 6.1 6.4 6.8 6.9 7.3 7.5 7.6 7.8 7.9 8.1 8.7 8.9
9.7 9.8 10. 10.1]
missing values:
[0.3 0.5 0.8 1.1 3.8 4.4 5.1 5.5 5.7 6. 7.4 7.7 8. 8.8 9.9]
More general solution
Find all missing value with specific gap size
import numpy as np
def find_missing(find_gaps, gaps = 1):
find_gaps = np.array(find_gaps)
gaps_diff = np.r_[np.diff(find_gaps).round(1), False]
gaps_index = find_gaps[(gaps_diff >= 0.2) & (gaps_diff <= round(0.1*(gaps + 1),1))]
gaps_values = np.searchsorted(find_gaps, gaps_index)
ranges = np.vstack([(find_gaps[gaps_values]+0.1).round(1),find_gaps[gaps_values+1]]).T
return np.concatenate([np.arange(start, end, 0.1001) for start, end in ranges]).round(1)
vals = [0.1,0.3, 0.6, 0.7, 1.1, 1.5, 1.8, 2.1]
print('Vals:', vals)
print('gap=1', find_missing(vals, gaps = 1))
print('gap=2', find_missing(vals, gaps = 2))
print('gap=3', find_missing(vals, gaps = 3))
Output
Vals: [0.1, 0.3, 0.6, 0.7, 1.1, 1.5, 1.8, 2.1]
gap=1 [0.2]
gap=2 [0.2 0.4 0.5 1.6 1.7 1.9 2. ]
gap=3 [0.2 0.4 0.5 0.8 0.9 1. 1.2 1.3 1.4 1.6 1.7 1.9 2. ]

Averaging values with irregular time intervals

I have several pairs of arrays of measurements and the times at which the measurements were taken that I want to average. Unfortunately the times at which these measurements were taken isn't regular or the same for each pair.
My idea for averaging them is to create a new array with the value at each second then average these. It works but it seems a bit clumsy and means I have to create many unnecessarily long arrays.
Example Inputs
m1 = [0.4, 0.6, 0.2]
t1 = [0.0, 2.4, 5.2]
m2 = [1.0, 1.4, 1.0]
t2 = [0.0, 3.6, 4.8]
Generated Regular Arrays for values at each second
r1 = [0.4, 0.4, 0.4, 0.6, 0.6, 0.6, 0.2]
r2 = [1.0, 1.0, 1.0, 1.0, 1.4, 1.0]
Average values up to length of shortest array
a = [0.7, 0.7, 0.7, 0.8, 1.0, 0.8]
My attempt given list of measurement arrays measurements and respective list of time interval arrays times
def granulate(values, times):
count = 0
regular_values = []
for index, x in enumerate(times):
while count <= x:
regular_values.append(values[index])
count += 1
return np.array(regular_values)
processed_measurements = [granulate(m, t) for m, t in zip(measurements, times)]
min_length = min(len(m) for m in processed_measurements )
processed_measurements = [m[:min_length] for m in processed_measurements]
average_measurement = np.mean(processed_measurements, axis=0)
Is there a better way to do it, ideally using numpy functions?
This will average to closest second:
time_series = np.arange(np.stack((t1, t2)).max())
np.mean([m1[abs(t1-time_series[:,None]).argmin(axis=1)], m2[abs(t2-time_series[:,None]).argmin(axis=1)]], axis=0)
If you want to floor times to each second (with possibility of generalizing to more arrays):
m = [m1, m2]
t = [t1, t2]
m_t=[]
time_series = np.arange(np.stack(t).max())
for i in range(len(t)):
time_diff = time_series-t[i][:,None]
m_t.append(m[i][np.where(time_diff > 0, time_diff, np.inf).argmin(axis=0)])
average = np.mean(m_t, axis=0)
output:
[0.7 0.7 0.7 0.8 1. 0.8]
You can do (a bit more numpy-ish solution):
import numpy as np
# oddly enough - numpy doesn't have it's own ffill function:
def np_ffill(arr):
mask = np.arange(len(arr))
mask[np.isnan(arr)]=0
np.maximum.accumulate(mask, axis=0, out=mask)
return arr[mask]
t1=np.ceil(t1).astype("int")
t2=np.ceil(t2).astype("int")
r1=np.empty(max(t1)+1)
r2=np.empty(max(t2)+1)
r1[:]=np.nan
r2[:]=np.nan
r1[t1]=m1
r2[t2]=m2
r1=np_ffill(r1)
r2=np_ffill(r2)
>>> print(r1,r2)
[0.4 0.4 0.4 0.6 0.6 0.6 0.2] [1. 1. 1. 1. 1.4 1. ]
#in order to get avg:
r3=np.vstack([r1[:len(r2)],r2[:len(r1)]]).mean(axis=0)
>>> print(r3)
[0.7 0.7 0.7 0.8 1. 0.8]
I see two possible solutions:
Create a 'bucket' for each time step, lets say 1 second, and insert all measurements that were taken at the time step +/- 1 second in the bucket. Average all values in the bucket.
Interpolate every measurement row, so that they have equal time steps. Average all measurements for every time step

Finding counts of relative and absolute fluctuations in dataframe where each row contains a timeseries

I have a dataframe containing a table of financial timeseries, with each row having the columns:
ID of that timeseries
a Target value (against which we want to measure deviations, both relative and absolute)
and a timeseries of values for various dates: 1/01, 1/02, 1/03, ...
We want to calculate the fluctuation counts, both relative and absolute, for every row/ID's timeseries. Then we want to find which row/ID has the most fluctuations/'spikes', as follows:
First, we find difference between two timeseries values and estimate a threshold. Threshold represents how much difference is allowed between two values before we declare that a 'fluctuation' or 'spike'. If the difference is higher than the threshold you set, between any two columns's values then it's a spike.
However, we need to ensure that the threshold is generic and works with both % and absolute values between any two values in any row.
So basically, we find a threshold in a percentage form (make an educated prediction) as we have one row values represented in "%" form. Plus, '%' form will also work properly with the absolute value as well.
The output should be a new column fluctuation counts (FCount), both relative and absolute, for every row/ID.
Code:
import pandas as pd
# Create sample dataframe
raw_data = {'ID': ['A1', 'B1', 'C1', 'D1'],
'Domain': ['Finance', 'IT', 'IT', 'Finance'],
'Target': [1, 2, 3, 0.9%],
'Criteria':['<=', '<=', '>=', '>='],
"1/01":[0.9, 1.1, 2.1, 1],
"1/02":[0.4, 0.3, 0.5, 0.9],
"1/03":[1, 1, 4, 1.1],
"1/04":[0.7, 0.7, 0.1, 0.7],
"1/05":[0.7, 0.7, 0.1, 1],
"1/06":[0.9, 1.1, 2.1, 0.6],}
df = pd.DataFrame(raw_data, columns = ['ID', 'Domain', 'Target','Criteria', '1/01',
'1/02','1/03', '1/04','1/05', '1/06'])
ID Domain Target Criteria 1/01 1/02 1/03 1/04 1/05 1/06
0 A1 Finance 1 <= 0.9 0.4 1.0 0.7 0.7 0.9
1 B1 IT 2 <= 1.1 0.3 1.0 0.7 0.7 1.1
2 C1 IT 3 >= 2.1 0.5 4.0 0.1 0.1 2.1
3 D1 Finance 0.9% >= 1.0 0.9 1.1 0.7 1.0 0.6
And here's the expect output with a fluctuation count (FCount) column. Then we can get whichever ID has the largest FCount.
ID Domain Target Criteria 1/01 1/02 1/03 1/04 1/05 1/06 FCount
0 A1 Finance 1 <= 0.9 0.4 1.0 0.7 0.7 0.9 -
1 B1 IT 2 <= 1.1 0.3 1.0 0.7 0.7 1.1 -
2 C1 IT 3 >= 2.1 0.5 4.0 0.1 0.1 2.1 -
3 D1 Finance 0.9% >= 1.0 0.9 1.1 0.7 1.0 0.6 -
Given,
# importing pandas as pd
import pandas as pd
import numpy as np
# Create sample dataframe
raw_data = {'ID': ['A1', 'B1', 'C1', 'D1'],
'Domain': ['Finance', 'IT', 'IT', 'Finance'],
'Target': [1, 2, 3, '0.9%'],
'Criteria':['<=', '<=', '>=', '>='],
"1/01":[0.9, 1.1, 2.1, 1],
"1/02":[0.4, 0.3, 0.5, 0.9],
"1/03":[1, 1, 4, 1.1],
"1/04":[0.7, 0.7, 0.1, 0.7],
"1/05":[0.7, 0.7, 0.1, 1],
"1/06":[0.9, 1.1, 2.1, 0.6],}
df = pd.DataFrame(raw_data, columns = ['ID', 'Domain', 'Target','Criteria', '1/01',
'1/02','1/03', '1/04','1/05', '1/06'])
It is easier to tackle this problem by breaking it into two parts (absolute thresholds and relative thresholds) and going through it step by step on the underlying numpy arrays.
EDIT: Long explanation ahead, skip to the end for just the final function
First, create a list of date columns to access only the relevant columns in every row.
date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06']
df[date_columns].values
#Output:
array([[0.9, 0.4, 1. , 0.7, 0.7, 0.9],
[1.1, 0.3, 1. , 0.7, 0.7, 1.1],
[2.1, 0.5, 4. , 0.1, 0.1, 2.1],
[1. , 0.9, 1.1, 0.7, 1. , 0.6]])
Then we can use np.diff to easily get differences between the dates on the underlying array. We will also take an absolute because that is what we are interested in.
np.abs(np.diff(df[date_columns].values))
#Output:
array([[0.5, 0.6, 0.3, 0. , 0.2],
[0.8, 0.7, 0.3, 0. , 0.4],
[1.6, 3.5, 3.9, 0. , 2. ],
[0.1, 0.2, 0.4, 0.3, 0.4]])
Now, just worrying about the absolute thresholds, it is as simple as just checking if the values in the differences are greater than a limit.
abs_threshold = 0.5
np.abs(np.diff(df[date_columns].values)) > abs_threshold
#Output:
array([[False, True, False, False, False],
[ True, True, False, False, False],
[ True, True, True, False, True],
[False, False, False, False, False]])
We can see that the sum over this array for every row will give us the result we need (sum over boolean arrays use the underlying True=1 and False=0. Thus, you are effectively counting how many True are present). For Percentage thresholds, we just need to do an additional step, dividing all differences with the original values before comparison. Putting it all together.
To elaborate:
We can see how the sum along each row can give us the counts of values crossing absolute threshold as follows.
abs_fluctuations = np.abs(np.diff(df[date_columns].values)) > abs_threshold
print(abs_fluctuations.sum(-1))
#Output:
[1 2 4 0]
To start with relative thresholds, we can create the differences array same as before.
dates = df[date_columns].values #same as before, but just assigned
differences = np.abs(np.diff(dates)) #same as before, just assigned
pct_threshold=0.5 #aka 50%
print(differences.shape) #(4, 5) aka 4 rows, 5 columns if you want to think traditional tabular 2D shapes only
print(dates.shape) #(4, 6) 4 rows, 6 columns
Now, note that the differences array will have 1 less number of columns, which makes sense too. because for 6 dates, there will be 5 "differences", one for each gap.
Now, just focusing on 1 row, we see that calculating percent changes is simple.
print(dates[0][:2]) #for first row[0], take the first two dates[:2]
#Output:
array([0.9, 0.4])
print(differences[0][0]) #for first row[0], take the first difference[0]
#Output:
0.5
a change from 0.9 to 0.4 is a change of 0.5 in absolute terms. but in percentage terms, it is a change of 0.5/0.9 (difference/original) * 100 (where i have omitted the multiplication by 100 to make things simpler)
aka 55.555% or 0.5555..
The main thing to realise at this step is that we need to do this division against the "original" values for all differences to get percent changes.
However, dates array has one "column" too many. So, we do a simple slice.
dates[:,:-1] #For all rows(:,), take all columns except the last one(:-1).
#Output:
array([[0.9, 0.4, 1. , 0.7, 0.7],
[1.1, 0.3, 1. , 0.7, 0.7],
[2.1, 0.5, 4. , 0.1, 0.1],
[1. , 0.9, 1.1, 0.7, 1. ]])
Now, i can just calculate relative or percentage changes by element-wise division
relative_differences = differences / dates[:,:-1]
And then, same thing as before. pick a threshold, see if it's crossed
rel_fluctuations = relative_differences > pct_threshold
#Output:
array([[ True, True, False, False, False],
[ True, True, False, False, True],
[ True, True, True, False, True],
[False, False, False, False, False]])
Now, if we want to consider whether either one of absolute or relative threshold is crossed, we just need to take a bitwise OR | (it's even there in the sentence!) and then take the sum along rows.
Putting all this together, we can just create a function that is ready to use. Note that functions are nothing special, just a way of grouping together lines of code for ease of use. using a function is as simple as calling it, you have been using functions/methods without realising it all the time already.
date_columns = ['1/01', '1/02','1/03', '1/04','1/05', '1/06'] #if hardcoded.
date_columns = df.columns[5:] #if you wish to assign dynamically, and all dates start from 5th column.
def get_FCount(df, date_columns, abs_threshold=0.5, pct_threshold=0.5):
'''Expects a list of date columns with atleast two values.
returns a 1D array, with FCounts for every row.
pct_threshold: percentage, where 1 means 100%
'''
dates = df[date_columns].values
differences = np.abs(np.diff(dates))
abs_fluctuations = differences > abs_threshold
rel_fluctuations = differences / dates[:,:-1] > pct_threshold
return (abs_fluctuations | rel_fluctuations).sum(-1) #we took a bitwise OR. since we are concerned with values that cross even one of the thresholds.
df['FCount'] = get_FCount(df, date_columns) #call our function, and assign the result array to a new column
print(df['FCount'])
#Output:
0 2
1 3
2 4
3 0
Name: FCount, dtype: int32
Assuming you want pct_changes() accross all columns in a row with a threshold, you can also try pct_change() on axis=1:
thresh_=0.5
s=pd.to_datetime(df.columns,format='%d/%m',errors='coerce').notna() #all date cols
df=df.assign(Count=df.loc[:,s].pct_change(axis=1).abs().gt(0.5).sum(axis=1))
Or:
df.assign(Count=df.iloc[:,4:].pct_change(axis=1).abs().gt(0.5).sum(axis=1))
ID Domain Target Criteria 1/01 1/02 1/03 1/04 1/05 1/06 Count
0 A1 Finance 1.0 <= 0.9 0.4 1.0 0.7 0.7 0.9 2
1 B1 IT 2.0 <= 1.1 0.3 1.0 0.7 0.7 1.1 3
2 C1 IT 3.0 >= 2.1 0.5 4.0 0.1 0.1 2.1 4
3 D1 Finance 0.9 >= 1.0 0.9 1.1 0.7 1.0 0.6 0
Try a loc and an iloc and a sub and an abs and a sum and an idxmin:
print(df.loc[df.iloc[:, 4:].sub(df['Target'].tolist(), axis='rows').abs().sum(1).idxmin(), 'ID'])
Output:
D1
Explanation:
I first get the columns staring from the 4th one, then simply subtract each row with the corresponding Target column.
Then get the absolute value of it, so -1.1 will be 1.1 and 1.1 will be still 1.1, then sum each row together and get the row with the lowest number.
Then use a loc to get that index in the actual dataframe, and get the ID column of it which gives you D1.
The following is much cleaner pandas idiom and improves on #ParitoshSingh's version. It's much cleaner to keep two separate dataframes:
a ts (metadata) dataframe for the timeseries columns 'ID', 'Domain', 'Target','Criteria'
a values dataframe for the timeseries values (or 'dates' as the OP keeps calling them)
and use ID as the common index for both dataframes, now you get seamless merge/join and also on any results like when we call compute_FCounts().
now there's no need to pass around ugly lists of column-names or indices (into compute_FCounts()). This is way better deduplication as mentioned in comments. Code for this is at bottom.
Doing this makes compute_FCount just reduce to a four-liner (and I improved #ParitoshSingh's version to use pandas builtins df.diff(axis=1), and then pandas .abs(); also note that the resulting series is returned with the correct ID index, not 0:3; hence can be used directly in assignment/insertion/merge/join):
def compute_FCount_df(dat, abs_threshold=0.5, pct_threshold=0.5):
""""""Compute FluctuationCount for all timeseries/rows""""""
differences = dat.diff(axis=1).iloc[:, 1:].abs()
abs_fluctuations = differences > abs_threshold
rel_fluctuations = differences / dat.iloc[:,:-1] > pct_threshold
return (abs_fluctuations | rel_fluctuations).sum(1)
where the boilerplate to set up two separate dataframes is at bottom.
Also note it's cleaner not to put the fcounts series/column in either values (where it definitely doesn't belong) or ts (where it would be kind of kludgy). Note that the
#ts['FCount']
fcounts = compute_FCount_df(values)
>>> fcounts
A1 2
B1 2
C1 4
D1 1
and this allows you to directly get the index (ID) of the timeseries with most 'fluctuations':
>>> fcounts.idxmax()
'C1'
But really since conceptually we're applying the function separately row-wise to each row of timeseries values, we should use values.apply(..., axis=1) :
values.apply(compute_FCount_ts, axis=1, reduce=False) #
def compute_FCount_ts(dat, abs_threshold=0.5, pct_threshold=0.5):
"""Compute FluctuationCount for single timeseries (row)"""
differences = dat.diff().iloc[1:].abs()
abs_fluctuations = differences > abs_threshold
rel_fluctuations = differences / dat.iloc[:,:-1] > pct_threshold
return (abs_fluctuations | rel_fluctuations).sum(1)
(Note: still trying to debug the "Too many indexers" pandas issue
)
Last, here's the boilerplate code to set up two separate dataframes, with shared index ID:
import pandas as pd
import numpy as np
ts = pd.DataFrame(index=['A1', 'B1', 'C1', 'D1'], data={
'Domain': ['Finance', 'IT', 'IT', 'Finance'],
'Target': [1, 2, 3, '0.9%'],
'Criteria':['<=', '<=', '>=', '>=']})
values = pd.DataFrame(index=['A1', 'B1', 'C1', 'D1'], data={
"1/01":[0.9, 1.1, 2.1, 1],
"1/02":[0.4, 0.3, 0.5, 0.9],
"1/03":[1, 1, 4, 1.1],
"1/04":[0.7, 0.7, 0.1, 0.7],
"1/05":[0.7, 0.7, 0.1, 1],
"1/06":[0.9, 1.1, 2.1, 0.6]})

Last decimal digit precision changes in different call of same generator function [python] [duplicate]

This question already has answers here:
Strange behaviour with floats and string conversion
(3 answers)
What is the difference between __str__ and __repr__?
(28 answers)
Why does str(float) return more digits in Python 3 than Python 2?
(1 answer)
Closed 8 years ago.
I created this generator function:
def myRange(start,stop,step):
r = start
while r < stop:
yield r
r += step
and I use it in two different ways. 1st:
for x in myRange(0,1,0.1):
print x
Result:
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
2nd way to call the function:
a = [x for x in myRange(0,1,0.1)]
which results in:
[0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7, 0.7999999999999999, 0.8999999999999999, 0.9999999999999999]
Why are the values generated different?
It is not the order in which you called your generator, but the way you are presenting the numbers that caused this change in output.
You are printing a list object the second time, and that's a container. Container contents are printed using repr(), while before you used print on the float directly, which uses str()
repr() and str() output of floating point numbers simply differs:
>>> lst = [0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7, 0.7999999999999999, 0.8999999999999999, 0.9999999999999999]
>>> print lst
[0, 0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7, 0.7999999999999999, 0.8999999999999999, 0.9999999999999999]
>>> for elem in lst:
... print elem
...
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
>>> str(lst[3])
'0.3'
>>> repr(lst[3])
'0.30000000000000004'
repr() on a float produces a result that'll let you reproduce the same value accurately. str() rounds the floating point number for presentation.

Categories

Resources