Related
I am trying to find "missing" values in a python array of floats.
Such that in this case [1.1, 1.3, 2.1, 2.2, 2.3] I would like to print "1.2"
I dont have much experience with floats, I have tried something like this How to find a missing number from a list? but it doesn't work on floats.
Thanks!
To solve this, the problem would need to be simplified first, I am assuming that all the values would be float and with one decimal place, also let's assume that there can be multiple ranges like 1.1-1.3 and 2.1-2.3, also assuming that the numbers are in sorted order, here is a solution. It is written in python 3 by the way
vals = [1.1, 1.3, 2.1, 2.2, 2.3] # This will be the values in which to find the missing number
# The logic starts from here
for i in range(len(vals) - 1):
if vals[i + 1] * 10 - vals[i] * 10 == 2:
print((vals[i] * 10 + 1)/10)
print("\nfinished")
You might want to use https://numpy.org/doc/stable/reference/generated/numpy.arange.html
and create a list of floats (if you know start, end, step values)
Then you can create two sets and use difference to find missing values
Simplest yet dumb way:
Split float to integer and decimal parts.
Create cartesian product of both to generate Full array.
Use set and XOR to find out missing ones.
from itertools import product
source = [1.1, 1.3, 2.1, 2.2, 2.3]
separated = [str(n).split(".") for n in source]
integers, decimals = map(set, zip(*separated))
products = [float(f"{i}.{d}") for i, d in product(integers, decimals)]
print(*(set(products) ^ set(source)))
output:
1.2
I guess that the solutions to the problem you quote proprably work on your case, you just need to adapt the built-in range function to numpy.arange that allow you to create a range of numbers with floats.
it gets something like that: (just did a simple example)
import numpy as np
np_range = np.arange(1, 2, 0.1)
float_list = [1.2, 1.3, 1.4, 1.6]
for i in np_range:
if not round(i, 1) in float_list:
print(round(i, 1))
output:
1.0
1.1
1.5
1.7
1.8
1.9
This is an absolutely AWFUL way to do this, but depending on how many numbers you have in the list and how difficult the other solutions are you might appreciate it.
If you write
firstvalue = 1.1
secondvalue = 1.2
thirdvalue = 1.3
#assign these for every value you are keeping track of
if firstvalue in val: #(or whatever you named your list)
print("1.1 is in the list")
else:
print("1.1 is missing!")
if secondvalue in val:
print("1.2 is in the list")
else:
print("1.2 is missing!")
#etc etc etc for every value in the list. It's tedious and dumb but if you have few enough values in your list it might be your simplest option
With numpy
import numpy as np
arr = [1.1, 1.3, 2.1, 2.2, 2.3]
find_gaps = np.array(arr).round(1)
find_gaps[np.r_[np.diff(find_gaps).round(1), False] == 0.2] + 0.1
Output
array([1.2])
Test with random data
import numpy as np
np.random.seed(10)
arr = np.arange(0.1, 10.4, 0.1)
mask = np.random.randint(0,2, len(arr)).astype(np.bool)
gaps = arr[mask]
print(gaps)
find_gaps = np.array(gaps).round(1)
print('missing values:')
print(find_gaps[np.r_[np.diff(find_gaps).round(1), False] == 0.2] + 0.1)
Output
[ 0.1 0.2 0.4 0.6 0.7 0.9 1. 1.2 1.3 1.6 2.2 2.5 2.6 2.9
3.2 3.6 3.7 3.9 4. 4.1 4.2 4.3 4.5 5. 5.2 5.3 5.4 5.6
5.8 5.9 6.1 6.4 6.8 6.9 7.3 7.5 7.6 7.8 7.9 8.1 8.7 8.9
9.7 9.8 10. 10.1]
missing values:
[0.3 0.5 0.8 1.1 3.8 4.4 5.1 5.5 5.7 6. 7.4 7.7 8. 8.8 9.9]
More general solution
Find all missing value with specific gap size
import numpy as np
def find_missing(find_gaps, gaps = 1):
find_gaps = np.array(find_gaps)
gaps_diff = np.r_[np.diff(find_gaps).round(1), False]
gaps_index = find_gaps[(gaps_diff >= 0.2) & (gaps_diff <= round(0.1*(gaps + 1),1))]
gaps_values = np.searchsorted(find_gaps, gaps_index)
ranges = np.vstack([(find_gaps[gaps_values]+0.1).round(1),find_gaps[gaps_values+1]]).T
return np.concatenate([np.arange(start, end, 0.1001) for start, end in ranges]).round(1)
vals = [0.1,0.3, 0.6, 0.7, 1.1, 1.5, 1.8, 2.1]
print('Vals:', vals)
print('gap=1', find_missing(vals, gaps = 1))
print('gap=2', find_missing(vals, gaps = 2))
print('gap=3', find_missing(vals, gaps = 3))
Output
Vals: [0.1, 0.3, 0.6, 0.7, 1.1, 1.5, 1.8, 2.1]
gap=1 [0.2]
gap=2 [0.2 0.4 0.5 1.6 1.7 1.9 2. ]
gap=3 [0.2 0.4 0.5 0.8 0.9 1. 1.2 1.3 1.4 1.6 1.7 1.9 2. ]
I am trying to generate 2 children from 2 parents by crossover. I want to fix a part from parent A and fill the blanks with elements from parent B.
I was able to mask both parents and get the elements on another array, but I am not able to fill in the gaps from the fixed part of Parent A with the fill elements from Parent B
Here's what I have tried so far:
import numpy as np
from numpy.random import default_rng
rng = default_rng()
numMachines = 5
numJobs = 5
population =[[[4, 0, 2, 1, 3],
[4, 2, 0, 1, 3],
[4, 2, 0, 1, 3],
[4, 0, 3, 2, 1],
[2, 3, 4, 1, 0]],
[[2, 0, 1, 3, 4],
[4, 3, 1, 2, 0],
[2, 0, 3, 4, 1],
[4, 3, 1, 0, 2],
[4, 0, 3, 1, 2]]]
parentA = np.array(population[0])
parentB = np.array(population[1])
childA = np.zeros((numJobs, numMachines))
np.copyto(childA, parentA)
childB = np.zeros((numJobs, numMachines))
np.copyto(childB, parentB)
subJobs = np.stack([rng.choice(numJobs ,size=int(np.max([2, np.floor(numJobs/2)])), replace=False) for i in range(numMachines)])
maskA = np.stack([(np.isin(childA[i], subJobs[i])) for i in range(numMachines)])
invMaskA = np.invert(maskA)
maskB = np.stack([(np.isin(childB[i], subJobs[i])) for i in range(numMachines)])
invMaskB = np.invert(maskB)
maskedChildAFixed = np.ma.masked_array(childA, maskA)
maskedChildBFixed = np.ma.masked_array(childB, maskB)
maskedChildAFill = np.ma.masked_array(childA, invMaskA)
maskedChildBFill = np.ma.masked_array(childB, invMaskB)
maskedChildAFill = np.stack([maskedChildAFill[i].compressed() for i in range(numMachines)])
maskedChildBFill = np.stack([maskedChildBFill[i].compressed() for i in range(numMachines)])
EDIT:
Sorry, I was so frustrated with this yesterday that I forgot to add some more information to make it more clear. First, I have fixed the code so it now runs by just copying and pasting (I forgot to add some import calls and some variables).
This is a fixed portion from Parent A that won't change in child A.
>>> print(maskedChildAFixed)
[[-- 0.0 2.0 -- 3.0]
[4.0 -- 0.0 1.0 --]
[4.0 -- -- 1.0 3.0]
[-- 0.0 3.0 2.0 --]
[-- -- 4.0 1.0 0.0]]
I need to fill in these blank parts with the fill part from parent B.
>>> print(maskedChildBFill)
[[1. 4.]
[3. 2.]
[2. 0.]
[4. 1.]
[3. 2.]]
For my children to be legal I can't repeat an integer in each row. If I try to use the "np.na.filled()" function with the compressed maskedChildBFill it gives me an error.
>>> print(np.ma.filled(maskedChildAFixed, fill_value=maskedChildBFill))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Rafael\.conda\envs\CoutoBinario\lib\site-packages\numpy\ma\core.py", line 639, in filled
return a.filled(fill_value)
File "C:\Users\Rafael\.conda\envs\CoutoBinario\lib\site-packages\numpy\ma\core.py", line 3752, in filled
np.copyto(result, fill_value, where=m)
File "<__array_function__ internals>", line 6, in copyto
ValueError: could not broadcast input array from shape (5,2) into shape (5,5)
I'll now coment the part of the code that compresses the fill portion (lines 46 and 47). It won't delete the blank spaces from the maskedChildBFill so that the size of the matrices are preserved.
>>> print(np.ma.filled(maskedChildAFixed, fill_value=maskedChildBFill))
[[2. 0. 2. 3. 3.]
[4. 3. 0. 1. 0.]
[4. 0. 3. 1. 3.]
[4. 0. 3. 2. 2.]
[4. 0. 4. 1. 0.]]
See how I get an invalid individual? Note the repeated integers in row 1. The individual should look like this:
[[1.0 0.0 2.0 4.0 3.0]
[4.0 3.0 0.0 1.0 2.0]
[4.0 2.0 0.0 1.0 3.0]
[4.0 0.0 3.0 2.0 1.0]
[3.0 2.0 4.0 1.0 0.0]]
I hope this update makes it easier to understand what I am trying to do. Thanks for all the help so far! <3
EDIT 2
I was able to work around by converting everything to list and then with for loops substitute the values in place, but this should be super slow. There might be a way to do this using numpy.
maskedChildAFill = maskedChildAFill.tolist()
maskedChildBFill = maskedChildBFill.tolist()
maskedChildAFixed = maskedChildAFixed.tolist()
maskedChildBFixed = maskedChildBFixed.tolist()
for i in range(numMachines):
counterA = 0
counterB = 0
for n, j in enumerate(maskedChildAFixed[i]):
if maskedChildAFixed[i][n] is None:
maskedChildAFixed[i][n] = maskedChildBFill[i][counterA]
counterA += 1
for n, j in enumerate(maskedChildBFixed[i]):
if maskedChildBFixed[i][n] is None:
maskedChildBFixed[i][n] = maskedChildAFill[i][counterB]
counterB += 1
I think you are looking for this:
parentA = np.array(population[0])
parentB = np.array(population[1])
childA = np.zeros((numJobs, numMachines))
np.copyto(childA, parentA)
childB = np.zeros((numJobs, numMachines))
np.copyto(childB, parentB)
subJobs = np.stack([rng.choice(numJobs ,size=int(np.max([2, np.floor(numJobs/2)])), replace=False) for i in range(numMachines)])
maskA = np.stack([(np.isin(childA[i], subJobs[i])) for i in range(numMachines)])
invMaskA = np.invert(maskA)
maskB = np.stack([(np.isin(childB[i], subJobs[i])) for i in range(numMachines)])
invMaskB = np.invert(maskB)
maskedChildAFixed = np.ma.masked_array(childA, maskA)
maskedChildBFixed = np.ma.masked_array(childB, maskB)
maskedChildAFill = np.ma.masked_array(childB, invMaskA)
maskedChildBFill = np.ma.masked_array(childA, invMaskB)
from operator import and_
crossA = np.ma.array(maskedChildAFixed.filled(0)+maskedChildAFill.filled(0),mask=list(map(and_,maskedChildAFixed.mask,maskedChildAFill.mask)))
crossB = np.ma.array(maskedChildBFixed.filled(0)+maskedChildBFill.filled(0),mask=list(map(and_,maskedChildBFixed.mask,maskedChildBFill.mask)))
Please note that I change line maskedChildAFill = np.ma.masked_array(childB, invMaskA) to fit the description of your problem. If that is not what you want, simply change it back to your original code. The last two lines should do the work for you.
output:
crossA
[[4.0 0.0 2.0 1.0 4.0]
[4.0 2.0 0.0 2.0 0.0]
[2.0 2.0 3.0 1.0 3.0]
[4.0 3.0 3.0 2.0 2.0]
[2.0 0.0 4.0 1.0 0.0]]
crossB
[[2.0 0.0 1.0 1.0 4.0]
[4.0 2.0 0.0 2.0 0.0]
[2.0 2.0 3.0 1.0 1.0]
[4.0 3.0 3.0 2.0 2.0]
[4.0 0.0 4.0 1.0 2.0]]
EDIT: Per OP's edit on question, this would work for the purpose:
maskedChildAFixed[np.where(maskA)] = maskedChildBFill.ravel()
maskedChildBFixed[np.where(maskB)] = maskedChildAFill.ravel()
Example output for maskedChildAFixed:
[[4.0 0.0 2.0 1.0 3.0]
[4.0 2.0 0.0 1.0 3.0]
[3.0 2.0 0.0 1.0 4.0]
[4.0 0.0 3.0 2.0 1.0]
[1.0 3.0 4.0 2.0 0.0]]
Here an example:
import pandas as pd
import numpy as np
positions = np.array([[2.2,3.1],
[2.3,6.2],
[2.4,9.3]])
df = pd.DataFrame({'pos': positions})
It returns the following error
ValueError: If using all scalar values, you must pass an index
Because it is being interpreted as two columns, use tolist:
import numpy as np
import pandas as pd
positions = np.array([[2.2, 3.1],
[2.3, 6.2],
[2.4, 9.3]])
df = pd.DataFrame({'pos': positions.tolist()})
print(df)
Output
pos
0 [2.2, 3.1]
1 [2.3, 6.2]
2 [2.4, 9.3]
You can try:
pd.DataFrame(positions)
Result:
0 1
0 2.2 3.1
1 2.3 6.2
2 2.4 9.3
Hope this helps.
I'm trying to calculate a rolling statistic that requires all variables in a window from two input columns.
My only solution involves a for loop. Is there a more efficient way, perhaps using Pandas' rolling and apply functions?
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])[1]
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.rolling(2).apply(lambda x: f(x), raw=False) # KeyError: 'a'
I get KeyError: 'a' because df gets passed to f() one series (column) at a time. Specifying axis=1 sends one row and all columns to f(), but neither approach provides the required set of observations.
You could try rolling, mean and sum:
df['result'] = df.rolling(2).mean().sum(axis=1)
a b result
0 1 5 0.0
1 2 6 7.0
2 3 7 9.0
3 4 8 11.0
EDIT
Adding a different answer based upon new information in the question by OP.
Set up the function.
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])
Create the data and dataframe:
a_data = [1,2,3,4]
b_data = [5,6,7,8]
df = pd.DataFrame(data={'a': a_data, 'b': b_data})
a b
0 1 5
1 2 6
2 3 7
3 4 8
I gather after researching coint that you are trying to pass two rolling arrays to f['a'] and f['b']. The following will create the arrays and dataframe.
n=2
arr_a = [df['a'].shift(x).values[::-1][:n] for x in range(len(df['a']))[::-1]]
arr_b = [df['b'].shift(x).values[::-1][:n] for x in range(len(df['b']))[::-1]]
df1 = pd.DataFrame(data={'a': arr_a, 'b': arr_b})
n is the size of the rolling window.
df1
a b
0 [1.0, nan] [5.0, nan]
1 [2.0, 1.0] [6.0, 5.0]
2 [3.0, 2.0] [7.0, 6.0]
3 [4, 3] [8, 7]
Then you can use apply.(f) to send in the rows of arrays.
df1.iloc[(n-1):,].apply(f, axis=1)
Your output is as follows:
1 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
2 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
3 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
dtype: object
When I run this I do get an error for perfectly colinear data, but I suspect that will disappear with real data.
Also, I know a purely vecotorized solution might have been faster. I wonder what the performance will be like for this if it what you are looking for?
Hats off to #Zero who really had the solution for this problem here.
I tried placing the sum before the rolling:
import pandas as pd
import time
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.copy()
s = time.time()
df2.loc[:, 'mean1'] = df.sum(axis = 1).rolling(2).mean()
print(time.time() - s)
s = time.time()
df2.loc[:, 'mean2'] = df.rolling(2).mean().sum(axis=1)
print(time.time() - s)
df2
0.003737926483154297
0.005460023880004883
a b mean1 mean2
0 1 5 NaN 0.0
1 2 6 7.0 7.0
2 3 7 9.0 9.0
3 4 8 11.0 11.0
It is slightly faster than the previous answer, but works the same and maybe in large datasets the difference migth significant.
You can modify it to select the columns of interest only:
s = time.time()
print(df[['a', 'b']].sum(axis = 1).rolling(2).mean())
print(time.time() - s)
0 NaN
1 7.0
2 9.0
3 11.0
dtype: float64
0.0033559799194335938
I have an array that contains numbers that are distances, and another that represents certain values at that distance. How do I calculate the average of all the data at a fixed value of the distance?
e.g distances (d): [1 1 14 6 1 12 14 6 6 7 4 3 7 9 1 3 3 6 5 8]
e.g data corresponding to the entry of the distances:
therefore value=3.3 at d=1; value=2,1 at d=1; value=3.5 at d=14; etc..
[3.3 2.1 3.5 2.5 4.6 7.4 2.6 7.8 9.2 10.11 14.3 2.5 6.7 3.4 7.5 8.5 9.7 4.3 2.8 4.1]
For example, at distance d=6 I should do the mean of 2.5, 7.8, 9.2 and 4.3
I've used the following code that works, but I do not know how to store the values into a new array:
from numpy import mean
for d in set(key):
print d, mean([dist[i] for i in range(len(key)) if key[i] == d])
Please help! Thanks
You've got the hard part done, just putting your results into a new list is as easy as:
result = []
for d in set(key):
result.append(mean([dist[i] for i in range(len(key)) if key[i] == d]))
Using pandas
g = pd.DataFrame({'d':d, 'k':k}).groupby('d')
Option 1: transform to get the values in the same positions
g.transform('mean').values
Option2: mean directly and get a dict with the mapping
g.mean().to_dict()['k']
Setup
d = np.array(
[1, 1, 14, 6, 1, 12, 14, 6, 6, 7, 4, 3, 7, 9, 1, 3, 3, 6, 5, 8]
)
k = np.array(
[3.3,2.1,3.5,2.5,4.6,7.4,2.6,7.8,9.2,10.11,14.3,2.5,6.7,3.4,7.5,8.5,9.7,4.3,2.8,4.1]
)
scipy.sparse + csr_matrix
from scipy import sparse
s = d.shape[0]
r = np.arange(s+1)
m = d.max() + 1
b = np.bincount(d)
out = sparse.csr_matrix( (k, d, r), (s, m) ).sum(0).A1
(out / b)[d]
array([ 4.375, 4.375, 3.05 , 5.95 , 4.375, 7.4 , 3.05 , 5.95 ,
5.95 , 8.405, 14.3 , 6.9 , 8.405, 3.4 , 4.375, 6.9 ,
6.9 , 5.95 , 2.8 , 4.1 ])
You could use array from the numpy lib in combination with where, also from the same lib.
You can define a function to get the positions of the desired distances:
from numpy import mean, array, where
def key_distances(distances, d):
return where(distances == d)[0]
then you use it for getting the values at those positions.
Let's say you have:
d = array([1,1,14,6,1,12,14,6,6,7,4,3,7,9,1,3,3,6,5,8])
v = array([3.3,2.1,3.5,2.5,4.6,7.4,2.6,7.8,9.2,10.11,14.3,2.5,6.7,3.4,7.5,8.5,9.7,4.3,2.8,4.1])
Then you might do something like:
vs = v[key_distances(d,d[1])]
Then get your mean:
print mean(vs)
The numpy_indexed package (disclaimer: I am its author) was designed with these use-cases in mind:
import numpy_indexed as npi
npi.group_by(d).mean(dist)
Pandas can do similar things; but its api isnt really tailored to these things; and for such an elementary operation as a group-by I feel its kinda wrong to have to hoist your data into a completely new datastructure.