Repetitively multiply 6 randomly generated numbers with data from csv - python

I want to generate 6 random numbers(weights) that always equals one 1000000 times and multiply it the columns of a data i have imported from as csv file. Store the sum in another column(weighted average) and find the difference between the max and min the new column(range). I want to repeat the process 1000000 times and get the least range and the set of random numbers(weights) generated to find that.
Here is what i have done so far:
1.Generate 6 random numbers
2.Import data from csv
3. Multiply the data random numbers with the data from the csv file and find the average(weighted average)
4. save the weighted average in a new column F(x)
5. Find the range
6. Repeat this 1000000 times and get the random numbers that gives me the least range.
Here is some Data from the file
A B C D E F F(x)
0 4.9 3.9 6.3 3.4 7.3 3.4 0.0
1 4.1 3.7 7.7 2.8 5.5 3.9 0.0
2 6.0 6.0 4.0 3.1 3.7 4.3 0.0
3 5.6 6.3 6.6 4.6 8.3 4.6 0.0
Currently getting 0.0 for all F(x) which should not be so.
arr = np.array(np.random.dirichlet(np.ones(6), size=1))
arr=pd.DataFrame(arr)
ar=(arr.iloc[0])
df = pd.read_csv('weit.csv')
df['F(x)']=df.mul(ar).sum(1)
df
df['F(x)'].max() - df['F(x)'].min()
I am getting 0 for all my weighted averages. I need to get the weighted average
I cant loop the code to run 1000000 times and get me the least range.

If understand correctly what you need:
#data from file
print (df)
A B C D E F
0 4.9 3.9 6.3 3.4 7.3 3.4
1 4.1 3.7 7.7 2.8 5.5 3.9
2 6.0 6.0 4.0 3.1 3.7 4.3
3 5.6 6.3 6.6 4.6 8.3 4.6
np.random.seed(3434)
Generate 2d array with 6 'columns' and N 'rows' filled unique random numbers by this:
N = 10
#in real data
#N = 1000000
N = 10
arr = np.array(np.random.dirichlet(np.ones(6), size=N))
print (arr)
[[0.07077773 0.08042978 0.02589592 0.03457833 0.53804634 0.25027191]
[0.22174594 0.22673581 0.26136526 0.04820957 0.00976747 0.23217594]
[0.01202493 0.14247592 0.3411326 0.0239181 0.08448841 0.39596005]
[0.09354759 0.54989312 0.08893737 0.22051801 0.03850101 0.00860291]
[0.09418778 0.33345217 0.11721214 0.33480462 0.11894247 0.00140081]
[0.04285476 0.04531546 0.38105815 0.04316535 0.46902838 0.0185779 ]
[0.00441747 0.08044848 0.33383453 0.09476135 0.37568431 0.11085386]
[0.14613552 0.11260451 0.10421495 0.27880266 0.28994218 0.06830019]
[0.50747802 0.15704797 0.04410511 0.07552837 0.18744306 0.02839746]
[0.00203448 0.13225783 0.43042505 0.33410145 0.08385366 0.01732753]]
Then convert values from DataFrame to 2d numpy array:
b = df.values
#pandas 0.24+
#b = df.to_numpy()
print (b)
[[4.9 3.9 6.3 3.4 7.3 3.4]
[4.1 3.7 7.7 2.8 5.5 3.9]
[6. 6. 4. 3.1 3.7 4.3]
[5.6 6.3 6.6 4.6 8.3 4.6]]
Last multiple both arrays together to 3d array and sum per axis 2, last for subtract maximum with minimum use numpy.ptp:
c = np.ptp((arr * b[:, None]).sum(axis=2), axis=1)
print (c)
[2.19787892 2.08476765 1.2654273 1.45134533]
Another solution with numpy.einsum:
c = np.ptp(np.einsum('ik,jk->jik', arr, b).sum(axis=2), axis=1)
print (c)
[2.19787892 2.08476765 1.2654273 1.45134533]
Loop solution for compare, but slow with large N:
out = []
for row in df.values:
# print (row)
a = np.ptp((row * arr).sum(axis=1))
out.append(a)
print (out)
[2.197878921892329, 2.0847676512823052, 1.2654272959079576, 1.4513453259898297]

Related

Remove Nan from two numpy array with different dimension Python

I would like to remove nan elements from two pair of different dimension numpy array using Python. One numpy array with shape (8, 3) and another with shape (8,). Meaning if at least one nan element appear in a row, the entire row need to be removed. However I faced issues when this two pair of array has different dimension.
For example,
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[3.4 nan 4.6] 4.8
[4.2 4.6 4.8] 4.6
[4.6 4.8 4.6] nan
[4.8 4.6 nan] nan
[4.6 nan nan] nan
[nan nan nan] nan
I want it to become
[1.7 2.3 3.4] 4.2
[2.3 3.4 4.2] 4.6
[4.2 4.6 4.8] 4.6
This is my code which generate the sequence data,
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
n_steps = 3
sequence = df_sensorRefill['sensor'].to_list()
X, y = split_sequence(sequence, n_steps)
Thanks
You could use np.isnan(), np.any() to find rows containing nan's and np.delete() to remove such rows.

find a value based on multiple conditions within a tolerance in Pandas

I'm looking for an identification procedure using Pandas to trace the movement of some objects.
I want to assign a value (ID) compared to previous frame (Time) for each group based on multiple conditions (X,Y,Z) being within a tolerance (X<0.5,Y<0.6,Z<0.7) and write it in place. Also, if the value is not available, give it a new value (Count) which restarts by Group.
Here is an example:
Before:
Group
Time
X
Y
Z
A
1.0
0.1
0.1
0.1
A
1.0
2.2
2.2
2.2
B
1.0
3.3
3.3
3.3
A
1.1
0.4
0.4
0.4
A
1.1
5.5
5.5
5.5
B
1.1
3.6
3.6
3.6
After
Group
Time
X
Y
Z
ID
A
1.0
0.1
0.1
0.1
1
A
1.0
2.2
2.2
2.2
2
B
1.0
3.3
3.3
3.3
1
A
1.1
0.4
0.4
0.4
1
A
1.1
5.5
5.5
5.5
3
B
1.1
3.6
3.6
3.6
1
for clarification:
Row#4: X,Y,Z change is within the tolerance, hence same ID
Row#5: X,Y,Z change is NOT within the tolerance, hence new ID in Group A
Row#6: X,Y,Z change NOT within the tolerance, hence same ID
I think I can trace the movement of my objects only for one direction (Let's say X) using pd.merge_asof regardless of their group and time, and find their ID. My problems are: 1. considering the group and time, 2. assignment of new ID, 3. different tolerances.
df3=pd.merge_asof(df1, df2, on="X", direction="nearest", by=["Group"], tolerance=0.5)

Averaging values in array corresponding to the values of another array

I have an array that contains numbers that are distances, and another that represents certain values at that distance. How do I calculate the average of all the data at a fixed value of the distance?
e.g distances (d): [1 1 14 6 1 12 14 6 6 7 4 3 7 9 1 3 3 6 5 8]
e.g data corresponding to the entry of the distances:
therefore value=3.3 at d=1; value=2,1 at d=1; value=3.5 at d=14; etc..
[3.3 2.1 3.5 2.5 4.6 7.4 2.6 7.8 9.2 10.11 14.3 2.5 6.7 3.4 7.5 8.5 9.7 4.3 2.8 4.1]
For example, at distance d=6 I should do the mean of 2.5, 7.8, 9.2 and 4.3
I've used the following code that works, but I do not know how to store the values into a new array:
from numpy import mean
for d in set(key):
print d, mean([dist[i] for i in range(len(key)) if key[i] == d])
Please help! Thanks
You've got the hard part done, just putting your results into a new list is as easy as:
result = []
for d in set(key):
result.append(mean([dist[i] for i in range(len(key)) if key[i] == d]))
Using pandas
g = pd.DataFrame({'d':d, 'k':k}).groupby('d')
Option 1: transform to get the values in the same positions
g.transform('mean').values
Option2: mean directly and get a dict with the mapping
g.mean().to_dict()['k']
Setup
d = np.array(
[1, 1, 14, 6, 1, 12, 14, 6, 6, 7, 4, 3, 7, 9, 1, 3, 3, 6, 5, 8]
)
k = np.array(
[3.3,2.1,3.5,2.5,4.6,7.4,2.6,7.8,9.2,10.11,14.3,2.5,6.7,3.4,7.5,8.5,9.7,4.3,2.8,4.1]
)
scipy.sparse + csr_matrix
from scipy import sparse
s = d.shape[0]
r = np.arange(s+1)
m = d.max() + 1
b = np.bincount(d)
out = sparse.csr_matrix( (k, d, r), (s, m) ).sum(0).A1
(out / b)[d]
array([ 4.375, 4.375, 3.05 , 5.95 , 4.375, 7.4 , 3.05 , 5.95 ,
5.95 , 8.405, 14.3 , 6.9 , 8.405, 3.4 , 4.375, 6.9 ,
6.9 , 5.95 , 2.8 , 4.1 ])
You could use array from the numpy lib in combination with where, also from the same lib.
You can define a function to get the positions of the desired distances:
from numpy import mean, array, where
def key_distances(distances, d):
return where(distances == d)[0]
then you use it for getting the values at those positions.
Let's say you have:
d = array([1,1,14,6,1,12,14,6,6,7,4,3,7,9,1,3,3,6,5,8])
v = array([3.3,2.1,3.5,2.5,4.6,7.4,2.6,7.8,9.2,10.11,14.3,2.5,6.7,3.4,7.5,8.5,9.7,4.3,2.8,4.1])
Then you might do something like:
vs = v[key_distances(d,d[1])]
Then get your mean:
print mean(vs)
The numpy_indexed package (disclaimer: I am its author) was designed with these use-cases in mind:
import numpy_indexed as npi
npi.group_by(d).mean(dist)
Pandas can do similar things; but its api isnt really tailored to these things; and for such an elementary operation as a group-by I feel its kinda wrong to have to hoist your data into a completely new datastructure.

Grouping Equal Elements In An Array

I’m writing a program in python, which needs to sort through four columns of data in a text file, and return the four numbers the row with largest number in the third column for each set of identical numbers in the first column.
For example:
I need:
1.0 19.3 15.5 0.1
1.0 25.0 25.0 0.1
2.0 4.8 3.1 0.1
2.0 7.1 6.4 0.1
2.0 8.6 9.7 0.1
2.0 11.0 14.2 0.1
2.0 13.5 19.0 0.1
2.0 16.0 22.1 0.1
2.0 19.3 22.7 0.1
2.0 25.0 21.7 0.1
3.0 2.5 2.7 0.1
3.0 3.5 4.8 0.1
3.0 4.8 10.0 0.1
3.0 7.1 18.4 0.1
3.0 8.6 21.4 0.1
3.0 11.0 22.4 0.1
3.0 19.3 15.9 0.1
4.0 4.8 16.5 0.1
4.0 7.1 13.9 0.1
4.0 8.6 11.3 0.1
4.0 11.0 9.3 0.1
4.0 19.3 5.3 0.1
4.0 2.5 12.8 0.1
3.0 25.0 13.2 0.1
To return:
1.0 19.3 15.5 0.1
2.0 19.3 22.7 0.1
3.0 11.0 22.4 0.1
4.0 4.8 16.5 0.1
Here, the row [1.0, 19.3, 15.5, 0.1] is returned because 15.5 is the greatest third column value that any of the rows has, out of all the rows where 1.0 is the first number. For each set of identical numbers in the first column, the function must return the rows with the greatest value in the third column.
I am struggling with actually doing this in python, because the loop iterates over EVERY row and finds a maximum, not each ‘set’ of first column numbers.
Is there something about for loops that I don’t know which could help me do this?
Below is what I have so far.
import numpy as np
C0,C1,C2,C3 = np.loadtxt("FILE.txt",dtype={'names': ('C0', 'C1', 'C2','C3'),'formats': ('f4', 'f4', 'f4','f4')},unpack=True,usecols=(0,1,2,3))
def FUNCTION(C_0,C_1,C_2,C_3):
for i in range(len(C_1)):
a = []
a.append(C_0 [i])
for j in range(len(C_0)):
if C_0[j] == C_0[i]:
a.append(C_0 [j])
return a
print FUNCTION(C0,C1,C2,C3)
where C0,C1,C2, and C3 are columns in the text file, loaded as 1-D arrays.
Right now I’m just trying to isolate the indexes of the rows with equal C0 values.
An approach could be to use a dict where the value is the row keyed by the first column item. This way you won't have to load the whole text file in memory at once. You can scan line by line and update the dict as you go.
I got some complex because of first and second rows... I believe 25.0 at (2, 3) is your mistake.
My code is not a mathematical solution, but it can be work.
import collections
with open("INPUT.txt", "r") as datasheet:
data = datasheet.read().splitlines()
dataset = collections.OrderedDict()
for dataitem in data:
temp = dataitem.split(" ")
# I just wrote this code, input and output was seperated by four spaces
print(temp)
if temp[0] in dataset.keys():
if float(dataset[temp[0]][1]) < float(temp[2]):
dataset[temp[0]] = [temp[1], temp[2], temp[3]]
else:
dataset[temp[0]] = [temp[1], temp[2], temp[3]]
# Some sort code here
with open("OUTPUT.txt", "w") as outputsheet:
for datakey in dataset.keys():
datavalue = dataset[datakey]
outputsheet.write("%s %s %s %s\n" % (datakey, datavalue[0], datavalue[1], datavalue[2]))
Using Numpy and Lambda
Using the properties of a dict with some lambda functions does the trick..
data = np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),'formats': ('f4', 'f4', 'f4','f4')},usecols=(0,1,2,3))
# ordering by columns 1 and 3
sorted_data = sorted(data, key=lambda x: (x[0],x[2]))
# dict comprehension mapping the value of first column to a row
# this will overwrite all previous entries as mapping is 1-to-1
ret = {d[0]: list(d) for d in sorted_data}.values()
Alternatively, you can make it a (ugly) one liner..
ret = {
d[0]: list(d)
for d in sorted(np.loadtxt("FILE.txt",dtype={'names': ('a', 'b', 'c','d'),
'formats': ('f4', 'f4', 'f4','f4')},
usecols=(0,1,2,3)),
key=lambda x: (x[0],x[2]))
}.values()
As #Fallen pointed out, this is an inefficient method as you need to read in the whole file. However, for the purposes of this example where the data set is quite small, it's reasonably acceptable.
Reading one line at a time
The more efficient way is reading in one line at a time.
import re
# Get the data
with open('data', 'r') as f:
str_data = f.readlines()
# Convert to dict
d = {}
for s in str_data:
data = [float(n) for n in re.split(r'\s+', s.strip())]
if data[0] in d:
if data[2] >= d[data[0]][2]:
d[data[0]] = data
else:
d[data[0]] = data
print d.values()
The caveat here is that there's no other sorting metric so if you initially have a row for 1.0 with [1.0, 2.0, 3.0, 5.0] then any subsequent line with a 1.0 where the 3rd column is greater or equal to 3.0 will be overwritten, e.g. [1.0, 1.0, 3.0, 1.0]

How to replace NaNs by average of preceding and succeeding values in pandas DataFrame?

If I have some missing values and I would like to replace all NaN with average of preceding and succeeding values, how can I do that ?.
I know I can use pandas.DataFrame.fillna with method='ffill' or method='bfill' options to replace the NaN values by preceding or succeeding values, however I would like to apply the average of those values on the dataframe instead of iterating over rows and columns.
Try DataFrame.interpolate(). Example from the panda docs:
In [65]: df = pd.DataFrame({'A': [1, 2.1, np.nan, 4.7, 5.6, 6.8],
....: 'B': [.25, np.nan, np.nan, 4, 12.2, 14.4]})
....:
In [66]: df
Out[66]:
A B
0 1.0 0.25
1 2.1 NaN
2 NaN NaN
3 4.7 4.00
4 5.6 12.20
5 6.8 14.40
In [67]: df.interpolate()
Out[67]:
A B
0 1.0 0.25
1 2.1 1.50
2 3.4 2.75
3 4.7 4.00
4 5.6 12.20
5 6.8 14.40
Maybe late but I just had the same question and the (unique) answer in this page did not satisfy my expectations. That's why I am answering now.
Your post states that you want to replace the NaNs with averages however, the interpolation is not a correct answer for me because it fills the empty cells with a linear equation. If you want to fill it with the averages of the preceding and succeeding rows. This code helped me:
dfb = df.fillna(method='bfill')
dff = df.fillna(method='ffill')
dfmeans = (dfb+dff)/2
dfmeans
For the datafrme of the example above, the result looks like
A B
0 1.0 0.250
1 2.1 2.125
2 3.4 2.125
3 4.7 4.000
4 5.6 12.200
5 6.8 14.400
Where you can see, at index 2 of the column A they both produce 3.4 because there the interpolation is (2.1 + 4.7)/2 but in column B the values differ.
For a one-line script and it's application to time series, you can see this post: Average between values with unevenly distributed time in Pandas DataFrame

Categories

Resources