find a value based on multiple conditions within a tolerance in Pandas - python

I'm looking for an identification procedure using Pandas to trace the movement of some objects.
I want to assign a value (ID) compared to previous frame (Time) for each group based on multiple conditions (X,Y,Z) being within a tolerance (X<0.5,Y<0.6,Z<0.7) and write it in place. Also, if the value is not available, give it a new value (Count) which restarts by Group.
Here is an example:
Before:
Group
Time
X
Y
Z
A
1.0
0.1
0.1
0.1
A
1.0
2.2
2.2
2.2
B
1.0
3.3
3.3
3.3
A
1.1
0.4
0.4
0.4
A
1.1
5.5
5.5
5.5
B
1.1
3.6
3.6
3.6
After
Group
Time
X
Y
Z
ID
A
1.0
0.1
0.1
0.1
1
A
1.0
2.2
2.2
2.2
2
B
1.0
3.3
3.3
3.3
1
A
1.1
0.4
0.4
0.4
1
A
1.1
5.5
5.5
5.5
3
B
1.1
3.6
3.6
3.6
1
for clarification:
Row#4: X,Y,Z change is within the tolerance, hence same ID
Row#5: X,Y,Z change is NOT within the tolerance, hence new ID in Group A
Row#6: X,Y,Z change NOT within the tolerance, hence same ID
I think I can trace the movement of my objects only for one direction (Let's say X) using pd.merge_asof regardless of their group and time, and find their ID. My problems are: 1. considering the group and time, 2. assignment of new ID, 3. different tolerances.
df3=pd.merge_asof(df1, df2, on="X", direction="nearest", by=["Group"], tolerance=0.5)

Related

Only want to consider a dataframe up to the present point

I have a dataframe and I am trying to do something along the lines of
df['foo'] = np.where(myfunc(df) == 1, 10, 20)
but I only want to consider the dataframe up to the present, for example if my dataframe looked like
A B C
1 0.3 0.3 1.6
2 0.6 0.6 0.4
3 0.9 0.9 1.2
4 1.2 1.2 0.8
and I was generating the value of 'foo' for the third row, I would be looking at the dataframe's first through third rows, but not the fourth row. Is it possible to accomplish this?
It is certainly possible. The dataframe up to the present is given by
df.iloc[:present],
and you can do whatever you want with it, in particular, use where, as described here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html

Why are the values different when iterating them in a for loop, than when printing the whole array? [duplicate]

This question already has answers here:
Is floating point math broken?
(31 answers)
Closed 2 years ago.
In Python, using numpy, the values change for printing them in an iterating process, or printing the whole array, why, and how can I fix this? I would like them to be e.g. 0.8 instead of 0.799999999...
>>> import numpy as np
>>> b = np.arange(0.5,2,0.1)
>>> for value in b:
... print(value)
...
0.5
0.6
0.7
0.7999999999999999
0.8999999999999999
0.9999999999999999
1.0999999999999999
1.1999999999999997
1.2999999999999998
1.4
1.4999999999999998
1.5999999999999996
1.6999999999999997
1.7999999999999998
1.8999999999999997
>>> print(b)
[0.5 0.6 0.7 0.8 0.9 1. 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9]
>>>
This happens because Python and NumPy use floating point arithmetic where some numbers, i.e. 0.1, cannot be represented exactly.
Also check python floating-point issues & limitations.
You can use Numpy's np.around for this:
>> b = np.around(b, 1) # first arg is the np array, second arg is the no. of decimal points
>> for value in b:
print(value)
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
To print you can use print format - "%.1f"
>>> for value in b:
... print("%.1f" %value)
...
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
you can use round to get whole number
for value in b:
print(round(value,2))
What i think is, that the method __str__ for the ndarray is implemented that way. There is nothing strange about that behaviour - when you use
print(b)
the function __str__ is called for readability. During this call you operate on ndarray. When you make print in for loop, you use __str__ of float number, which prints the number as it is.
Hope that it is clear, but this can be actually helpful.
:)

Repetitively multiply 6 randomly generated numbers with data from csv

I want to generate 6 random numbers(weights) that always equals one 1000000 times and multiply it the columns of a data i have imported from as csv file. Store the sum in another column(weighted average) and find the difference between the max and min the new column(range). I want to repeat the process 1000000 times and get the least range and the set of random numbers(weights) generated to find that.
Here is what i have done so far:
1.Generate 6 random numbers
2.Import data from csv
3. Multiply the data random numbers with the data from the csv file and find the average(weighted average)
4. save the weighted average in a new column F(x)
5. Find the range
6. Repeat this 1000000 times and get the random numbers that gives me the least range.
Here is some Data from the file
A B C D E F F(x)
0 4.9 3.9 6.3 3.4 7.3 3.4 0.0
1 4.1 3.7 7.7 2.8 5.5 3.9 0.0
2 6.0 6.0 4.0 3.1 3.7 4.3 0.0
3 5.6 6.3 6.6 4.6 8.3 4.6 0.0
Currently getting 0.0 for all F(x) which should not be so.
arr = np.array(np.random.dirichlet(np.ones(6), size=1))
arr=pd.DataFrame(arr)
ar=(arr.iloc[0])
df = pd.read_csv('weit.csv')
df['F(x)']=df.mul(ar).sum(1)
df
df['F(x)'].max() - df['F(x)'].min()
I am getting 0 for all my weighted averages. I need to get the weighted average
I cant loop the code to run 1000000 times and get me the least range.
If understand correctly what you need:
#data from file
print (df)
A B C D E F
0 4.9 3.9 6.3 3.4 7.3 3.4
1 4.1 3.7 7.7 2.8 5.5 3.9
2 6.0 6.0 4.0 3.1 3.7 4.3
3 5.6 6.3 6.6 4.6 8.3 4.6
np.random.seed(3434)
Generate 2d array with 6 'columns' and N 'rows' filled unique random numbers by this:
N = 10
#in real data
#N = 1000000
N = 10
arr = np.array(np.random.dirichlet(np.ones(6), size=N))
print (arr)
[[0.07077773 0.08042978 0.02589592 0.03457833 0.53804634 0.25027191]
[0.22174594 0.22673581 0.26136526 0.04820957 0.00976747 0.23217594]
[0.01202493 0.14247592 0.3411326 0.0239181 0.08448841 0.39596005]
[0.09354759 0.54989312 0.08893737 0.22051801 0.03850101 0.00860291]
[0.09418778 0.33345217 0.11721214 0.33480462 0.11894247 0.00140081]
[0.04285476 0.04531546 0.38105815 0.04316535 0.46902838 0.0185779 ]
[0.00441747 0.08044848 0.33383453 0.09476135 0.37568431 0.11085386]
[0.14613552 0.11260451 0.10421495 0.27880266 0.28994218 0.06830019]
[0.50747802 0.15704797 0.04410511 0.07552837 0.18744306 0.02839746]
[0.00203448 0.13225783 0.43042505 0.33410145 0.08385366 0.01732753]]
Then convert values from DataFrame to 2d numpy array:
b = df.values
#pandas 0.24+
#b = df.to_numpy()
print (b)
[[4.9 3.9 6.3 3.4 7.3 3.4]
[4.1 3.7 7.7 2.8 5.5 3.9]
[6. 6. 4. 3.1 3.7 4.3]
[5.6 6.3 6.6 4.6 8.3 4.6]]
Last multiple both arrays together to 3d array and sum per axis 2, last for subtract maximum with minimum use numpy.ptp:
c = np.ptp((arr * b[:, None]).sum(axis=2), axis=1)
print (c)
[2.19787892 2.08476765 1.2654273 1.45134533]
Another solution with numpy.einsum:
c = np.ptp(np.einsum('ik,jk->jik', arr, b).sum(axis=2), axis=1)
print (c)
[2.19787892 2.08476765 1.2654273 1.45134533]
Loop solution for compare, but slow with large N:
out = []
for row in df.values:
# print (row)
a = np.ptp((row * arr).sum(axis=1))
out.append(a)
print (out)
[2.197878921892329, 2.0847676512823052, 1.2654272959079576, 1.4513453259898297]

How to replace values in a 4-dimensional array?

Let's say I have a 4D array that is
(600, 1, 3, 3)
If you take the first 2 elements they may look like this:
0 1 0
1 1 1
0 1 0
2 2 2
3 3 3
1 1 1
etc
I have a list that contains certain weights that I want to replace specific values in the array. My intention is to use the index of the list element match the value in the array. Therefore, this list
[0.1 1.1 1.2 1.3]
when applied against my array would give this result:
0.1 1.1 0.1
1.1 1.1 1.1
0.1 1.1 0.1
1.2 1.2 1.2
1.3 1.3 1.3
1.1 1.1 1.1
etc
This method would have to run through the entire 600 elements of the array.
I can do this in a clunky way using a for loop and array[array==x] = y or np.place but I wanted to avoid a loop and perhaps use a method that at once will replace all values. Is there such an approach?
Quoting from #Divakar's solution in the comments, which solves the issue in a very efficient manner:
Simply index into the array version:
np.asarray(vals)[idx], where vals is
the list and idx is the array.
Or use np.take(vals, idx) to do the array conversion under the hood.

How to load a dataset's examples into different arrays for a decision tree classification?

I have a dataset containing 15 examples. It has 3 features and a target label. How do I load the values corresponding to the 3 features into an array in Python (Pandas)?
I want to train a decision tree classifier on the dataset. For this, I have to load the examples into arrays such that all the data points are in an array X and the corresponding labels are in another array Y. How should I proceed?
The dataset looks like following:
x1 x2 x3 z
0 5.5 0.5 4.5 2
1 7.4 1.1 3.6 0
2 5.9 0.2 3.4 2
3 9.9 0.1 0.8 0
4 6.9 -0.1 0.6 2
5 6.8 -0.3 5.1 2
6 4.1 0.3 5.1 1
7 1.3 -0.2 1.8 1
8 4.5 0.4 2.0 0
9 0.5 0.0 2.3 1
10 5.9 -0.1 4.4 0
11 9.3 -0.2 3.2 0
12 1.0 0.1 2.8 1
13 0.4 0.1 4.3 1
14 2.7 -0.5 4.2 1
I already have the dataset loaded into a dataframe :
import pandas as pd
df = pd.read_csv('C:\Users\Dell\Downloads\dataset.csv')
print(df.to_string())
I need to know how to load the values corresponding to the features x1, x2 and x3 into X (as training examples) and the values corresponding to the label z into Y (as the labels for the training examples).
Thanks.
First you load the data in a data.frame.
Since you had a very strange formatting I changed this to normal .csv to make this example easier to understand.
x1,x2,x3,z
5.5,0.5,4.5,2
7.4,1.1,3.6,0
5.9,0.2,3.4,2
9.9,0.1,0.8,0
6.9,-0.1,0.6,2
6.8,-0.3,5.1,2
4.1,0.3,5.1,1
1.3,-0.2,1.8,1
4.5,0.4,2.0,0
0.5,0.0,2.3,1
5.9,-0.1,4.4,0
9.3,-0.2,3.2,0
1.0,0.1,2.8,1
0.4,0.1,4.3,1
2.7,-0.5,4.2,1
If you have the data in the data.frame, half of the work is already done. I posted you a example using the "caret" package using a linear regression model.
library("caret")
my.dataframe <- read.csv("myExample.csv", header = T, sep =",")
fit <- train(z ~ ., data = my.dataframe, method = "lm")
fit
Basically you have just to replace the "lm" in method to train all kinds of other models.
Here is a list where you can choose from:
http://topepo.github.io/caret/available-models.html
For training a random forest model you would type
library("caret")
my.dataframe <- read.csv("myExample.csv", header = T, sep =",")
fit <- train(z ~ ., data = my.dataframe, method = "rf")
fit
But be also careful you have very limited data - not every model makes sense for just 15 data points.
Random Forest model will give you for example this warning:
45: In randomForest.default(x, y, mtry = param$mtry, ...) :
The response has five or fewer unique values. Are you sure you want to do regression?

Categories

Resources