How to replace values in a 4-dimensional array? - python

Let's say I have a 4D array that is
(600, 1, 3, 3)
If you take the first 2 elements they may look like this:
0 1 0
1 1 1
0 1 0
2 2 2
3 3 3
1 1 1
etc
I have a list that contains certain weights that I want to replace specific values in the array. My intention is to use the index of the list element match the value in the array. Therefore, this list
[0.1 1.1 1.2 1.3]
when applied against my array would give this result:
0.1 1.1 0.1
1.1 1.1 1.1
0.1 1.1 0.1
1.2 1.2 1.2
1.3 1.3 1.3
1.1 1.1 1.1
etc
This method would have to run through the entire 600 elements of the array.
I can do this in a clunky way using a for loop and array[array==x] = y or np.place but I wanted to avoid a loop and perhaps use a method that at once will replace all values. Is there such an approach?

Quoting from #Divakar's solution in the comments, which solves the issue in a very efficient manner:
Simply index into the array version:
np.asarray(vals)[idx], where vals is
the list and idx is the array.
Or use np.take(vals, idx) to do the array conversion under the hood.

Related

Pandas Slice Columns and select subsets based on between condition

I have a dataframe as follows:
100 105 110
timestamp
2020-11-0112:00:00 0.2 0.5 0.1
2020-11-0112:01:00 0.3 0.8 0.2
2020-11-0112:02:00 0.8 0.9 0.4
2020-11-0112:03:00 1 0 0.4
2020-11-0112:04:00 0 1 0.5
2020-11-0112:05:00 0.5 1 0.2
I want to select columns with dataframe where the values would be greater than equal 0.5 and less than equal to 1, and I want the index/timestamp in which these occurrences happened. Each column could have multiple such occurrences. So, 100, can be between 0.5 and 1 from 12:00 to 12:03 and then again from 12:20 to 12:30. It needs to reset when it hits 0. The column names are variable.
I also want the time difference in which the column value was between 0.5 and 1, so from the above it was 3 minutes, and 10 minutes.
The expected output would be with a dict for ranges the indexes appeared in:
100 105 110
timestamp
2020-11-0112:00:00 NaN 0.5 NaN
2020-11-0112:01:00 NaN 0.8 NaN
2020-11-0112:02:00 0.8 0.9 NaN
2020-11-0112:03:00 1 NaN NaN
2020-11-0112:04:00 NaN 1 0.5
2020-11-0112:05:00 0.5 1 NaN
and probably a way to calculate the minutes which could be in a dict/list of dicts:
["105":
[{"from": "2020-11-0112:00:00", "to":"2020-11-0112:02:00"},
{"from": "2020-11-0112:04:00", "to":"2020-11-0112:05:00"}]
...
]
Essentially I want a the dicts at the end to evaluate.
Basically, it would be best if you got the ordered sequence of timestamps; then, you can manipulate it to get the differences. If the question is only about Pandas slicing and not about timestamp operations, then you need to do the following operation:
df[df["100"] >= 0.5][df["100"] <= 1]["timestamp"].values
Pandas data frames comparaision operations
For Pandas, data frames, normal comparison operations are overridden. If you do dataframe_instance >= 0.5, the result is a sequence of boolean values. An individual value in the sequence results from comparing an individual data frame value to 0.5.
Pandas data frame slicing
This sequence could be used to filter a subsequence from your data frame. It is possible because Pandas slicing is overridden and implemented as a reach filtering algorithm.

Multiply each column by their relative factor using prefix column name

I have a matrix like
id |v1_m1 v2_m1 v3_m1 f_m1 v1_m2 v2_m2 v3_m2 f_m2|
1 | 0 .5 .5 4 0.1 0.3 0.6 4 |
2 | 0.3 .3 .4 8 0.2 0.4 0.4 7 |
What I want is to mulply each v's in m1 by the f_m1 column, and all the v's columns with the suffix "_m2" by ghe f_m2 column.
The output that I expect is something like this:
id |v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2 |
1 | 0 2 2 0.4 1.2 2.4 |
2 | 2.4 2.4 3.2 1.4 2.8 2.8 |
for m in range (1,maxm):
for i in range (1,maxv):
df["v{}_m{}".format(i,m)] = df["v{}_m{}".format(i,m)]*df["f_m{}".format(m)]
for m in range (1,maxm):
df.drop(columns=["f_m{}".format(m)])
You could do this with some fancy dataframe reshaping:
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df=df.stack()
df_mul = df.filter(like='v').mul(df.filter(like='f').squeeze(), axis=0)
df_mul = df_mul.unstack().sort_index(level=1, axis=1)
df_mul.columns = [f'{i}_{j}' for i, j in df_mul.columns]
df_mul
Output:
v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2
id
1 0.0 2.0 2.0 0.4 1.2 2.4
2 2.4 2.4 3.2 1.4 2.8 2.8
Details:
Create MultiIndex column headers split on '_'
Reshape dataframe stacking the m# to rows, leaving four columns f and
three v's
Using filter, we can select the v columns and multiply by the f
series created by selecting the single column and using squeeze to
create a pd.Series from a single column dataframe
unstack the m# level back to columns
Flatten the MultiIndex column header back to single level using
f-string with list comprehension.
Assuming that your matrix is a pandas dataframe called df, I would like to give my nomination for a list comprehension approach if you enjoy them.
import itertools
items = [(i[0][0],i[0][1].multiply(i[1][1]))
for i in itertools.product(df.items(),repeat=2)
if (i[0][0][-2:]==i[1][0][-2:])
and i[1][0][:1]=='f'
and i[0][0][:1]!='f']
df_mul = pd.DataFrame.from_dict({i[0]:i[1] for i in items})
It should be superfast on larger versions of this problem.
Explanation -
Creates a generator for cross-product between each column as (c1,c2) tuples
Keeps only the columns where last 2 alphabets are same for both c1,c2 AND c2 starts with 'f', AND c1 doesn't start with 'f' (leaving you with the columns you wanna operate on as individual tuples). Something like this - [('v1_m1', 'f_m1'), ('v2_m1', 'f_m1'), ('v1_m2', 'f_m2')]
Multiplies the columns, attaches a column name and saves them as items (similar structure to df.items())
Turns the items into a dataframe

Only want to consider a dataframe up to the present point

I have a dataframe and I am trying to do something along the lines of
df['foo'] = np.where(myfunc(df) == 1, 10, 20)
but I only want to consider the dataframe up to the present, for example if my dataframe looked like
A B C
1 0.3 0.3 1.6
2 0.6 0.6 0.4
3 0.9 0.9 1.2
4 1.2 1.2 0.8
and I was generating the value of 'foo' for the third row, I would be looking at the dataframe's first through third rows, but not the fourth row. Is it possible to accomplish this?
It is certainly possible. The dataframe up to the present is given by
df.iloc[:present],
and you can do whatever you want with it, in particular, use where, as described here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.where.html

Python overlapping sliding dataframe

I'm implementing a machine learning algorithm and I'm extracting features out of a dataframe. I obviously need overlapping windows. Suppose the DataFrame looks like this
x y z
12.1 11 0.5
12.2 10 0.3
12.4 11 0.5
12.8 12 0.4
13.1 13 0.4
14.7 14 0.5
15.2 14 0.6
15.3 13 0.5
17.3 14 0.5
18.2 15 0.4
16.1 16 0.2
15.0 17 0.1
But in reality a lot larger (thousands of samples). I now want a list of dataframes where each DataFrame is of length ws (here 150) and to have step (stride) of 60.
This is what I got
r = np.arange(len(df))
s = r[::step]
return [df.iloc[k:k+ws] for k in s]
This works reasonably good but there's still one problem. The last 1,2 or 3 frames might not have length ws. I also cannot just discard the last 3 since there's sometimes only one with a length smaller then ws. So the s variable just keeps all the start indices, I'd need a way to have only the start indices where start_index + step < len(df). Unless of course, there are better and or faster ways for this (maybe a library). All existing documentation only talks about simple arrays.
You might only need to change s:
s = r[:len(df)-ws+1:step]
In this way you only find the start indexes of frames with length ws.

How to compare if any value is similar to any other using numpy

I have many pairs of coordinate arrays like so
a=[(1.001,3),(1.334, 4.2),...,(17.83, 3.4)]
b=[(1.002,3.0001),(1.67, 5.4),...,(17.8299, 3.4)]
c=[(1.00101,3.002),(1.3345, 4.202),...,(18.6, 12.511)]
Any coordinate in any of the pairs can be a duplicate of another coordinate in another array of pairs. The arrays are also not the same size.
The duplicates will vary slightly in their value and for an example, I would consider the first value in a, b and c to be duplicates.
I could iterate through each array and compare the values one by one using numpy.isclose, however that will be slow.
Is there an efficient way to tackle this problem, hopefully using numpy to keep computing times low?
you might wanna try the round() function which will round off the numbers in your lists to the nearest integers.
the next thing that I'd suggest might be too extreme:
concat the arrays and put them into a pandas dataframe and drop_duplicates()
this might not be the solution you want
You might want to take a look at numpy.testing if you allow for AsertionError handling.
from numpy import testing as ts
a = np.array((1.001,3))
b = np.array((1.000101, 3.002))
ts.assert_array_almost_equal(a, b, decimal=1) # output None
but
ts.assert_array_almost_equal(a, b, decimal=3)
results in
AssertionError:
Arrays are not almost equal to 3 decimals
Mismatch: 50%
Max absolute difference: 0.002
Max relative difference: 0.00089891
x: array([1.001, 3. ])
y: array([1. , 3.002])
There are some more interesting functions from numpy.testing. Make sure to take a look at the docs.
I'm using pandas to give you an intuitive result, rather than just numbers. Of course you can expand the solution to your need
Say you create a pd.DataFrame from each array, and tag them from which array each belongs to. I am rounding the results to 2 decimal places, you may use whatever tolerance you want
dfa = pd.DataFrame(a).round(2)
dfa['arr'] = 'a'
Then, by concatenating, using duplicated and sorting, you may find an intuitive Dataframe that might fulfill your needs
df = pd.concat([dfa, dfb, dfc])
df[df.duplicated(subset=[0,1], keep=False)].sort_values(by=[0,1])
yields
x y arr
0 1.00 3.0 a
0 1.00 3.0 b
0 1.00 3.0 c
1 1.33 4.2 a
1 1.33 4.2 c
2 17.83 3.4 a
2 17.83 3.4 b
The indexes are duplicated, so you can simply use reset_index() at the end and use the newly-generated column as a parameter that indicates the corresponding index on each array. I.e.:
index x y arr
0 0 1.00 3.0 a
1 0 1.00 3.0 b
2 0 1.00 3.0 c
3 1 1.33 4.2 a
4 1 1.33 4.2 c
5 2 17.83 3.4 a
6 2 17.83 3.4 b
So, for example, line 0 indicates a duplicate coordinate, and is found on index 0 of arr a. Line 1 also indicates a dupe coordinate, found or index 0 of arr b, etc.
Now, if you just want to delete the duplicates and get one final array with only non-duplicate values, you may usedrop_duplicates
df.drop_duplicates(subset=[0,1])[[0,1]].to_numpy()
which yields
array([[ 1. , 3. ],
[ 1.33, 4.2 ],
[17.83, 3.4 ],
[ 1.67, 5.4 ],
[18.6 , 12.51]])

Categories

Resources