How to Reccurently Transpose A Series/List/Array - python

I have a array/list/pandas series :
np.arange(15)
Out[11]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
What I want is:
[[0,1,2,3,4,5],
[1,2,3,4,5,6],
[2,3,4,5,6,7],
...
[10,11,12,13,14]]
That is, recurently transpose this columns into a 5-column matrix.
The reason is that I am doing feature engineering for a column of temperature data. I want to use last 5 data as features and the next as target.
What's the most efficient way to do that? my data is large.

If the array is formatted like this :
arr = np.array([1,2,3,4,5,6,7,8,....])
You could try it like this :
recurr_transpose = np.matrix([[arr[i:i+5] for i in range(len(arr)-4)]])

Related

Why do these two numpy.divide operations give such different results?

I would like to correct the values in hyperspectral readings from a cameara using the formula described over here;
the captured data is subtracted by dark reference and divided with
white reference subtracted dark reference.
In the original example, the task is rather simple, white and dark reference has the same shape as the main data so the formula is executed as:
corrected_nparr = np.divide(np.subtract(data_nparr, dark_nparr),
np.subtract(white_nparr, dark_nparr))
However the main data is much larger in my experience. Shapes in my case are as following;
$ white_nparr.shape, dark_nparr.shape, data_nparr.shape
((100, 640, 224), (100, 640, 224), (4300, 640, 224))
that's why I repeat the reference arrays.
white_nparr_rep = white_nparr.repeat(43, axis=0)
dark_nparr_rep = dark_nparr.repeat(43, axis=0)
return np.divide(np.subtract(data_nparr, dark_nparr_rep), np.subtract(white_nparr_rep, dark_nparr_rep))
And it works almost perfectly, as can be seen in the image at the left. But this approach requires enormous amount of memory, so I decided to traverse the large array and replace the original values with corrected ones on-the-go instead:
ref_scale = dark_nparr.shape[0]
data_scale = data_nparr.shape[0]
for i in range(int(data_scale / ref_scale)):
data_nparr[i*ref_scale:(i+1)*ref_scale] =
np.divide
(
np.subtract(data_nparr[i*ref_scale:(i+1)*ref_scale], dark_nparr),
np.subtract(white_nparr, dark_nparr)
)
But that traversal approach gives me the ugliest of results, as can be seen in the right. I'd appreciate any idea that would help me fix this.
Note: I apply 20-times co-adding (mean of 20 readings) to obtain the images below.
EDIT: dtype of each array is as following:
$ white_nparr.dtype, dark_nparr.dtype, data_nparr.dtype
(dtype('float32'), dtype('float32'), dtype('float32'))
Your two methods don't agree because in the first method you used
white_nparr_rep = white_nparr.repeat(43, axis=0)
but the second method corresponds to using
white_nparr_rep = np.tile(white_nparr, (43, 1, 1))
If the first method is correct, you'll have to adjust the second method to act accordingly. Perhaps
for i in range(int(data_scale / ref_scale)):
data_nparr[i*ref_scale:(i+1)*ref_scale] =
np.divide
(
np.subtract(data_nparr[i*ref_scale:(i+1)*ref_scale], dark_nparr[i]),
np.subtract(white_nparr[i], dark_nparr[i])
)
A simple example with 2-d arrays that shows the difference between repeat and tile:
In [146]: z
Out[146]:
array([[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15]])
In [147]: np.repeat(z, 3, axis=0)
Out[147]:
array([[ 1, 2, 3, 4, 5],
[ 1, 2, 3, 4, 5],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[11, 12, 13, 14, 15],
[11, 12, 13, 14, 15]])
In [148]: np.tile(z, (3, 1))
Out[148]:
array([[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15],
[ 1, 2, 3, 4, 5],
[11, 12, 13, 14, 15]])
Off topic postscript: I don't know why the author of the page that you linked to writes NumPy expressions as (for example):
corrected_nparr = np.divide(
np.subtract(data_nparr, dark_nparr),
np.subtract(white_nparr, dark_nparr))
NumPy allows you to write that as
corrected_nparr = (data_nparr - dark_nparr) / (white_nparr - dark_nparr)
whick looks much nicer to me.

Drop consecutive duplicates which have milliseconds different sampling frequency - Python

The dataframe looks like this:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
5, 3713.110107421875, 2012-01-07T03:14:03.937Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
7, 3713.89892578125, 2012-01-07T03:14:13.968Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
10, 3714.64990234375, 2012-01-07T03:14:24.015Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
12, 3714.64990234375, 2012-01-07T03:14:29.031Z
At some rows, there are lines with millisecond different timestamps, I want to drop them and only keep the rows that have different second timestamps. there are rows that have the same value for milliseconds and seconds different rows like from row 9 to 12, therefore, I can't use a.loc[a.shift() != a]
The desired output would be:
0, 3710.968017578125, 2012-01-07T03:13:43.859Z
1, 3710.968017578125, 2012-01-07T03:13:48.890Z
2, 3712.472900390625, 2012-01-07T03:13:53.906Z
3, 3712.472900390625, 2012-01-07T03:13:58.921Z
4, 3713.110107421875, 2012-01-07T03:14:03.900Z
6, 3713.89892578125, 2012-01-07T03:14:13.900Z
8, 3713.89892578125, 2012-01-07T03:14:19.000Z
9, 3714.64990234375, 2012-01-07T03:14:24.000Z
11, 3714.64990234375, 2012-01-07T03:14:29.000Z
Try:
df.groupby(pd.to_datetime(df[2]).astype('datetime64[s]')).head(1)
I hope it's self-explained.
You can use below script. I didn't get your dataframe column names so I invented below columns ['x', 'date_time']
df = pd.DataFrame([
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:43.859Z')),
(3710.968017578125, pd.to_datetime('2012-01-07T03:13:48.890Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:53.906Z')),
(3712.472900390625, pd.to_datetime('2012-01-07T03:13:58.921Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.900Z')),
(3713.110107421875, pd.to_datetime('2012-01-07T03:14:03.937Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.900Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:13.968Z')),
(3713.89892578125, pd.to_datetime('2012-01-07T03:14:19.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:24.015Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.000Z')),
(3714.64990234375, pd.to_datetime('2012-01-07T03:14:29.031Z'))],
columns=['x', 'date_time'])
create a column 'time_diff' to get the difference between the
datetime of current row and next row
only get those difference either
None or more than 1 second
drop temp column time_diff
df['time_diff'] = df.groupby('x')['date_time'].diff()
df = df[(df['time_diff'].isnull()) | (df['time_diff'].map(lambda x: x.seconds > 1))]
df = df.drop(['time_diff'], axis=1)
df

Adding values from for-loop into an array

I'm trying to get the length of the values of the states array into a separate array then sort them by descending order, but I'm having trouble getting all the length values of the string into the array instead of having a single value after the iteration.
states = ["Abia", "Adamawa", "Anambra", "Akwa Ibom", "Bauchi", "Bayelsa", "Benue", "Borno", "Cross River", "Delta", "Ebonyi", "Enugu", "Edo", "Ekiti", "Gombe", "Imo", "Jigawa", "Kaduna", "Kano", "Katsina", "Kebbi", "Kogi", "Kwara", "Lagos", "Nasarawa", "Niger", "Ogun", "Ondo", "Osun", "Oyo", "Plateau", "Rivers", "Sokoto", "Taraba", "Yobe", "Zamfara"]
for i in states:
a = [len(i)]
print(a)
Since you want the lengths sorted in descending order, use sorted with reverse=True and list comprehension
states = ["Abia", "Adamawa", "Anambra", "Akwa Ibom", "Bauchi", "Bayelsa", "Benue", "Borno", "Cross River", "Delta", "Ebonyi", "Enugu", "Edo", "Ekiti", "Gombe", "Imo", "Jigawa", "Kaduna", "Kano", "Katsina", "Kebbi", "Kogi", "Kwara", "Lagos", "Nasarawa", "Niger", "Ogun", "Ondo", "Osun", "Oyo", "Plateau", "Rivers", "Sokoto", "Taraba", "Yobe", "Zamfara"]
a = sorted([len(i) for i in states], reverse=True)
print (a)
Output
[11, 9, 8, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3]
To get the indices of the sorted list without resorting to NumPy arrays, there are many ways: see here. I personally prefer to directly make use of NumPy's argsort. As the name suggests, it returns an array of indices corresponding to the sorted array/list in ascending order. To get the indices for descending order, you can just reverse the array returned by argsort by using [::-1]. Following is a solution to your problem:
import numpy as np
states = ["Abia", "Adamawa", "Anambra", "Akwa Ibom", "Bauchi", "Bayelsa", "Benue", "Borno", "Cross River", "Delta", "Ebonyi", "Enugu", "Edo", "Ekiti", "Gombe", "Imo", "Jigawa", "Kaduna", "Kano", "Katsina", "Kebbi", "Kogi", "Kwara", "Lagos", "Nasarawa", "Niger", "Ogun", "Ondo", "Osun", "Oyo", "Plateau", "Rivers", "Sokoto", "Taraba", "Yobe", "Zamfara"]
a = [len(i) for i in states]
indices_sorted = np.argsort(a)[::-1] # [::-1] gives you indices for decreasing order
Output
array([ 8, 3, 24, 35, 19, 1, 2, 30, 5, 4, 10, 16, 17, 33, 32, 31, 22,
13, 6, 7, 9, 11, 14, 25, 23, 20, 21, 26, 27, 34, 28, 18, 0, 12,
15, 29])
Now as you can see, the first index in the above output is 8 which means the 9th element of states which is Cross River. Similarly you can access and verify the other elements.
You can use a list comprehension:
lengths = [len(state) for state in states]
If you need to use a for loop, create a list and append to it:
lengths = []
for i in states:
lengths.append(len(i))
You can also do this using the map function without using a for loop:
a = list(map(len,states))
Through generator:
lens = [len(a) for a in states]

how to exclude elements from numpy matrix

Suppose we have a matrix:
mat = np.random.randn(5,5)
array([[-1.3979852 , -0.37711369, -1.99509723, -0.6151796 , -0.78780951],
[ 0.12491113, 0.90526669, -0.18217331, 1.1252506 , -0.31782889],
[-3.5933008 , -0.17981343, 0.91469733, -0.59719805, 0.12728085],
[ 0.6906646 , 0.2316733 , -0.2804641 , 1.39864598, -0.09113139],
[-0.38012856, -1.7230821 , -0.5779237 , 0.30610451, -1.30015299]])
Suppose also that we have an index array:
idx = np.array([0,4,3,1,3])
While we can extract elements from the matrix using the following:
mat[idx, range(len(idx))]
array([-1.3979852 , -1.7230821 , -0.2804641 , 1.1252506 , -0.09113139])
What I want to know is how we can use the index to exclude elements from matrix, i.e. how do I obtain the following result:
array([[0.12491113 , -0.37711369, -1.99509723, -0.6151796 , -0.78780951],
[-3.5933008 , 0.90526669, -0.18217331, -0.59719805, -0.31782889],
[0.6906646 , -0.17981343, 0.91469733, 1.39864598, 0.12728085],
[-0.38012856, 0.2316733 , -0.5779237 , 0.30610451, -1.30015299]])
Thought it would be as simple as doing mat[-idx, range(len(idx))] but that doesn't work. I've also tried np.delete() but that doesn't seem to do it either. Any solutions out there that don't require looping or list comprehensions? Would appreciate any insight. Thanks.
EDIT: data must be in the same columns post processing.
When you say 'delete' does not work, what do you mean? What does it do? That might be diagnostic.
Lets first look at the selection that does work:
In [484]: mat=np.arange(25).reshape(5,5) # I like this better than random
In [485]: mat[idx,range(5)]
Out[485]: array([ 0, 21, 17, 8, 19])
this can also be used on a flattened version of the file:
In [486]: mat.flat[idx*5+np.arange(5)]
Out[486]: array([ 0, 21, 17, 8, 19])
now try the same with the default flat delete:
In [487]: np.delete(mat,idx*5+np.arange(5)).reshape(5,4)
Out[487]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 9],
[10, 11, 12, 13],
[14, 15, 16, 18],
[20, 22, 23, 24]])
delete isn't an inplace operator; it returns a new matrix. And if you specify an axis, delete removes whole rows or columns, not selected items.
mat[-idx, range(len(idx))] isn't going to work since negative indexes already have a meaning - count from the end.
This delete ends up doing boolean indexing, thus:
In [498]: mat1=mat.ravel()
In [499]: idx1=idx*5+np.arange(5)
In [500]: ii=np.ones(mat1.shape, bool)
In [501]: ii[idx1]=False
In [502]: mat1[ii]
Out[502]:
array([ 1, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24])
This sort of indexing/delete works even if you delete a different number of items from each row. Of course in that case you couldn't count on reshaping the matrix back to a rectangular matrix.
In general when dealing with different indexes for different rows, the operation ends up acting on the flat or raveled version of the matrix. 'Irregular' operations usually make more sense when dealing with 1d arrays than with 2d.
Looking more carefully at your example, I see that when you remove an item, you move the other column values up to fill the gap. In my version, I moved values along rows. Let's try this with F ordered.
In [523]: mat2=mat.flatten('F')
In [524]: np.delete(mat2,idx2).reshape(5,4).T
Out[524]:
array([[ 5, 1, 2, 3, 4],
[10, 6, 7, 13, 9],
[15, 11, 12, 18, 14],
[20, 16, 22, 23, 24]])
where I removed a value from each column:
In [525]: mat2[idx2]
Out[525]: array([ 0, 21, 17, 8, 19])

Delete rows at select indexes from a numpy array

In my dataset I've close to 200 rows but for a minimal working e.g., let's assume the following array:
arr = np.array([[1,2,3,4], [5,6,7,8],
[9,10,11,12], [13,14,15,16],
[17,18,19,20], [21,22,23,24]])
I can take a random sampling of 3 of the rows as follows:
indexes = np.random.choice(np.arange(arr.shape[0]), int(arr.shape[0]/2), replace=False)
Using these indexes, I can select my test cases as follows:
testing = arr[indexes]
I want to delete the rows at these indexes and I can use the remaining elements for my training set.
From the post here, it seems that training = np.delete(arr, indexes) ought to do it. But I get 1d array instead.
I also tried the suggestion here using training = arr[indexes.astype(np.bool)] but it did not give a clean separation. I get element [5,6,7,8] in both the training and testing sets.
training = arr[indexes.astype(np.bool)]
testing
Out[101]:
array([[13, 14, 15, 16],
[ 5, 6, 7, 8],
[17, 18, 19, 20]])
training
Out[102]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
Any idea what I am doing wrong? Thanks.
To delete indexed rows from numpy array:
arr = np.delete(arr, indexes, axis=0)
One approach would be to get the remaining row indices with np.setdiff1d and then use those row indices to get the desired output -
out = arr[np.setdiff1d(np.arange(arr.shape[0]), indexes)]
Or use np.in1d to leverage boolean indexing -
out = arr[~np.in1d(np.arange(arr.shape[0]), indexes)]

Categories

Resources