How do I better perform this numpy calculation - python

I have text file something like this:
0 0 0 1 2
0 0 1 3 1
0 1 0 4 1
0 1 1 2 3
1 0 0 5 3
1 0 1 1 3
1 1 0 4 5
1 1 1 6 1
Let label these columns as:
s1 a s2 r t
I also have another array with dummy values (for simplicity)
>>> V = np.array([10.,20.])
I want to do certain calculation on these numbers with good performance. The calculation I want to perform is: for each s1, I want max sum t*(r+V[s1]) for each a.
For example,
for s1=0, a=0, we will have sum = 2*(1+10)+1*(3+10) = 35
for s1=0, a=1, we will have sum = 1*(4+10)+3*(2+10) = 50
So max of this is 50, which is what I want to obtain as an output for s1=0.
Also, note that, in above calculation, 10 is V[s1].
If, I dont have last three lines in file, then, for s1=1, I will simply return 3*(5+20)=75, where 20 is V[s1]. So the final desire result is [50,75]
So I thought it will be good for numpy to load it as follows (consider values only for s1=0 for simplicity)
>>> c1=[[ [ [0,1,2],[1,3,1] ],[ [0,4,1],[1,2,3] ] ]]
>>> import numpy as np
>>> c1arr = np.array(c1)
>>> c1arr #when I actually load from file, its not loading as this (check Q2 below)
array([[[[0, 1, 2],
[1, 3, 1]],
[[0, 4, 1],
[1, 2, 3]]]])
>>> np.sum(c1arr[0,0][:,2]*(c1arr[0,0][:,1]+V)) #sum over t*(r+V)
45.0
Q1. I am not able to guess, how can I modify above to get numpy array [45.0,80.0], so that I can get numpy.max over it.
Q2. When I actually load the file, I am not able load it as c1arr as stated in comment above. Instead, am getting it as follows:
>>> type(a) #a is populated by parsing file
<class 'list'>
>>> print(a)
[[[[0, -0.9, 0.3], [1, 0.9, 0.6]], [[0, -0.2, 0.6], [1, 0.7, 0.3]]], [[[1, 0.2, 1.0]], [[0, -0.8, 1.0]]]]
>>> np.array(a) #note that this is not same as c1arr above
<string>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
array([[list([[0, -0.9, 0.3], [1, 0.9, 0.6]]),
list([[0, -0.2, 0.6], [1, 0.7, 0.3]])],
[list([[1, 0.2, 1.0]]),
list([[0, -0.8, 1.0]])]], dtype=object)
How I can fix this?
Q3. Is there any overall better approach, say by laying out the numpy array differently? (Given I am not allowed to use pandas, but only numpy)

In my opinion, the most intuitive and maintainable approach
is to use Pandas, where you can assign names to columns.
Another important factor is that grouping is much easier just in Pandas.
As your input sample contains only integers, I defined V
also as an array of integers:
V = np.array([10, 20])
I read your input file as follows:
df = pd.read_csv('Input.txt', sep=' ', names=['s1', 'a', 's2', 'r', 't'])
(print it to see what has been read).
Then, to get results for each combination of s1 and a,
you can run:
result = df.groupby(['s1', 'a']).apply(lambda grp:
(grp.t * (grp.r + V[grp.s1])).sum())
Note that as you refer to named columns, this code is easy to read.
The result is:
s1 a
0 0 35
1 50
1 0 138
1 146
dtype: int64
Each result is integer because V is also an array of
int type. But if you define it just as in your post (an
array of float), the result will be also of float type
(your choice).
If you want the max result for each s1, run:
result.max(level=0)
This time the result is:
s1
0 50
1 146
dtype: int64
The Numpy version
If you really are restricted to Numpy, there is also a solution,
although more difficult to read and update.
Read your input file:
data = np.genfromtxt('Input.txt')
Initially I tried int type, just like in the pandasonic solution,
but one of your comments states that 2 rightmost columns are float.
So, because Numpy arrays must be of a single type, the whole
array must be of float type.
Run the following code:
res = []
# First level grouping - by "s1" (column 0)
for s1 in np.unique(data[:,0]).astype(int):
dat1 = data[np.where(data[:,0] == s1)]
res2 = []
# Second level grouping - by "a" (column 1)
for a in np.unique(dat1[:,1]):
dat2 = dat1[np.where(dat1[:,1] == a)]
# t - column 4, r - column 3
res2.append((dat2[:,4] * (dat2[:,3] + V[s1])).sum())
res.append([s1, max(res2)])
result = np.array(res)
The result (a Numpy array) is:
array([[ 0., 50.],
[ 1., 146.]])
The left column contains s1 values and the right - maximum
group values from the second level grouping.
The Numpy version with a structured array
Actually, you can also use a Numpy structured array.
Then the code is at least more readable, because you refer to column names,
not to column numbers.
Read the array passing dtype with column names and types:
data = np.genfromtxt(io.StringIO(txt), dtype=[('s1', '<i4'),
('a', '<i4'), ('s2', '<i4'), ('r', '<f8'), ('t', '<f8')])
Then run:
res = []
# First level grouping - by "s1"
for s1 in np.unique(data['s1']):
dat1 = data[np.where(data['s1'] == s1)]
res2 = []
# Second level grouping - by "a"
for a in np.unique(dat1['a']):
dat2 = dat1[np.where(dat1['a'] == a)]
res2.append((dat2['t'] * (dat2['r'] + V[s1])).sum())
res.append([s1, max(res2)])
result = np.array(res)

Related

Is it possible to use pandas.DataFrame.rolling with a step greater than 1?

In R you can compute a rolling mean with a specified window that can shift by a specified amount each time.
However maybe I just haven't found it anywhere but it doesn't seem like you can do it in pandas or some other Python library?
Does anyone know of a way around this? I'll give you an example of what I mean:
Here we have bi-weekly data, and I am computing the two month moving average that shifts by 1 month which is 2 rows.
So in R I would do something like: two_month__movavg=rollapply(mydata,4,mean,by = 2,na.pad = FALSE)
Is there no equivalent in Python?
EDIT1:
DATE A DEMAND ... AA DEMAND A Price
0 2006/01/01 00:30:00 8013.27833 ... 5657.67500 20.03
1 2006/01/01 01:00:00 7726.89167 ... 5460.39500 18.66
2 2006/01/01 01:30:00 7372.85833 ... 5766.02500 20.38
3 2006/01/01 02:00:00 7071.83333 ... 5503.25167 18.59
4 2006/01/01 02:30:00 6865.44000 ... 5214.01500 17.53
So, I know it is a long time since the question was asked, by I bumped into this same problem and when dealing with long time series you really would want to avoid the unnecessary calculation of the values you are not interested at. Since Pandas rolling method does not implement a step argument, I wrote a workaround using numpy.
It is basically a combination of the solution in this link and the indexing proposed by BENY.
def apply_rolling_data(data, col, function, window, step=1, labels=None):
"""Perform a rolling window analysis at the column `col` from `data`
Given a dataframe `data` with time series, call `function` at
sections of length `window` at the data of column `col`. Append
the results to `data` at a new columns with name `label`.
Parameters
----------
data : DataFrame
Data to be analyzed, the dataframe must stores time series
columnwise, i.e., each column represent a time series and each
row a time index
col : str
Name of the column from `data` to be analyzed
function : callable
Function to be called to calculate the rolling window
analysis, the function must receive as input an array or
pandas series. Its output must be either a number or a pandas
series
window : int
length of the window to perform the analysis
step : int
step to take between two consecutive windows
labels : str
Name of the column for the output, if None it defaults to
'MEASURE'. It is only used if `function` outputs a number, if
it outputs a Series then each index of the series is going to
be used as the names of their respective columns in the output
Returns
-------
data : DataFrame
Input dataframe with added columns with the result of the
analysis performed
"""
x = _strided_app(data[col].to_numpy(), window, step)
rolled = np.apply_along_axis(function, 1, x)
if labels is None:
labels = [f"metric_{i}" for i in range(rolled.shape[1])]
for col in labels:
data[col] = np.nan
data.loc[
data.index[
[False]*(window-1)
+ list(np.arange(len(data) - (window-1)) % step == 0)],
labels] = rolled
return data
def _strided_app(a, L, S): # Window len = L, Stride len/stepsize = S
"""returns an array that is strided
"""
nrows = ((a.size-L)//S)+1
n = a.strides[0]
return np.lib.stride_tricks.as_strided(
a, shape=(nrows, L), strides=(S*n, n))
You can using rolling again, just need a little bit work with you assign index
Here by = 2
by = 2
df.loc[df.index[np.arange(len(df))%by==1],'New']=df.Price.rolling(window=4).mean()
df
Price New
0 63 NaN
1 92 NaN
2 92 NaN
3 5 63.00
4 90 NaN
5 3 47.50
6 81 NaN
7 98 68.00
8 100 NaN
9 58 84.25
10 38 NaN
11 15 52.75
12 75 NaN
13 19 36.75
If the data size is not too large, here is an easy way:
by = 2
win = 4
start = 3 ## it is the index of your 1st valid value.
df.rolling(win).mean()[start::by] ## calculate all, choose what you need.
Now this is a bit of overkill for a 1D array of data, but you can simplify it and pull out what you need. Since pandas can rely on numpy, you might want to check to see how their rolling/strided function if implemented.
Results for 20 sequential numbers. A 7 day window, striding/sliding by 2
z = np.arange(20)
z #array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
s = stride(z, (7,), (2,))
np.mean(s, axis=1) # array([ 3., 5., 7., 9., 11., 13., 15.])
Here is the code I use without the major portion of the documentation. It is derived from many implementations of strided function in numpy that can be found on this site. There are variants and incarnation, this is just another.
def stride(a, win=(3, 3), stepby=(1, 1)):
"""Provide a 2D sliding/moving view of an array.
There is no edge correction for outputs. Use the `pad_` function first."""
err = """Array shape, window and/or step size error.
Use win=(3,) with stepby=(1,) for 1D array
or win=(3,3) with stepby=(1,1) for 2D array
or win=(1,3,3) with stepby=(1,1,1) for 3D
---- a.ndim != len(win) != len(stepby) ----
"""
from numpy.lib.stride_tricks import as_strided
a_ndim = a.ndim
if isinstance(win, int):
win = (win,) * a_ndim
if isinstance(stepby, int):
stepby = (stepby,) * a_ndim
assert (a_ndim == len(win)) and (len(win) == len(stepby)), err
shp = np.array(a.shape) # array shape (r, c) or (d, r, c)
win_shp = np.array(win) # window (3, 3) or (1, 3, 3)
ss = np.array(stepby) # step by (1, 1) or (1, 1, 1)
newshape = tuple(((shp - win_shp) // ss) + 1) + tuple(win_shp)
newstrides = tuple(np.array(a.strides) * ss) + a.strides
a_s = as_strided(a, shape=newshape, strides=newstrides, subok=True).squeeze()
return a_s
I failed to point out that you can create an output that you could append as a column into pandas. Going back to the original definitions used above
nans = np.full_like(z, np.nan, dtype='float') # z is the 20 number sequence
means = np.mean(s, axis=1) # results from the strided mean
# assign the means to the output array skipping the first and last 3 and striding by 2
nans[3:-3:2] = means
nans # array([nan, nan, nan, 3., nan, 5., nan, 7., nan, 9., nan, 11., nan, 13., nan, 15., nan, nan, nan, nan])
Since pandas 1.5.0, there is a step parameter to rolling() that should do the trick. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html
Using Pandas.asfreq() after rolling

Replace a a column of values with a column of vectors in pandas

I'm using python pandas to organize some measurements values in a DataFrame.
One of the columns is a value which I want to convert in a 2D-vector so let's say the column contains such values:
col1
25
12
14
21
I want to have the values of this column changed one by one (in a for loop):
for value in values:
df.['col1'][value] = convert2Vector(df.['col1'][value])
So that the column col1 becomes:
col1
[-1. 21.]
[-1. -2.]
[-15. 54.]
[11. 2.]
The values are only examples and the function convert2Vector() converts the angle to a 2D-vector.
With the for-loop that I wrote it doesn't work .. I get the error:
ValueError: setting an array element with a sequence.
Which I can understand.
So the question is: How to do it?
That exception comes from the fact that you want to insert a list or array in a column (array) that stores ints. And arrays in Pandas and NumPy can't have a "ragged shape" so you can't have 2 elements in one row and 1 element in all the others (except maybe with masking).
To make it work you need to store "general" objects. For example:
import pandas as pd
df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
df.col1[0] = [1, 2]
# ValueError: setting an array element with a sequence.
But this works:
>>> df.col1 = df.col1.astype(object)
>>> df.col1[0] = [1, 2]
>>> df
col1
0 [1, 2]
1 12
2 14
3 21
Note: I wouldn't recommend doing that because object columns are much slower than specifically typed columns. But since you're iterating over the Column with a for loop it seems you don't need the performance so you can also use an object array.
What you should be doing if you want it fast is vectorize the convert2vector function and assign the result to two columns:
import pandas as pd
import numpy as np
def convert2Vector(angle):
"""I don't know what your function does so this is just something that
calculates the sin and cos of the input..."""
ret = np.zeros((angle.size, 2), dtype=float)
ret[:, 0] = np.sin(angle)
ret[:, 1] = np.cos(angle)
return ret
>>> df = pd.DataFrame({'col1' : [25, 12, 14, 21]})
>>> df['col2'] = [0]*len(df)
>>> df[['col1', 'col2']] = convert2Vector(df.col1)
>>> df
col1 col2
0 -0.132352 0.991203
1 -0.536573 0.843854
2 0.990607 0.136737
3 0.836656 -0.547729
You should call a first order function like df.apply or df.transform which creates a new column which you then assign back:
In [1022]: df.col1.apply(lambda x: [x, x // 2])
Out[1022]:
0 [25, 12]
1 [12, 6]
2 [14, 7]
3 [21, 10]
Name: col1, dtype: object
In your case, you would do:
df['col1'] = df.col1.apply(convert2vector)

Split a pandas dataframe by a list of values from another data frame

I'm pretty sure there's a really simple solution for this and I'm just not realising it. However...
I have a data frame of high-frequency data. Call this data frame A. I also have a separate list of far lower frequency demarcation points, call this B. I would like to append a column to A that would display 1 if A's timestamp column is between B[0] and B[1], 2 if it is between B[1] and B[2], and so on.
As said, it's probably incredibly trivial, and I'm just not realising it at this late an hour.
Here is a quick and dirty approach using a list comprehension.
>>> df = pd.DataFrame({'A': np.arange(1, 3, 0.2)})
>>> A = df.A.values.tolist()
A: [1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.2, 2.5, 2.6, 2.8]
>>> B = np.arange(0, 3, 1).tolist()
B: [0, 1, 2]
>>> BA = [k for k in range(0, len(B)-1) for a in A if (B[k]<=a) & (B[k+1]>a) or (a>max(B))]
BA: [0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Use searchsorted:
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
For each value in A['timestamp'], an index value is returned. That index indicates where amongst the sorted values in B['timestamp'] that value from A would be inserted into B in order to maintain sorted order.
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
N = 10
A = pd.DataFrame({'timestamp':np.random.uniform(0, 1, size=N).cumsum()})
B = pd.DataFrame({'timestamp':np.random.uniform(0, 3, size=N).cumsum()})
# timestamp
# 0 1.739869
# 1 2.467790
# 2 2.863659
# 3 3.295505
# 4 5.106419
# 5 6.872791
# 6 7.080834
# 7 9.909320
# 8 11.027117
# 9 12.383085
A['group'] = B['timestamp'].searchsorted(A['timestamp'])
print(A)
yields
timestamp group
0 0.896705 0
1 1.626945 0
2 2.410220 1
3 3.151872 3
4 3.613962 4
5 4.256528 4
6 4.481392 4
7 5.189938 5
8 5.937064 5
9 6.562172 5
Thus, the timestamp 0.896705 is in group 0 because it comes before B['timestamp'][0] (i.e. 1.739869). The timestamp 2.410220 is in group 1 because it is larger than B['timestamp'][0] (i.e. 1.739869) but smaller than B['timestamp'][1] (i.e. 2.467790).
You should also decide what to do if a value in A['timestamp'] is exactly equal to one of the cutoff values in B['timestamp']. Use
B['timestamp'].searchsorted(A['timestamp'], side='left')
if you want searchsorted to return i when B['timestamp'][i] <= A['timestamp'][i] <= B['timestamp'][i+1]. Use
B['timestamp'].searchsorted(A['timestamp'], side='right')
if you want searchsorted to return i+1 in that situation. If you don't specify side, then side='left' is used by default.

Create a vector with values from a Numpy array selected according to criteria in a Pandas DataFrame

I am working with a pandas df that contains two columns with integers. For each data of the df, I would like to select these two integers, use them as [row,column] pairs to extract values from a np.array and create a new np.array with the extracted values.
In more detail, my df contains the following entries:
State FutureState
DATE
1947-10-01 0 0
1948-01-01 0 1
1948-04-01 1 1
1948-07-01 1 1
For each Date, I would like to select the [State,FutureState] pair and extract the corresponding [row,column] item from the following np.array, called P:
array([[ 0.7, 0.3],
[ 0.4, 0.6]])
With these values, I would like to create a new np.array called Transition, which contains of the following values:
[P[0,0],P[0,1],P[1,1],P[1,1]] = [0.7, 0.3, 0.6, 0.6]
The pairs [0,0], [0,1], [1,1] [1,1] used as index for the array P are the values for [State,FutureState] for each date ( 1947-10-01, 1948-01-01 , 1948-04-01, 1948-07-01 ).
I already tried to solve my problem in a lot of different ways but to no avail. Can somebody kindly suggest how to successfully create the Transition vector?
try this:
p[df.State, df.FutureState]
Here is the full code:
import io
import pandas as pd
import numpy as np
txt = """ State FutureState
1947-10-01 0 0
1948-01-01 0 1
1948-04-01 1 1
1948-07-01 1 1"""
df = pd.read_csv(io.BytesIO(txt), delim_whitespace=True)
p = np.array([[ 0.7, 0.3], [ 0.4, 0.6]])
p[df.State, df.FutureState]
How about this?
df.apply(lambda x:P[x[0],x[1]], axis=1)
It does what you describe, go row-wise (so apply over axis=1) along df and use the entries as index for selecting in P.

how to read from an array without a particular column in python

I have a numpy array of dtype = object (which are actually lists of various data types). So it makes a 2D array because I have an array of lists (?). I want to copy every row & only certain columns of this array to another array. I stored data in this array from a csv file. This csv file contains several fields(columns) and large amount of rows. Here's the code chunk I used to store data into the array.
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
data[i] = row
data can be basically depicted as follows
column1 column2 column3 column4 column5 ....
1 none 2 'gona' 5.3
2 34 2 'gina' 5.5
3 none 2 'gana' 5.1
4 43 2 'gena' 5.0
5 none 2 'guna' 5.7
..... .... ..... ..... ....
..... .... ..... ..... ....
..... .... ..... ..... ....
There're unwanted fields in the middle that I want to remove. Suppose I don't want column3.
How do I remove only that column from my array? Or copy only relevant columns to another array?
Use pandas. Also it seems to me, that for various type of data as yours, the pandas.DataFrame may be better fit.
from StringIO import StringIO
from pandas import *
import numpy as np
data = """column1 column2 column3 column4 column5
1 none 2 'gona' 5.3
2 34 2 'gina' 5.5
3 none 2 'gana' 5.1
4 43 2 'gena' 5.0
5 none 2 'guna' 5.7"""
data = StringIO(data)
print read_csv(data, delim_whitespace=True).drop('column3',axis =1)
out:
column1 column2 column4 column5
0 1 none 'gona' 5.3
1 2 34 'gina' 5.5
2 3 none 'gana' 5.1
3 4 43 'gena' 5.0
4 5 none 'guna' 5.7
If you need an array instead of DataFrame, use the to_records() method:
df.to_records(index = False)
#output:
rec.array([(1L, 'none', "'gona'", 5.3),
(2L, '34', "'gina'", 5.5),
(3L, 'none', "'gana'", 5.1),
(4L, '43', "'gena'", 5.0),
(5L, 'none', "'guna'", 5.7)],
dtype=[('column1', '<i8'), ('column2', '|O4'),
('column4', '|O4'), ('column5', '<f8')])
Assuming you're reading the CSV rows and sticking them into a numpy array, the easiest and best solution is almost definitely preprocessing the data before it gets to the array, as Maciek D.'s answer shows. (If you want to do something more complicated than "remove column 3" you might want something like [value for i, value in enumerate(row) if i not in (1, 3, 5)], but the idea is still the same.)
However, if you've already imported the array and you want to filter it after the fact, you probably want take or delete:
>>> d=np.array([[1,None,2,'gona',5.3],[2,34,2,'gina',5.5],[3,None,2,'gana',5.1],[4,43,2,'gena',5.0],[5,None,2,'guna',5.7]])
>>> np.delete(d, 2, 1)
array([[1, None, gona, 5.3],
[2, 34, gina, 5.5],
[3, None, gana, 5.1],
[4, 43, gena, 5.0],
[5, None, guna, 5.7]], dtype=object)
>>> np.take(d, [0, 1, 3, 4], 1)
array([[1, None, gona, 5.3],
[2, 34, gina, 5.5],
[3, None, gana, 5.1],
[4, 43, gena, 5.0],
[5, None, guna, 5.7]], dtype=object)
For the simple case of "remove column 3", delete makes more sense; for a more complicated case, take probably makes more sense.
If you haven't yet worked out how to import the data in the first place, you could either use the built-in csv module and something like Maciek D.'s code and process as you go, or use something like pandas.read_csv and post-process the result, as root's answer shows.
But it might be better to use a native numpy data format in the first place instead of CSV.
You can use range selection. Eg. to remove column3, you can use:
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
data[i] = row[:2] + row[3:]
This will work, assuming that csv_file_object yields lists. If it is e.g. a simple file object created with csv_file_object = open("file.cvs"), add split in your loop:
data = np.zeros((401125,), dtype = object)
for i, row in enumerate(csv_file_object):
row = row.split()
data[i] = row[:2] + row[3:]

Categories

Resources