More efficient way to build dataset then using lists - python

I am building a dataset for a squence to point conv network, where each window is moved by one timestep.
Basically this loop is doing it:
x_train = []
y_train = []
for i in range(window,len(input_train)):
x_train.append(input_train[i-window:i].tolist())
y = target_train[i-window:i]
y = y[int(len(y)/2)]
y_train.append(y)
When im using a big value for window, e.g. 500 i get a memory error.
Is there a way to build the training dataset more efficiently?

You should use pandas. It still might take too much space, but you can try:
import pandas as pd
# if input_train isn't a pd.Series already
input_train = pd.Series(input_train)
rolling_data = (w.reset_index(drop=True) for w in input_train.rolling(window))
x_train = pd.DataFrame(rolling_data).iloc[window - 1:]
y_train = target_train[window//2::window]
Some explanations with an example:
Assuming a simple series:
>>> input_train = pd.Series([1, 2, 3, 4, 5])
>>> input_train
0 1
1 2
2 3
3 4
4 5
dtype: int64
We can create a dataframe with the windowed data like so:
>>> pd.DataFrame(input_train.rolling(2))
0 1 2 3 4
0 1.0 NaN NaN NaN NaN
1 1.0 2.0 NaN NaN NaN
2 NaN 2.0 3.0 NaN NaN
3 NaN NaN 3.0 4.0 NaN
4 NaN NaN NaN 4.0 5.0
The problem with this is that values in each window have their own indices (0 has 0, 1 has 1, etc.) so they end up in corresponding columns. We can fix this by resetting indices for each window:
>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2))
0 1
0 1.0 NaN
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0
The only thing left to do is remove the first window - 1 number of rows because they are not complete (that is just how rolling works):
>>> pd.DataFrame(w.reset_index(drop=True) for w in input_train.rolling(2)).iloc[2-1:] # .iloc[1:]
0 1
1 1.0 2.0
2 2.0 3.0
3 3.0 4.0
4 4.0 5.0

Related

number of time periods between the previous two periods where (non-zero) demand occurs in Python

I'm trying to get the time interval of the last two periods of non-zero demand. The final column should be as shown in nonzero_interval. TIA.
edit:
I've added a link to the paper where this question was motivated from.
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'y': [34, 12, 2, 0, 0, 0, 23, 0, 10, 0],
'nonzero_interval' : [np.nan, np. nan, 1,1,1,1,1,4,4,2]})
print(df)
The idea comes from Forecasting Intermittent Demand Patterns with Time Seriesand Machine Learning Methodologies
One method from numpy
n=2
s=df[df.y.ne(0)].index
a=np.diag(s.values-s.values[:,None],k=n-1)
df['New']=pd.Series(a,index=s[n-1:])
df.New=df.New.shift(n-1).ffill()
df
y nonzero_interval New
0 34 NaN NaN
1 12 NaN NaN
2 2 1.0 1.0
3 0 1.0 1.0
4 0 1.0 1.0
5 0 1.0 1.0
6 23 1.0 1.0
7 0 4.0 4.0
8 10 4.0 4.0
9 0 2.0 2.0
IIUC, you can do it with groupby.transform with count, the groups are created where there are a value not equal to 0 with cumsum. then change where the values are equal to 0 to nan with where, shift and ffill.
df['nonzero_interval'] = (df.groupby(df['y'].ne(0).cumsum().shift())
['y'].transform('count')
.where(df['y'].ne(0))
.shift().ffill()
)
print (df)
y nonzero_interval
0 34 NaN
1 12 NaN
2 2 1.0
3 0 1.0
4 0 1.0
5 0 1.0
6 23 1.0
7 0 4.0
8 10 4.0
9 0 2.0

How to assign each element in an array column its ordered position?

I have a dataframe that looks like this:
df = pd.DataFrame({'group':[1,1,1,1,1,2,2,2,2,3,3,4,4],
'x':[np.nan,np.nan,3,np.nan,2,np.nan,3,3,4,2,1,1,3],
'y':[np.nan,np.nan,2,np.nan,1,np.nan,1,1,5,1,5,1,1]})
group x y
1 nan nan
1 nan nan
1 3.0 2.0
1 nan nan
1 2.0 1.0
2 nan nan
2 3.0 1.0
2 3.0 1.0
2 4.0 5.0
3 2.0 1.0
3 1.0 5.0
4 1.0 1.0
4 3.0 1.0
Basically, lets say I have 4 groups and each group contains points with x,y coordinates. Points can have the same coordinates. For example (3,1) exists (twice) in group 2 and also in group 4. Furthermore if x is nan then y should also be nan
I want to assign each pair (x,y) its corresponding position with respect to the sorted list of tuples. If x=y=nan then zero should be returned.
Hence the output should be:
group x y label_global
1 nan nan 0
1 nan nan 0
1 3.0 2.0 5
1 nan nan 0
1 2.0 1.0 3
2 nan nan 0
2 3.0 1.0 4
2 3.0 1.0 4
2 4.0 5.0 6
3 2.0 1.0 3
3 1.0 5.0 2
4 1.0 1.0 1
4 3.0 1.0 4
What I have done is the following:
centroids = sorted(set([x for x in zip(df.dropna().x.values, df.dropna().y.values)]))
df['label_global'] = [centroids.index(d) + 1 if d[1]==d[1] else 0 for d in zip(df.x.values, df.y.values)]
Is there a better way to do this please? My dataframe is about 2million lines long and it takes around 3mins for the task to complete
As a sidenote: In the last list comprehension, the expression if d[1]==d[1] else is meant to filter out tuples with nan since np.nan==np.nan evaluates to False. I had initially tried with if np.nan not in d else, ie:
df['label_global'] = [centroids.index(d) + 1 if np.nan not in d else 0 for d in zip(df.x.values, df.y.values)]
but that doesnt work and I have no idea why. It returns a value error:
ValueError: (nan, nan) is not in list
which to me indicates that the if else loop hasn't worked. Any insights are very much welcome.
I find it also a bit strange that
(np.nan, np.nan)==(np.nan, np.nan) returns True
or even
(np.nan,)==(np.nan,) returns True
but
np.nan==np.nan returns False
Sort by x,y pairs, setting nan first, and use cumsum to set group numbers
df['label_global'] = df.sort_values(['x','y'], na_position='first') \
[['x','y']].fillna(0).diff().ne([0,0]).any(1).cumsum()-1
group x y label_global
0 1 NaN NaN 0
1 1 NaN NaN 0
2 1 3.0 2.0 5
3 1 NaN NaN 0
4 1 2.0 1.0 3
5 2 NaN NaN 0
6 2 3.0 1.0 4
7 2 3.0 1.0 4
8 2 4.0 5.0 6
9 3 2.0 1.0 3
10 3 1.0 5.0 2
11 4 1.0 1.0 1
12 4 3.0 1.0 4

Perform arithmetic operations on null values

When i am trying to do arithmetic operation including two or more columns facing problem with null values.
One more thing which i want to mention here that i don't want to fill missed/null values.
Actually i want something like 1 + np.nan = 1 but it is giving np.nan. I tried to solve it by np.nansum but it didn't work.
df = pd.DataFrame({"a":[1,2,3,4],"b":[1,2,np.nan,np.nan]})
df
Out[6]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN NaN
3 4 NaN NaN
And,
df["d"] = np.nansum([df.a + df.b])
df
Out[13]:
a b d
0 1 1.0 6.0
1 2 2.0 6.0
2 3 NaN 6.0
3 4 NaN 6.0
But i want actually like,
df
Out[10]:
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
The np.nansum here calculated the sum, of the entire column. You do not want that, you probably want to call the np.nansum on the two columns, like:
df['d'] = np.nansum((df.a, df.b), axis=0)
This then yield the expected:
>>> df
a b d
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0
Simply use DataFrame.sum over axis=1:
df['c'] = df.sum(axis=1)
Output
a b c
0 1 1.0 2.0
1 2 2.0 4.0
2 3 NaN 3.0
3 4 NaN 4.0

Interpolate / fillna with a decay formula in pandas

Let's say I have the following pandas dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([1,2,4, None, None, None, None, -1, 1, None, None])
>>> df
0
0 1.0
1 3.0
2 4.0
3 NaN
4 NaN
5 NaN
6 NaN
7 -1.0
8 1.0
9 NaN
10 NaN
I want to fill the missing values with an exponential decay starting from the previous value, giving:
>>> df_result
0
0 1.0
1 2.0
2 4.0
3 4.0 # NaN replaced with previous value
4 2.0 # NaN replaced previous value / 2
5 1.0 # NaN replaced previous value / 2
6 0.5 # NaN replaced previous value / 2
7 -1.0
8 1.0
9 1.0 # NaN replaced previous value
10 0.5 # NaN replaced previous value / 2
With fillna, I have method='pad', but I cannot fit my formula here.
With interpolate, I'm not sure I can give a specific exponential decay formula, and take into account only the last not-NaN value.
I'm thinking of creating a separate dataframe df_replacements initialised with 0.5 instead of the NaN and 0 elsewhere, do a cumprod (somehow I need to reset the running product to 1 for every first NaN), and then df_result = df.fillna(df_replacements, inplace=True)
Is there a simple way to achieve this replacement with pandas?
In your case fill the nan forward, then we groupby to find the consecutive NaN , get the cumcount
s=df[0].ffill()
df[0].fillna(s[df[0].isnull()].mul((1/2)**(df[0].groupby(df[0].notnull().cumsum()).cumcount()-1),0))
Out[655]:
0 1.0
1 2.0
2 4.0
3 4.0
4 2.0
5 1.0
6 0.5
7 -1.0
8 1.0
9 1.0
10 0.5
Name: 0, dtype: float64
Edit by OP: same solution with more explicit variables names:
ffilled = df[0].ffill()
is_na = df[0].isnull()
group_ids = df[0].notnull().cumsum()
mul_factors = (1 / 2) ** (df[0].groupby(group_ids).cumcount() - 1)
result = df[0].fillna(ffilled[is_na].mul(mul_factors, 0))

Calculate the two rows following a row with a certain value

I have a dataframe with ones and NaN values and would like to calculate the two rows following the ones to two and three.
import pandas as pd
df=pd.DataFrame({"b" : [1,None,None,None,None,1,None,None,None]})
print(df)
b
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Like this:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
I know i can use df.loc[df['b']==1] to retrive the ones but i dont know how to calculate the two rows below.
You can create a group variable where each 1 in b starts a new group, then forward fill 2 rows for each group, and do a cumsum:
g = (df.b == 1).cumsum()
df.b.groupby(g).apply(lambda g: g.ffill(limit = 2).cumsum())
#0 1.0
#1 2.0
#2 3.0
#3 NaN
#4 NaN
#5 1.0
#6 2.0
#7 3.0
#8 NaN
#Name: b, dtype: float64
One without groupby:
temp = df.ffill(limit=2).cumsum()
temp-temp.mask(df.b.isnull()).ffill(limit=2)+1
Out[91]:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
Using your current line of thinking, you simply need the index of the rows after the 1s and set to appropriate values:
df.loc[np.where(df['b']==1)[0]+1, 'b'] = 2
df.loc[np.where(df['b']==1)[0]+2, 'b'] = 3

Categories

Resources