I have to create a sliding window of input data with window size = 3
Dataframe
0 1
0 1 2
1 3 4
2 5 6
3 7 8
Desired output:
0 1 2 3 4 5
0 1 2 NA NA NA NA
1 3 4 1 2 NA NA
2 5 6 3 4 1 2
3 7 8 5 6 3 4
I used data.values.flatten() but it converts all rows in dataframe in nested list.
How can I create a sliding window of input data (of desired window length) from dataframe
You can just concat:
new_df = pd.concat([df.shift(i) for i in range(3)], axis=1)
# rename columns
new_df.columns = np.arange(new_df.shape[1])
output:
0 1 2 3 4 5
0 1 2 NaN NaN NaN NaN
1 3 4 1.0 2.0 NaN NaN
2 5 6 3.0 4.0 1.0 2.0
3 7 8 5.0 6.0 3.0 4.0
Related
I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have a df like this:
time data
0 1
1 1
2 nan
3 nan
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 nan
Is there a way to use pd.Series.ffill() to ffill on for certain occurences of values? Specifically, I want to forward fill only if values in df.data are == 1 or 4. Should look like this:
time data
0 1
1 1
2 1
3 1
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 4
One option would be to forward fill (ffill) the column, then only populate where the values are 1 or 4 using (isin) and (mask):
s = df['data'].ffill()
df['data'] = df['data'].mask(s.isin([1, 4]), s)
df:
time data
0 0 1.0
1 1 1.0
2 2 1.0
3 3 1.0
4 4 6.0
5 5 NaN
6 6 NaN
7 7 NaN
8 8 5.0
9 9 4.0
10 10 4.0
I have 2 dataframes:
df1 =
item sale
0 7 10.0
1 4 10.0
2 6 10.0
3 5 10.0
4 5 10.0
5 6 10.0
6 4 10.0
df2 =
item sale
0 1 7
1 2 6
2 3 5
3 4 4
4 5 3
I want to change the values of df1 sales column, taking the values from df2 sales column.
I use the code:
df1.loc[df1.item.isin(df2.item), ['sale']] = df2[['sale']]
And I get
df1 =
item sale
0 7 10.0
1 4 6.0
2 6 10.0
3 5 4.0
4 5 3.0
5 6 10.0
6 4 NaN
The output I wanted was:
df1 =
item sale
0 7 10.0
1 4 4.0
2 6 10.0
3 5 3.0
4 5 3.0
5 6 10.0
6 4 4.0
The two dataframes are related by item number. So, set the item number as the index on both dataframes, run an update method on df1 with df2, and reset index
df1 = df1.set_index("item")
df1.update(df2.set_index("item"))
df1.reset_index()
item sale
0 7 10.0
1 4 4.0
2 6 10.0
3 5 3.0
4 5 3.0
5 6 10.0
6 4 4.0
Let's suppose that I am having the following dataset:
Stock_id Week Stock_value
1 1 2
1 2 4
1 4 7
1 5 1
2 3 8
2 4 6
2 5 5
2 6 3
I want to shift the values of the Stock_value column by one position so that I get the following:
Stock_id Week Stock_value
1 1 NA
1 2 2
1 4 4
1 5 7
2 3 NA
2 4 8
2 5 6
2 6 5
What I am doing is the following:
df = pd.read_csv('C:/Users/user/Desktop/test.txt', keep_default_na=True, sep='\t')
df = df.groupby('Store_id', as_index=False)['Waiting_time'].transform(lambda x:x.shift(periods=1))
But then this gives me:
Waiting_time
0 NaN
1 2.0
2 4.0
3 7.0
4 NaN
5 8.0
6 6.0
7 5.0
So it gives me the values shifted but it does not retain all the columns of the dataframe.
How do I also retain all the columns of the dataframe along with shifting the values of one column?
You can simplify solution by DataFrameGroupBy.shift and assign back to new column:
df['Waiting_time'] = df.groupby('Stock_id')['Stock_value'].shift()
Working same like:
df['Waiting_time']=df.groupby('Stock_id')['Stock_value'].transform(lambda x:x.shift(periods=1))
print (df)
Stock_id Week Stock_value Waiting_time
0 1 1 2 NaN
1 1 2 4 2.0
2 1 4 7 4.0
3 1 5 1 7.0
4 2 3 8 NaN
5 2 4 6 8.0
6 2 5 5 6.0
7 2 6 3 5.0
When you do df.groupby('Store_id', as_index=False)['Waiting_time'], you obtain a DataFrame with a single column 'Waiting_time', you can't generate the other columns from that.
As suggested by jezrael in the comment, you should do
df['new col'] = df.groupby('Store_id...
to add this new column to your previously existing DataFrame.
My dataset looks like this(first row is header)
0 1 2 3 4 5
1 3 4 6 2 3
3 8 9 3 2 4
2 2 3 2 1 2
I want to select a range of columns of the dataset based on the column [5], e.g:
1 3 4
3 8 9 3
2 2
I have tried the following, but it did not work:
df.iloc[:,0:df['5'].values]
Let's try:
df.apply(lambda x: x[:x.iloc[5]], 1)
Output:
0 1 2 3
0 1.0 3.0 4.0 NaN
1 3.0 8.0 9.0 3.0
2 2.0 2.0 NaN NaN
Recreate your dataframe
df=pd.DataFrame([x[:x[5]] for x in df.values]).fillna(0)
df
Out[184]:
0 1 2 3
0 1 3 4.0 0.0
1 3 8 9.0 3.0
2 2 2 0.0 0.0