This question already has answers here:
Rolling or sliding window iterator?
(29 answers)
Closed 1 year ago.
I have this big data in csv file:
I manage to open this on Jupyter Notebook.
The data in csv example: 1 2 3 4 5 6 7 8 9 10
And I wanted to open in the notebook as '3 windows rolling' without doing any (sum,mean for example)
The output I want in the notebook are>>
First open csv to get first column.
import pandas as pd
df = pd.read_csv("filename.csv")
I will use io only to simulate data from file
text = """first
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
Result
first
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Next you can use shift to create other columns
df['second'] = df['first'].shift(-1)
df['third'] = df['first'].shift(-2)
Result
first second third
0 1 2.0 3.0
1 2 3.0 4.0
2 3 4.0 5.0
3 4 5.0 6.0
4 5 6.0 7.0
5 6 7.0 8.0
6 7 8.0 9.0
7 8 9.0 10.0
8 9 10.0 NaN
9 10 NaN NaN
At the end you can remove two last rows with NaN and convert all to integer
df = df[:-2].astype(int)
or if you don't have NaN in other places
df = df.dropna().astype(int)
Result:
first second third
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
Minimal working code
text = """first
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
#df = pd.DataFrame(range(1,11), columns=['first'])
print(df)
df['second'] = df['first'].shift(-1) #, fill_value=0)
df['third'] = df['first'].shift(-2)
print(df)
#df = df.dropna().astype(int)
df = df[:-2].astype(int)
print(df)
EDIT:
The same using for-loop to create any number of columns
text = """col 1
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
#df = pd.DataFrame(range(1,11), columns=['col 1'])
print(df)
number = 5
for x in range(1, number+1):
df[f'col {x+1}'] = df['col 1'].shift(-x)
print(df)
#df = df.dropna().astype(int)
df = df[:-number].astype(int)
print(df)
Result
col 1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
col 1 col 2 col 3 col 4 col 5 col 6
0 1 2.0 3.0 4.0 5.0 6.0
1 2 3.0 4.0 5.0 6.0 7.0
2 3 4.0 5.0 6.0 7.0 8.0
3 4 5.0 6.0 7.0 8.0 9.0
4 5 6.0 7.0 8.0 9.0 10.0
5 6 7.0 8.0 9.0 10.0 NaN
6 7 8.0 9.0 10.0 NaN NaN
7 8 9.0 10.0 NaN NaN NaN
8 9 10.0 NaN NaN NaN NaN
9 10 NaN NaN NaN NaN NaN
col 1 col 2 col 3 col 4 col 5 col 6
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
I have a dataframe with three columns. Two of them are group and subgroup, adn the third one is a value. I have some NaN values in the values column. I need to fiil them by median values,according to group and subgroup.
I made a pivot table with double index and the median of target column. But I don`t understand how to get this values and put them into original dataframe
import pandas as pd
df=pd.DataFrame(data=[
[1,1,'A',1],
[2,1,'A',3],
[3,3,'B',8],
[4,2,'C',1],
[5,3,'A',3],
[6,2,'C',6],
[7,1,'B',2],
[8,1,'C',3],
[9,2,'A',7],
[10,3,'C',4],
[11,2,'B',6],
[12,1,'A'],
[13,1,'C'],
[14,2,'B'],
[15,3,'A']],columns=['id','group','subgroup','value'])
print(df)
id group subgroup value
0 1 1 A 1
1 2 1 A 3
2 3 3 B 8
3 4 2 C 1
4 5 3 A 3
5 6 2 C 6
6 7 1 B 2
7 8 1 C 3
8 9 2 A 7
9 10 3 C 4
10 11 2 B 6
11 12 1 A NaN
12 13 1 C NaN
13 14 2 B NaN
14 15 3 A NaN
df_struct=df.pivot_table(index=['group','subgroup'],values='value',aggfunc='median')
print(df_struct)
value
group subgroup
1 A 2.0
B 2.0
C 3.0
2 A 7.0
B 6.0
C 3.5
3 A 3.0
B 8.0
C 4.0
Will be thankfull for any help
Use pandas.DataFrame.groupby.transform then fillna:
id group subgroup value
0 1 1 A 1.0
1 2 1 A NaN # < Value with nan
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
df['value'] = df['value'].fillna(df.groupby(['group', 'subgroup'])['value'].transform('median'))
print(df)
Output:
id group subgroup value
0 1 1 A 1.0
1 2 1 A 1.0
2 3 3 B 8.0
3 4 2 C 1.0
4 5 3 A 3.0
5 6 2 C 6.0
6 7 1 B 2.0
7 8 1 C 3.0
8 9 2 A 7.0
9 10 3 C 4.0
10 11 2 B 6.0
Let's suppose that I am having the following dataset:
Stock_id Week Stock_value
1 1 2
1 2 4
1 4 7
1 5 1
2 3 8
2 4 6
2 5 5
2 6 3
I want to shift the values of the Stock_value column by one position so that I get the following:
Stock_id Week Stock_value
1 1 NA
1 2 2
1 4 4
1 5 7
2 3 NA
2 4 8
2 5 6
2 6 5
What I am doing is the following:
df = pd.read_csv('C:/Users/user/Desktop/test.txt', keep_default_na=True, sep='\t')
df = df.groupby('Store_id', as_index=False)['Waiting_time'].transform(lambda x:x.shift(periods=1))
But then this gives me:
Waiting_time
0 NaN
1 2.0
2 4.0
3 7.0
4 NaN
5 8.0
6 6.0
7 5.0
So it gives me the values shifted but it does not retain all the columns of the dataframe.
How do I also retain all the columns of the dataframe along with shifting the values of one column?
You can simplify solution by DataFrameGroupBy.shift and assign back to new column:
df['Waiting_time'] = df.groupby('Stock_id')['Stock_value'].shift()
Working same like:
df['Waiting_time']=df.groupby('Stock_id')['Stock_value'].transform(lambda x:x.shift(periods=1))
print (df)
Stock_id Week Stock_value Waiting_time
0 1 1 2 NaN
1 1 2 4 2.0
2 1 4 7 4.0
3 1 5 1 7.0
4 2 3 8 NaN
5 2 4 6 8.0
6 2 5 5 6.0
7 2 6 3 5.0
When you do df.groupby('Store_id', as_index=False)['Waiting_time'], you obtain a DataFrame with a single column 'Waiting_time', you can't generate the other columns from that.
As suggested by jezrael in the comment, you should do
df['new col'] = df.groupby('Store_id...
to add this new column to your previously existing DataFrame.
I have a dataframe similar to below
id A B C D E
1 2 3 4 5 5
1 NaN 4 NaN 6 7
2 3 4 5 6 6
2 NaN NaN 5 4 1
I want to do a null value imputation for columns A, B, C in a forward filling but for each group. That means, I want the forward filling be applied on each id. How can I do that?
Use GroupBy.ffill for forward filling per groups for all columns, but if first values per groups are NaNs there is no replace, so is possible use fillna and last casting to integers:
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 NaN 4.0 NaN 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 NaN NaN 5.0 4 1.0
cols = ['A','B','C']
df[['id'] + cols] = df.groupby('id')[cols].ffill().fillna(0).astype(int)
print (df)
id A B C D E
0 1 2 3 4 5 NaN
1 1 2 4 4 6 NaN
2 2 3 4 5 6 6.0
3 2 3 4 5 4 1.0
Detail:
print (df.groupby('id')[cols].ffill().fillna(0).astype(int))
id A B C
0 1 2 3 4
1 1 2 4 4
2 2 3 4 5
3 2 3 4 5
Or:
cols = ['A','B','C']
df.update(df.groupby('id')[cols].ffill().fillna(0))
print (df)
id A B C D E
0 1 2.0 3.0 4.0 5 NaN
1 1 2.0 4.0 4.0 6 NaN
2 2 3.0 4.0 5.0 6 6.0
3 2 3.0 4.0 5.0 4 1.0
My dataset looks like this(first row is header)
0 1 2 3 4 5
1 3 4 6 2 3
3 8 9 3 2 4
2 2 3 2 1 2
I want to select a range of columns of the dataset based on the column [5], e.g:
1 3 4
3 8 9 3
2 2
I have tried the following, but it did not work:
df.iloc[:,0:df['5'].values]
Let's try:
df.apply(lambda x: x[:x.iloc[5]], 1)
Output:
0 1 2 3
0 1.0 3.0 4.0 NaN
1 3.0 8.0 9.0 3.0
2 2.0 2.0 NaN NaN
Recreate your dataframe
df=pd.DataFrame([x[:x[5]] for x in df.values]).fillna(0)
df
Out[184]:
0 1 2 3
0 1 3 4.0 0.0
1 3 8 9.0 3.0
2 2 2 0.0 0.0