This question already has answers here:
Rolling or sliding window iterator?
(29 answers)
Closed 1 year ago.
I have this big data in csv file:
I manage to open this on Jupyter Notebook.
The data in csv example: 1 2 3 4 5 6 7 8 9 10
And I wanted to open in the notebook as '3 windows rolling' without doing any (sum,mean for example)
The output I want in the notebook are>>
First open csv to get first column.
import pandas as pd
df = pd.read_csv("filename.csv")
I will use io only to simulate data from file
text = """first
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
Result
first
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Next you can use shift to create other columns
df['second'] = df['first'].shift(-1)
df['third'] = df['first'].shift(-2)
Result
first second third
0 1 2.0 3.0
1 2 3.0 4.0
2 3 4.0 5.0
3 4 5.0 6.0
4 5 6.0 7.0
5 6 7.0 8.0
6 7 8.0 9.0
7 8 9.0 10.0
8 9 10.0 NaN
9 10 NaN NaN
At the end you can remove two last rows with NaN and convert all to integer
df = df[:-2].astype(int)
or if you don't have NaN in other places
df = df.dropna().astype(int)
Result:
first second third
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9
7 8 9 10
Minimal working code
text = """first
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
#df = pd.DataFrame(range(1,11), columns=['first'])
print(df)
df['second'] = df['first'].shift(-1) #, fill_value=0)
df['third'] = df['first'].shift(-2)
print(df)
#df = df.dropna().astype(int)
df = df[:-2].astype(int)
print(df)
EDIT:
The same using for-loop to create any number of columns
text = """col 1
1
2
3
4
5
6
7
8
9
10"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
#df = pd.DataFrame(range(1,11), columns=['col 1'])
print(df)
number = 5
for x in range(1, number+1):
df[f'col {x+1}'] = df['col 1'].shift(-x)
print(df)
#df = df.dropna().astype(int)
df = df[:-number].astype(int)
print(df)
Result
col 1
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
col 1 col 2 col 3 col 4 col 5 col 6
0 1 2.0 3.0 4.0 5.0 6.0
1 2 3.0 4.0 5.0 6.0 7.0
2 3 4.0 5.0 6.0 7.0 8.0
3 4 5.0 6.0 7.0 8.0 9.0
4 5 6.0 7.0 8.0 9.0 10.0
5 6 7.0 8.0 9.0 10.0 NaN
6 7 8.0 9.0 10.0 NaN NaN
7 8 9.0 10.0 NaN NaN NaN
8 9 10.0 NaN NaN NaN NaN
9 10 NaN NaN NaN NaN NaN
col 1 col 2 col 3 col 4 col 5 col 6
0 1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
Related
I have a df like this:
time data
0 1
1 1
2 nan
3 nan
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 nan
Is there a way to use pd.Series.ffill() to ffill on for certain occurences of values? Specifically, I want to forward fill only if values in df.data are == 1 or 4. Should look like this:
time data
0 1
1 1
2 1
3 1
4 6
5 nan
6 nan
7 nan
8 5
9 4
10 4
One option would be to forward fill (ffill) the column, then only populate where the values are 1 or 4 using (isin) and (mask):
s = df['data'].ffill()
df['data'] = df['data'].mask(s.isin([1, 4]), s)
df:
time data
0 0 1.0
1 1 1.0
2 2 1.0
3 3 1.0
4 4 6.0
5 5 NaN
6 6 NaN
7 7 NaN
8 8 5.0
9 9 4.0
10 10 4.0
Here is My dataframe and List
X Y Z X1
1 2 3 3
2 7 2 6
3 10 5 4
4 3 7 9
5 3 3 4
list1=[3,5,6]
list2=[4,3,7,4]
I want to add the lists into a data frame, I have tried some code but it gives an error and something is not working
#Expected Output
X Y Z X1
1 2 3 3
2 7 2 6
3 10 5 4
4 3 7 9
5 3 3 4
3 4
5 3
6 7
4
#here is my code
list1 = [3,5, 6]
df_length = len(df1)
df1.loc[df_length] = list1
Please help me to solve this problem.
Thanks in advance.
Use series.append() to create the new series (X & X1), and create the output df using pd.concat():
s1 = df.X.append(pd.Series(list1)).reset_index(drop=True)
s2 = df.X1.append(pd.Series(list2)).reset_index(drop=True)
df = pd.concat([s1, df.Y, df.Z, s2], axis=1).rename(columns={0: 'X', 1: 'X1'})
df
X Y Z X1
0 1.0 2.0 3.0 3
1 2.0 7.0 2.0 6
2 3.0 10.0 5.0 4
3 4.0 3.0 7.0 9
4 5.0 3.0 3.0 4
5 3.0 NaN NaN 4
6 5.0 NaN NaN 3
7 6.0 NaN NaN 7
8 NaN NaN NaN 4
'''
X Y Z X1
1 2 3 3
2 7 2 6
3 10 5 4
4 3 7 9
5 3 3 4
'''
list1=[3,5,6]
list2=[4,3,7,4]
ls_empty=[]
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df1 = pd.DataFrame([list1, ls_empty, ls_empty, list2])
df1 = df1.T
df1.columns = df.columns
df2 = pd.concat([df, df1]).replace(np.nan, '', regex=True).reset_index(drop=True).astype({'X1': int})
print(df2)
Output:
X Y Z X1
0 1 2 3 3
1 2 7 2 6
2 3 10 5 4
3 4 3 7 9
4 5 3 3 4
5 3 4
6 5 3
7 6 7
8 4
I am trying to find a way to calculate the amount of values randomly removed from a data frame and the amount of values randomly removed one after another.
The code I have so far is:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#Sampledata
x=[1,2,3,4,5,6,7,8,9,10]
y=[1,2,3,4,5,6,7,8,9,10]
df = pd.DataFrame({'col_1':y,'col_2':x})
drop_indices = np.random.choice(df.index, 5,replace=False )
df_subset = df.drop(drop_indices)
print(df_subset)
print(df)
Which randomly removes 5 rows from the data frame and gives as output:
col_1 col_2
0 1 1
1 2 2
2 3 3
5 6 6
8 9 9
col_1 col_2
0 1 1
1 2 2
2 3 3
3 4 4
4 5 5
5 6 6
6 7 7
7 8 8
8 9 9
9 10 10
I would like to turn this into the following data frame:
col_1 col_2 col_2 N_removedvalues N_consecutive
0 1 1 1 0 0
1 2 2 2 0 0
2 3 3 3 0 0
3 4 4 1 1
4 5 5 2 2
5 6 6 6 2 0
6 7 7 3 1
7 8 8 4 2
8 9 9 9 4 0
9 10 10 5 1
res=df.merge(df_subset, on='col_1', suffixes=['_1',''], how='left')
res["N_removedvalues"]=np.where(res['col_2'].isna(), res.groupby(res['col_2'].isna()).cumcount().add(1), np.nan)
res["N_removedvalues"]=res["N_removedvalues"].ffill().fillna(0)
res['N_consecutive']=np.logical_and(res['col_2'].isna(), np.logical_or(~res['col_2'].shift().isna(), res.index==res.index[0]))
res.loc[np.logical_and(res['N_consecutive']==0, res['col_2'].isna()), 'N_consecutive']=np.nan
res['N_consecutive']=res.groupby('N_consecutive')['N_consecutive'].cumsum().ffill()
res.loc[res['N_consecutive'].gt(0), 'N_consecutive']=res.loc[res['N_consecutive'].gt(0)].groupby('N_consecutive').cumcount().add(1)
Outputs:
col_1 col_2_1 col_2 N_removedvalues N_consecutive
0 1 1 1.0 0.0 0.0
1 2 2 2.0 0.0 0.0
2 3 3 NaN 1.0 1.0
3 4 4 4.0 1.0 0.0
4 5 5 NaN 2.0 1.0
5 6 6 NaN 3.0 2.0
6 7 7 7.0 3.0 0.0
7 8 8 8.0 3.0 0.0
8 9 9 NaN 4.0 1.0
9 10 10 NaN 5.0 2.0
I want to ignore NaN values in my selected dataframe columns when I want to normalize with sklearn.preprocessing.normalize. Column example:
0 12.0
1 12.0
2 3.0
3 NaN
4 3.0
5 3.0
6 NaN
7 NaN
8 3.0
9 3.0
10 3.0
11 4.0
12 10.0
You can make use of function dropna(). It will return the same dataframe with rows containing NaN deleted.
>>> a.dropna()
0 12.0
0 1 12
1 2 3
3 4 3
4 5 3
7 8 3
8 9 3
9 10 3
10 11 4
11 12 10
I have a dataframe with like this, and want to add a new column that is the equivalent of applying shift n times. For example, let n = 2:
df = pd.DataFrame(numpy.random.randint(0, 10, (10, 2)), columns=['a','b'])
a b
0 0 3
1 7 0
2 6 6
3 6 0
4 5 0
5 0 7
6 8 0
7 8 7
8 4 4
9 2 2
df['c'] = df['b'].shift(1) + df['b'].shift(2)
a b c
0 0 3 NaN
1 7 0 NaN
2 6 6 3.0
3 6 0 6.0
4 5 0 6.0
5 0 7 0.0
6 8 0 7.0
7 8 7 7.0
8 4 4 7.0
9 2 2 11.0
In this manner, column c gets the sum of the previous n values from column b.
Other than a loop, is there a better way to accomplish this for a large n?
You can use the rolling() method with a window of 2:
df['c'] = df.b.rolling(window = 2).sum().shift()
df
a b c
0 0 3 NaN
1 7 0 NaN
2 6 6 3.0
3 6 0 6.0
4 5 0 6.0
5 0 7 0.0
6 8 0 7.0
7 8 7 7.0
8 4 4 7.0
9 2 2 11.0