GroupBy and Transform does not keep all columns of dataframe - python

Let's suppose that I am having the following dataset:
Stock_id Week Stock_value
1 1 2
1 2 4
1 4 7
1 5 1
2 3 8
2 4 6
2 5 5
2 6 3
I want to shift the values of the Stock_value column by one position so that I get the following:
Stock_id Week Stock_value
1 1 NA
1 2 2
1 4 4
1 5 7
2 3 NA
2 4 8
2 5 6
2 6 5
What I am doing is the following:
df = pd.read_csv('C:/Users/user/Desktop/test.txt', keep_default_na=True, sep='\t')
df = df.groupby('Store_id', as_index=False)['Waiting_time'].transform(lambda x:x.shift(periods=1))
But then this gives me:
Waiting_time
0 NaN
1 2.0
2 4.0
3 7.0
4 NaN
5 8.0
6 6.0
7 5.0
So it gives me the values shifted but it does not retain all the columns of the dataframe.
How do I also retain all the columns of the dataframe along with shifting the values of one column?

You can simplify solution by DataFrameGroupBy.shift and assign back to new column:
df['Waiting_time'] = df.groupby('Stock_id')['Stock_value'].shift()
Working same like:
df['Waiting_time']=df.groupby('Stock_id')['Stock_value'].transform(lambda x:x.shift(periods=1))
print (df)
Stock_id Week Stock_value Waiting_time
0 1 1 2 NaN
1 1 2 4 2.0
2 1 4 7 4.0
3 1 5 1 7.0
4 2 3 8 NaN
5 2 4 6 8.0
6 2 5 5 6.0
7 2 6 3 5.0

When you do df.groupby('Store_id', as_index=False)['Waiting_time'], you obtain a DataFrame with a single column 'Waiting_time', you can't generate the other columns from that.
As suggested by jezrael in the comment, you should do
df['new col'] = df.groupby('Store_id...
to add this new column to your previously existing DataFrame.

Related

How to remove or drop all rows after first occurrence of `NaN` from the entire DataFrame

I am looking forward to remove/drop all rows after first occurrence of NaN based on any of dataFrame column.
I have created two sample DataFrames as illustrated Below, the first dataframe the dtypes are for initial two columns are object while the last one in int, while in the Second dataframe these are float, obj and int.
First:
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,'NaN','NaN','NaN','NaN'),"B": (1,2,3,'NaN',4,5,6,7,'NaN',"9","10"),"C": range(11)})
>>> df
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
Dtypes:
>>> df.dtypes
A object
B object
C int64
dtype: object
While carrying out index based approach as follows based on a particular, it works Just fine as long as dtype is obj and int but i'm looking for dataFrame level action merely not limited to a column.
>>> df[:df[df['A'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
3 4 NaN 3
4 5 4 4
5 6 5 5
6 7 6 6
>>> df[:df[df['B'] == 'NaN'].index[0]]
A B C
0 1 1 0
1 2 2 1
2 3 3 2
Second:
Another interesting fact while creating DataFrame with np.nan where we get different dtype, then even index based approach failed for a single column operation s well.
>>> df = pd.DataFrame({"A": (1,2,3,4,5,6,7,np.nan,np.nan,np.nan,np.nan),"B": (1,2,3,np.nan,4,5,6,7,np.nan,"9","10"),"C": range(11)})
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
3 4.0 NaN 3
4 5.0 4 4
5 6.0 5 5
6 7.0 6 6
7 NaN 7 7
8 NaN NaN 8
9 NaN 9 9
10 NaN 10 10
dtypes:
>>> df.dtypes
A float64
B object
C int64
dtype: object
Error:
>>> df[:df[df['B'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
>>> df[:df[df['A'] == 'NaN'].index[0]]
IndexError: index 0 is out of bounds for axis 0 with size 0
Expected should be for the Second DataFrame:
>>> df
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
So, i am looking a way around to check across the entire DataFrame regardless of dtype and drop all rows from the first occurrence of NaN in the DataFrame.
You can try:
out=df.iloc[:df.isna().any(1).idxmax()]
OR
via replace() make your string 'NaN's to real 'NaN's then check for missing values and filter rows:
df=df.replace({'NaN':float('NaN'),'nan':float('NaN')})
out=df.iloc[:df.isna().any(1).idxmax()]
output of out:
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
Just for posterity ...
>>> df.iloc[:df.isna().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2
>>> df.iloc[:df.isnull().any(1).argmax()]
A B C
0 1.0 1 0
1 2.0 2 1
2 3.0 3 2

create a sliding window of input data from pandas dataframe

I have to create a sliding window of input data with window size = 3
Dataframe
0 1
0 1 2
1 3 4
2 5 6
3 7 8
Desired output:
0 1 2 3 4 5
0 1 2 NA NA NA NA
1 3 4 1 2 NA NA
2 5 6 3 4 1 2
3 7 8 5 6 3 4
I used data.values.flatten() but it converts all rows in dataframe in nested list.
How can I create a sliding window of input data (of desired window length) from dataframe
You can just concat:
new_df = pd.concat([df.shift(i) for i in range(3)], axis=1)
# rename columns
new_df.columns = np.arange(new_df.shape[1])
output:
0 1 2 3 4 5
0 1 2 NaN NaN NaN NaN
1 3 4 1.0 2.0 NaN NaN
2 5 6 3.0 4.0 1.0 2.0
3 7 8 5.0 6.0 3.0 4.0

To fill a dataframe column by values in another dataframe column values

I have 2 dataframes:
df1 =
item sale
0 7 10.0
1 4 10.0
2 6 10.0
3 5 10.0
4 5 10.0
5 6 10.0
6 4 10.0
df2 =
item sale
0 1 7
1 2 6
2 3 5
3 4 4
4 5 3
I want to change the values of df1 sales column, taking the values from df2 sales column.
I use the code:
df1.loc[df1.item.isin(df2.item), ['sale']] = df2[['sale']]
And I get
df1 =
item sale
0 7 10.0
1 4 6.0
2 6 10.0
3 5 4.0
4 5 3.0
5 6 10.0
6 4 NaN
The output I wanted was:
df1 =
item sale
0 7 10.0
1 4 4.0
2 6 10.0
3 5 3.0
4 5 3.0
5 6 10.0
6 4 4.0
The two dataframes are related by item number. So, set the item number as the index on both dataframes, run an update method on df1 with df2, and reset index
df1 = df1.set_index("item")
df1.update(df2.set_index("item"))
df1.reset_index()
item sale
0 7 10.0
1 4 4.0
2 6 10.0
3 5 3.0
4 5 3.0
5 6 10.0
6 4 4.0

Indexing new dataframes into new columns with pandas

I need to create a new dataframe from an existing one by selecting multiple columns, and appending those column values to a new column with it's corresponding index as a new column
So, lets say I have this as a dataframe:
A B C D E F
0 1 2 3 4 0
0 7 8 9 1 0
0 4 5 2 4 0
Transform into this by selecting columns B through E:
A index_value
1 1
7 1
4 1
2 2
8 2
5 2
3 3
9 3
2 3
4 4
1 4
4 4
So, for the new dataframe, column A would be all of the values from columns B through E in the old dataframe, and column index_value would correspond to the index value [starting from zero] of the selected columns.
I've been scratching my head for hours. Any help would be appreciated, thanks!
Python3, Using pandas & numpy libraries.
#Another way
A B C D E F
0 0 1 2 3 4 0
1 0 7 8 9 1 0
2 0 4 5 2 4 0
# Select columns to include
start_colum ='B'
end_column ='E'
index_column_name ='A'
#re-stack the dataframe
df = df.loc[:,start_colum:end_column].stack().sort_index(level=1).reset_index(level=0, drop=True).to_frame()
#Create the "index_value" column
df['index_value'] =pd.Categorical(df.index).codes+1
df.rename(columns={0:index_column_name}, inplace=True)
df.set_index(index_column_name, inplace=True)
df
index_value
A
1 1
7 1
4 1
2 2
8 2
5 2
3 3
9 3
2 3
4 4
1 4
4 4
This is just melt
df.columns = range(df.shape[1])
s = df.melt().loc[lambda x : x.value!=0]
s
variable value
3 1 1
4 1 7
5 1 4
6 2 2
7 2 8
8 2 5
9 3 3
10 3 9
11 3 2
12 4 4
13 4 1
14 4 4
Try using:
df = pd.melt(df[['B', 'C', 'D', 'E']])
# Or df['variable'] = df[['B', 'C', 'D', 'E']].melt()
df['variable'].shift().eq(df['variable'].shift(-1)).cumsum().shift(-1).ffill()
print(df)
Output:
variable value
0 1.0 1
1 1.0 7
2 1.0 4
3 2.0 2
4 2.0 8
5 2.0 5
6 3.0 3
7 3.0 9
8 3.0 2
9 4.0 4
10 4.0 1
11 4.0 4

GroupBy and Shift based on the values of a column

Let's suppose that I have the following dataset:
Stock_id Week Stock_value
1 1 2
1 2 4
1 4 7
1 5 1
2 3 8
2 4 6
2 5 5
2 6 3
I want to shift the values of the Stock_value column but only for consecutive weeks.
This should give the following output:
Stock_id Week Stock_value
1 1 NA
1 2 2
1 4 NA
1 5 7
2 3 NA
2 4 8
2 5 6
2 6 5
So for example at store 1 the Stock_value of week 2 should not be shifted over to week 4 (since I want one week shift for now).
How can I do this easily?
IIUC using week with its diff create another group key
df.groupby([df.Stock_id,df.Week.diff().ne(1).cumsum()]).Stock_value.shift()
Out[157]:
0 NaN
1 2.0
2 NaN
3 7.0
4 NaN
5 8.0
6 6.0
7 5.0
Name: Stock_value, dtype: float64
#df['Stock_value2']= df.groupby([df.Stock_id,df.Week.diff().ne(1).cumsum()]).Stock_value.shift()

Categories

Resources