Python pandas: Applying rolling sum on pivot table - python

I created a dataframe using pivot_table command.Dataframe has 351 rows and 120 columns. The dataframe looks like follow:
RY 2011 ... 2020
Month 1 2 3 4 5 6 7 8 9 10 ... 3 4 5 6 7 8 9 10 11 12
ID
AB10 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AB1286 0 0 0 0 2 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AB1951 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AB2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AB2338 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
Now I want to calculate the rolling sum of 12 months for ID. I wrote following command to calculate the rolling sum:
df.groupby('ID').rolling(12,on='Month').sum()
However, it gave the following error:
ValueError: invalid on specified as Month, must be a column (of DataFrame), an Index or None
Could anyone help me in fixing the issue?

Try running that code before creating a pivot table. But, make sure that you first create a datetime column with something like:
df['Date'] = pd.to_datetime(df['Year'].astype(str) + '-' + df['Month'].astype(str) + '-01')
and then:
df.groupby('ID').rolling(12,on='Date').sum()

What does "ID" contain? Have you tried transposing the pivot table by using?
df.T.groupby('ID').rolling(12,on='Month').sum()

Related

Python panda's dataframe boolean Series/Column based on conditional next columns

I'm having trouble describing exactly what I want to achieve. I've tried looking here on stack to find others with the same problem, but are unable to find any. So I will try to describe exactly what I want and give you a sample setup code.
I would like to have a function that gives me a new column/pd.Series. This new column has boolean TRUE values (or int's) that are based on a certain condition.
The condition being as follows. There are N number of columns (example is 8), each with the same name but ending with one new number. IE, column_1, column_2 etc. The function I need is twofold:
If N is given, look for/through each column row and see if it and the next N columns row are also TRUE/1 ..
If N is NOT given, look for each column row and if all next columns rows are also TRUE/1, with the numbers as ID's to look at the column.
def get_df_series(df: pd.DataFrame, columns_ids: list, n: int=8) -> pd.Dataframe:
for i in columns_ids:
# missing code here .. i dont know if this would be the way to go
pass
return df
def create_dataframe(numbers: list) -> pd.DataFrame:
df = pd.DataFrame() # empty df
# create a column for each number with the number as ID and with random boolean values as int's
for i in numbers:
df[f'column_{i}'] = np.random.randint(2, size=20)
return df
if __name__=="__main__":
numbers = [1, 2, 3, 4, 5, 6, 7, 8]
df = create_dataframe(numbers=numbers)
df = get_df_series(df=df, numbers=numbers, n=3)
I have some experience with Pandas dataframes and know how to create IF/ELSE things with np.select for example.
(function) select(condlist: Sequence[ArrayLike], choicelist: Sequence[ArrayLike], default: ArrayLike = ...) -> NDArray
The problem I'm running into is that I don't know how to make a conditional statement if I don't know how many columns are ahead. For example, if I want to know for column_5 if the next 3 are also true, I can hardcode this, but I have columns up to id 20 and would love to not have to hardcode everything from column_1 to column_20 if I want to know if all conditions in all those columns are true.
Now the problem is that I don't know if this is even possible. So any feedback would be appreaciated. Even just giving me a hint on where to look for a way to do this.
EDIT: What I forgot to mention was that there will be random columns in between that obviously cannot be taking into the equation. For example, there will be main_column_1, main_column_2, main_column_3, side_column_1, side_column_2, right_column_1, main_column_3, main_column_4 etc...
The answer Corralien gave is correct, but I should've made my question more clearer.
I need to be able to, say, look at main_column and for that one look ahead N amount of columns of the same type: main_column.
Try:
n = 3
out = (df.rolling(n, min_periods=1, axis=1).sum()
.shift(-n+1, fill_value=0, axis=1).eq(n).astype(int)
.rename(columns=lambda x: 'result_' + x.split('_')[1]))
Output:
>>> out
result_1 result_2 result_3 result_4 result_5 result_6 result_7 result_8
0 1 1 1 1 1 1 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 1 1 1 0 0 0 0
9 0 0 0 0 0 1 0 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 1 0 0 0
12 0 0 0 0 0 0 0 0
13 0 0 0 1 1 0 0 0
14 0 0 0 0 0 1 0 0
15 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0 0
18 0 0 1 0 0 0 0 0
19 0 0 0 0 0 0 0 0
Input:
>>> df
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8
0 1 1 1 1 1 1 1 1
1 0 1 0 0 0 1 1 0
2 1 1 0 1 0 1 1 0
3 1 0 1 0 0 0 0 0
4 1 0 0 1 1 1 0 1
5 1 1 0 1 0 1 1 0
6 1 0 1 0 0 0 0 1
7 0 0 1 0 0 0 0 0
8 0 1 1 1 1 1 0 0
9 1 0 1 1 0 1 1 1
10 0 0 1 1 0 0 1 1
11 1 0 1 0 1 1 1 0
12 0 1 1 0 1 0 1 0
13 0 0 0 1 1 1 1 0
14 0 0 1 1 0 1 1 1
15 1 0 0 1 0 1 0 0
16 1 0 0 0 0 0 0 1
17 0 0 1 1 1 0 0 1
18 0 0 1 1 1 0 0 1
19 0 0 1 0 0 0 1 0

Get a value from a DataFrame in pandas by Row Name, NOT by Row Index

I have the following Pandas DataFrame:
1 2 3 4 5 6 7 8 9 10
1 10 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 0 0
As you can see, the Row names are numbers that is not the same index value.
I don't have the Row name "6".
I would like to set and get the cell value by Row Name and Column Name, Not Row index.
I'm using:
pandas 1.0.1
Python 3.7.6

How to concatenate bit columns in Python Pandas?

Seems like an easy question but I'm running into an odd error. I have a large dataframe with 24+ columns that all contain 1s or 0s. I wish to concatenate each field to create a binary key that'll act as a signature.
However, when the number of columns exceeds 12, the whole process falls apart.
a = np.zeros(shape=(3,12))
df = pd.DataFrame(a)
df = df.astype(int) # This converts each 0.0 into just 0
df[2]=1 # Changes one column to all 1s
#result
0 1 2 3 4 5 6 7 8 9 10 11
0 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0
Concatenating function...
df['new'] = df.astype(str).sum(1).astype(int).astype(str) # Concatenate
df['new'].apply('{0:0>12}'.format) # Pad leading zeros
# result
0 1 2 3 4 5 6 7 8 9 10 11 new
0 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
1 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
2 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
This is good. However, if I increase the number of columns to 13, I get...
a = np.zeros(shape=(3,13))
# ...same intermediate steps as above...
0 1 2 3 4 5 6 7 8 9 10 11 12 new
0 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
1 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
2 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
Why am I getting -2147483648? I was expecting 0010000000000
Any help is appreciated!

Pandas sum every other column by index where names, and index size changes

Here is my current dataframe named out
Date David_Added David_Removed Malik_Added Malik_Removed Meghan_Added Meghan_Removed Sucely_Added Sucely_Removed
02/19/2019 3 1 39 41 1 6 14 24
02/18/2019 0 0 8 6 0 3 0 0
02/16/2019 0 0 0 0 0 0 0 0
02/15/2019 0 0 0 0 0 0 0 0
02/14/2019 0 0 0 0 0 0 0 0
02/13/2019 0 0 0 0 0 0 0 0
02/12/2019 0 0 0 0 0 0 0 0
02/11/2019 0 0 0 0 0 0 0 0
02/08/2019 0 0 0 0 0 0 0 0
02/07/2019 0 0 0 0 0 0 0 0
I need to sum every persons data by date obviously skipping the Date column. I would like the total to be the column next to the columns summed. "User_Add, User_Removed, User_Total" as shown below. My issue I face is that the prefix names won't always be the same, and the total amount of users changes.
My thought process would be count the total columns. Then loop through them doing the math, and dumping the results to a new column for every user. Then sort the columns alphabetically so they are grouped together.
something along the line of
loops = out.shape[1]
while loop < loops:
out['User_Total'] = out['User_Added']+out['User_Removed']
loop += 1
out.sort_index(axis=1, inplace=True)
However I'm not sure how to call an entire column by index, or if this is even a good way to handle it.
Here is what I'd like the output to look like.
Date David_Added David_Removed David_Total Malik_Added Malik_Removed Malik_Total Meghan_Added Meghan_Removed Meghan_Total Sucely_Added Sucely_Removed Sucely_Total
2/19/2019 3 1 4 39 41 80 1 6 7 14 24 38
2/18/2019 0 0 0 8 6 14 0 3 3 0 0 0
2/16/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/15/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/14/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/13/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/12/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/11/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/8/2019 0 0 0 0 0 0 0 0 0 0 0 0
2/7/2019 0 0 0 0 0 0 0 0 0 0 0 0
Any help is much appreciated!
Using groupby with columns split
s=df.groupby(df.columns.str.split('_').str[0],axis=1).sum().drop('Date',1).add_suffix('_Total')
yourdf=pd.concat([df,s],1).sort_index(level=0,axis=1)
yourdf
Out[455]:
Date David_Added ... Sucely_Removed Sucely_Total
0 02/19/2019 3 ... 24 38
1 02/18/2019 0 ... 0 0
2 02/16/2019 0 ... 0 0
3 02/15/2019 0 ... 0 0
4 02/14/2019 0 ... 0 0
5 02/13/2019 0 ... 0 0
6 02/12/2019 0 ... 0 0
7 02/11/2019 0 ... 0 0
8 02/08/2019 0 ... 0 0
9 02/07/2019 0 ... 0 0
[10 rows x 13 columns]
Alternatively:
df.join(df.T.groupby(df.T.index.str.split("_").str[0]).sum().T.iloc[:,1:].add_suffix('_Total'))
Date David_Added David_Removed Malik_Added Malik_Removed \
0 02/19/2019 3 1 39 41
1 02/18/2019 0 0 8 6
2 02/16/2019 0 0 0 0
3 02/15/2019 0 0 0 0
4 02/14/2019 0 0 0 0
5 02/13/2019 0 0 0 0
6 02/12/2019 0 0 0 0
7 02/11/2019 0 0 0 0
8 02/08/2019 0 0 0 0
9 02/07/2019 0 0 0 0
Meghan_Added Meghan_Removed Sucely_Added Sucely_Removed David_Total \
0 1 6 14 24 4
1 0 3 0 0 0
2 0 0 0 0 0
3 0 0 0 0 0
4 0 0 0 0 0
5 0 0 0 0 0
6 0 0 0 0 0
7 0 0 0 0 0
8 0 0 0 0 0
9 0 0 0 0 0
Malik_Total Meghan_Total Sucely_Total
0 80 7 38
1 14 3 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 0
6 0 0 0
7 0 0 0
8 0 0 0
9 0 0 0
I'm aware my this is not an answer for the question the OP posed, it is an advice on better practices that would solve the problem he is facing.
You have a structural problem. Having your dataframe modeled as such:
Date User_Name User_Added User_Removed User_Total
would make the code you've entered the solution to your problem, besides handling the variable number of users.

Convert the last non-zero value to 0 for each row in a pandas DataFrame

I'm trying to modify my data frame in a way that the last variable of a label encoded feature is converted to 0. For example, I have this data frame, top row being the labels and the first column as the index:
df
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 1
1 0 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1 0
Columns 1-10 are the ones that have been encoded. What I want to convert this data frame to, without changing anything else is:
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
So the last values occurring in each row should be converted to 0. I was thinking of using the last_valid_index method, but that would take in the other remaining columns and change that as well, which I don't want. Any help is appreciated
You can use cumsum to build a boolean mask, and set to zero.
v = df.cumsum(axis=1)
df[v.lt(v.max(axis=1), axis=0)].fillna(0, downcast='infer')
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
Another similar option is reversing before calling cumsum, you can now do this in a single line.
df[~df.iloc[:, ::-1].cumsum(1).le(1)].fillna(0, downcast='infer')
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
If you have more columns, just apply these operations on the slice. Later, assign back.
u = df.iloc[:, :10]
df[u.columns] = u[~u.iloc[:, ::-1].cumsum(1).le(1)].fillna(0, downcast='infer')

Categories

Resources