Fill missing rows with zeros from a data frame - python

Now I have a DataFrame as below:
video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56
user_id ...
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0
2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0
3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6
4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3
6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0
7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39
9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0
10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0
11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9
12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0
We can see that part of the DataFrame is missing, like user_id_5 and user_id_8. What I want to do is to fill these rows with 0, like:
video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56
user_id ...
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0
2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0
3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6
4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0
7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0
10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0
11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9
12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0
Is there any solution to this issue?

You could use arange + reindex -
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0)
Assuming your index is meant to be monotonically increasing index.
df
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45
2 0 0 0 0 0 0 0 11 0 0
3 4 13 0 8 0 0 5 9 12 11
4 0 0 4 13 25 4 0 33 0 39
6 2 0 0 0 12 0 0 0 2 0
7 33 59 52 59 113 53 29 32 59 82
9 0 0 0 0 5 0 0 1 0 4
10 0 0 0 0 40 0 0 0 0 0
11 2 2 32 3 12 3 3 11 19 10
12 0 0 0 0 0 0 0 7 0 0
df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0)
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45
2 0 0 0 0 0 0 0 11 0 0
3 4 13 0 8 0 0 5 9 12 11
4 0 0 4 13 25 4 0 33 0 39
5 0 0 0 0 0 0 0 0 0 0 # <-----
6 2 0 0 0 12 0 0 0 2 0
7 33 59 52 59 113 53 29 32 59 82
8 0 0 0 0 0 0 0 0 0 0 # <-----
9 0 0 0 0 5 0 0 1 0 4
10 0 0 0 0 40 0 0 0 0 0
11 2 2 32 3 12 3 3 11 19 10
12 0 0 0 0 0 0 0 7 0 0

Related

How to calculate an an accumulated value conditionally?

This question is based on this thread.
I have the following dataframe:
diff_hours stage sensor
0 0 20
0 0 21
0 0 21
1 0 22
5 0 21
0 0 22
0 1 20
7 1 23
0 1 24
0 3 25
0 3 28
6 0 21
0 0 22
I need to calculated an accumulated value of diff_hours while stage is growing. When stage drops to 0, the accumulated value acc_hours should restart to 0 even though diff_hours might not be equal to 0.
The proposed solution is this one:
blocks = df['stage'].diff().lt(0).cumsum()
df['acc_hours'] = df['diff_hours'].groupby(blocks).cumsum()
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 6
12 0 0 22 6
On the line 11 the value of acc_hours is equal to 6. I need it to be restarted to 0, because the stage dropped from 3 back to 0 in row 11.
The expected output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 0
How can I implement this logic?
The expected output is unclear, what about a simple mask?
Masking only the value during the change:
m = df['stage'].diff().lt(0)
df['acc_hours'] = (df.groupby(m.cumsum())
['diff_hours'].cumsum()
.mask(m, 0)
)
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 6
13 3 0 22 9
14 0 0 22 9
Or ignoring the value completely bu masking before groupby:
m = df['stage'].diff().lt(0)
df['acc_hours'] = (df['diff_hours'].mask(m, 0)
.groupby(m.cumsum())
.cumsum()
)
Output:
diff_hours stage sensor acc_hours
0 0 0 20 0
1 0 0 21 0
2 0 0 21 0
3 1 0 22 1
4 5 0 21 6
5 0 0 22 6
6 0 1 20 6
7 7 1 23 13
8 0 1 24 13
9 0 3 25 13
10 0 3 28 13
11 6 0 21 0
12 0 0 22 0
13 3 0 22 3
14 0 0 22 3

Trying to merge dictionaries together to create new df but dictionaries values arent showing up in df

image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1

Finding Occurrences SUM using Dataframe

I have a data frame and I need to group by at least one occurrence greater than 0 and I need to sum it to last occurance. My code is below
data = {'id':
[7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'timeatAcc':
[0,0,0,0,0,0,0,0,1,1,1,0,0,1,1,0,0,1,1,1,1,1,1,0,0,0,0,0,1,1,1,0,0,1,1,0,0,0,0,0]
}
df =pd.DataFrame(data, columns= ['id','timeatAcc'])
df['consecutive'] = df['id'].groupby((df['timeatAcc'] !=
df['timeatAcc'].shift()).cumsum()).transform('size') * df['timeatAcc']
print(df)
Current Output
Expected output
Need help and thanks in advance
Let's try groupby().diff():
df['Occurences'] = df.groupby('id')['timeatAcc'].diff(-1).eq(1).astype(int)
Output:
id timeatAcc Occurences
0 7 0 0
1 7 0 0
2 7 0 0
3 7 0 0
4 7 0 0
5 7 0 0
6 7 0 0
7 7 0 0
8 7 1 0
9 7 1 0
10 7 1 1
11 7 0 0
12 7 0 0
13 7 1 0
14 7 1 1
15 7 0 0
16 7 0 0
17 7 1 0
18 7 1 0
19 7 1 0
20 1 1 0
21 1 1 0
22 1 1 1
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 1 0
29 1 1 0
30 1 1 1
31 1 0 0
32 1 0 0
33 1 1 0
34 1 1 1
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0
Update: to get the sum instead of 1:
df['Occurences'] = df.groupby(['id', df['timeatAcc'].eq(0).cumsum()])['timeatAcc'].transform('sum')
df['Occurences'] = np.where(df.groupby('id')['timeatAcc'].diff(-1).eq(1).astype(int)
, df['Occurences'], 0)
Output:
id timeatAcc Occurences
0 7 0 0
1 7 0 0
2 7 0 0
3 7 0 0
4 7 0 0
5 7 0 0
6 7 0 0
7 7 0 0
8 7 1 0
9 7 1 0
10 7 1 3
11 7 0 0
12 7 0 0
13 7 1 0
14 7 1 2
15 7 0 0
16 7 0 0
17 7 1 0
18 7 1 0
19 7 1 0
20 1 1 0
21 1 1 0
22 1 1 3
23 1 0 0
24 1 0 0
25 1 0 0
26 1 0 0
27 1 0 0
28 1 1 0
29 1 1 0
30 1 1 3
31 1 0 0
32 1 0 0
33 1 1 0
34 1 1 2
35 1 0 0
36 1 0 0
37 1 0 0
38 1 0 0
39 1 0 0

Python Dictionary: Simple division with massive DataFrame values in each indexes

So I have a two Dictionaries which are composed with 10 of 3000 by 3000 Dataframe in each index(0~9). All the values in the Dataframe is int, and I just want to simply divide each values. The first loop below is only replacing index=column values into 0, and personally I do not think this loop is slowing the process. The second loop is the problem with run time (I believe) since there are too many data to compute. Please see the code below.
for a in range(10):
for aa in range(len(dict_cat4[a])):
dict_cat4[a].iloc[aa,aa] = 0
dict_amt4[a].iloc[aa,aa] = 0
for b in range(10):
temp_df3 = dict_amt4[b] / dict_cat4[b]
temp_df3.replace(np.nan,0.0,inplace=True)
dict_div4[b] = temp_df3
One problem is that the process takes forever to compute this loop since the data set is very big. Is there a efficient way to convert my code into other loops? Now its been 60+ minutes and still computing. Please let me know! Thanks
-----------------edit------------------
Below is sample input and output of first loop
Output:dict_amt4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 62 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 13 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
And second loop is below
Input:dict_amt4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 186 174 0 4 46 46 14 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 186 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 130 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
Input:dict_cat4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 62 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 13 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
Output:dict_div4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 3 1 0 1 1 1 2 1 1 ... 1 1 1 1 1 0 0 0 0 0
B 3 0 1 0 0 1 1 1 1 1 ... 1 1 1 1 0 0 0 0 0 0
C 1 1 0 0 0 10 1 1 1 1 ... 0 1 1 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 1 0 0 0 0 1 1 1 1 1 ... 0 0 0 0 0 0 1 0 0 0
F 1 1 1 0 1 0 1 1 0 0 ... 0 1 1 0 0 0 1 1 0 0
.............
I just made a sample data by hand, so please disregard typo. As you can see the first loop is just converting a value that dict_cat4[0].iloc[i,i] = 0. Second loop is dividing all the value from dict_amt[0] to dict_cat[0]. Hope it makes more sense.

How to add new columns by reindex in pivot table in python?

I have a very long origin dataframe
ID cols even1 event2 event3 event4 event5 event6
1 1 0 0 0 0 1 1
1 16 9 1 0 0 7 11
2 2 3 3 0 0 68 36
2 25 1 0 1 1 97 27
2 59 3 0 0 0 38 38
2 118 4 0 1 1 33 10
2 150 3 1 0 0 4 7
.....
One userID to multiple records on the origin dataframe.
then I convert it to a pivot table,
df = df.pivot_table(df, index='ID', columns='cols', fill_value='0')
event1 \ ... event2 \
cols 1 2 3 5 7 8 ... 1 2 3 5 7 8 ...
ID ... ...
1 0 77 0 2 0 0 ... 2 4 1 0 0 12 ...
2 0 0 0 1 0 0 ... 0 3 3 0 11 2 ...
3 0 0 0 3 0 0 ... 1 2 6 0 4 5 ...
4 0 1 0 6 0 1 ... 9 0 0 0 1 6 ...
... event6
cols 8 9 10 ... 236 249
ID ...
1 0 0 0 ... 0 0
2 0 0 0 ... 0 0
3 0 0 0 ... 0 0
4 0 0 0 ... 0 0
5 0 0 0 ... 0 0
It seems some of the columns missed from 1 to 249, So I tried to reindex columns by using this
df.columns=df.columns.droplevel()
df.reindex(columns=list(range(1,249))).fillna(0)
But it gives me an error when reindex them.
ValueError: cannot reindex from a duplicate axis
Does anyone know how to fix this problem?
Final dataframe should be similar like
event1 \ ... event2
cols 1 2 3 4 5 6 7 8 ... 1 2 3 4 5 6 7 8 ...
ID
1 0 77 0 0 2 0 0 0 ... 2 4 1 0 0 0 0 12
2 0 0 0 0 1 0 0 0 ... 0 3 3 0 0 0 11 2 ...
3 0 0 0 0 3 0 0 0 ... 1 2 6 0 0 0 4 5 ...
4 0 1 0 0 6 0 0 1 ... 9 0 0 0 0 0 1 6 ...
...
... event6
cols ... 247 248 249
ID ... 0 0 0
1 ... 0 0 0
2 ... 0 0 0
3 ... 0 0 0
4 ... 0 0 0

Categories

Resources