Python Dictionary: Simple division with massive DataFrame values in each indexes - python
So I have a two Dictionaries which are composed with 10 of 3000 by 3000 Dataframe in each index(0~9). All the values in the Dataframe is int, and I just want to simply divide each values. The first loop below is only replacing index=column values into 0, and personally I do not think this loop is slowing the process. The second loop is the problem with run time (I believe) since there are too many data to compute. Please see the code below.
for a in range(10):
for aa in range(len(dict_cat4[a])):
dict_cat4[a].iloc[aa,aa] = 0
dict_amt4[a].iloc[aa,aa] = 0
for b in range(10):
temp_df3 = dict_amt4[b] / dict_cat4[b]
temp_df3.replace(np.nan,0.0,inplace=True)
dict_div4[b] = temp_df3
One problem is that the process takes forever to compute this loop since the data set is very big. Is there a efficient way to convert my code into other loops? Now its been 60+ minutes and still computing. Please let me know! Thanks
-----------------edit------------------
Below is sample input and output of first loop
Output:dict_amt4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 62 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 13 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
And second loop is below
Input:dict_amt4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 186 174 0 4 46 46 14 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 186 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 130 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
Input:dict_cat4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 62 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 13 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
Output:dict_div4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 3 1 0 1 1 1 2 1 1 ... 1 1 1 1 1 0 0 0 0 0
B 3 0 1 0 0 1 1 1 1 1 ... 1 1 1 1 0 0 0 0 0 0
C 1 1 0 0 0 10 1 1 1 1 ... 0 1 1 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 1 0 0 0 0 1 1 1 1 1 ... 0 0 0 0 0 0 1 0 0 0
F 1 1 1 0 1 0 1 1 0 0 ... 0 1 1 0 0 0 1 1 0 0
.............
I just made a sample data by hand, so please disregard typo. As you can see the first loop is just converting a value that dict_cat4[0].iloc[i,i] = 0. Second loop is dividing all the value from dict_amt[0] to dict_cat[0]. Hope it makes more sense.
Related
Trying to merge dictionaries together to create new df but dictionaries values arent showing up in df
image of jupter notebook issue For my quarters instead of values for examples 1,0,0,0 showing up I get NaN. How do I fix the code below so I return values in my dataframe qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]} qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]} qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]} qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]} year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]} value = data_1['Sales'] data = [year, qrt_1, qrt_2, qrt_3, qrt_4] dataframes = [] for x in data: dataframes.append(pd.DataFrame(x)) df = pd.concat(dataframes) I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat: df = pd.concat(dataframes, axis=1) print(df) Prints: year q1 q2 q3 q4 0 1 1 0 0 0 1 1 0 1 0 0 2 1 0 0 1 0 3 1 0 0 0 1 4 2 1 0 0 0 5 2 0 1 0 0 6 2 0 0 1 0 7 2 0 0 0 1 8 3 1 0 0 0 9 3 0 1 0 0 10 3 0 0 1 0 11 3 0 0 0 1 12 4 1 0 0 0 13 4 0 1 0 0 14 4 0 0 1 0 15 4 0 0 0 1 16 5 1 0 0 0 17 5 0 1 0 0 18 5 0 0 1 0 19 5 0 0 0 1 20 6 1 0 0 0 21 6 0 1 0 0 22 6 0 0 1 0 23 6 0 0 0 1 24 7 1 0 0 0 25 7 0 1 0 0 26 7 0 0 1 0 27 7 0 0 0 1 28 8 1 0 0 0 29 8 0 1 0 0 30 8 0 0 1 0 31 8 0 0 0 1 32 9 1 0 0 0 33 9 0 1 0 0 34 9 0 0 1 0 35 9 0 0 0 1
How can i groupby ID and add columns to each other
id volume location_ 10 location_ 100 location_ 1000 location_ 1002 location_ 1005 0 14121 19 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 1 14121 19 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 2 14121 19 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 3 14121 19 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 4 9320 200 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 5 9320 116 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 6 9320 200 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 7 9320 116 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0 I have df like that. I want to groupby ID and i have to reach to something like that. There is 4 id in 14121 and sum of their volumes is 76. How can i do that? id 0 1 2 3 4 5 6 7 ... vol 0 14121 0 0 0 0 0 0 0 0 ... 76 1 9329 0 0 0 0 0 0 0 0 ... 632 2 14934 0 0 0 0 0 0 0 0 ... 4
I am not sure what the location columns are. Here's how you would get the sum of ids. import pandas as pd df = pd.DataFrame({'id':[14121,14121,14121,14121,9320,9320,9320,9320,14934,14934,14934,14934], 'volume':[19,19,19,19,200,116,200,116,1,1,1,1]}) print (df) print (df.groupby('id')['volume'].sum()) Input DataFrame: id volume 0 14121 19 1 14121 19 2 14121 19 3 14121 19 4 9320 200 5 9320 116 6 9320 200 7 9320 116 8 14934 1 9 14934 1 10 14934 1 11 14934 1 Output DataFrame: id 9320 632 14121 76 14934 4 Or you can also give: print (df.groupby('id').agg(vol_sum = ('volume','sum')).reset_index()) Output will be: id vol_sum 0 9320 632 1 14121 76 2 14934 4
Fill missing rows with zeros from a data frame
Now I have a DataFrame as below: video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56 user_id ... 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0 2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0 3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6 4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3 6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0 7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39 9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0 10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0 11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9 12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0 We can see that part of the DataFrame is missing, like user_id_5 and user_id_8. What I want to do is to fill these rows with 0, like: video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56 user_id ... 0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0 2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0 3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6 4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3 5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0 7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39 8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0 10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0 11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9 12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0 Is there any solution to this issue?
You could use arange + reindex - df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0) Assuming your index is meant to be monotonically increasing index. df 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 1 2 0 4 13 16 2 0 10 6 45 2 0 0 0 0 0 0 0 11 0 0 3 4 13 0 8 0 0 5 9 12 11 4 0 0 4 13 25 4 0 33 0 39 6 2 0 0 0 12 0 0 0 2 0 7 33 59 52 59 113 53 29 32 59 82 9 0 0 0 0 5 0 0 1 0 4 10 0 0 0 0 40 0 0 0 0 0 11 2 2 32 3 12 3 3 11 19 10 12 0 0 0 0 0 0 0 7 0 0 df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0) 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 1 2 0 4 13 16 2 0 10 6 45 2 0 0 0 0 0 0 0 11 0 0 3 4 13 0 8 0 0 5 9 12 11 4 0 0 4 13 25 4 0 33 0 39 5 0 0 0 0 0 0 0 0 0 0 # <----- 6 2 0 0 0 12 0 0 0 2 0 7 33 59 52 59 113 53 29 32 59 82 8 0 0 0 0 0 0 0 0 0 0 # <----- 9 0 0 0 0 5 0 0 1 0 4 10 0 0 0 0 40 0 0 0 0 0 11 2 2 32 3 12 3 3 11 19 10 12 0 0 0 0 0 0 0 7 0 0
How to add new columns by reindex in pivot table in python?
I have a very long origin dataframe ID cols even1 event2 event3 event4 event5 event6 1 1 0 0 0 0 1 1 1 16 9 1 0 0 7 11 2 2 3 3 0 0 68 36 2 25 1 0 1 1 97 27 2 59 3 0 0 0 38 38 2 118 4 0 1 1 33 10 2 150 3 1 0 0 4 7 ..... One userID to multiple records on the origin dataframe. then I convert it to a pivot table, df = df.pivot_table(df, index='ID', columns='cols', fill_value='0') event1 \ ... event2 \ cols 1 2 3 5 7 8 ... 1 2 3 5 7 8 ... ID ... ... 1 0 77 0 2 0 0 ... 2 4 1 0 0 12 ... 2 0 0 0 1 0 0 ... 0 3 3 0 11 2 ... 3 0 0 0 3 0 0 ... 1 2 6 0 4 5 ... 4 0 1 0 6 0 1 ... 9 0 0 0 1 6 ... ... event6 cols 8 9 10 ... 236 249 ID ... 1 0 0 0 ... 0 0 2 0 0 0 ... 0 0 3 0 0 0 ... 0 0 4 0 0 0 ... 0 0 5 0 0 0 ... 0 0 It seems some of the columns missed from 1 to 249, So I tried to reindex columns by using this df.columns=df.columns.droplevel() df.reindex(columns=list(range(1,249))).fillna(0) But it gives me an error when reindex them. ValueError: cannot reindex from a duplicate axis Does anyone know how to fix this problem? Final dataframe should be similar like event1 \ ... event2 cols 1 2 3 4 5 6 7 8 ... 1 2 3 4 5 6 7 8 ... ID 1 0 77 0 0 2 0 0 0 ... 2 4 1 0 0 0 0 12 2 0 0 0 0 1 0 0 0 ... 0 3 3 0 0 0 11 2 ... 3 0 0 0 0 3 0 0 0 ... 1 2 6 0 0 0 4 5 ... 4 0 1 0 0 6 0 0 1 ... 9 0 0 0 0 0 1 6 ... ... ... event6 cols ... 247 248 249 ID ... 0 0 0 1 ... 0 0 0 2 ... 0 0 0 3 ... 0 0 0 4 ... 0 0 0
Leave blocks of 1 of size >= k in Pandas data frame
I need to leave block >= k of '1'. All other block of '1' should be transformed to zero. For example, k=2: df= a b 0 1 1 1 1 1 2 0 0 3 1 0 4 0 0 5 1 0 6 0 0 7 1 0 8 0 0 9 1 1 10 1 1 11 1 1 12 0 0 13 0 0 14 1 0 15 0 0 16 1 1 17 1 1 18 0 0 19 1 0 where the column a is the original sequence, and the column b is the desired.
z = df.a.eq(0) g = z.cumsum().mask(z, -1) k = 2 df['b'] = df.a.groupby(g).transform('size').ge(k).mask(z, 0) a b 0 1 1 1 1 1 2 0 0 3 1 0 4 0 0 5 1 0 6 0 0 7 1 0 8 0 0 9 1 1 10 1 1 11 1 1 12 0 0 13 0 0 14 1 0 15 0 0 16 1 1 17 1 1 18 0 0 19 1 0