How to add new columns by reindex in pivot table in python? - python

I have a very long origin dataframe
ID cols even1 event2 event3 event4 event5 event6
1 1 0 0 0 0 1 1
1 16 9 1 0 0 7 11
2 2 3 3 0 0 68 36
2 25 1 0 1 1 97 27
2 59 3 0 0 0 38 38
2 118 4 0 1 1 33 10
2 150 3 1 0 0 4 7
.....
One userID to multiple records on the origin dataframe.
then I convert it to a pivot table,
df = df.pivot_table(df, index='ID', columns='cols', fill_value='0')
event1 \ ... event2 \
cols 1 2 3 5 7 8 ... 1 2 3 5 7 8 ...
ID ... ...
1 0 77 0 2 0 0 ... 2 4 1 0 0 12 ...
2 0 0 0 1 0 0 ... 0 3 3 0 11 2 ...
3 0 0 0 3 0 0 ... 1 2 6 0 4 5 ...
4 0 1 0 6 0 1 ... 9 0 0 0 1 6 ...
... event6
cols 8 9 10 ... 236 249
ID ...
1 0 0 0 ... 0 0
2 0 0 0 ... 0 0
3 0 0 0 ... 0 0
4 0 0 0 ... 0 0
5 0 0 0 ... 0 0
It seems some of the columns missed from 1 to 249, So I tried to reindex columns by using this
df.columns=df.columns.droplevel()
df.reindex(columns=list(range(1,249))).fillna(0)
But it gives me an error when reindex them.
ValueError: cannot reindex from a duplicate axis
Does anyone know how to fix this problem?
Final dataframe should be similar like
event1 \ ... event2
cols 1 2 3 4 5 6 7 8 ... 1 2 3 4 5 6 7 8 ...
ID
1 0 77 0 0 2 0 0 0 ... 2 4 1 0 0 0 0 12
2 0 0 0 0 1 0 0 0 ... 0 3 3 0 0 0 11 2 ...
3 0 0 0 0 3 0 0 0 ... 1 2 6 0 0 0 4 5 ...
4 0 1 0 0 6 0 0 1 ... 9 0 0 0 0 0 1 6 ...
...
... event6
cols ... 247 248 249
ID ... 0 0 0
1 ... 0 0 0
2 ... 0 0 0
3 ... 0 0 0
4 ... 0 0 0

Related

Trying to merge dictionaries together to create new df but dictionaries values arent showing up in df

image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1

Transpose Pandas dataframe preserving the index

I have a problem while transposing a Pandas DataFrame that has the following structure:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
foo 0 4 0 0 0 0 0 0 0 0 14 1 0 1 0 0 0
bar 0 6 0 0 4 0 5 0 0 0 0 0 0 0 1 0 0
lorem 1 3 0 0 0 1 0 0 2 0 3 0 1 2 1 1 0
ipsum 1 2 0 1 0 0 1 0 0 0 0 0 4 0 6 0 0
dolor 1 2 4 0 1 0 0 0 0 0 2 0 0 1 0 0 2
..
With index:
foo,bar,lorem,ipsum,dolor,...
And this is basically a terms-documents matrix, where rows are terms and the headers (0-16) are document indexes.
Since my purpose is clustering documents and not terms, I want to transpose the dataframe and use this to perform a cosine-distance computation between documents themselves.
But when I transpose with:
pd.transpose()
I get:
foo bar ... pippo lorem
0 0 0 ... 0 0
1 4 6 ... 0 0
2 0 0 ... 0 0
3 0 0 ... 0 0
4 0 4 ... 0 0
..
16 0 2 ... 0 1
With index:
0 , 1 , 2 , 3 , ... , 15, 16
What I would like?
I'm looking for a way to make this operation preserving the dataframe index. Basically the first row of my new df should be the index.
Thank you
We can use a series of unstack
df2 = df.unstack().to_frame().unstack(1).droplevel(0,axis=1)
print(df2)
foo bar lorem ipsum dolor
0 0 0 1 1 1
1 4 6 3 2 2
2 0 0 0 0 4
3 0 0 0 1 0
4 0 4 0 0 1
5 0 0 1 0 0
6 0 5 0 1 0
7 0 0 0 0 0
8 0 0 2 0 0
9 0 0 0 0 0
10 14 0 3 0 2
11 1 0 0 0 0
12 0 0 1 4 0
13 1 0 2 0 1
14 0 1 1 6 0
15 0 0 1 0 0
16 0 0 0 0 2
Assuming data is square matrix (n x n) and if I understand the question correctly
df = pd.DataFrame([[0, 4,0], [0,6,0], [1,3,0]],
index =['foo', 'bar', 'lorem'],
columns=[0, 1, 2]
)
df_T = pd.DataFrame(df.values.T, index=df.index, columns=df.columns)

Python Dictionary: Simple division with massive DataFrame values in each indexes

So I have a two Dictionaries which are composed with 10 of 3000 by 3000 Dataframe in each index(0~9). All the values in the Dataframe is int, and I just want to simply divide each values. The first loop below is only replacing index=column values into 0, and personally I do not think this loop is slowing the process. The second loop is the problem with run time (I believe) since there are too many data to compute. Please see the code below.
for a in range(10):
for aa in range(len(dict_cat4[a])):
dict_cat4[a].iloc[aa,aa] = 0
dict_amt4[a].iloc[aa,aa] = 0
for b in range(10):
temp_df3 = dict_amt4[b] / dict_cat4[b]
temp_df3.replace(np.nan,0.0,inplace=True)
dict_div4[b] = temp_df3
One problem is that the process takes forever to compute this loop since the data set is very big. Is there a efficient way to convert my code into other loops? Now its been 60+ minutes and still computing. Please let me know! Thanks
-----------------edit------------------
Below is sample input and output of first loop
Output:dict_amt4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 62 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 13 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
And second loop is below
Input:dict_amt4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 186 174 0 4 46 46 14 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 186 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 130 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
Input:dict_cat4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15 ... 4 17 45 1 2 0 0 0 0 0
B 62 0 27 0 0 12 61 2 4 11 ... 6 9 14 1 0 0 0 0 0 0
C 174 27 0 0 0 13 22 5 2 4 ... 0 2 8 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5 ... 0 0 0 0 0 0 2 0 0 0
F 46 12 13 0 10 0 4 5 0 0 ... 0 33 2 0 0 0 2 3 0 0
.............
Output:dict_div4[0]
DNA Cat2 ... Func
Item A B C D E F F H I J ... ZDF11 ZDF12 ZDF13 ZDF14 ZDF15 ZDF16 ZDF17 ZDF18 ZDF19 ZDF20
DNA Item
Cat2 A 0 3 1 0 1 1 1 2 1 1 ... 1 1 1 1 1 0 0 0 0 0
B 3 0 1 0 0 1 1 1 1 1 ... 1 1 1 1 0 0 0 0 0 0
C 1 1 0 0 0 10 1 1 1 1 ... 0 1 1 0 0 0 0 0 0 0
D 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
E 1 0 0 0 0 1 1 1 1 1 ... 0 0 0 0 0 0 1 0 0 0
F 1 1 1 0 1 0 1 1 0 0 ... 0 1 1 0 0 0 1 1 0 0
.............
I just made a sample data by hand, so please disregard typo. As you can see the first loop is just converting a value that dict_cat4[0].iloc[i,i] = 0. Second loop is dividing all the value from dict_amt[0] to dict_cat[0]. Hope it makes more sense.

Fill missing rows with zeros from a data frame

Now I have a DataFrame as below:
video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56
user_id ...
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0
2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0
3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6
4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3
6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0
7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39
9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0
10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0
11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9
12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0
We can see that part of the DataFrame is missing, like user_id_5 and user_id_8. What I want to do is to fill these rows with 0, like:
video_id 0 1 2 3 4 5 6 7 8 9 ... 53 54 55 56
user_id ...
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45 ... 3 352 6 0
2 0 0 0 0 0 0 0 11 0 0 ... 0 0 0 0
3 4 13 0 8 0 0 5 9 12 11 ... 14 17 0 6
4 0 0 4 13 25 4 0 33 0 39 ... 5 7 4 3
5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
6 2 0 0 0 12 0 0 0 2 0 ... 19 4 0 0
7 33 59 52 59 113 53 29 32 59 82 ... 60 119 57 39
8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0
9 0 0 0 0 5 0 0 1 0 4 ... 16 0 0 0
10 0 0 0 0 40 0 0 0 0 0 ... 26 0 0 0
11 2 2 32 3 12 3 3 11 19 10 ... 16 3 3 9
12 0 0 0 0 0 0 0 7 0 0 ... 7 0 0 0
Is there any solution to this issue?
You could use arange + reindex -
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0)
Assuming your index is meant to be monotonically increasing index.
df
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45
2 0 0 0 0 0 0 0 11 0 0
3 4 13 0 8 0 0 5 9 12 11
4 0 0 4 13 25 4 0 33 0 39
6 2 0 0 0 12 0 0 0 2 0
7 33 59 52 59 113 53 29 32 59 82
9 0 0 0 0 5 0 0 1 0 4
10 0 0 0 0 40 0 0 0 0 0
11 2 2 32 3 12 3 3 11 19 10
12 0 0 0 0 0 0 0 7 0 0
df.reindex(np.arange(df.index.min(), df.index.max() + 1), fill_value=0)
0 1 2 3 4 5 6 7 8 9
0 0 0 0 0 0 0 0 0 0 0
1 2 0 4 13 16 2 0 10 6 45
2 0 0 0 0 0 0 0 11 0 0
3 4 13 0 8 0 0 5 9 12 11
4 0 0 4 13 25 4 0 33 0 39
5 0 0 0 0 0 0 0 0 0 0 # <-----
6 2 0 0 0 12 0 0 0 2 0
7 33 59 52 59 113 53 29 32 59 82
8 0 0 0 0 0 0 0 0 0 0 # <-----
9 0 0 0 0 5 0 0 1 0 4
10 0 0 0 0 40 0 0 0 0 0
11 2 2 32 3 12 3 3 11 19 10
12 0 0 0 0 0 0 0 7 0 0

Convert a list of values to a time series in python

I want to convert the foll. data:
jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
0 0 0 0 0 1 1 2 2 2 2 2 2 3 3 3 3 3 0 0 0 0 0 0
into a array of length 365, where each element is repeated till the next date days e.g. 0 is repeated from january 1 to january 15...
I could do something like numpy.repeat, but that is not date aware, so would not take into account that less than 15 days happen between feb_15 and mar_1.
Any pythonic solution for this?
You can use resample:
#add last value - 31 dec by value of last column of df
df['dec_31'] = df.iloc[:,-1]
#convert to datetime - see http://strftime.org/
df.columns = pd.to_datetime(df.columns, format='%b_%d')
#transpose and resample by days
df1 = df.T.resample('d').ffill()
df1.columns = ['col']
print (df1)
col
1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
1900-01-16 0
1900-01-17 0
1900-01-18 0
1900-01-19 0
1900-01-20 0
1900-01-21 0
1900-01-22 0
1900-01-23 0
1900-01-24 0
1900-01-25 0
1900-01-26 0
1900-01-27 0
1900-01-28 0
1900-01-29 0
1900-01-30 0
..
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
1900-12-16 0
1900-12-17 0
1900-12-18 0
1900-12-19 0
1900-12-20 0
1900-12-21 0
1900-12-22 0
1900-12-23 0
1900-12-24 0
1900-12-25 0
1900-12-26 0
1900-12-27 0
1900-12-28 0
1900-12-29 0
1900-12-30 0
1900-12-31 0
[365 rows x 1 columns]
#if need serie
print (df1.col)
1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
1900-01-16 0
1900-01-17 0
1900-01-18 0
1900-01-19 0
1900-01-20 0
1900-01-21 0
1900-01-22 0
1900-01-23 0
1900-01-24 0
1900-01-25 0
1900-01-26 0
1900-01-27 0
1900-01-28 0
1900-01-29 0
1900-01-30 0
..
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
1900-12-16 0
1900-12-17 0
1900-12-18 0
1900-12-19 0
1900-12-20 0
1900-12-21 0
1900-12-22 0
1900-12-23 0
1900-12-24 0
1900-12-25 0
1900-12-26 0
1900-12-27 0
1900-12-28 0
1900-12-29 0
1900-12-30 0
1900-12-31 0
Freq: D, Name: col, dtype: int64
#transpose and convert to numpy array
print (df1.T.values)
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
IIUC you can do it this way:
In [194]: %paste
# transpose DF, rename columns
x = df.T.reset_index().rename(columns={'index':'date', 0:'val'})
# parse dates
x['date'] = pd.to_datetime(x['date'], format='%b_%d')
# group resampled DF by the month and resample(`D`) each group
result = (x.groupby(x['date'].dt.month)
.apply(lambda x: x.set_index('date').resample('1D').ffill()))
# rename index names
result.index.names = ['month','date']
## -- End pasted text --
In [212]: result
Out[212]:
val
month date
1 1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
2 1900-02-01 0
1900-02-02 0
1900-02-03 0
1900-02-04 0
1900-02-05 0
1900-02-06 0
1900-02-07 0
1900-02-08 0
1900-02-09 0
1900-02-10 0
1900-02-11 0
1900-02-12 0
1900-02-13 0
1900-02-14 0
1900-02-15 0
... ...
11 1900-11-01 0
1900-11-02 0
1900-11-03 0
1900-11-04 0
1900-11-05 0
1900-11-06 0
1900-11-07 0
1900-11-08 0
1900-11-09 0
1900-11-10 0
1900-11-11 0
1900-11-12 0
1900-11-13 0
1900-11-14 0
1900-11-15 0
12 1900-12-01 0
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
[180 rows x 1 columns]
or using reset_index():
In [213]: result.reset_index().head(20)
Out[213]:
month date val
0 1 1900-01-01 0
1 1 1900-01-02 0
2 1 1900-01-03 0
3 1 1900-01-04 0
4 1 1900-01-05 0
5 1 1900-01-06 0
6 1 1900-01-07 0
7 1 1900-01-08 0
8 1 1900-01-09 0
9 1 1900-01-10 0
10 1 1900-01-11 0
11 1 1900-01-12 0
12 1 1900-01-13 0
13 1 1900-01-14 0
14 1 1900-01-15 0
15 2 1900-02-01 0
16 2 1900-02-02 0
17 2 1900-02-03 0
18 2 1900-02-04 0
19 2 1900-02-05 0

Categories

Resources