Transpose Pandas dataframe preserving the index - python

I have a problem while transposing a Pandas DataFrame that has the following structure:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
foo 0 4 0 0 0 0 0 0 0 0 14 1 0 1 0 0 0
bar 0 6 0 0 4 0 5 0 0 0 0 0 0 0 1 0 0
lorem 1 3 0 0 0 1 0 0 2 0 3 0 1 2 1 1 0
ipsum 1 2 0 1 0 0 1 0 0 0 0 0 4 0 6 0 0
dolor 1 2 4 0 1 0 0 0 0 0 2 0 0 1 0 0 2
..
With index:
foo,bar,lorem,ipsum,dolor,...
And this is basically a terms-documents matrix, where rows are terms and the headers (0-16) are document indexes.
Since my purpose is clustering documents and not terms, I want to transpose the dataframe and use this to perform a cosine-distance computation between documents themselves.
But when I transpose with:
pd.transpose()
I get:
foo bar ... pippo lorem
0 0 0 ... 0 0
1 4 6 ... 0 0
2 0 0 ... 0 0
3 0 0 ... 0 0
4 0 4 ... 0 0
..
16 0 2 ... 0 1
With index:
0 , 1 , 2 , 3 , ... , 15, 16
What I would like?
I'm looking for a way to make this operation preserving the dataframe index. Basically the first row of my new df should be the index.
Thank you

We can use a series of unstack
df2 = df.unstack().to_frame().unstack(1).droplevel(0,axis=1)
print(df2)
foo bar lorem ipsum dolor
0 0 0 1 1 1
1 4 6 3 2 2
2 0 0 0 0 4
3 0 0 0 1 0
4 0 4 0 0 1
5 0 0 1 0 0
6 0 5 0 1 0
7 0 0 0 0 0
8 0 0 2 0 0
9 0 0 0 0 0
10 14 0 3 0 2
11 1 0 0 0 0
12 0 0 1 4 0
13 1 0 2 0 1
14 0 1 1 6 0
15 0 0 1 0 0
16 0 0 0 0 2

Assuming data is square matrix (n x n) and if I understand the question correctly
df = pd.DataFrame([[0, 4,0], [0,6,0], [1,3,0]],
index =['foo', 'bar', 'lorem'],
columns=[0, 1, 2]
)
df_T = pd.DataFrame(df.values.T, index=df.index, columns=df.columns)

Related

Trying to merge dictionaries together to create new df but dictionaries values arent showing up in df

image of jupter notebook issue
For my quarters instead of values for examples 1,0,0,0 showing up I get NaN.
How do I fix the code below so I return values in my dataframe
qrt_1 = {'q1':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0]}
qrt_2 = {'q2':[0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0]}
qrt_3 = {'q3':[0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0]}
qrt_4 = {'q4':[0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
year = {'year': [1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,6,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9]}
value = data_1['Sales']
data = [year, qrt_1, qrt_2, qrt_3, qrt_4]
dataframes = []
for x in data:
dataframes.append(pd.DataFrame(x))
df = pd.concat(dataframes)
I am expecting a dataframe that returns the qrt_1, qrt_2 etc with their corresponding column names
Try to use axis=1 in pd.concat:
df = pd.concat(dataframes, axis=1)
print(df)
Prints:
year q1 q2 q3 q4
0 1 1 0 0 0
1 1 0 1 0 0
2 1 0 0 1 0
3 1 0 0 0 1
4 2 1 0 0 0
5 2 0 1 0 0
6 2 0 0 1 0
7 2 0 0 0 1
8 3 1 0 0 0
9 3 0 1 0 0
10 3 0 0 1 0
11 3 0 0 0 1
12 4 1 0 0 0
13 4 0 1 0 0
14 4 0 0 1 0
15 4 0 0 0 1
16 5 1 0 0 0
17 5 0 1 0 0
18 5 0 0 1 0
19 5 0 0 0 1
20 6 1 0 0 0
21 6 0 1 0 0
22 6 0 0 1 0
23 6 0 0 0 1
24 7 1 0 0 0
25 7 0 1 0 0
26 7 0 0 1 0
27 7 0 0 0 1
28 8 1 0 0 0
29 8 0 1 0 0
30 8 0 0 1 0
31 8 0 0 0 1
32 9 1 0 0 0
33 9 0 1 0 0
34 9 0 0 1 0
35 9 0 0 0 1

is there any way to convert the columns in Pandas Dataframe using its mirror image Dataframe structure

the df I have is :
0 1 2
0 0 0 0
1 0 0 1
2 0 1 0
3 0 1 1
4 1 0 0
5 1 0 1
6 1 1 0
7 1 1 1
I wanted to obtain a Dataframe with columns reversed/mirror image :
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Is there any way to do that
You can check
df[:] = df.iloc[:,::-1]
df
Out[959]:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
Here is a bit more verbose, but likely more efficient solution as it doesn't require to rewrite the data. It only renames and reorders the columns:
cols = df.columns
df.columns = df.columns[::-1]
df = df.loc[:,cols]
Or shorter variant:
df = df.iloc[:,::-1].set_axis(df.columns, axis=1)
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1
There are other ways, but here's one solution:
df[df.columns] = df[reversed(df.columns)]
Output:
0 1 2
0 0 0 0
1 1 0 0
2 0 1 0
3 1 1 0
4 0 0 1
5 1 0 1
6 0 1 1
7 1 1 1

Creating week flags from DOW

I have a dataframe:
DOW
0 0
1 1
2 2
3 3
4 4
5 5
6 6
This corresponds to the dayof the week. Now I want to create this dataframe-
DOW MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
3 3 0 0 1 0 0 0
4 4 0 0 0 1 0 0
5 5 0 0 0 0 1 0
6 6 0 0 0 0 0 1
7 0 0 0 0 0 0 0
8 1 1 0 0 0 0 0
Depending on the DOW column for example its 1 then MON_FLAG will be 1 if its 2 then TUES_FLAG will be 1 and so on. I have kept Sunday as 0 that's why all the flag columns are zero in that case.
Use get_dummies with rename columns by dictionary:
d = {0:'SUN_FLAG',1:'MON_FLAG',2:'TUE_FLAG',
3:'WED_FLAG',4:'THUR_FLAG',5: 'FRI_FLAG',6:'SAT_FLAG'}
df = df.join(pd.get_dummies(df['DOW']).rename(columns=d))
print (df)
DOW SUN_FLAG MON_FLAG TUE_FLAG WED_FLAG THUR_FLAG FRI_FLAG SAT_FLAG
0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 0 0 0
2 2 0 0 1 0 0 0 0
3 3 0 0 0 1 0 0 0
4 4 0 0 0 0 1 0 0
5 5 0 0 0 0 0 1 0
6 6 0 0 0 0 0 0 1
7 0 1 0 0 0 0 0 0
8 1 0 1 0 0 0 0 0

How to concatenate bit columns in Python Pandas?

Seems like an easy question but I'm running into an odd error. I have a large dataframe with 24+ columns that all contain 1s or 0s. I wish to concatenate each field to create a binary key that'll act as a signature.
However, when the number of columns exceeds 12, the whole process falls apart.
a = np.zeros(shape=(3,12))
df = pd.DataFrame(a)
df = df.astype(int) # This converts each 0.0 into just 0
df[2]=1 # Changes one column to all 1s
#result
0 1 2 3 4 5 6 7 8 9 10 11
0 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0
Concatenating function...
df['new'] = df.astype(str).sum(1).astype(int).astype(str) # Concatenate
df['new'].apply('{0:0>12}'.format) # Pad leading zeros
# result
0 1 2 3 4 5 6 7 8 9 10 11 new
0 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
1 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
2 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
This is good. However, if I increase the number of columns to 13, I get...
a = np.zeros(shape=(3,13))
# ...same intermediate steps as above...
0 1 2 3 4 5 6 7 8 9 10 11 12 new
0 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
1 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
2 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
Why am I getting -2147483648? I was expecting 0010000000000
Any help is appreciated!

Convert a list of values to a time series in python

I want to convert the foll. data:
jan_1 jan_15 feb_1 feb_15 mar_1 mar_15 apr_1 apr_15 may_1 may_15 jun_1 jun_15 jul_1 jul_15 aug_1 aug_15 sep_1 sep_15 oct_1 oct_15 nov_1 nov_15 dec_1 dec_15
0 0 0 0 0 1 1 2 2 2 2 2 2 3 3 3 3 3 0 0 0 0 0 0
into a array of length 365, where each element is repeated till the next date days e.g. 0 is repeated from january 1 to january 15...
I could do something like numpy.repeat, but that is not date aware, so would not take into account that less than 15 days happen between feb_15 and mar_1.
Any pythonic solution for this?
You can use resample:
#add last value - 31 dec by value of last column of df
df['dec_31'] = df.iloc[:,-1]
#convert to datetime - see http://strftime.org/
df.columns = pd.to_datetime(df.columns, format='%b_%d')
#transpose and resample by days
df1 = df.T.resample('d').ffill()
df1.columns = ['col']
print (df1)
col
1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
1900-01-16 0
1900-01-17 0
1900-01-18 0
1900-01-19 0
1900-01-20 0
1900-01-21 0
1900-01-22 0
1900-01-23 0
1900-01-24 0
1900-01-25 0
1900-01-26 0
1900-01-27 0
1900-01-28 0
1900-01-29 0
1900-01-30 0
..
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
1900-12-16 0
1900-12-17 0
1900-12-18 0
1900-12-19 0
1900-12-20 0
1900-12-21 0
1900-12-22 0
1900-12-23 0
1900-12-24 0
1900-12-25 0
1900-12-26 0
1900-12-27 0
1900-12-28 0
1900-12-29 0
1900-12-30 0
1900-12-31 0
[365 rows x 1 columns]
#if need serie
print (df1.col)
1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
1900-01-16 0
1900-01-17 0
1900-01-18 0
1900-01-19 0
1900-01-20 0
1900-01-21 0
1900-01-22 0
1900-01-23 0
1900-01-24 0
1900-01-25 0
1900-01-26 0
1900-01-27 0
1900-01-28 0
1900-01-29 0
1900-01-30 0
..
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
1900-12-16 0
1900-12-17 0
1900-12-18 0
1900-12-19 0
1900-12-20 0
1900-12-21 0
1900-12-22 0
1900-12-23 0
1900-12-24 0
1900-12-25 0
1900-12-26 0
1900-12-27 0
1900-12-28 0
1900-12-29 0
1900-12-30 0
1900-12-31 0
Freq: D, Name: col, dtype: int64
#transpose and convert to numpy array
print (df1.T.values)
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
IIUC you can do it this way:
In [194]: %paste
# transpose DF, rename columns
x = df.T.reset_index().rename(columns={'index':'date', 0:'val'})
# parse dates
x['date'] = pd.to_datetime(x['date'], format='%b_%d')
# group resampled DF by the month and resample(`D`) each group
result = (x.groupby(x['date'].dt.month)
.apply(lambda x: x.set_index('date').resample('1D').ffill()))
# rename index names
result.index.names = ['month','date']
## -- End pasted text --
In [212]: result
Out[212]:
val
month date
1 1900-01-01 0
1900-01-02 0
1900-01-03 0
1900-01-04 0
1900-01-05 0
1900-01-06 0
1900-01-07 0
1900-01-08 0
1900-01-09 0
1900-01-10 0
1900-01-11 0
1900-01-12 0
1900-01-13 0
1900-01-14 0
1900-01-15 0
2 1900-02-01 0
1900-02-02 0
1900-02-03 0
1900-02-04 0
1900-02-05 0
1900-02-06 0
1900-02-07 0
1900-02-08 0
1900-02-09 0
1900-02-10 0
1900-02-11 0
1900-02-12 0
1900-02-13 0
1900-02-14 0
1900-02-15 0
... ...
11 1900-11-01 0
1900-11-02 0
1900-11-03 0
1900-11-04 0
1900-11-05 0
1900-11-06 0
1900-11-07 0
1900-11-08 0
1900-11-09 0
1900-11-10 0
1900-11-11 0
1900-11-12 0
1900-11-13 0
1900-11-14 0
1900-11-15 0
12 1900-12-01 0
1900-12-02 0
1900-12-03 0
1900-12-04 0
1900-12-05 0
1900-12-06 0
1900-12-07 0
1900-12-08 0
1900-12-09 0
1900-12-10 0
1900-12-11 0
1900-12-12 0
1900-12-13 0
1900-12-14 0
1900-12-15 0
[180 rows x 1 columns]
or using reset_index():
In [213]: result.reset_index().head(20)
Out[213]:
month date val
0 1 1900-01-01 0
1 1 1900-01-02 0
2 1 1900-01-03 0
3 1 1900-01-04 0
4 1 1900-01-05 0
5 1 1900-01-06 0
6 1 1900-01-07 0
7 1 1900-01-08 0
8 1 1900-01-09 0
9 1 1900-01-10 0
10 1 1900-01-11 0
11 1 1900-01-12 0
12 1 1900-01-13 0
13 1 1900-01-14 0
14 1 1900-01-15 0
15 2 1900-02-01 0
16 2 1900-02-02 0
17 2 1900-02-03 0
18 2 1900-02-04 0
19 2 1900-02-05 0

Categories

Resources