How to melt multiple columns into one column? - python

I have this table:
a b c d e f 19-08-06 19-08-07 19-08-08 g h i
1 2 3 4 5 6 7 8 9 10 11 12
I have 34 columns of the date, so I want to melt the date columns to be into one column only.
How can I do this in pyhton?
Thanks in advance

You can use pd.Series.fullmatch to create a boolean mask for extracting date columns, then use df.melt
m = df.columns.str.fullmatch("\d{2}-\d{2}-\d{2}")
cols = df.columns[m]
df.melt(value_vars=cols, var_name='date', value_name='vals')
date vals
0 19-08-06 7
1 19-08-07 8
2 19-08-08 9
If you want to melt while keeping other columns then try this.
df.melt(
id_vars=df.columns.difference(cols), var_name="date", value_name="vals"
)
a b c d e f g h i date vals
0 1 2 3 4 5 6 10 11 12 19-08-06 7
1 1 2 3 4 5 6 10 11 12 19-08-07 8
2 1 2 3 4 5 6 10 11 12 19-08-08 9
Here I did not use value_vars=cols as it's done implicitly
value_vars: tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are
not set as id_vars.

Related

How to map nested dictionaries to dataframe columns in python?

I have nested dictionaries like this:
X = {A:{col1:12,col-2:13},B:{col1:12,col-2:13},C:{col1:12,col-2:13},D:{col1:12,col-2:13}}
Y = {A:{col1:3,col-2:5},B:{col1:1,col-2:2},C:{col1:4,col-2:7},D:{col1:8,col-2:7}}
Z = {A:{col1:6,col-2:7},B:{col1:4,col-2:7},C:{col1:5,col-2:7},D:{col1:4,col-2:9}}
I also have a data frame with a single column like this:
Df: data frame([A,B,C,D],columns = ['Names'])
For every row in the data frame, I want to map the given dictionaries in this format:
Names Col-1_X Col-1_Y Col-1_Z Col-2_X Col-2_Y Col-2_Z
A 12 3 6 13 5 7
B 12 1 4 13 2 7
C 12 4 5 13 7 7
D 12 8 4 13 7 9
Can anyone help me get the data in this format?
Transpose and concat them:
dfX = pd.DataFrame(X).T.add_suffix('_X')
dfY = pd.DataFrame(Y).T.add_suffix('_Y')
dfZ = pd.DataFrame(Z).T.add_suffix('_Z')
output = pd.concat([dfX, dfY,dfZ], axis=1))
output :
col1_X col-2_X col1_Y col-2_Y col1_Z col-2_Z
A 12 13 3 5 6 7
B 12 13 1 2 4 7
C 12 13 4 7 5 7
D 12 13 8 7 4 9

Sort part of DataFrame in Python Panda, return new column with order depending on row values

My first question here, I hope this is understandable.
I have a Panda DataFrame:
order_numbers
x_closest_autobahn
0
34
3
1
11
3
2
5
3
3
8
12
4
2
12
I would like to get a new column with the order_number per closest_autobahn in ascending order:
order_numbers
x_closest_autobahn
order_number_autobahn_x
2
5
3
1
1
11
3
2
0
34
3
3
4
2
12
1
3
8
12
2
I have tried:
df['order_number_autobahn_x'] = ([df.loc[(df['x_closest_autobahn'] == 3)]].sort_values(by=['order_numbers'], ascending=True, inplace=True))
I have looked at slicing, sort_values and reset_index
df.sort_values(by=['order_numbers'], ascending=True, inplace=True)
df = df.reset_index() # reset index to the order after sort
df['order_numbers_index'] = df.index
but I can't seem to get the DataFrame I am looking for.
Use DataFrame.sort_values by both columns and for counter use GroupBy.cumcount:
df = df.sort_values(['x_closest_autobahn','order_numbers'])
df['order_number_autobahn_x'] = df.groupby('x_closest_autobahn').cumcount().add(1)
print (df)
order_numbers x_closest_autobahn order_number_autobahn_x
2 5 3 1
1 11 3 2
0 34 3 3
4 2 12 1
3 8 12 2

Finding difference between two columns of a dataframe along with groupby

I saw a primitive version of this question here
but i my dataframe has diffrent names and i want to calculate separately for them
A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
i want to groupby A and then find diffence between alternate B and C.something like this
A B C dA
0 a 3 5 6
1 a 6 9 NaN
2 b 3 8 16
3 b 11 19 NaN
i tried doing
df['dA']=df.groupby('A')(['C']-['B'])
df['dA']=df.groupby('A')['C']-df.groupby('A')['B']
none of them helped
what mistake am i making?
IIUC, here is one way to perform the calculation:
# create the data frame
from io import StringIO
import pandas as pd
data = '''idx A B C
0 a 3 5
1 a 6 9
2 b 3 8
3 b 11 19
'''
df = pd.read_csv(StringIO(data), sep='\s+', engine='python').set_index('idx')
Now, compute dA. I look last value of C less first value of B, as grouped by A. (Is this right? Or is it max(C) less min(B)?). If you're guaranteed to have the A values in pairs, then #BenT's shift() would be more concise.
dA = (
(df.groupby('A')['C'].transform('last') -
df.groupby('A')['B'].transform('first'))
.drop_duplicates()
.rename('dA'))
print(pd.concat([df, dA], axis=1))
A B C dA
idx
0 a 3 5 6.0
1 a 6 9 NaN
2 b 3 8 16.0
3 b 11 19 NaN
I used groupby().transform() to preserve index values, to support the concat operation.

Calculate Quantiles on Groupby Dataframe and add value back to DF

What I'm looking to do is group my Dataframe on a Categorical column, compute quantiles using second column, and store the result in a 3rd column. For simplicity lets just do the P50. Example below:
Original DF:
Col1 Col2
A 2
B 4
C 2
A 6
B 12
C 10
Desired DF:
Col1 Col2 Col3_P50
A 2 4
B 4 8
C 2 6
A 6 4
B 12 8
C 10 6
One easy way would be to create a small dataframe of each Category (A,B,C) and compute quantile and merge back to existing DF, but my actual dataset has 100s of category so this isn't an option. Any suggestions would be much appreciated!
You can do transform with quantile
df['Col3_P50'] = df.groupby("Col1")['Col2'].transform('quantile',0.5)
print(df)
Col1 Col2 Col3_P50
0 A 2 4
1 B 4 8
2 C 2 6
3 A 6 4
4 B 12 8
5 C 10 6
If you have multiple values, one way is creating a dictionary and set the keys as column names and values inside the groupby:
d = {'P_50':0.5,'P_90':0.9}
for k,v in d.items():
df[k]=df.groupby("Col1")['Col2'].transform('quantile',v)
print(df)
Col1 Col2 P_50 P_90
0 A 2 4 5.6
1 B 4 8 11.2
2 C 2 6 9.2
3 A 6 4 5.6
4 B 12 8 11.2
5 C 10 6 9.2

How to remove ugly row in pandas.dataframe

so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)

Categories

Resources