Pandas Flatten a dataframe to a single column - python

I have dataset in the following format:
df = pd.DataFrame({'x':[1,2,3], 'y':[10,20,30], 'v1':[3,2,3] , 'v2':[13,25,31] })
>> v1 v2 x y
3 13 1 10
2 25 2 20
3 31 3 30
Setting the index column with x, I want to flatten the data combining v1 and v2 (V), The expected output is like:
>> x y V
1 10 3
1 10 13
2 20 2
2 20 25
3 30 3
3 30 31
And again bringing to the original format of df. I tried reshaping using stack and unstack, but I couldn't get it the way, which I was expecting.
Many Thanks!

pd.lreshape can reformat wide data to long format:
In [55]: pd.lreshape(df, {'V':['v1', 'v2']})
Out[57]:
x y V
0 1 10 3
1 2 20 2
2 3 30 3
3 1 10 13
4 2 20 25
5 3 30 31
lreshape is an undocumented "experimental" feature. To learn more about lreshape see help(pd.lreshape).
If you need reversible operations, use jezrael's pd.melt solution to go from wide to long format, and use pivot_table to go from long to wide format:
In [72]: melted = pd.melt(df, id_vars=['x', 'y'], value_name='V'); melted
Out[72]:
x y variable V
0 1 10 v1 3
1 2 20 v1 2
2 3 30 v1 3
3 1 10 v2 13
4 2 20 v2 25
5 3 30 v2 31
In [74]: df2 = melted.pivot_table(index=['x','y'], columns=['variable'], values='V').reset_index(); df2
Out[74]:
variable x y v1 v2
0 1 10 3 13
1 2 20 2 25
2 3 30 3 31
Notice that you must hang on to the variable column if you wish to return to df2. Also keep in mind that it is more efficient to simply retain a reference to df than to recompute it using melted and pivot_table.

You can use stack with set_index. Last drop column level_2:
print (df.set_index(['x','y']).stack().reset_index(name='V').drop('level_2', axis=1))
x y V
0 1 10 3
1 1 10 13
2 2 20 2
3 2 20 25
4 3 30 3
5 3 30 31
Another solution with melt and sort_values:
print (pd.melt(df, id_vars=['x','y'], value_name='V')
.drop('variable', axis=1)
.sort_values('x'))
x y V
0 1 10 3
3 1 10 13
1 2 20 2
4 2 20 25
2 3 30 3
5 3 30 31

Related

How to group dataframe by column and receive new column for every group

I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values

Merge two Series according to their index

After a long time of googling and not finding a solution to my, probably often asked, problem.
I have two Dataframes:
DF1: DF2:
val val
index index
1 3 2 5
3 10 4 15
5 20 7 35
6 30 8 40
and need an output like this:
DF_out:
val
index
1 3
2 5
3 10
4 15
5 20
6 30
7 35
8 40
DF1 and DF2 should be combined and sorted according to ther indices.
Side notes:
DF1 and DF2 never have the same index twice
The values of the dataframes are always sequel
I would very much appreciate your help!
Use concat with Series.sort_index:
df = pd.concat([DF1, DF2]).sort_index()
print (df)
val
index
1 3
2 5
3 10
4 15
5 20
6 30
7 35
8 40

Concatenating two Pandas DataFrames while maintaining index order

Basic question - I am trying to concatenate two DataFrames, with the resulting DataFrame preserving the index in order of the original two. For example:
df = pd.DataFrame({'Houses':[10,20,30,40,50], 'Cities':[3,4,7,6,1]}, index = [1,2,4,6,8])
df2 = pd.DataFrame({'Houses':[15,25,35,45,55], 'Cities':[1,8,11,14,4]}, index = [0,3,5,7,9])
Using pd.concat([df, df2]) simply appends df2 to the end of df1. I am trying to instead concatenate them to produce correct index order (0 through 9).
Use concat with parameter sort for avoid warning and then DataFrame.sort_index:
df = pd.concat([df, df2], sort=False).sort_index()
print(df)
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55
Try using:
print(df.T.join(df2.T).T.sort_index())
Output:
Cities Houses
0 1 15
1 3 10
2 4 20
3 8 25
4 7 30
5 11 35
6 6 40
7 14 45
8 1 50
9 4 55

Dynamically reshape the dataframe in pandas

I am having a dataframe which has 4 columns and 4 rows. I need to reshape it into 2 columns and 4 rows. The 2 new columns are result of addition of values of col1 + col3 and col2 +col4. I do not wish to create any other memory object for it.
I am trying
df['A','B'] = df['A']+df['C'],df['B']+df['D']
Can it be achieved by using drop function only? Is there any other simpler method for this?
The dynamic way of summing two columns at a time is to use groupby:
df.groupby(np.arange(len(df.columns)) % 2, axis=1).sum()
Out[11]:
0 1
0 2 4
1 10 12
2 18 20
3 26 28
You can use rename afterwards if you want to change column names but that would require a logic.
Consider the sample dataframe df
df = pd.DataFrame(np.arange(16).reshape(4, 4), columns=list('ABCD'))
df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
One line of code
pd.DataFrame(
df.values.reshape(4, 2, 2).transpose(0, 2, 1).sum(2),
columns=df.columns[:2]
)
A B
0 2 4
1 10 12
2 18 20
3 26 28
Another line of code
df.iloc[:, :2] + df.iloc[:, 2:4].values
A B
0 2 4
1 10 12
2 18 20
3 26 28
Yet another
df.assign(A=df.A + df.C, B=df.B + df.D).drop(['C', 'D'], 1)
A B
0 2 4
1 10 12
2 18 20
3 26 28
This works for me:
df['A'], df['B'] = df['A'] + df['C'], df['B'] + df['D']
df.drop(['C','D'], axis=1)

How to remove ugly row in pandas.dataframe

so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)

Categories

Resources