Combing dataframe values to a new dataframe - python

I am combining two dataframe values from an Excel file to a new dataframe but the combined values changed to decimal number. Here are my codes:
My dataframe that I wish to combine:
cable_block pair
1 10
1 11
3 123
3 222
I insert a dataframe to have those two combined with a delimiter of /, so here is my code:
df['new_col'] = df[['cable_block', 'pair']].apply(lambda x: '/'.join(x.astype(str), axis=1))
The result I get is:
cable_block pair new_col
1 10 1.0/10.0
1 11 1.0/11.0
3 123 3.0/123.0
3 222 3.0/222.0
After searching, I found good answer by
here Psidom and Skirrebattie. So I tried:
df['new_col'] = df['new_col'].applymap(str)
and
df['new_col'] = df['new_col'].astype(str)
But it doesn't work the way it should. Looking by the codes, it should work and I find it weird that it doesn't.
Is there another work around?

First, to remove the trailing .0 ensure that data is int:
df = df.astype(int)
Then you can do:
df['cable_block'].astype(str) + '/' + df['pair'].astype(str)
0 1/10
1 1/11
2 3/123
3 3/222
dtype: object
Another option to ensure a correct formatting could be:
df.apply(lambda x: "%d/%d" %(x['cable_block'], x['pair']), axis=1)
0 1/10
1 1/11
2 3/123
3 3/222
dtype: object

Why not using astype
df.astype(str).apply('/'.join,1)
Out[604]:
0 1/10
1 1/11
2 3/123
3 3/222
dtype: object

df['cable_block'].astype(int).astype(str) + '/' + df['pair'].astype(int).astype(str)
The data in your dataframe is probably floats, not ints.

You can use a list comprehension and f-strings:
df['new_col'] = [f'{cable_block}/{pair}' for cable_block, pair in df.values]
print(df)
cable_block pair new_col
0 1 10 1/10
1 1 11 1/11
2 3 123 3/123
3 3 222 3/222
The approach compares reasonably well versus the alternatives:
df = pd.concat([df]*10000, ignore_index=True)
%timeit df['cable_block'].astype(str) + '/' + df['pair'].astype(str) # 62.8 ms
%timeit [f'{cable_block}/{pair}' for cable_block, pair in df.values] # 85.1 ms
%timeit list(map('/'.join, map(list, df.values.astype(str)))) # 157 ms
%timeit df.astype(str).apply('/'.join,1) # 1.11 s

Related

Find where word is present in string with where statement [duplicate]

I having replace issue while I try to replace a string with value from another column.
I want to replace 'Length' with df['Length'].
df["Length"]= df["Length"].replace('Length', df['Length'], regex = True)
Below is my data
Input:
**Formula** **Length**
Length 5
Length+1.5 6
Length-2.5 5
Length 4
5 5
Expected Output:
**Formula** **Length**
5 5
6+1.5 6
5-2.5 5
4 4
5 5
However, with the code I used above, it will replace my entire cell instead of Length only.
I getting below output:
I found it was due to df['column'] is used, if I used any other string the behind offset (-1.5) will not get replaced.
**Formula** **Length**
5 5
6 6
5 5
4 4
5 5
May I know is there any replace method for values from other columns?
Thank you.
If want replace by another column is necessary use DataFrame.apply:
df["Formula"]= df.apply(lambda x: x['Formula'].replace('Length', str(x['Length'])), axis=1)
print (df)
Formula Length
0 5 5
1 6+1.5 6
2 5-2.5 5
3 4 4
4 5 5
Or list comprehension:
df["Formula"]= [x.replace('Length', str(y)) for x, y in df[['Formula','Length']].to_numpy()]
Just wanted to add, that list comprehension is much faster of course:
df = pd.DataFrame({'a': ['aba'] * 1000000, 'c': ['c'] * 1000000})
%timeit df.apply(lambda x: x['a'].replace('b', x['c']), axis=1)
# 1 loop, best of 5: 11.8 s per loop
%timeit [x.replace('b', str(y)) for x, y in df[['a', 'c']].to_numpy()]
# 1 loop, best of 5: 1.3 s per loop

Get all row values from single column

This code is working and retrieving the exact data i need. I just need to output all the row values into a single value;
HERE IS THE bit of CODE
dfs = pd.read_html(str(tables))
df = dfs[0].iloc[[1,3,5,7,9,11,13,15,17,19],[1]]
s = df['score'].str.split('-',expand=True).astype(int)
df['team_win'] = np.where(s[0] == s[1], 0,s.idxmax(1) + 1)
df = df['team_win'].drop([9, 11])
HERE IS THE OUTPUT
Name: team_win, dtype: int64
1 1
3 0
5 2
7 0
13 1
15 2
17 2
19 2
I need the team_win to be outputed like this... 10201222 so i can copy into google sheet
Ranked from best performance to worst performance, you could:
1)
''.join(map(str, df['team_win']))
Out[319]: '10201222'
2)
''.join(df['team_win'].map(str))
Out[320]: '10201222'
3)
''.join([str(i) for i in df['team_win']])
Out[313]: '10201222'
Thanks to #piRSquared for the additional suggestions

Finding the common columns when comparing two rows in a dataframe in python

I have a dataframe of the below structure. I want to get the column numbers which has the same value (for a specific value) when i compare two rows.
1 1 0 1 1
0 1 0 1 0
0 1 0 0 1
1 0 0 0 1
0 0 0 0 0
1 0 0 0 1
So for example when I use the above sample df to compare two rows to get the columns which has 1 in it, I should get col(1) and col(3) when I compare row(0) and row(1). Similarly, when I compare row(1) and row(2), I should get col(1). I want to know if there is a more efficient solution in python.
NB: I want only the matching column numbers and also I will specify the rows to compare.
Consider the following dataframe:
import numpy as np
df = pd.DataFrame(np.random.binomial(1, 0.2, (2, 10000)))
It will be a binary matrix of size 2x10000.
np.where((df.iloc[0] * df.iloc[1]))
Or,
np.where((df.iloc[0]) & (df.iloc[1]))
returns the columns that have 1s in both rows. Multiplication seems to be faster:
%timeit np.where((df.iloc[0]) & (df.iloc[1]))
1000 loops, best of 3: 400 µs per loop
%timeit np.where((df.iloc[0] * df.iloc[1]))
1000 loops, best of 3: 269 µs per loop
Here's a simple function. You can modify it as needed, depending on how you represent your data. I'm assuming a list of lists:
df = [[1,1,0,1,1],
[0,1,0,1,0],
[0,1,0,0,1],
[1,0,0,0,1],
[0,0,0,0,0],
[1,0,0,0,1]]
def compare_rows(df,row1,row2):
"""Returns the column numbers in which both rows contain 1's"""
column_numbers = []
for i,_ in enumerate(df[0]):
if (df[row1][i] == 1) and (df[row2][i] ==1):
column_numbers.append(i)
return column_numbers
compare_rows(df,0,1) produces the output:
[1,3]

Pandas dataframe total row

I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
Voilà!
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)

How to change the order of DataFrame columns?

I have the following DataFrame (df):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
I add more column(s) by assignment:
df['mean'] = df.mean(1)
How can I move the column mean to the front, i.e. set it as first column leaving the order of the other columns untouched?
One easy way would be to reassign the dataframe with a list of the columns, rearranged as needed.
This is what you have now:
In [6]: df
Out[6]:
0 1 2 3 4 mean
0 0.445598 0.173835 0.343415 0.682252 0.582616 0.445543
1 0.881592 0.696942 0.702232 0.696724 0.373551 0.670208
2 0.662527 0.955193 0.131016 0.609548 0.804694 0.632596
3 0.260919 0.783467 0.593433 0.033426 0.512019 0.436653
4 0.131842 0.799367 0.182828 0.683330 0.019485 0.363371
5 0.498784 0.873495 0.383811 0.699289 0.480447 0.587165
6 0.388771 0.395757 0.745237 0.628406 0.784473 0.588529
7 0.147986 0.459451 0.310961 0.706435 0.100914 0.345149
8 0.394947 0.863494 0.585030 0.565944 0.356561 0.553195
9 0.689260 0.865243 0.136481 0.386582 0.730399 0.561593
In [7]: cols = df.columns.tolist()
In [8]: cols
Out[8]: [0L, 1L, 2L, 3L, 4L, 'mean']
Rearrange cols in any way you want. This is how I moved the last element to the first position:
In [12]: cols = cols[-1:] + cols[:-1]
In [13]: cols
Out[13]: ['mean', 0L, 1L, 2L, 3L, 4L]
Then reorder the dataframe like this:
In [16]: df = df[cols] # OR df = df.ix[:, cols]
In [17]: df
Out[17]:
mean 0 1 2 3 4
0 0.445543 0.445598 0.173835 0.343415 0.682252 0.582616
1 0.670208 0.881592 0.696942 0.702232 0.696724 0.373551
2 0.632596 0.662527 0.955193 0.131016 0.609548 0.804694
3 0.436653 0.260919 0.783467 0.593433 0.033426 0.512019
4 0.363371 0.131842 0.799367 0.182828 0.683330 0.019485
5 0.587165 0.498784 0.873495 0.383811 0.699289 0.480447
6 0.588529 0.388771 0.395757 0.745237 0.628406 0.784473
7 0.345149 0.147986 0.459451 0.310961 0.706435 0.100914
8 0.553195 0.394947 0.863494 0.585030 0.565944 0.356561
9 0.561593 0.689260 0.865243 0.136481 0.386582 0.730399
You could also do something like this:
df = df[['mean', '0', '1', '2', '3']]
You can get the list of columns with:
cols = list(df.columns.values)
The output will produce:
['0', '1', '2', '3', 'mean']
...which is then easy to rearrange manually before dropping it into the first function
Just assign the column names in the order you want them:
In [39]: df
Out[39]:
0 1 2 3 4 mean
0 0.172742 0.915661 0.043387 0.712833 0.190717 1
1 0.128186 0.424771 0.590779 0.771080 0.617472 1
2 0.125709 0.085894 0.989798 0.829491 0.155563 1
3 0.742578 0.104061 0.299708 0.616751 0.951802 1
4 0.721118 0.528156 0.421360 0.105886 0.322311 1
5 0.900878 0.082047 0.224656 0.195162 0.736652 1
6 0.897832 0.558108 0.318016 0.586563 0.507564 1
7 0.027178 0.375183 0.930248 0.921786 0.337060 1
8 0.763028 0.182905 0.931756 0.110675 0.423398 1
9 0.848996 0.310562 0.140873 0.304561 0.417808 1
In [40]: df = df[['mean', 4,3,2,1]]
Now, 'mean' column comes out in the front:
In [41]: df
Out[41]:
mean 4 3 2 1
0 1 0.190717 0.712833 0.043387 0.915661
1 1 0.617472 0.771080 0.590779 0.424771
2 1 0.155563 0.829491 0.989798 0.085894
3 1 0.951802 0.616751 0.299708 0.104061
4 1 0.322311 0.105886 0.421360 0.528156
5 1 0.736652 0.195162 0.224656 0.082047
6 1 0.507564 0.586563 0.318016 0.558108
7 1 0.337060 0.921786 0.930248 0.375183
8 1 0.423398 0.110675 0.931756 0.182905
9 1 0.417808 0.304561 0.140873 0.310562
For pandas >= 1.3 (Edited in 2022):
df.insert(0, 'mean', df.pop('mean'))
How about (for Pandas < 1.3, the original answer)
df.insert(0, 'mean', df['mean'])
https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#column-selection-addition-deletion
In your case,
df = df.reindex(columns=['mean',0,1,2,3,4])
will do exactly what you want.
In my case (general form):
df = df.reindex(columns=sorted(df.columns))
df = df.reindex(columns=(['opened'] + list([a for a in df.columns if a != 'opened']) ))
import numpy as np
import pandas as pd
df = pd.DataFrame()
column_names = ['x','y','z','mean']
for col in column_names:
df[col] = np.random.randint(0,100, size=10000)
You can try out the following solutions :
Solution 1:
df = df[ ['mean'] + [ col for col in df.columns if col != 'mean' ] ]
Solution 2:
df = df[['mean', 'x', 'y', 'z']]
Solution 3:
col = df.pop("mean")
df = df.insert(0, col.name, col)
Solution 4:
df.set_index(df.columns[-1], inplace=True)
df.reset_index(inplace=True)
Solution 5:
cols = list(df)
cols = [cols[-1]] + cols[:-1]
df = df[cols]
solution 6:
order = [1,2,3,0] # setting column's order
df = df[[df.columns[i] for i in order]]
Time Comparison:
Solution 1:
CPU times: user 1.05 ms, sys: 35 µs, total: 1.08 ms Wall time: 995 µs
Solution 2:
CPU times: user 933 µs, sys: 0 ns, total: 933 µs
Wall time: 800 µs
Solution 3:
CPU times: user 0 ns, sys: 1.35 ms, total: 1.35 ms
Wall time: 1.08 ms
Solution 4:
CPU times: user 1.23 ms, sys: 45 µs, total: 1.27 ms
Wall time: 986 µs
Solution 5:
CPU times: user 1.09 ms, sys: 19 µs, total: 1.11 ms
Wall time: 949 µs
Solution 6:
CPU times: user 955 µs, sys: 34 µs, total: 989 µs
Wall time: 859 µs
You need to create a new list of your columns in the desired order, then use df = df[cols] to rearrange the columns in this new order.
cols = ['mean'] + [col for col in df if col != 'mean']
df = df[cols]
You can also use a more general approach. In this example, the last column (indicated by -1) is inserted as the first column.
cols = [df.columns[-1]] + [col for col in df if col != df.columns[-1]]
df = df[cols]
You can also use this approach for reordering columns in a desired order if they are present in the DataFrame.
inserted_cols = ['a', 'b', 'c']
cols = ([col for col in inserted_cols if col in df]
+ [col for col in df if col not in inserted_cols])
df = df[cols]
Suppose you have df with columns A B C.
The most simple way is:
df = df.reindex(['B','C','A'], axis=1)
If your column names are too-long-to-type then you could specify the new order through a list of integers with the positions:
Data:
0 1 2 3 4 mean
0 0.397312 0.361846 0.719802 0.575223 0.449205 0.500678
1 0.287256 0.522337 0.992154 0.584221 0.042739 0.485741
2 0.884812 0.464172 0.149296 0.167698 0.793634 0.491923
3 0.656891 0.500179 0.046006 0.862769 0.651065 0.543382
4 0.673702 0.223489 0.438760 0.468954 0.308509 0.422683
5 0.764020 0.093050 0.100932 0.572475 0.416471 0.389390
6 0.259181 0.248186 0.626101 0.556980 0.559413 0.449972
7 0.400591 0.075461 0.096072 0.308755 0.157078 0.207592
8 0.639745 0.368987 0.340573 0.997547 0.011892 0.471749
9 0.050582 0.714160 0.168839 0.899230 0.359690 0.438500
Generic example:
new_order = [3,2,1,4,5,0]
print(df[df.columns[new_order]])
3 2 1 4 mean 0
0 0.575223 0.719802 0.361846 0.449205 0.500678 0.397312
1 0.584221 0.992154 0.522337 0.042739 0.485741 0.287256
2 0.167698 0.149296 0.464172 0.793634 0.491923 0.884812
3 0.862769 0.046006 0.500179 0.651065 0.543382 0.656891
4 0.468954 0.438760 0.223489 0.308509 0.422683 0.673702
5 0.572475 0.100932 0.093050 0.416471 0.389390 0.764020
6 0.556980 0.626101 0.248186 0.559413 0.449972 0.259181
7 0.308755 0.096072 0.075461 0.157078 0.207592 0.400591
8 0.997547 0.340573 0.368987 0.011892 0.471749 0.639745
9 0.899230 0.168839 0.714160 0.359690 0.438500 0.050582
Although it might seem like I'm just explicitly typing the column names in a different order, the fact that there's a column 'mean' should make it clear that new_order relates to actual positions and not column names.
For the specific case of OP's question:
new_order = [-1,0,1,2,3,4]
df = df[df.columns[new_order]]
print(df)
mean 0 1 2 3 4
0 0.500678 0.397312 0.361846 0.719802 0.575223 0.449205
1 0.485741 0.287256 0.522337 0.992154 0.584221 0.042739
2 0.491923 0.884812 0.464172 0.149296 0.167698 0.793634
3 0.543382 0.656891 0.500179 0.046006 0.862769 0.651065
4 0.422683 0.673702 0.223489 0.438760 0.468954 0.308509
5 0.389390 0.764020 0.093050 0.100932 0.572475 0.416471
6 0.449972 0.259181 0.248186 0.626101 0.556980 0.559413
7 0.207592 0.400591 0.075461 0.096072 0.308755 0.157078
8 0.471749 0.639745 0.368987 0.340573 0.997547 0.011892
9 0.438500 0.050582 0.714160 0.168839 0.899230 0.359690
The main problem with this approach is that calling the same code multiple times will create different results each time, so one needs to be careful :)
This question has been answered before but reindex_axis is deprecated now so I would suggest to use:
df = df.reindex(sorted(df.columns), axis=1)
For those who want to specify the order they want instead of just sorting them, here's the solution spelled out:
df = df.reindex(['the','order','you','want'], axis=1)
Now, how you want to sort the list of column names is really not a pandas question, that's a Python list manipulation question. There are many ways of doing that, and I think this answer has a very neat way of doing it.
You can reorder the dataframe columns using a list of names with:
df = df.filter(list_of_col_names)
I think this is a slightly neater solution:
df.insert(0, 'mean', df.pop("mean"))
This solution is somewhat similar to #JoeHeffer 's solution but this is one liner.
Here we remove the column "mean" from the dataframe and attach it to index 0 with the same column name.
I ran into a similar question myself, and just wanted to add what I settled on. I liked the reindex_axis() method for changing column order. This worked:
df = df.reindex_axis(['mean'] + list(df.columns[:-1]), axis=1)
An alternate method based on the comment from #Jorge:
df = df.reindex(columns=['mean'] + list(df.columns[:-1]))
Although reindex_axis seems to be slightly faster in micro benchmarks than reindex, I think I prefer the latter for its directness.
This function avoids you having to list out every variable in your dataset just to order a few of them.
def order(frame,var):
if type(var) is str:
var = [var] #let the command take a string or list
varlist =[w for w in frame.columns if w not in var]
frame = frame[var+varlist]
return frame
It takes two arguments, the first is the dataset, the second are the columns in the data set that you want to bring to the front.
So in my case I have a data set called Frame with variables A1, A2, B1, B2, Total and Date. If I want to bring Total to the front then all I have to do is:
frame = order(frame,['Total'])
If I want to bring Total and Date to the front then I do:
frame = order(frame,['Total','Date'])
EDIT:
Another useful way to use this is, if you have an unfamiliar table and you're looking with variables with a particular term in them, like VAR1, VAR2,... you may execute something like:
frame = order(frame,[v for v in frame.columns if "VAR" in v])
Simply do,
df = df[['mean'] + df.columns[:-1].tolist()]
Here's a way to move one existing column that will modify the existing dataframe in place.
my_column = df.pop('column name')
df.insert(3, my_column.name, my_column) # Is in-place
Just type the column name you want to change, and set the index for the new location.
def change_column_order(df, col_name, index):
cols = df.columns.tolist()
cols.remove(col_name)
cols.insert(index, col_name)
return df[cols]
For your case, this would be like:
df = change_column_order(df, 'mean', 0)
You could do the following (borrowing parts from Aman's answer):
cols = df.columns.tolist()
cols.insert(0, cols.pop(-1))
cols
>>>['mean', 0L, 1L, 2L, 3L, 4L]
df = df[cols]
Moving any column to any position:
import pandas as pd
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
cols = df.columns.tolist()
column_to_move = "C"
new_position = 1
cols.insert(new_position, cols.pop(cols.index(column_to_move)))
df = df[cols]
I wanted to bring two columns in front from a dataframe where I do not know exactly the names of all columns, because they are generated from a pivot statement before.
So, if you are in the same situation: To bring columns in front that you know the name of and then let them follow by "all the other columns", I came up with the following general solution:
df = df.reindex_axis(['Col1','Col2'] + list(df.columns.drop(['Col1','Col2'])), axis=1)
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
df['mean'] = df.mean(1)
df
0 1 2 3 4 mean
0 0.929616 0.316376 0.183919 0.204560 0.567725 0.440439
1 0.595545 0.964515 0.653177 0.748907 0.653570 0.723143
2 0.747715 0.961307 0.008388 0.106444 0.298704 0.424512
3 0.656411 0.809813 0.872176 0.964648 0.723685 0.805347
4 0.642475 0.717454 0.467599 0.325585 0.439645 0.518551
5 0.729689 0.994015 0.676874 0.790823 0.170914 0.672463
6 0.026849 0.800370 0.903723 0.024676 0.491747 0.449473
7 0.526255 0.596366 0.051958 0.895090 0.728266 0.559587
8 0.818350 0.500223 0.810189 0.095969 0.218950 0.488736
9 0.258719 0.468106 0.459373 0.709510 0.178053 0.414752
### here you can add below line and it should work
# Don't forget the two (()) 'brackets' around columns names.Otherwise, it'll give you an error.
df = df[list(('mean',0, 1, 2,3,4))]
df
mean 0 1 2 3 4
0 0.440439 0.929616 0.316376 0.183919 0.204560 0.567725
1 0.723143 0.595545 0.964515 0.653177 0.748907 0.653570
2 0.424512 0.747715 0.961307 0.008388 0.106444 0.298704
3 0.805347 0.656411 0.809813 0.872176 0.964648 0.723685
4 0.518551 0.642475 0.717454 0.467599 0.325585 0.439645
5 0.672463 0.729689 0.994015 0.676874 0.790823 0.170914
6 0.449473 0.026849 0.800370 0.903723 0.024676 0.491747
7 0.559587 0.526255 0.596366 0.051958 0.895090 0.728266
8 0.488736 0.818350 0.500223 0.810189 0.095969 0.218950
9 0.414752 0.258719 0.468106 0.459373 0.709510 0.178053
You can use a set which is an unordered collection of unique elements to do keep the "order of the other columns untouched":
other_columns = list(set(df.columns).difference(["mean"])) #[0, 1, 2, 3, 4]
Then, you can use a lambda to move a specific column to the front by:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.random.rand(10, 5))
In [4]: df["mean"] = df.mean(1)
In [5]: move_col_to_front = lambda df, col: df[[col]+list(set(df.columns).difference([col]))]
In [6]: move_col_to_front(df, "mean")
Out[6]:
mean 0 1 2 3 4
0 0.697253 0.600377 0.464852 0.938360 0.945293 0.537384
1 0.609213 0.703387 0.096176 0.971407 0.955666 0.319429
2 0.561261 0.791842 0.302573 0.662365 0.728368 0.321158
3 0.518720 0.710443 0.504060 0.663423 0.208756 0.506916
4 0.616316 0.665932 0.794385 0.163000 0.664265 0.793995
5 0.519757 0.585462 0.653995 0.338893 0.714782 0.305654
6 0.532584 0.434472 0.283501 0.633156 0.317520 0.994271
7 0.640571 0.732680 0.187151 0.937983 0.921097 0.423945
8 0.562447 0.790987 0.200080 0.317812 0.641340 0.862018
9 0.563092 0.811533 0.662709 0.396048 0.596528 0.348642
In [7]: move_col_to_front(df, 2)
Out[7]:
2 0 1 3 4 mean
0 0.938360 0.600377 0.464852 0.945293 0.537384 0.697253
1 0.971407 0.703387 0.096176 0.955666 0.319429 0.609213
2 0.662365 0.791842 0.302573 0.728368 0.321158 0.561261
3 0.663423 0.710443 0.504060 0.208756 0.506916 0.518720
4 0.163000 0.665932 0.794385 0.664265 0.793995 0.616316
5 0.338893 0.585462 0.653995 0.714782 0.305654 0.519757
6 0.633156 0.434472 0.283501 0.317520 0.994271 0.532584
7 0.937983 0.732680 0.187151 0.921097 0.423945 0.640571
8 0.317812 0.790987 0.200080 0.641340 0.862018 0.562447
9 0.396048 0.811533 0.662709 0.596528 0.348642 0.563092
Just flipping helps often.
df[df.columns[::-1]]
Or just shuffle for a look.
import random
cols = list(df.columns)
random.shuffle(cols)
df[cols]
You can use reindex which can be used for both axis:
df
# 0 1 2 3 4 mean
# 0 0.943825 0.202490 0.071908 0.452985 0.678397 0.469921
# 1 0.745569 0.103029 0.268984 0.663710 0.037813 0.363821
# 2 0.693016 0.621525 0.031589 0.956703 0.118434 0.484254
# 3 0.284922 0.527293 0.791596 0.243768 0.629102 0.495336
# 4 0.354870 0.113014 0.326395 0.656415 0.172445 0.324628
# 5 0.815584 0.532382 0.195437 0.829670 0.019001 0.478415
# 6 0.944587 0.068690 0.811771 0.006846 0.698785 0.506136
# 7 0.595077 0.437571 0.023520 0.772187 0.862554 0.538182
# 8 0.700771 0.413958 0.097996 0.355228 0.656919 0.444974
# 9 0.263138 0.906283 0.121386 0.624336 0.859904 0.555009
df.reindex(['mean', *range(5)], axis=1)
# mean 0 1 2 3 4
# 0 0.469921 0.943825 0.202490 0.071908 0.452985 0.678397
# 1 0.363821 0.745569 0.103029 0.268984 0.663710 0.037813
# 2 0.484254 0.693016 0.621525 0.031589 0.956703 0.118434
# 3 0.495336 0.284922 0.527293 0.791596 0.243768 0.629102
# 4 0.324628 0.354870 0.113014 0.326395 0.656415 0.172445
# 5 0.478415 0.815584 0.532382 0.195437 0.829670 0.019001
# 6 0.506136 0.944587 0.068690 0.811771 0.006846 0.698785
# 7 0.538182 0.595077 0.437571 0.023520 0.772187 0.862554
# 8 0.444974 0.700771 0.413958 0.097996 0.355228 0.656919
# 9 0.555009 0.263138 0.906283 0.121386 0.624336 0.859904
Hackiest method in the book
df.insert(0, "test", df["mean"])
df = df.drop(columns=["mean"]).rename(columns={"test": "mean"})
A pretty straightforward solution that worked for me is to use .reindex on df.columns:
df = df[df.columns.reindex(['mean', 0, 1, 2, 3, 4])[0]]
Here is a function to do this for any number of columns.
def mean_first(df):
ncols = df.shape[1] # Get the number of columns
index = list(range(ncols)) # Create an index to reorder the columns
index.insert(0,ncols) # This puts the last column at the front
return(df.assign(mean=df.mean(1)).iloc[:,index]) # new df with last column (mean) first
A simple approach is using set(), in particular when you have a long list of columns and do not want to handle them manually:
cols = list(set(df.columns.tolist()) - set(['mean']))
cols.insert(0, 'mean')
df = df[cols]
How about using T?
df = df.T.reindex(['mean', 0, 1, 2, 3, 4]).T
I believe #Aman's answer is the best if you know the location of the other column.
If you don't know the location of mean, but only have its name, you cannot resort directly to cols = cols[-1:] + cols[:-1]. Following is the next-best thing I could come up with:
meanDf = pd.DataFrame(df.pop('mean'))
# now df doesn't contain "mean" anymore. Order of join will move it to left or right:
meanDf.join(df) # has mean as first column
df.join(meanDf) # has mean as last column

Categories

Resources