Concatenate all columns in a pandas dataframe - python

I have multiple pandas dataframe which may have different number of columns and the number of these columns typically vary from 50 to 100. I need to create a final column that is simply all the columns concatenated. Basically the string in the first row of the column should be the sum(concatenation) of the strings on the first row of all the columns. I wrote the loop below but I feel there might be a better more efficient way to do this. Any ideas on how to do this
num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
print('Concatenating', col_names[each_col_ind])
df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]

Solution with sum, but output is float, so convert to int and str is necessary:
df['new'] = df.sum(axis=1).astype(int).astype(str)
Another solution with apply function join, but it the slowiest:
df['new'] = df.apply(''.join, axis=1)
Last very fast numpy solution - convert to numpy array and then 'sum':
df['new'] = df.values.sum(axis=1)
Timings:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
cols = list('ABC')
#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop
In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop
In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop
In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop
EDIT If dtypes of some columns are not object (obviously strings) cast by DataFrame.astype:
df['new'] = df.astype(str).values.sum(axis=1)

df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Gives us:
df
Out[6]:
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
To select a given set of columns:
df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')
df
Out[8]:
A B C concat
0 1 4 7 14
1 2 5 8 25
2 3 6 9 36
However, I've noticed that approach can sometimes result in NaNs being populated where they shouldn't, so here's another way:
>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
Although it should be noted that this approach is a lot slower:
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop
Versus
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop

I don't have enough reputation to comment, so I'm building my answer off of blacksite's response.
For clarity, LunchBox commented that it failed for Python 3.7.0. It also failed for me on Python 3.6.3. Here is the original answer by blacksite:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Here is my modification for Python 3.6.3:
df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))

The solutions given above that use numpy arrays have worked great for me.
However, one thing to be careful about is the indexing when you get the numpy.ndarray from df.values, since the axis labels are removed from df.values.
So to take one of the solutions offered above (the one that I use most often) as an example:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
This portion:
df.fillna('').values
does not preserve the indices of the original DataFrame. Not a problem when the DataFrame has the common 0, 1, 2, ... row indexing scheme, but this solution will not work when the DataFrame is indexed in any other way. You can fix this by adding an index= argument to pd.Series():
df['concat'] = pd.Series(df.fillna('').values.tolist(),
index=df.index).str.join('')
I always add the index= argument just to be safe, even when I'm sure the DataFrame is row-indexed as 0, 1, 2, ...

This lambda approach offers some flexibility with columns chosen and separator type:
Setup:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Concatenate All Columns - no separator:
cols = ['A', 'B', 'C']
df['combined'] = df[cols].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
A B C combined
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
Concatenate Two Columns A and C with '_' separator:
cols = ['A', 'C']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
A B C combined
0 1 4 7 1_7
1 2 5 8 2_8
2 3 6 9 3_9

as a solution to #Gary Dorman's question in the comment,
i would want to have a delimiter in place so when you're looking at your overall column, you can see how it's broken out.
you maybe use
df_tmp=df.astype(str) + ','
df_tmp.sum(axis=1).str.rstrip(',')
before:
1.2.3.480tcp
6.6.6.680udp
7.7.7.78080tcp
8.8.8.88080tcp
9.9.9.98080tcp
after:
1.2.3.4,80,tcp
6.6.6.6,80,udp
7.7.7.7,8080,tcp
8.8.8.8,8080,tcp
9.9.9.9,8080,tcp
which looks better (like CSV :)
This additional sep step is about 30% slower in my machine.

Related

Renaming values in a column from lists within a dataframe

I have a data frame which looks like this,
df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
df
and I have two lists,
lis1=['A']
Lis2=['S','O']
I need to replace the value in col2 based on the lis1 and lis2.
So I used np.where to do so.
like this,
df['col2'] = np.where(df.col2.isin(lis1),'PC',df.col2.isin(lis2),'Ln','others')
But its throwing me follwoing error,
TypeError: function takes at most 3 arguments (5 given)
Any suggestion is very appreciated.!!
At the end I am aiming to have the values replace in col2 of my data frame as,
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Use double numpy.where:
lis1=['A']
lis2=['S','O']
df['col2'] = np.where(df.col2.isin(lis1),'PC',
np.where(df.col2.isin(lis2),'Ln','others'))
print (df)
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Timings:
#[60000 rows x 2 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
In [257]: %timeitnp.where(df.col2.isin(lis1),'PC',np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 8.15 ms per loop
In [258]: %timeit in1d_based(df, lis1, lis2)
100 loops, best of 3: 4.98 ms per loop
Here's one approach -
a = df.col2.values
df.col2 = np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
Sample step-by-step run -
# Input dataframe
In [206]: df
Out[206]:
col1 col2
0 1 A
1 2 A
2 3 S
3 4 O
4 5 S
5 6 P
# Extract out col2 values
In [207]: a = df.col2.values
# Form an indexing array based on where we have matches in lis1 or lis2 or neither
In [208]: idx = np.in1d(a,lis1) + 2*np.in1d(a,lis2)
In [209]: idx
Out[209]: array([1, 1, 2, 2, 2, 0])
# Index into a list of new strings with those indices
In [210]: newvals = np.take(['others','PC','Ln'], idx)
In [211]: newvals
Out[211]:
array(['PC', 'PC', 'Ln', 'Ln', 'Ln', 'others'],
dtype='|S6')
# Finally assign those into col2
In [212]: df.col2 = newvals
In [213]: df
Out[213]:
col1 col2
0 1 PC
1 2 PC
2 3 Ln
3 4 Ln
4 5 Ln
5 6 others
Runtime test -
In [251]: df=pd.DataFrame({'col1':[1,2,3,4,5,6], 'col2':list('AASOSP')})
In [252]: df = pd.concat([df]*10000).reset_index(drop=True)
In [253]: lis1
Out[253]: ['A']
In [254]: lis2
Out[254]: ['S', 'O']
In [255]: def in1d_based(df, lis1, lis2):
...: a = df.col2.values
...: return np.take(['others','PC','Ln'], np.in1d(a,lis1) + 2*np.in1d(a,lis2))
...:
# #jezrael's soln
In [256]: %timeit np.where(df.col2.isin(lis1),'PC', np.where(df.col2.isin(lis2),'Ln','others'))
100 loops, best of 3: 3.78 ms per loop
In [257]: %timeit in1d_based(df, lis1, lis2)
1000 loops, best of 3: 1.89 ms per loop

pandas sort lambda function

Given a dataframe a with 3 columns, A , B , C and 3 rows of numerical values. How does one sort all the rows with a comp operator using only the product of A[i]*B[i]. It seems that the pandas sort only takes columns and then a sort method.
I would like to use a comparison function like below.
f = lambda i,j: a['A'][i]*a['B'][i] < a['A'][j]*a['B'][j]
There are at least two ways:
Method 1
Say you start with
In [175]: df = pd.DataFrame({'A': [1, 2], 'B': [1, -1], 'C': [1, 1]})
You can add a column which is your sort key
In [176]: df['sort_val'] = df.A * df.B
Finally sort by it and drop it
In [190]: df.sort_values('sort_val').drop('sort_val', 1)
Out[190]:
A B C
1 2 -1 1
0 1 1 1
Method 2
Use numpy.argsort and then use .ix on the resulting indices:
In [197]: import numpy as np
In [198]: df.ix[np.argsort(df.A * df.B).values]
Out[198]:
A B C
0 1 1 1
1 2 -1 1
Another way, adding it here because this is the first result at Google:
df.loc[(df.A * df.B).sort_values().index]
This works well for me and is pretty straightforward. #Ami Tavory's answer gave strange results for me with a categorical index; not sure it's because of that though.
Just adding on #srs super elegant answer an iloc option with some time comparisons with loc and the naive solution.
(iloc is preferred for when your your index is position-based (vs label-based for loc)
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame({
'A': np.random.randint(low=1, high=N, size=N),
'B': np.random.randint(low=1, high=N, size=N)
})
%%timeit -n 100
df['C'] = df['A'] * df['B']
df.sort_values(by='C')
naive: 100 loops, best of 3: 1.85 ms per loop
%%timeit -n 100
df.loc[(df.A * df.B).sort_values().index]
loc: 100 loops, best of 3: 2.69 ms per loop
%%timeit -n 100
df.iloc[(df.A * df.B).sort_values().index]
iloc: 100 loops, best of 3: 2.02 ms per loop
df['C'] = df['A'] * df['B']
df1 = df.sort_values(by='C')
df2 = df.loc[(df.A * df.B).sort_values().index]
df3 = df.iloc[(df.A * df.B).sort_values().index]
print np.array_equal(df1.index, df2.index)
print np.array_equal(df2.index, df3.index)
testing results (comparing the entire index order) between all options:
True
True

Accessing every 1st element of Pandas DataFrame column containing lists

I have a Pandas DataFrame with a column containing lists objects
A
0 [1,2]
1 [3,4]
2 [8,9]
3 [2,6]
How can I access the first element of each list and save it into a new column of the DataFrame? To get a result like this:
A new_col
0 [1,2] 1
1 [3,4] 3
2 [8,9] 8
3 [2,6] 2
I know this could be done via iterating over each row, but is there any "pythonic" way?
As always, remember that storing non-scalar objects in frames is generally disfavoured, and should really only be used as a temporary intermediate step.
That said, you can use the .str accessor even though it's not a column of strings:
>>> df = pd.DataFrame({"A": [[1,2],[3,4],[8,9],[2,6]]})
>>> df["new_col"] = df["A"].str[0]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
>>> df["new_col"]
0 1
1 3
2 8
3 2
Name: new_col, dtype: int64
You can use map and a lambda function
df.loc[:, 'new_col'] = df.A.map(lambda x: x[0])
Use apply with x[0]:
df['new_col'] = df.A.apply(lambda x: x[0])
print df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
You can use the method str.get:
df['A'].str.get(0)
You can just use a conditional list comprehension which takes the first value of any iterable or else uses None for that item. List comprehensions are very Pythonic.
df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
Timings
df = pd.concat([df] * 10000)
%timeit df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
100 loops, best of 3: 13.2 ms per loop
%timeit df["new_col"] = df["A"].str[0]
100 loops, best of 3: 15.3 ms per loop
%timeit df['new_col'] = df.A.apply(lambda x: x[0])
100 loops, best of 3: 12.1 ms per loop
%timeit df.A.map(lambda x: x[0])
100 loops, best of 3: 11.1 ms per loop
Removing the safety check ensuring an interable.
%timeit df['new_col'] = [val[0] for val in df["A"]]
100 loops, best of 3: 7.38 ms per loop

Fast way to convert strings into lists of ints in a Pandas column?

I'm trying to compute the Hamming distance between all strings in a column in a large dataframe. I have over 100,000 rows in this column so with all pairwise combinations, which is 10x10^9 comparisons. These strings are short DNA sequences. I would like to quickly convert every string in the column to a list of integers, where a unique integer represent each character in the string. E.g.
"ACGTACA" -> [0, 1, 2, 3, 1, 2, 1]
then I use scipy.spatial.distance.pdist to quickly and efficiently compute the hamming distance between all of these. Is there a fast way to do this in Pandas?
I have tried using apply but it is pretty slow:
mapping = {"A":0, "C":1, "G":2, "T":3}
df.apply(lambda x: np.array([mapping[char] for char in x]))
get_dummies and other Categorical operations don't apply because they operate on a per row level. Not within the row.
Since Hamming distance doesn't care about magnitude differences, I can get about a 40-60% speedup just replacing df.apply(lambda x: np.array([mapping[char] for char in x])) with df.apply(lambda x: map(ord, x)) on made-up datasets.
Create your test data
In [39]: pd.options.display.max_rows=12
In [40]: N = 100000
In [41]: chars = np.array(list('ABCDEF'))
In [42]: s = pd.Series(np.random.choice(chars, size=4 * np.prod(N)).view('S4'))
In [45]: s
Out[45]:
0 BEBC
1 BEEC
2 FEFA
3 BBDA
4 CCBB
5 CABE
...
99994 EEBC
99995 FFBD
99996 ACFB
99997 FDBE
99998 BDAB
99999 CCFD
dtype: object
These don't actually have to be the same length the way we are doing it.
In [43]: maxlen = s.str.len().max()
In [44]: result = pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
In [47]: result
Out[47]:
0 1 2 3
0 1 4 1 2
1 1 4 4 2
2 5 4 5 0
3 1 1 3 0
4 2 2 1 1
5 2 0 1 4
... .. .. .. ..
99994 4 4 1 2
99995 5 5 1 3
99996 0 2 5 1
99997 5 3 1 4
99998 1 3 0 1
99999 2 2 5 3
[100000 rows x 4 columns]
So you get a factorization according the same categories (e.g. the codes are meaningful)
And pretty fast
In [46]: %timeit pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
10 loops, best of 3: 118 ms per loop
I didn't test the performance of this, but you could also try somthing like
atest = "ACGTACA"
alist = atest.replace('A', '3.').replace('C', '2.').replace('G', '1.').replace('T', '0.').split('.')
anumlist = [int(x) for x in alist if x.isdigit()]
results in:
[3, 2, 1, 0, 3, 2, 3]
Edit: Ok, so testing it with atest = "ACTACA"*100000 takes a while :/
Maybe not the best idea...
Edit 5:
Another improvement:
import datetime
import numpy as np
class Test(object):
def __init__(self):
self.mapping = {'A' : 0, 'C' : 1, 'G' : 2, 'T' : 3}
def char2num(self, astring):
return [self.mapping[c] for c in astring]
def main():
now = datetime.datetime.now()
atest = "AGTCAGTCATG"*10000000
t = Test()
alist = t.char2num(atest)
testme = np.array(alist)
print testme, len(testme)
print datetime.datetime.now() - now
if __name__ == "__main__":
main()
Takes about 16 seconds for 110.000.000 characters and keeps your processor busy instead of your ram:
[0 2 3 ..., 0 3 2] 110000000
0:00:16.866659
There doesn't seem to be much difference between using ord or a dictionary-based lookup that exactly maps A->0, C->1 etc:
import pandas as pd
import numpy as np
bases = ['A', 'C', 'T', 'G']
rowlen = 4
nrows = 1000000
dna = pd.Series(np.random.choice(bases, nrows * rowlen).view('S%i' % rowlen))
lookup = dict(zip(bases, range(4)))
%timeit dna.apply(lambda row: map(lookup.get, row))
# 1 loops, best of 3: 785 ms per loop
%timeit dna.apply(lambda row: map(ord, row))
# 1 loops, best of 3: 713 ms per loop
Jeff's solution is also not far off in terms of performance:
%timeit pd.concat([dna.str[i].astype('category', categories=bases).cat.codes for i in range(rowlen)], axis=1)
# 1 loops, best of 3: 1.03 s per loop
A major advantage of this approach over mapping the rows to lists of ints is that the categories can then be viewed as a single (nrows, rowlen) uint8 array via the .values attribute, which could then be passed directly to pdist.

How to change the order of DataFrame columns?

I have the following DataFrame (df):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
I add more column(s) by assignment:
df['mean'] = df.mean(1)
How can I move the column mean to the front, i.e. set it as first column leaving the order of the other columns untouched?
One easy way would be to reassign the dataframe with a list of the columns, rearranged as needed.
This is what you have now:
In [6]: df
Out[6]:
0 1 2 3 4 mean
0 0.445598 0.173835 0.343415 0.682252 0.582616 0.445543
1 0.881592 0.696942 0.702232 0.696724 0.373551 0.670208
2 0.662527 0.955193 0.131016 0.609548 0.804694 0.632596
3 0.260919 0.783467 0.593433 0.033426 0.512019 0.436653
4 0.131842 0.799367 0.182828 0.683330 0.019485 0.363371
5 0.498784 0.873495 0.383811 0.699289 0.480447 0.587165
6 0.388771 0.395757 0.745237 0.628406 0.784473 0.588529
7 0.147986 0.459451 0.310961 0.706435 0.100914 0.345149
8 0.394947 0.863494 0.585030 0.565944 0.356561 0.553195
9 0.689260 0.865243 0.136481 0.386582 0.730399 0.561593
In [7]: cols = df.columns.tolist()
In [8]: cols
Out[8]: [0L, 1L, 2L, 3L, 4L, 'mean']
Rearrange cols in any way you want. This is how I moved the last element to the first position:
In [12]: cols = cols[-1:] + cols[:-1]
In [13]: cols
Out[13]: ['mean', 0L, 1L, 2L, 3L, 4L]
Then reorder the dataframe like this:
In [16]: df = df[cols] # OR df = df.ix[:, cols]
In [17]: df
Out[17]:
mean 0 1 2 3 4
0 0.445543 0.445598 0.173835 0.343415 0.682252 0.582616
1 0.670208 0.881592 0.696942 0.702232 0.696724 0.373551
2 0.632596 0.662527 0.955193 0.131016 0.609548 0.804694
3 0.436653 0.260919 0.783467 0.593433 0.033426 0.512019
4 0.363371 0.131842 0.799367 0.182828 0.683330 0.019485
5 0.587165 0.498784 0.873495 0.383811 0.699289 0.480447
6 0.588529 0.388771 0.395757 0.745237 0.628406 0.784473
7 0.345149 0.147986 0.459451 0.310961 0.706435 0.100914
8 0.553195 0.394947 0.863494 0.585030 0.565944 0.356561
9 0.561593 0.689260 0.865243 0.136481 0.386582 0.730399
You could also do something like this:
df = df[['mean', '0', '1', '2', '3']]
You can get the list of columns with:
cols = list(df.columns.values)
The output will produce:
['0', '1', '2', '3', 'mean']
...which is then easy to rearrange manually before dropping it into the first function
Just assign the column names in the order you want them:
In [39]: df
Out[39]:
0 1 2 3 4 mean
0 0.172742 0.915661 0.043387 0.712833 0.190717 1
1 0.128186 0.424771 0.590779 0.771080 0.617472 1
2 0.125709 0.085894 0.989798 0.829491 0.155563 1
3 0.742578 0.104061 0.299708 0.616751 0.951802 1
4 0.721118 0.528156 0.421360 0.105886 0.322311 1
5 0.900878 0.082047 0.224656 0.195162 0.736652 1
6 0.897832 0.558108 0.318016 0.586563 0.507564 1
7 0.027178 0.375183 0.930248 0.921786 0.337060 1
8 0.763028 0.182905 0.931756 0.110675 0.423398 1
9 0.848996 0.310562 0.140873 0.304561 0.417808 1
In [40]: df = df[['mean', 4,3,2,1]]
Now, 'mean' column comes out in the front:
In [41]: df
Out[41]:
mean 4 3 2 1
0 1 0.190717 0.712833 0.043387 0.915661
1 1 0.617472 0.771080 0.590779 0.424771
2 1 0.155563 0.829491 0.989798 0.085894
3 1 0.951802 0.616751 0.299708 0.104061
4 1 0.322311 0.105886 0.421360 0.528156
5 1 0.736652 0.195162 0.224656 0.082047
6 1 0.507564 0.586563 0.318016 0.558108
7 1 0.337060 0.921786 0.930248 0.375183
8 1 0.423398 0.110675 0.931756 0.182905
9 1 0.417808 0.304561 0.140873 0.310562
For pandas >= 1.3 (Edited in 2022):
df.insert(0, 'mean', df.pop('mean'))
How about (for Pandas < 1.3, the original answer)
df.insert(0, 'mean', df['mean'])
https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#column-selection-addition-deletion
In your case,
df = df.reindex(columns=['mean',0,1,2,3,4])
will do exactly what you want.
In my case (general form):
df = df.reindex(columns=sorted(df.columns))
df = df.reindex(columns=(['opened'] + list([a for a in df.columns if a != 'opened']) ))
import numpy as np
import pandas as pd
df = pd.DataFrame()
column_names = ['x','y','z','mean']
for col in column_names:
df[col] = np.random.randint(0,100, size=10000)
You can try out the following solutions :
Solution 1:
df = df[ ['mean'] + [ col for col in df.columns if col != 'mean' ] ]
Solution 2:
df = df[['mean', 'x', 'y', 'z']]
Solution 3:
col = df.pop("mean")
df = df.insert(0, col.name, col)
Solution 4:
df.set_index(df.columns[-1], inplace=True)
df.reset_index(inplace=True)
Solution 5:
cols = list(df)
cols = [cols[-1]] + cols[:-1]
df = df[cols]
solution 6:
order = [1,2,3,0] # setting column's order
df = df[[df.columns[i] for i in order]]
Time Comparison:
Solution 1:
CPU times: user 1.05 ms, sys: 35 µs, total: 1.08 ms Wall time: 995 µs
Solution 2:
CPU times: user 933 µs, sys: 0 ns, total: 933 µs
Wall time: 800 µs
Solution 3:
CPU times: user 0 ns, sys: 1.35 ms, total: 1.35 ms
Wall time: 1.08 ms
Solution 4:
CPU times: user 1.23 ms, sys: 45 µs, total: 1.27 ms
Wall time: 986 µs
Solution 5:
CPU times: user 1.09 ms, sys: 19 µs, total: 1.11 ms
Wall time: 949 µs
Solution 6:
CPU times: user 955 µs, sys: 34 µs, total: 989 µs
Wall time: 859 µs
You need to create a new list of your columns in the desired order, then use df = df[cols] to rearrange the columns in this new order.
cols = ['mean'] + [col for col in df if col != 'mean']
df = df[cols]
You can also use a more general approach. In this example, the last column (indicated by -1) is inserted as the first column.
cols = [df.columns[-1]] + [col for col in df if col != df.columns[-1]]
df = df[cols]
You can also use this approach for reordering columns in a desired order if they are present in the DataFrame.
inserted_cols = ['a', 'b', 'c']
cols = ([col for col in inserted_cols if col in df]
+ [col for col in df if col not in inserted_cols])
df = df[cols]
Suppose you have df with columns A B C.
The most simple way is:
df = df.reindex(['B','C','A'], axis=1)
If your column names are too-long-to-type then you could specify the new order through a list of integers with the positions:
Data:
0 1 2 3 4 mean
0 0.397312 0.361846 0.719802 0.575223 0.449205 0.500678
1 0.287256 0.522337 0.992154 0.584221 0.042739 0.485741
2 0.884812 0.464172 0.149296 0.167698 0.793634 0.491923
3 0.656891 0.500179 0.046006 0.862769 0.651065 0.543382
4 0.673702 0.223489 0.438760 0.468954 0.308509 0.422683
5 0.764020 0.093050 0.100932 0.572475 0.416471 0.389390
6 0.259181 0.248186 0.626101 0.556980 0.559413 0.449972
7 0.400591 0.075461 0.096072 0.308755 0.157078 0.207592
8 0.639745 0.368987 0.340573 0.997547 0.011892 0.471749
9 0.050582 0.714160 0.168839 0.899230 0.359690 0.438500
Generic example:
new_order = [3,2,1,4,5,0]
print(df[df.columns[new_order]])
3 2 1 4 mean 0
0 0.575223 0.719802 0.361846 0.449205 0.500678 0.397312
1 0.584221 0.992154 0.522337 0.042739 0.485741 0.287256
2 0.167698 0.149296 0.464172 0.793634 0.491923 0.884812
3 0.862769 0.046006 0.500179 0.651065 0.543382 0.656891
4 0.468954 0.438760 0.223489 0.308509 0.422683 0.673702
5 0.572475 0.100932 0.093050 0.416471 0.389390 0.764020
6 0.556980 0.626101 0.248186 0.559413 0.449972 0.259181
7 0.308755 0.096072 0.075461 0.157078 0.207592 0.400591
8 0.997547 0.340573 0.368987 0.011892 0.471749 0.639745
9 0.899230 0.168839 0.714160 0.359690 0.438500 0.050582
Although it might seem like I'm just explicitly typing the column names in a different order, the fact that there's a column 'mean' should make it clear that new_order relates to actual positions and not column names.
For the specific case of OP's question:
new_order = [-1,0,1,2,3,4]
df = df[df.columns[new_order]]
print(df)
mean 0 1 2 3 4
0 0.500678 0.397312 0.361846 0.719802 0.575223 0.449205
1 0.485741 0.287256 0.522337 0.992154 0.584221 0.042739
2 0.491923 0.884812 0.464172 0.149296 0.167698 0.793634
3 0.543382 0.656891 0.500179 0.046006 0.862769 0.651065
4 0.422683 0.673702 0.223489 0.438760 0.468954 0.308509
5 0.389390 0.764020 0.093050 0.100932 0.572475 0.416471
6 0.449972 0.259181 0.248186 0.626101 0.556980 0.559413
7 0.207592 0.400591 0.075461 0.096072 0.308755 0.157078
8 0.471749 0.639745 0.368987 0.340573 0.997547 0.011892
9 0.438500 0.050582 0.714160 0.168839 0.899230 0.359690
The main problem with this approach is that calling the same code multiple times will create different results each time, so one needs to be careful :)
This question has been answered before but reindex_axis is deprecated now so I would suggest to use:
df = df.reindex(sorted(df.columns), axis=1)
For those who want to specify the order they want instead of just sorting them, here's the solution spelled out:
df = df.reindex(['the','order','you','want'], axis=1)
Now, how you want to sort the list of column names is really not a pandas question, that's a Python list manipulation question. There are many ways of doing that, and I think this answer has a very neat way of doing it.
You can reorder the dataframe columns using a list of names with:
df = df.filter(list_of_col_names)
I think this is a slightly neater solution:
df.insert(0, 'mean', df.pop("mean"))
This solution is somewhat similar to #JoeHeffer 's solution but this is one liner.
Here we remove the column "mean" from the dataframe and attach it to index 0 with the same column name.
I ran into a similar question myself, and just wanted to add what I settled on. I liked the reindex_axis() method for changing column order. This worked:
df = df.reindex_axis(['mean'] + list(df.columns[:-1]), axis=1)
An alternate method based on the comment from #Jorge:
df = df.reindex(columns=['mean'] + list(df.columns[:-1]))
Although reindex_axis seems to be slightly faster in micro benchmarks than reindex, I think I prefer the latter for its directness.
This function avoids you having to list out every variable in your dataset just to order a few of them.
def order(frame,var):
if type(var) is str:
var = [var] #let the command take a string or list
varlist =[w for w in frame.columns if w not in var]
frame = frame[var+varlist]
return frame
It takes two arguments, the first is the dataset, the second are the columns in the data set that you want to bring to the front.
So in my case I have a data set called Frame with variables A1, A2, B1, B2, Total and Date. If I want to bring Total to the front then all I have to do is:
frame = order(frame,['Total'])
If I want to bring Total and Date to the front then I do:
frame = order(frame,['Total','Date'])
EDIT:
Another useful way to use this is, if you have an unfamiliar table and you're looking with variables with a particular term in them, like VAR1, VAR2,... you may execute something like:
frame = order(frame,[v for v in frame.columns if "VAR" in v])
Simply do,
df = df[['mean'] + df.columns[:-1].tolist()]
Here's a way to move one existing column that will modify the existing dataframe in place.
my_column = df.pop('column name')
df.insert(3, my_column.name, my_column) # Is in-place
Just type the column name you want to change, and set the index for the new location.
def change_column_order(df, col_name, index):
cols = df.columns.tolist()
cols.remove(col_name)
cols.insert(index, col_name)
return df[cols]
For your case, this would be like:
df = change_column_order(df, 'mean', 0)
You could do the following (borrowing parts from Aman's answer):
cols = df.columns.tolist()
cols.insert(0, cols.pop(-1))
cols
>>>['mean', 0L, 1L, 2L, 3L, 4L]
df = df[cols]
Moving any column to any position:
import pandas as pd
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
cols = df.columns.tolist()
column_to_move = "C"
new_position = 1
cols.insert(new_position, cols.pop(cols.index(column_to_move)))
df = df[cols]
I wanted to bring two columns in front from a dataframe where I do not know exactly the names of all columns, because they are generated from a pivot statement before.
So, if you are in the same situation: To bring columns in front that you know the name of and then let them follow by "all the other columns", I came up with the following general solution:
df = df.reindex_axis(['Col1','Col2'] + list(df.columns.drop(['Col1','Col2'])), axis=1)
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
df['mean'] = df.mean(1)
df
0 1 2 3 4 mean
0 0.929616 0.316376 0.183919 0.204560 0.567725 0.440439
1 0.595545 0.964515 0.653177 0.748907 0.653570 0.723143
2 0.747715 0.961307 0.008388 0.106444 0.298704 0.424512
3 0.656411 0.809813 0.872176 0.964648 0.723685 0.805347
4 0.642475 0.717454 0.467599 0.325585 0.439645 0.518551
5 0.729689 0.994015 0.676874 0.790823 0.170914 0.672463
6 0.026849 0.800370 0.903723 0.024676 0.491747 0.449473
7 0.526255 0.596366 0.051958 0.895090 0.728266 0.559587
8 0.818350 0.500223 0.810189 0.095969 0.218950 0.488736
9 0.258719 0.468106 0.459373 0.709510 0.178053 0.414752
### here you can add below line and it should work
# Don't forget the two (()) 'brackets' around columns names.Otherwise, it'll give you an error.
df = df[list(('mean',0, 1, 2,3,4))]
df
mean 0 1 2 3 4
0 0.440439 0.929616 0.316376 0.183919 0.204560 0.567725
1 0.723143 0.595545 0.964515 0.653177 0.748907 0.653570
2 0.424512 0.747715 0.961307 0.008388 0.106444 0.298704
3 0.805347 0.656411 0.809813 0.872176 0.964648 0.723685
4 0.518551 0.642475 0.717454 0.467599 0.325585 0.439645
5 0.672463 0.729689 0.994015 0.676874 0.790823 0.170914
6 0.449473 0.026849 0.800370 0.903723 0.024676 0.491747
7 0.559587 0.526255 0.596366 0.051958 0.895090 0.728266
8 0.488736 0.818350 0.500223 0.810189 0.095969 0.218950
9 0.414752 0.258719 0.468106 0.459373 0.709510 0.178053
You can use a set which is an unordered collection of unique elements to do keep the "order of the other columns untouched":
other_columns = list(set(df.columns).difference(["mean"])) #[0, 1, 2, 3, 4]
Then, you can use a lambda to move a specific column to the front by:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.random.rand(10, 5))
In [4]: df["mean"] = df.mean(1)
In [5]: move_col_to_front = lambda df, col: df[[col]+list(set(df.columns).difference([col]))]
In [6]: move_col_to_front(df, "mean")
Out[6]:
mean 0 1 2 3 4
0 0.697253 0.600377 0.464852 0.938360 0.945293 0.537384
1 0.609213 0.703387 0.096176 0.971407 0.955666 0.319429
2 0.561261 0.791842 0.302573 0.662365 0.728368 0.321158
3 0.518720 0.710443 0.504060 0.663423 0.208756 0.506916
4 0.616316 0.665932 0.794385 0.163000 0.664265 0.793995
5 0.519757 0.585462 0.653995 0.338893 0.714782 0.305654
6 0.532584 0.434472 0.283501 0.633156 0.317520 0.994271
7 0.640571 0.732680 0.187151 0.937983 0.921097 0.423945
8 0.562447 0.790987 0.200080 0.317812 0.641340 0.862018
9 0.563092 0.811533 0.662709 0.396048 0.596528 0.348642
In [7]: move_col_to_front(df, 2)
Out[7]:
2 0 1 3 4 mean
0 0.938360 0.600377 0.464852 0.945293 0.537384 0.697253
1 0.971407 0.703387 0.096176 0.955666 0.319429 0.609213
2 0.662365 0.791842 0.302573 0.728368 0.321158 0.561261
3 0.663423 0.710443 0.504060 0.208756 0.506916 0.518720
4 0.163000 0.665932 0.794385 0.664265 0.793995 0.616316
5 0.338893 0.585462 0.653995 0.714782 0.305654 0.519757
6 0.633156 0.434472 0.283501 0.317520 0.994271 0.532584
7 0.937983 0.732680 0.187151 0.921097 0.423945 0.640571
8 0.317812 0.790987 0.200080 0.641340 0.862018 0.562447
9 0.396048 0.811533 0.662709 0.596528 0.348642 0.563092
Just flipping helps often.
df[df.columns[::-1]]
Or just shuffle for a look.
import random
cols = list(df.columns)
random.shuffle(cols)
df[cols]
You can use reindex which can be used for both axis:
df
# 0 1 2 3 4 mean
# 0 0.943825 0.202490 0.071908 0.452985 0.678397 0.469921
# 1 0.745569 0.103029 0.268984 0.663710 0.037813 0.363821
# 2 0.693016 0.621525 0.031589 0.956703 0.118434 0.484254
# 3 0.284922 0.527293 0.791596 0.243768 0.629102 0.495336
# 4 0.354870 0.113014 0.326395 0.656415 0.172445 0.324628
# 5 0.815584 0.532382 0.195437 0.829670 0.019001 0.478415
# 6 0.944587 0.068690 0.811771 0.006846 0.698785 0.506136
# 7 0.595077 0.437571 0.023520 0.772187 0.862554 0.538182
# 8 0.700771 0.413958 0.097996 0.355228 0.656919 0.444974
# 9 0.263138 0.906283 0.121386 0.624336 0.859904 0.555009
df.reindex(['mean', *range(5)], axis=1)
# mean 0 1 2 3 4
# 0 0.469921 0.943825 0.202490 0.071908 0.452985 0.678397
# 1 0.363821 0.745569 0.103029 0.268984 0.663710 0.037813
# 2 0.484254 0.693016 0.621525 0.031589 0.956703 0.118434
# 3 0.495336 0.284922 0.527293 0.791596 0.243768 0.629102
# 4 0.324628 0.354870 0.113014 0.326395 0.656415 0.172445
# 5 0.478415 0.815584 0.532382 0.195437 0.829670 0.019001
# 6 0.506136 0.944587 0.068690 0.811771 0.006846 0.698785
# 7 0.538182 0.595077 0.437571 0.023520 0.772187 0.862554
# 8 0.444974 0.700771 0.413958 0.097996 0.355228 0.656919
# 9 0.555009 0.263138 0.906283 0.121386 0.624336 0.859904
Hackiest method in the book
df.insert(0, "test", df["mean"])
df = df.drop(columns=["mean"]).rename(columns={"test": "mean"})
A pretty straightforward solution that worked for me is to use .reindex on df.columns:
df = df[df.columns.reindex(['mean', 0, 1, 2, 3, 4])[0]]
Here is a function to do this for any number of columns.
def mean_first(df):
ncols = df.shape[1] # Get the number of columns
index = list(range(ncols)) # Create an index to reorder the columns
index.insert(0,ncols) # This puts the last column at the front
return(df.assign(mean=df.mean(1)).iloc[:,index]) # new df with last column (mean) first
A simple approach is using set(), in particular when you have a long list of columns and do not want to handle them manually:
cols = list(set(df.columns.tolist()) - set(['mean']))
cols.insert(0, 'mean')
df = df[cols]
How about using T?
df = df.T.reindex(['mean', 0, 1, 2, 3, 4]).T
I believe #Aman's answer is the best if you know the location of the other column.
If you don't know the location of mean, but only have its name, you cannot resort directly to cols = cols[-1:] + cols[:-1]. Following is the next-best thing I could come up with:
meanDf = pd.DataFrame(df.pop('mean'))
# now df doesn't contain "mean" anymore. Order of join will move it to left or right:
meanDf.join(df) # has mean as first column
df.join(meanDf) # has mean as last column

Categories

Resources