Reshape column to multiple columns - python

I'm rather new to Python and i'm having some troubles. I have the following dataframe:
import pandas as pd
data = {'v1':('Belgium[country]', 'Antwerp[city]', 'Gent[city]', 'France[country]', 'Paris[city]', 'Marseille[city]', 'Toulouse[city]', 'Spain[country]', 'Madrid[city]')}
df = pd.DataFrame(data)
df
v1
0 Belgium[country]
1 Antwerp[city]
2 Gent[city]
3 France[country]
4 Paris[city]
5 Marseille[city]
6 Toulouse[city]
7 Spain[country]
8 Madrid[city]
Which I want to map to the following format:
v1 v2
0 Belgium[country] Antwerp[city]
1 Belgium[country] Gent[city]
2 France[country] Paris[city]
3 France[country] Marseille[city]
4 France[country] Toulouse[city]
5 Spain[country] Madrid[city]
I found a way to do this using a dictionary, but since I want to maintain the order I'm looking for a way to do this using a list or so.
I tried it both based on the indexes and on the values themselves (specifically [country] and [city]), but i failed with both. Any help is much appreciated!

This will work:
counter = df['v1'].str.contains('country').cumsum()
result = df.groupby(counter).apply(lambda g: g[1:]).reset_index(level=1, drop=True)
result = result.rename(columns={'v1': 'v2'}).reset_index(drop=False)
result['v1'] = result['v1'].replace(df.groupby(counter).first().squeeze())
The idea is to add a counter that is incremented for each new country. You can then group by this counter to access the information you need.
Specifically, the first step is to keep only cities (g[1:] for each group g). Then do some renaming and reindexing. Finally, use the result from another groupby (giving the country) to replace the values in column v1.

Solution without groupby:
#rename columns
df = df.rename(columns={'v1':'v2'})
#get counter
counter= df.v2.str.contains('country').cumsum()
#get mask where are changed country to city
df.insert(0, 'v1', df.loc[counter.ne(counter.shift()), 'v2'])
#forward filling NaN
df.v1 = df.v1.ffill()
#remove rows where v1 == v2
df = df[df.v1.ne(df.v2)].reset_index(drop=True)
print (df)
v1 v2
0 Belgium[country] Antwerp[city]
1 Belgium[country] Gent[city]
2 France[country] Paris[city]
3 France[country] Marseille[city]
4 France[country] Toulouse[city]
5 Spain[country] Madrid[city]
Timings:
In [189]: %timeit (jez(df))
100 loops, best of 3: 2.47 ms per loop
In [191]: %timeit (IanS(df1))
100 loops, best of 3: 5.06 ms per loop
Code for timings:
def jez(df):
df = df.rename(columns={'v1':'v2'})
counter= df.v2.str.contains('country').cumsum()
df.insert(0, 'v1', df.loc[counter.ne(counter.shift()), 'v2'])
df.v1 = df.v1.ffill()
df = df[df.v1.ne(df.v2)].reset_index(drop=True)
return (df)
def IanS(df):
counter = df['v1'].str.contains('country').cumsum()
result = df.groupby(counter).apply(lambda g: g[1:]).reset_index(level=1, drop=True)
result = result.rename(columns={'v1': 'v2'}).reset_index(drop=False)
result['v1'] = result['v1'].replace(df.groupby(counter).first().squeeze())
return (result)

Related

Pandas count NAs with a groupby for all columns [duplicate]

This question already has answers here:
Pandas count null values in a groupby function
(3 answers)
Groupby class and count missing values in features
(5 answers)
Closed 3 years ago.
This question shows how to count NAs in a dataframe for a particular column C. How do I count NAs for all columns (that aren't the groupby column)?
Here is some test code that doesn't work:
#!/usr/bin/env python3
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,2,2],
'b':[1,np.nan,2,np.nan],
'c':[1,np.nan,2,3]})
# result = df.groupby('a').isna().sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
# result = df.groupby('a').transform('isna').sum()
# AttributeError: Cannot access callable attribute 'isna' of 'DataFrameGroupBy' objects, try using the 'apply' method
result = df.isna().groupby('a').sum()
print(result)
# result:
# b c
# a
# False 2.0 1.0
result = df.groupby('a').apply(lambda _df: df.isna().sum())
print(result)
# result:
# a b c
# a
# 1 0 2 1
# 2 0 2 1
Desired output:
b c
a
1 1 1
2 1 0
It's always best to avoid groupby.apply in favor of the basic functions which are cythonized, as this scales better with many groups. This will lead to a great increase in performance. In this case first check isnull() on the entire DataFrame then groupby + sum.
df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
# b c
#a
#1 1 1
#2 1 0
To illustrate the performance gain:
import pandas as pd
import numpy as np
N = 50000
df = pd.DataFrame({'a': [*range(N//2)]*2,
'b': np.random.choice([1, np.nan], N),
'c': np.random.choice([1, np.nan], N)})
%timeit df[df.columns.difference(['a'])].isnull().groupby(df.a).sum().astype(int)
#7.89 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
#9.47 s ± 111 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Your question has the answer (You mistyped _df as df):
result = df.groupby('a')['b', 'c'].apply(lambda _df: _df.isna().sum())
result
b c
a
1 1 1
2 1 0
Using apply with isna and sum. Plus we select the correct columns, so we don't get the unnecessary a column:
Note: apply can be slow, it's recommended to use one of the vectorized solutions, see the answers of WenYoBen, Anky or ALollz
df.groupby('a')[['b', 'c']].apply(lambda x: x.isna().sum())
Output
b c
a
1 1 1
2 1 0
Another way would be set_index() on a and groupby on the index and sum:
df.set_index('a').isna().groupby(level=0).sum()*1
Or:
df.set_index('a').isna().groupby(level=0).sum().astype(int)
Or without groupby courtesy #WenYoBen:
df.set_index('a').isna().sum(level=0).astype(int)
b c
a
1 1 1
2 1 0
I will do count then sub with value_counts, the reason why I did not using apply , cause it is usually has bad performance
df.groupby('a')[['b','c']].count().rsub(df.a.value_counts(dropna=False),axis=0)
Out[78]:
b c
1 1 1
2 1 0
Alternative
df.isna().drop('a',1).astype(int).groupby(df['a']).sum()
Out[83]:
b c
a
1 1 1
2 1 0
You need to drop the column after using apply.
df.groupby('a').apply(lambda x: x.isna().sum()).drop('a',1)
Output:
b c
a
1 1 1
2 1 0
Another dirty work:
df.set_index('a').isna().astype(int).groupby(level=0).sum()
Output:
b c
a
1 1 1
2 1 0
You could write your own aggregation function as follows:
df.groupby('a').agg(lambda x: x.isna().sum())
which results in
b c
a
1 1.0 1.0
2 1.0 0.0

pandas rolling max with groupby

I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop

python pandas - input values into new column

I have a small dataframe below of spending of 4 persons.
There is an empty column called 'Grade'.
I would like to rate those who spent more than $100 grade A, and grade B for those less than $100.
What is the most efficient method of filling up column 'Grade', assuming it is a big dataframe?
import pandas as pd
df=pd.DataFrame({'Customer':['Bob','Ken','Steve','Joe'],
'Spending':[130,22,313,46]})
df['Grade']=''
You can use numpy.where:
df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
print (df)
Customer Spending Grade
0 Bob 130 A
1 Ken 22 B
2 Steve 313 A
3 Joe 46 B
Timings:
df=pd.DataFrame({'Customer':['Bob','Ken','Steve','Joe'],
'Spending':[130,22,313,46]})
#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop
In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop
Fastest way to do that would be to use lambda function with an apply function.
df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)

cogroup like operation for pandas

I was trying to using pandas to analysis a fairly large data set (~5GB). I wanted to divide the data sets into groups, then perform a Cartesian product on each group, and then aggregate the result.
The apply operation of pandas is quite expressive, I could first group, and then do the Cartesian product on each group using apply, and then aggregate the result using sum. The problem with this approach, however, is that apply is not lazy, it will compute all the intermediate results before the aggregation, and the intermediate results (Cartesian production on each group) is very large.
I was looking at Apache Spark and found one very interesting operator called cogroup. The definition is here:
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Iterable, Iterable) tuples. This operation is also called groupWith.
This seems to be exactly what I want. If I could first cogroup and then do a sum, then the intermediate results won't be expanded (assuming cogroup works in the same lazy fashion as group).
Is there operation similar to cogroup in pandas, or how to achieve my goal efficiently?
Here is my example:
I want to group the data by id, and then do a Cartesian product for each group, and then group by cluster_x and cluster_y and aggregate the count_x and count_y using sum. The following code works, but is extremely slow and consumes too much memory.
# add dummy_key to do Cartesian product by merge
df['dummy_key'] = 1
def join_group(g):
return pandas.merge(g, g, on='dummy_key')\
[['cache_cluster_x', 'count_x', 'cache_cluster_y', 'count_y']]
df_count_stats = df.groupby(['id'], as_index=True).apply(join_group).\
groupby(['cache_cluster_x', 'cache_cluster_y'], as_index=False)\
[['count_x', 'count_y']].sum()
A toy data set
id cluster count
0 i1 A 2
1 i1 B 3
2 i2 A 1
3 i2 B 4
Intermediate result after the apply (can be large)
cluster_x count_x cluster_y count_y
id
i1 0 A 2 A 2
1 A 2 B 3
2 B 3 A 2
3 B 3 B 3
i2 0 A 1 A 1
1 A 1 B 4
2 B 4 A 1
3 B 4 B 4
The desired final result
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
My first attempt failed, sort of: while I was able to limit the memory use (by summing over the Cartesian product within each group), it was considerably slower than the original. But for your particular desired output, I think we can simplify the problem considerably:
import numpy as np, pandas as pd
def fake_data(nids, nclusters, ntile):
ids = ["i{}".format(i) for i in range(1,nids+1)]
clusters = ["A{}".format(i) for i in range(nclusters)]
df = pd.DataFrame(index=pd.MultiIndex.from_product([ids, clusters], names=["id", "cluster"]))
df = df.reset_index()
df = pd.concat([df]*ntile)
df["count"] = np.random.randint(0, 10, size=len(df))
return df
def join_group(g):
m= pd.merge(g, g, on='dummy_key')
return m[['cluster_x', 'count_x', 'cluster_y', 'count_y']]
def old_method(df):
df["dummy_key"] = 1
h1 = df.groupby(['id'], as_index=True).apply(join_group)
h2 = h1.groupby(['cluster_x', 'cluster_y'], as_index=False)
h3 = h2[['count_x', 'count_y']].sum()
return h3
def new_method1(df):
m1 = df.groupby("cluster", as_index=False)["count"].sum()
m1["dummy_key"] = 1
m2 = m1.merge(m1, on="dummy_key")
m2 = m2.sort_index(axis=1).drop(["dummy_key"], axis=1)
return m2
which gives (with df as your toy frame):
>>> new_method1(df)
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
>>> df2 = fake_data(100, 100, 1)
>>> %timeit old_method(df2)
1 loops, best of 3: 954 ms per loop
>>> %timeit new_method1(df2)
100 loops, best of 3: 8.58 ms per loop
>>> (old_method(df2) == new_method1(df2)).all().all()
True
and even
>>> df2 = fake_data(100, 100, 100)
>>> %timeit new_method1(df2)
10 loops, best of 3: 88.8 ms per loop
Whether this will be enough of an improvement to handle your actual case, I'm not sure.

How to change the order of DataFrame columns?

I have the following DataFrame (df):
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
I add more column(s) by assignment:
df['mean'] = df.mean(1)
How can I move the column mean to the front, i.e. set it as first column leaving the order of the other columns untouched?
One easy way would be to reassign the dataframe with a list of the columns, rearranged as needed.
This is what you have now:
In [6]: df
Out[6]:
0 1 2 3 4 mean
0 0.445598 0.173835 0.343415 0.682252 0.582616 0.445543
1 0.881592 0.696942 0.702232 0.696724 0.373551 0.670208
2 0.662527 0.955193 0.131016 0.609548 0.804694 0.632596
3 0.260919 0.783467 0.593433 0.033426 0.512019 0.436653
4 0.131842 0.799367 0.182828 0.683330 0.019485 0.363371
5 0.498784 0.873495 0.383811 0.699289 0.480447 0.587165
6 0.388771 0.395757 0.745237 0.628406 0.784473 0.588529
7 0.147986 0.459451 0.310961 0.706435 0.100914 0.345149
8 0.394947 0.863494 0.585030 0.565944 0.356561 0.553195
9 0.689260 0.865243 0.136481 0.386582 0.730399 0.561593
In [7]: cols = df.columns.tolist()
In [8]: cols
Out[8]: [0L, 1L, 2L, 3L, 4L, 'mean']
Rearrange cols in any way you want. This is how I moved the last element to the first position:
In [12]: cols = cols[-1:] + cols[:-1]
In [13]: cols
Out[13]: ['mean', 0L, 1L, 2L, 3L, 4L]
Then reorder the dataframe like this:
In [16]: df = df[cols] # OR df = df.ix[:, cols]
In [17]: df
Out[17]:
mean 0 1 2 3 4
0 0.445543 0.445598 0.173835 0.343415 0.682252 0.582616
1 0.670208 0.881592 0.696942 0.702232 0.696724 0.373551
2 0.632596 0.662527 0.955193 0.131016 0.609548 0.804694
3 0.436653 0.260919 0.783467 0.593433 0.033426 0.512019
4 0.363371 0.131842 0.799367 0.182828 0.683330 0.019485
5 0.587165 0.498784 0.873495 0.383811 0.699289 0.480447
6 0.588529 0.388771 0.395757 0.745237 0.628406 0.784473
7 0.345149 0.147986 0.459451 0.310961 0.706435 0.100914
8 0.553195 0.394947 0.863494 0.585030 0.565944 0.356561
9 0.561593 0.689260 0.865243 0.136481 0.386582 0.730399
You could also do something like this:
df = df[['mean', '0', '1', '2', '3']]
You can get the list of columns with:
cols = list(df.columns.values)
The output will produce:
['0', '1', '2', '3', 'mean']
...which is then easy to rearrange manually before dropping it into the first function
Just assign the column names in the order you want them:
In [39]: df
Out[39]:
0 1 2 3 4 mean
0 0.172742 0.915661 0.043387 0.712833 0.190717 1
1 0.128186 0.424771 0.590779 0.771080 0.617472 1
2 0.125709 0.085894 0.989798 0.829491 0.155563 1
3 0.742578 0.104061 0.299708 0.616751 0.951802 1
4 0.721118 0.528156 0.421360 0.105886 0.322311 1
5 0.900878 0.082047 0.224656 0.195162 0.736652 1
6 0.897832 0.558108 0.318016 0.586563 0.507564 1
7 0.027178 0.375183 0.930248 0.921786 0.337060 1
8 0.763028 0.182905 0.931756 0.110675 0.423398 1
9 0.848996 0.310562 0.140873 0.304561 0.417808 1
In [40]: df = df[['mean', 4,3,2,1]]
Now, 'mean' column comes out in the front:
In [41]: df
Out[41]:
mean 4 3 2 1
0 1 0.190717 0.712833 0.043387 0.915661
1 1 0.617472 0.771080 0.590779 0.424771
2 1 0.155563 0.829491 0.989798 0.085894
3 1 0.951802 0.616751 0.299708 0.104061
4 1 0.322311 0.105886 0.421360 0.528156
5 1 0.736652 0.195162 0.224656 0.082047
6 1 0.507564 0.586563 0.318016 0.558108
7 1 0.337060 0.921786 0.930248 0.375183
8 1 0.423398 0.110675 0.931756 0.182905
9 1 0.417808 0.304561 0.140873 0.310562
For pandas >= 1.3 (Edited in 2022):
df.insert(0, 'mean', df.pop('mean'))
How about (for Pandas < 1.3, the original answer)
df.insert(0, 'mean', df['mean'])
https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#column-selection-addition-deletion
In your case,
df = df.reindex(columns=['mean',0,1,2,3,4])
will do exactly what you want.
In my case (general form):
df = df.reindex(columns=sorted(df.columns))
df = df.reindex(columns=(['opened'] + list([a for a in df.columns if a != 'opened']) ))
import numpy as np
import pandas as pd
df = pd.DataFrame()
column_names = ['x','y','z','mean']
for col in column_names:
df[col] = np.random.randint(0,100, size=10000)
You can try out the following solutions :
Solution 1:
df = df[ ['mean'] + [ col for col in df.columns if col != 'mean' ] ]
Solution 2:
df = df[['mean', 'x', 'y', 'z']]
Solution 3:
col = df.pop("mean")
df = df.insert(0, col.name, col)
Solution 4:
df.set_index(df.columns[-1], inplace=True)
df.reset_index(inplace=True)
Solution 5:
cols = list(df)
cols = [cols[-1]] + cols[:-1]
df = df[cols]
solution 6:
order = [1,2,3,0] # setting column's order
df = df[[df.columns[i] for i in order]]
Time Comparison:
Solution 1:
CPU times: user 1.05 ms, sys: 35 µs, total: 1.08 ms Wall time: 995 µs
Solution 2:
CPU times: user 933 µs, sys: 0 ns, total: 933 µs
Wall time: 800 µs
Solution 3:
CPU times: user 0 ns, sys: 1.35 ms, total: 1.35 ms
Wall time: 1.08 ms
Solution 4:
CPU times: user 1.23 ms, sys: 45 µs, total: 1.27 ms
Wall time: 986 µs
Solution 5:
CPU times: user 1.09 ms, sys: 19 µs, total: 1.11 ms
Wall time: 949 µs
Solution 6:
CPU times: user 955 µs, sys: 34 µs, total: 989 µs
Wall time: 859 µs
You need to create a new list of your columns in the desired order, then use df = df[cols] to rearrange the columns in this new order.
cols = ['mean'] + [col for col in df if col != 'mean']
df = df[cols]
You can also use a more general approach. In this example, the last column (indicated by -1) is inserted as the first column.
cols = [df.columns[-1]] + [col for col in df if col != df.columns[-1]]
df = df[cols]
You can also use this approach for reordering columns in a desired order if they are present in the DataFrame.
inserted_cols = ['a', 'b', 'c']
cols = ([col for col in inserted_cols if col in df]
+ [col for col in df if col not in inserted_cols])
df = df[cols]
Suppose you have df with columns A B C.
The most simple way is:
df = df.reindex(['B','C','A'], axis=1)
If your column names are too-long-to-type then you could specify the new order through a list of integers with the positions:
Data:
0 1 2 3 4 mean
0 0.397312 0.361846 0.719802 0.575223 0.449205 0.500678
1 0.287256 0.522337 0.992154 0.584221 0.042739 0.485741
2 0.884812 0.464172 0.149296 0.167698 0.793634 0.491923
3 0.656891 0.500179 0.046006 0.862769 0.651065 0.543382
4 0.673702 0.223489 0.438760 0.468954 0.308509 0.422683
5 0.764020 0.093050 0.100932 0.572475 0.416471 0.389390
6 0.259181 0.248186 0.626101 0.556980 0.559413 0.449972
7 0.400591 0.075461 0.096072 0.308755 0.157078 0.207592
8 0.639745 0.368987 0.340573 0.997547 0.011892 0.471749
9 0.050582 0.714160 0.168839 0.899230 0.359690 0.438500
Generic example:
new_order = [3,2,1,4,5,0]
print(df[df.columns[new_order]])
3 2 1 4 mean 0
0 0.575223 0.719802 0.361846 0.449205 0.500678 0.397312
1 0.584221 0.992154 0.522337 0.042739 0.485741 0.287256
2 0.167698 0.149296 0.464172 0.793634 0.491923 0.884812
3 0.862769 0.046006 0.500179 0.651065 0.543382 0.656891
4 0.468954 0.438760 0.223489 0.308509 0.422683 0.673702
5 0.572475 0.100932 0.093050 0.416471 0.389390 0.764020
6 0.556980 0.626101 0.248186 0.559413 0.449972 0.259181
7 0.308755 0.096072 0.075461 0.157078 0.207592 0.400591
8 0.997547 0.340573 0.368987 0.011892 0.471749 0.639745
9 0.899230 0.168839 0.714160 0.359690 0.438500 0.050582
Although it might seem like I'm just explicitly typing the column names in a different order, the fact that there's a column 'mean' should make it clear that new_order relates to actual positions and not column names.
For the specific case of OP's question:
new_order = [-1,0,1,2,3,4]
df = df[df.columns[new_order]]
print(df)
mean 0 1 2 3 4
0 0.500678 0.397312 0.361846 0.719802 0.575223 0.449205
1 0.485741 0.287256 0.522337 0.992154 0.584221 0.042739
2 0.491923 0.884812 0.464172 0.149296 0.167698 0.793634
3 0.543382 0.656891 0.500179 0.046006 0.862769 0.651065
4 0.422683 0.673702 0.223489 0.438760 0.468954 0.308509
5 0.389390 0.764020 0.093050 0.100932 0.572475 0.416471
6 0.449972 0.259181 0.248186 0.626101 0.556980 0.559413
7 0.207592 0.400591 0.075461 0.096072 0.308755 0.157078
8 0.471749 0.639745 0.368987 0.340573 0.997547 0.011892
9 0.438500 0.050582 0.714160 0.168839 0.899230 0.359690
The main problem with this approach is that calling the same code multiple times will create different results each time, so one needs to be careful :)
This question has been answered before but reindex_axis is deprecated now so I would suggest to use:
df = df.reindex(sorted(df.columns), axis=1)
For those who want to specify the order they want instead of just sorting them, here's the solution spelled out:
df = df.reindex(['the','order','you','want'], axis=1)
Now, how you want to sort the list of column names is really not a pandas question, that's a Python list manipulation question. There are many ways of doing that, and I think this answer has a very neat way of doing it.
You can reorder the dataframe columns using a list of names with:
df = df.filter(list_of_col_names)
I think this is a slightly neater solution:
df.insert(0, 'mean', df.pop("mean"))
This solution is somewhat similar to #JoeHeffer 's solution but this is one liner.
Here we remove the column "mean" from the dataframe and attach it to index 0 with the same column name.
I ran into a similar question myself, and just wanted to add what I settled on. I liked the reindex_axis() method for changing column order. This worked:
df = df.reindex_axis(['mean'] + list(df.columns[:-1]), axis=1)
An alternate method based on the comment from #Jorge:
df = df.reindex(columns=['mean'] + list(df.columns[:-1]))
Although reindex_axis seems to be slightly faster in micro benchmarks than reindex, I think I prefer the latter for its directness.
This function avoids you having to list out every variable in your dataset just to order a few of them.
def order(frame,var):
if type(var) is str:
var = [var] #let the command take a string or list
varlist =[w for w in frame.columns if w not in var]
frame = frame[var+varlist]
return frame
It takes two arguments, the first is the dataset, the second are the columns in the data set that you want to bring to the front.
So in my case I have a data set called Frame with variables A1, A2, B1, B2, Total and Date. If I want to bring Total to the front then all I have to do is:
frame = order(frame,['Total'])
If I want to bring Total and Date to the front then I do:
frame = order(frame,['Total','Date'])
EDIT:
Another useful way to use this is, if you have an unfamiliar table and you're looking with variables with a particular term in them, like VAR1, VAR2,... you may execute something like:
frame = order(frame,[v for v in frame.columns if "VAR" in v])
Simply do,
df = df[['mean'] + df.columns[:-1].tolist()]
Here's a way to move one existing column that will modify the existing dataframe in place.
my_column = df.pop('column name')
df.insert(3, my_column.name, my_column) # Is in-place
Just type the column name you want to change, and set the index for the new location.
def change_column_order(df, col_name, index):
cols = df.columns.tolist()
cols.remove(col_name)
cols.insert(index, col_name)
return df[cols]
For your case, this would be like:
df = change_column_order(df, 'mean', 0)
You could do the following (borrowing parts from Aman's answer):
cols = df.columns.tolist()
cols.insert(0, cols.pop(-1))
cols
>>>['mean', 0L, 1L, 2L, 3L, 4L]
df = df[cols]
Moving any column to any position:
import pandas as pd
df = pd.DataFrame({"A": [1,2,3],
"B": [2,4,8],
"C": [5,5,5]})
cols = df.columns.tolist()
column_to_move = "C"
new_position = 1
cols.insert(new_position, cols.pop(cols.index(column_to_move)))
df = df[cols]
I wanted to bring two columns in front from a dataframe where I do not know exactly the names of all columns, because they are generated from a pivot statement before.
So, if you are in the same situation: To bring columns in front that you know the name of and then let them follow by "all the other columns", I came up with the following general solution:
df = df.reindex_axis(['Col1','Col2'] + list(df.columns.drop(['Col1','Col2'])), axis=1)
Here is a very simple answer to this(only one line).
You can do that after you added the 'n' column into your df as follows.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
df['mean'] = df.mean(1)
df
0 1 2 3 4 mean
0 0.929616 0.316376 0.183919 0.204560 0.567725 0.440439
1 0.595545 0.964515 0.653177 0.748907 0.653570 0.723143
2 0.747715 0.961307 0.008388 0.106444 0.298704 0.424512
3 0.656411 0.809813 0.872176 0.964648 0.723685 0.805347
4 0.642475 0.717454 0.467599 0.325585 0.439645 0.518551
5 0.729689 0.994015 0.676874 0.790823 0.170914 0.672463
6 0.026849 0.800370 0.903723 0.024676 0.491747 0.449473
7 0.526255 0.596366 0.051958 0.895090 0.728266 0.559587
8 0.818350 0.500223 0.810189 0.095969 0.218950 0.488736
9 0.258719 0.468106 0.459373 0.709510 0.178053 0.414752
### here you can add below line and it should work
# Don't forget the two (()) 'brackets' around columns names.Otherwise, it'll give you an error.
df = df[list(('mean',0, 1, 2,3,4))]
df
mean 0 1 2 3 4
0 0.440439 0.929616 0.316376 0.183919 0.204560 0.567725
1 0.723143 0.595545 0.964515 0.653177 0.748907 0.653570
2 0.424512 0.747715 0.961307 0.008388 0.106444 0.298704
3 0.805347 0.656411 0.809813 0.872176 0.964648 0.723685
4 0.518551 0.642475 0.717454 0.467599 0.325585 0.439645
5 0.672463 0.729689 0.994015 0.676874 0.790823 0.170914
6 0.449473 0.026849 0.800370 0.903723 0.024676 0.491747
7 0.559587 0.526255 0.596366 0.051958 0.895090 0.728266
8 0.488736 0.818350 0.500223 0.810189 0.095969 0.218950
9 0.414752 0.258719 0.468106 0.459373 0.709510 0.178053
You can use a set which is an unordered collection of unique elements to do keep the "order of the other columns untouched":
other_columns = list(set(df.columns).difference(["mean"])) #[0, 1, 2, 3, 4]
Then, you can use a lambda to move a specific column to the front by:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: df = pd.DataFrame(np.random.rand(10, 5))
In [4]: df["mean"] = df.mean(1)
In [5]: move_col_to_front = lambda df, col: df[[col]+list(set(df.columns).difference([col]))]
In [6]: move_col_to_front(df, "mean")
Out[6]:
mean 0 1 2 3 4
0 0.697253 0.600377 0.464852 0.938360 0.945293 0.537384
1 0.609213 0.703387 0.096176 0.971407 0.955666 0.319429
2 0.561261 0.791842 0.302573 0.662365 0.728368 0.321158
3 0.518720 0.710443 0.504060 0.663423 0.208756 0.506916
4 0.616316 0.665932 0.794385 0.163000 0.664265 0.793995
5 0.519757 0.585462 0.653995 0.338893 0.714782 0.305654
6 0.532584 0.434472 0.283501 0.633156 0.317520 0.994271
7 0.640571 0.732680 0.187151 0.937983 0.921097 0.423945
8 0.562447 0.790987 0.200080 0.317812 0.641340 0.862018
9 0.563092 0.811533 0.662709 0.396048 0.596528 0.348642
In [7]: move_col_to_front(df, 2)
Out[7]:
2 0 1 3 4 mean
0 0.938360 0.600377 0.464852 0.945293 0.537384 0.697253
1 0.971407 0.703387 0.096176 0.955666 0.319429 0.609213
2 0.662365 0.791842 0.302573 0.728368 0.321158 0.561261
3 0.663423 0.710443 0.504060 0.208756 0.506916 0.518720
4 0.163000 0.665932 0.794385 0.664265 0.793995 0.616316
5 0.338893 0.585462 0.653995 0.714782 0.305654 0.519757
6 0.633156 0.434472 0.283501 0.317520 0.994271 0.532584
7 0.937983 0.732680 0.187151 0.921097 0.423945 0.640571
8 0.317812 0.790987 0.200080 0.641340 0.862018 0.562447
9 0.396048 0.811533 0.662709 0.596528 0.348642 0.563092
Just flipping helps often.
df[df.columns[::-1]]
Or just shuffle for a look.
import random
cols = list(df.columns)
random.shuffle(cols)
df[cols]
You can use reindex which can be used for both axis:
df
# 0 1 2 3 4 mean
# 0 0.943825 0.202490 0.071908 0.452985 0.678397 0.469921
# 1 0.745569 0.103029 0.268984 0.663710 0.037813 0.363821
# 2 0.693016 0.621525 0.031589 0.956703 0.118434 0.484254
# 3 0.284922 0.527293 0.791596 0.243768 0.629102 0.495336
# 4 0.354870 0.113014 0.326395 0.656415 0.172445 0.324628
# 5 0.815584 0.532382 0.195437 0.829670 0.019001 0.478415
# 6 0.944587 0.068690 0.811771 0.006846 0.698785 0.506136
# 7 0.595077 0.437571 0.023520 0.772187 0.862554 0.538182
# 8 0.700771 0.413958 0.097996 0.355228 0.656919 0.444974
# 9 0.263138 0.906283 0.121386 0.624336 0.859904 0.555009
df.reindex(['mean', *range(5)], axis=1)
# mean 0 1 2 3 4
# 0 0.469921 0.943825 0.202490 0.071908 0.452985 0.678397
# 1 0.363821 0.745569 0.103029 0.268984 0.663710 0.037813
# 2 0.484254 0.693016 0.621525 0.031589 0.956703 0.118434
# 3 0.495336 0.284922 0.527293 0.791596 0.243768 0.629102
# 4 0.324628 0.354870 0.113014 0.326395 0.656415 0.172445
# 5 0.478415 0.815584 0.532382 0.195437 0.829670 0.019001
# 6 0.506136 0.944587 0.068690 0.811771 0.006846 0.698785
# 7 0.538182 0.595077 0.437571 0.023520 0.772187 0.862554
# 8 0.444974 0.700771 0.413958 0.097996 0.355228 0.656919
# 9 0.555009 0.263138 0.906283 0.121386 0.624336 0.859904
Hackiest method in the book
df.insert(0, "test", df["mean"])
df = df.drop(columns=["mean"]).rename(columns={"test": "mean"})
A pretty straightforward solution that worked for me is to use .reindex on df.columns:
df = df[df.columns.reindex(['mean', 0, 1, 2, 3, 4])[0]]
Here is a function to do this for any number of columns.
def mean_first(df):
ncols = df.shape[1] # Get the number of columns
index = list(range(ncols)) # Create an index to reorder the columns
index.insert(0,ncols) # This puts the last column at the front
return(df.assign(mean=df.mean(1)).iloc[:,index]) # new df with last column (mean) first
A simple approach is using set(), in particular when you have a long list of columns and do not want to handle them manually:
cols = list(set(df.columns.tolist()) - set(['mean']))
cols.insert(0, 'mean')
df = df[cols]
How about using T?
df = df.T.reindex(['mean', 0, 1, 2, 3, 4]).T
I believe #Aman's answer is the best if you know the location of the other column.
If you don't know the location of mean, but only have its name, you cannot resort directly to cols = cols[-1:] + cols[:-1]. Following is the next-best thing I could come up with:
meanDf = pd.DataFrame(df.pop('mean'))
# now df doesn't contain "mean" anymore. Order of join will move it to left or right:
meanDf.join(df) # has mean as first column
df.join(meanDf) # has mean as last column

Categories

Resources