Index match in Pandas? - python

I am trying to match x value based on their row and column keys. In excel I have used INDEX & MATCH to fetch the correct values, but I am struggling to do the same in Pandas.
Example:
I want to add the highlighted value (saved in df2) to my df['Cost'] column.
I have got df['Weight'] & df['Country'] as keys but I don't know how to use them to look up the highlighted value in df2.
How can I fetch the yellow value into df3['Postage'], which I can then use to add that to my df['Cost'] column?
I hope this makes sense. Let me know i should provide more info.
Edit - more info (sorry, I could not figure out how to copy the output from Jupyter):
When I run [93] I get the following error:
ValueError: Row labels must have same size as column labels
Thanks!

To get the highlighted value 1.75 simply
df2.loc[df2['Country']=='B', 3]
So generalizing the above and using country-weight key pairs from df1:
cost = []
for i in range(df1.shape[0]):
country = df1.loc[i, 'Country']
weight = df1.loc[i, 'Weight']
cost.append(df2.loc[df2['Country']==country, weight]
df1['Cost'] = cost
Or much better:
df1['Cost'] = df1.apply(lambda x: df2.loc[df2['Country']==x['Country'], x['Weight'], axis=1)

for your case use (note [0] is needed to index into array)
row = df1.iloc[1]
df2[df2.Country == row.Country][row.Weight][0]
Hope this helps with .iloc and .loc
d = {chr(ord('A')+r):[c+r*10 for c in range(5)] for r in range(5)}
df = pd.DataFrame(d).transpose()
df.columns=['a','b','c','d','e']
print(df)
print("--------")
print(df.loc['B']['c'])
print(df.iloc[1][2])
output
a b c d e
A 0 1 2 3 4
B 10 11 12 13 14
C 20 21 22 23 24
D 30 31 32 33 34
E 40 41 42 43 44
--------
12
12

Related

Pandas: Search and match based on two conditions

I am using the code below to make a search on a .csv file and match a column in both files and grab a different column I want and add it as a new column. However, I am trying to make the match based on two columns instead of one. Is there a way to do this?
import pandas as pd
df1 = pd.read_csv("matchone.csv")
df2 = pd.read_csv("comingfrom.csv")
def lookup_prod(ip):
for row in df2.itertuples():
if ip in row[1]:
return row[3]
else:
return '0'
df1['want'] = df1['name'].apply(lookup_prod)
df1[df1.want != '0']
print(df1)
#df1.to_csv('file_name.csv')
The code above makes a search from the column name 'samename' in both files and gets the column I request ([3]) from the df2. I want to make the code make a match for both column 'name' and another column 'price' and only if both columns in both df1 and df2 match then the code take the value on ([3]).
df 1 :
name price value
a 10 35
b 10 21
c 10 33
d 10 20
e 10 88
df 2 :
name price want
a 10 123
b 5 222
c 10 944
d 10 104
e 5 213
When the code is run (asking for the want column from d2, based on both if df1 name = df2 name) the produced result is :
name price value want
a 10 35 123
b 10 21 222
c 10 33 944
d 10 20 104
e 10 88 213
However, what I want is if both df1 name = df2 name and df1 price = df2 price, then take the column df2 want, so the desired result is:
name price value want
a 10 35 123
b 10 21 0
c 10 33 944
d 10 20 104
e 10 88 0
You need to use pandas.DataFrame.merge() method with multiple keys:
df1.merge(df2, on=['name','price'], how='left').fillna(0)
Method represents missing values as NaNs, so that the column's dtype changes to float64 but you can change it back after filling the missed values with 0.
Also please be aware that duplicated combinations of name and price in df2 will appear several times in the result.
If you are matching the two dataframes based on the name and the price, you can use df.where and df.isin
df1['want'] = df2['want'].where(df1[['name','price']].isin(df2).all(axis=1)).fillna('0')
df1
name price value want
0 a 10 35 123.0
1 b 10 21 0
2 c 10 33 944.0
3 d 10 20 104.0
4 e 10 88 0
Expanding on https://stackoverflow.com/a/73830294/20110802:
You can add the validate option to the merge in order to avoid duplication on one side (or both):
pd.merge(df1, df2, on=['name','price'], how='left', validate='1:1').fillna(0)
Also, if the float conversion is a problem for you, one option is to do an inner join first and then pd.concat the result with the "leftover" df1 where you already added a constant valued column. Would look something like:
df_inner = pd.merge(df1, df2, on=['name', 'price'], how='inner', validate='1:1')
merged_pairs = set(zip(df_inner.name, df_inner.price))
df_anti = df1.loc[~pd.Series(zip(df1.name, df1.price)).isin(merged_pairs)]
df_anti['want'] = 0
df_result = pd.concat([df_inner, df_anti]) # perhaps ignore_index=True ?
Looks complicated, but should be quite performant because it filters by set. I think there might be a possibility to set name and price as index, merge on index and then filter by index to not having to do the zip-set-shenanigans, bit I'm no expert on multiindex-handling.
#Try this code it will give you expected results
import pandas as pd
df1 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,10,10,10,10],
'value' : [35,21,33,20,88]})
df2 = pd.DataFrame({'name' :['a','b','c','d','e'] ,
'price' :[10,5,10,10,5],
'want' : [123,222,944,104 ,213]})
new = pd.merge(df1,df2, how='left', left_on=['name','price'], right_on=['name','price'])
print(new.fillna(0))

Pandas Dataframe creating a unique column

I have this dataframe:
I want to add each column, as duration + credit_amount, so I have created the following algorithm:
def automate_add(add):
for i, column in enumerate(df):
for j, operando in enumerate(df):
if column != operando:
columnName = column + '_sum_' + operando
add[columnName] = df[column] + df[operando]
with the output:
duration_sum_credit_amount
duration_sum_installment_commitment
credit_amount_sum_duration
credit_amount_sum_installment_commitment
installment_commitment_sum_duration
installment_commitment_sum_credit_amount
However, knowing that duration + credit_amount = credit_amount + duration. I wouldn't like to have repeated columns.
Expecting this result from the function:
duration_sum_credit_amount
duration_sum_installment_commitment
credit_amount_sum_installment_commitment
How can I do it?
I am trying to use hash sets but seems to work only in pandas series [1].
EDIT:
Dataframe: https://www.openml.org/d/31
Use the below, should work faster:
import itertools
my_list=[(pd.Series(df.loc[:,list(i)].sum(axis=1),\
name='_sum_'.join(df.loc[:,list(i)].columns))) for i in list(itertools.combinations(df.columns,2))]
final_df=pd.concat(my_list,axis=1)
print(final_df)
duration_sum_credit_amount duration_sum_installment_commitment \
0 1175 10
1 5999 50
2 2108 14
3 7924 44
4 4894 27
credit_amount_sum_installment_commitment
0 1173
1 5953
2 2098
3 7884
4 4873
Explanation:
print(list(itertools.combinations(df.columns,2))) gives:
[('duration', 'credit_amount'),
('duration', 'installment_commitment'),
('credit_amount', 'installment_commitment')]
Post that do :
for i in list(itertools.combinations(df.columns,2)):
print(df.loc[:,list(i)])
print("---------------------------")
this prints the combinations of columns together. so i just summed it on axis=1 and called it under pd.series, and gave it a name by joining them.
Post this just append them to the list and concat them on axis=1 to get the final result. :)
You have been pointed already to itertools.combinations, which is the right tool here, and will save you some for loops and the issue with repeated columns. See the documentation for more details about permutations, combinations etc.
First, let's create the DataFrame so we can reproduce the example:
import pandas as pd
from itertools import combinations
df = pd.DataFrame({
'a': [1,2,3],
'b': [4,5,6],
'c': [7,8,9]
})
>>> df
a b c
0 1 4 7
1 2 5 8
2 3 6 9
Now let's get to work. The idea is to get all the combinations of the columns, then do a dictionary comprehension to return something like {column_name: sum}. Here it is:
>>> pd.DataFrame({c1 + '_sum_' + c2: df[c1] + df[c2]
for c1, c2 in combinations(df.columns, 2)})
a_sum_b a_sum_c b_sum_c
0 5 8 11
1 7 10 13
2 9 12 15
Notice you can replace sum with any other function that operates on two pd.Series.
The function can have one more if condition to check if the associate addition is already added as a column to dataframe like below:
def automate_add(add):
columnLst=[]
#list where we will add column names to avoid the associate sum columns
for i, column in enumerate(df):
for j, operando in enumerate(df):
if column != operando:
if operando + '_sum_' + column not in columnLst:
columnName = column + '_sum_' + operando
add[columnName] = df[column] + df[operando]
columnLst.append(columnName)
I havent tested this on your data. Try and let me know if it doesnt work.

NaNs after merging two dataframes

I have two dataframes like the following:
df1
id name
-------------------------
0 43 c
1 23 t
2 38 j
3 9 s
df2
user id
--------------------------------------------------
0 222087 27,26
1 1343649 6,47,17
2 404134 18,12,23,22,27,43,38,20,35,1
3 1110200 9,23,2,20,26,47,37
I want to split all the ids in df2 into multiple rows and join the resultant dataframe to df1 on "id".
I do the following:
b = pd.DataFrame(df2['id'].str.split(',').tolist(), index=df2.user_id).stack()
b = b.reset_index()[[0, 'user_id']] # var1 variable is currently labeled 0
b.columns = ['Item_id', 'user_id']
When I try to merge, I get NaNs in the resultant dataframe.
pd.merge(b, df1, on = "id", how="left")
id user name
-------------------------------------
0 27 222087 NaN
1 26 222087 NaN
2 6 1343649 NaN
3 47 1343649 NaN
4 17 1343649 NaN
So, I tried doing the following:
b['name']=np.nan
for i in range(0, len(df1)):
b['name'][(b['id'] == df1['id'][i])] = df1['name'][i]
It still gives the same result as above. I am confused as to what could cause this because I am sure both of them should work!
Any help would be much appreciated!
I read similar posts on SO but none seemed to have a concrete answer. I am also not sure if this is not at all related to coding or not.
Thanks in advance!
Problem is you need convert column id in df2 to int, because output of string functions is always string, also if works with numeric.
df2.id = df2.id.astype(int)
Another solution is convert df1.id to string:
df1.id = df1.id.astype(str)
And get NaNs because no match - str values doesnt match with int values.

python pandas remove duplicate columns

What is the easiest way to remove duplicate columns from a dataframe?
I am reading a text file that has duplicate columns via:
import pandas as pd
df=pd.read_table(fname)
The column names are:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
All the Time and Time Relative columns contain the same data. I want:
Time, Time Relative, N2, H2
All my attempts at dropping, deleting, etc such as:
df=df.T.drop_duplicates().T
Result in uniquely valued index errors:
Reindexing only valid with uniquely valued index objects
Sorry for being a Pandas noob. Any Suggestions would be appreciated.
Additional Details
Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)
data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
Here's a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].
Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])
Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
It sounds like you already know the unique column names. If that's the case, then df = df['Time', 'Time Relative', 'N2'] would work.
If not, your solution should work:
In [101]: vals = np.random.randint(0,20, (4,3))
vals
Out[101]:
array([[ 3, 13, 0],
[ 1, 15, 14],
[14, 19, 14],
[19, 5, 1]])
In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
df
Out[106]:
Time H1 N2 Time Relative N2 Time
0 3 13 0 3 13 0
1 1 15 14 1 15 14
2 14 19 14 14 19 14
3 19 5 1 19 5 1
In [107]: df.T.drop_duplicates().T
Out[107]:
Time H1 N2
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
You probably have something specific to your data that's messing it up. We could give more help if there's more details you could give us about the data.
Edit:
Like Andy said, the problem is probably with the duplicate column titles.
For a sample table file 'dummy.csv' I made up:
Time H1 N2 Time N2 Time Relative
3 13 13 3 13 0
1 15 15 1 15 14
14 19 19 14 19 14
19 5 5 19 5 1
using read_table gives unique columns and works properly:
In [151]: df2 = pd.read_table('dummy.csv')
df2
Out[151]:
Time H1 N2 Time.1 N2.1 Time Relative
0 3 13 13 3 13 0
1 1 15 15 1 15 14
2 14 19 19 14 19 14
3 19 5 5 19 5 1
In [152]: df2.T.drop_duplicates().T
Out[152]:
Time H1 Time Relative
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
If your version doesn't let your, you can hack together a solution to make them unique:
In [169]: df2 = pd.read_table('dummy.csv', header=None)
df2
Out[169]:
0 1 2 3 4 5
0 Time H1 N2 Time N2 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [171]: from collections import defaultdict
col_counts = defaultdict(int)
col_ix = df2.first_valid_index()
In [172]: cols = []
for col in df2.ix[col_ix]:
cnt = col_counts[col]
col_counts[col] += 1
suf = '_' + str(cnt) if cnt else ''
cols.append(col + suf)
cols
Out[172]:
['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
Time H1 N2 Time_1 N2_1 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [178]: df2.T.drop_duplicates().T
Out[178]:
Time H1 Time Relative
1 3 13 0
2 1 15 14
3 14 19 14
4 19 5 1
Transposing is inefficient for large DataFrames. Here is an alternative:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
Use it like this:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
A memory efficient version that treats nans like any other value:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups
If I'm not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than #kalu 's function, keeping the first of any similarly named columns.
Cols = list(df.columns)
for i,item in enumerate(df.columns):
if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)
It looks like you were on the right path. Here is the one-liner you were looking for:
df.reset_index().T.drop_duplicates().T
But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:
original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T
Note that Gene Burinsky's answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:
df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]]
An update on #kalu's answer, which uses the latest pandas:
def find_duplicated_columns(df):
dupes = []
columns = df.columns
for i in range(len(columns)):
col1 = df.iloc[:, i]
for j in range(i + 1, len(columns)):
col2 = df.iloc[:, j]
# break early if dtypes aren't the same (helps deal with
# categorical dtypes)
if col1.dtype is not col2.dtype:
break
# otherwise compare values
if col1.equals(col2):
dupes.append(columns[i])
break
return dupes
Although #Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df.
This means that subsequent assignments like df['newcol'] = 1 generate a SettingWithCopy warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.
The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.
- E
March 2021 update
The subsequent post by #CircArgs may have provided a succinct one-liner to accomplish what I described here.
First step:- Read first row i.e all columns the remove all duplicate columns.
Second step:- Finally read only that columns.
cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)
The way below will identify dupe columns to review what is going wrong building the dataframe originally.
dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]
Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:
def get_dup_columns(m):
'''
This will check every column in data frame
and verify if you have duplicated columns.
can help whenever you are cleaning big data sets of 50+ columns
and clean up a little bit for you
The result will be a list of tuples showing what columns are duplicates
for example
(column A, Column C)
That means that column A is duplicated with column C
more info go to https://wanatux.com
'''
headers_list = [x for x in m.columns]
duplicate_col2 = []
y = 0
while y <= len(headers_list)-1:
for x in range(1,len(headers_list)-1):
if m[headers_list[y]].equals(m[headers_list[x]]) == False:
continue
else:
duplicate_col2.append((headers_list[y],headers_list[x]))
headers_list.pop(0)
return duplicate_col2
And you can cast the definition like this:
duplicate_col = get_dup_columns(pd_excel)
It will show a result like the following:
[('column a', 'column k'),
('column a', 'column r'),
('column h', 'column m'),
('column k', 'column r')]
I am not sure why Gene Burinsky's answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.
df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()])
A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:
import numpy as np
import pandas as pd
from itertools import combinations as combi
df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy() # column 'a' is equal to column 'd'
# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
df = df.drop(columns=dupli_cols)
In case you want to check for duplicate columns, this code can be useful
columns_to_drop= []
for cname in sorted(list(df)):
for cname2 in sorted(list(df))[::-1]:
if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop:
columns_to_drop.append(cname2)
print(cname,cname2,'Are equal')
df = df.drop(columns_to_drop, axis=1)
Fast and easy way to drop the duplicated columns by their values:
df = df.T.drop_duplicates().T
More info: Pandas DataFrame drop_duplicates manual .

Sorting columns in pandas dataframe based on column name [duplicate]

This question already has answers here:
How to change the order of DataFrame columns?
(41 answers)
Closed 3 years ago.
I have a dataframe with over 200 columns. The issue is as they were generated the order is
['Q1.3','Q6.1','Q1.2','Q1.1',......]
I need to sort the columns as follows:
['Q1.1','Q1.2','Q1.3',.....'Q6.1',......]
Is there some way for me to do this within Python?
df = df.reindex(sorted(df.columns), axis=1)
This assumes that sorting the column names will give the order you want. If your column names won't sort lexicographically (e.g., if you want column Q10.3 to appear after Q9.1), you'll need to sort differently, but that has nothing to do with pandas.
You can also do more succinctly:
df.sort_index(axis=1)
Make sure you assign the result back:
df = df.sort_index(axis=1)
Or, do it in-place:
df.sort_index(axis=1, inplace=True)
You can just do:
df[sorted(df.columns)]
Edit: Shorter is
df[sorted(df)]
For several columns, You can put columns order what you want:
#['A', 'B', 'C'] <-this is your columns order
df = df[['C', 'B', 'A']]
This example shows sorting and slicing columns:
d = {'col1':[1, 2, 3], 'col2':[4, 5, 6], 'col3':[7, 8, 9], 'col4':[17, 18, 19]}
df = pandas.DataFrame(d)
You get:
col1 col2 col3 col4
1 4 7 17
2 5 8 18
3 6 9 19
Then do:
df = df[['col3', 'col2', 'col1']]
Resulting in:
col3 col2 col1
7 4 1
8 5 2
9 6 3
Tweet's answer can be passed to BrenBarn's answer above with
data.reindex_axis(sorted(data.columns, key=lambda x: float(x[1:])), axis=1)
So for your example, say:
vals = randint(low=16, high=80, size=25).reshape(5,5)
cols = ['Q1.3', 'Q6.1', 'Q1.2', 'Q9.1', 'Q10.2']
data = DataFrame(vals, columns = cols)
You get:
data
Q1.3 Q6.1 Q1.2 Q9.1 Q10.2
0 73 29 63 51 72
1 61 29 32 68 57
2 36 49 76 18 37
3 63 61 51 30 31
4 36 66 71 24 77
Then do:
data.reindex_axis(sorted(data.columns, key=lambda x: float(x[1:])), axis=1)
resulting in:
data
Q1.2 Q1.3 Q6.1 Q9.1 Q10.2
0 2 0 1 3 4
1 7 5 6 8 9
2 2 0 1 3 4
3 2 0 1 3 4
4 2 0 1 3 4
If you need an arbitrary sequence instead of sorted sequence, you could do:
sequence = ['Q1.1','Q1.2','Q1.3',.....'Q6.1',......]
your_dataframe = your_dataframe.reindex(columns=sequence)
I tested this in 2.7.10 and it worked for me.
Don't forget to add "inplace=True" to Wes' answer or set the result to a new DataFrame.
df.sort_index(axis=1, inplace=True)
The quickest method is:
df.sort_index(axis=1)
Be aware that this creates a new instance. Therefore you need to store the result in a new variable:
sortedDf=df.sort_index(axis=1)
The sort method and sorted function allow you to provide a custom function to extract the key used for comparison:
>>> ls = ['Q1.3', 'Q6.1', 'Q1.2']
>>> sorted(ls, key=lambda x: float(x[1:]))
['Q1.2', 'Q1.3', 'Q6.1']
One use-case is that you have named (some of) your columns with some prefix, and you want the columns sorted with those prefixes all together and in some particular order (not alphabetical).
For example, you might start all of your features with Ft_, labels with Lbl_, etc, and you want all unprefixed columns first, then all features, then the label. You can do this with the following function (I will note a possible efficiency problem using sum to reduce lists, but this isn't an issue unless you have a LOT of columns, which I do not):
def sortedcols(df, groups = ['Ft_', 'Lbl_'] ):
return df[ sum([list(filter(re.compile(r).search, list(df.columns).copy())) for r in (lambda l: ['^(?!(%s))' % '|'.join(l)] + ['^%s' % i for i in l ] )(groups) ], []) ]
print df.sort_index(by='Frequency',ascending=False)
where by is the name of the column,if you want to sort the dataset based on column

Categories

Resources