How to get the position of certain columns in dataframe - Python [duplicate] - python

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?

Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.

Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]

DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()

For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)

When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])

Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)

To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.

how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]

When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something

import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

Related

Is there a python function to fill missing data with consecutive value

I want to Fill in these missing numbers in column b with the consecutive values 1 and 2.
This is what I have done:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 4, 7,8,4],
'b': [1, np.nan, 3, np.nan, 5]})
df['b'].fillna({'b':[1,2]}, inplace=True)
but nothing is done.
One way is to use loc with an array:
df.loc[df['b'].isnull(), 'b'] = [1, 2]
What you're attempting is possible but cumbersome with fillna:
nulls = df['b'].isnull()
df['b'] = df['b'].fillna(pd.Series([1, 2], index=nulls[nulls].index))
You may be looking for interpolate but the above solutions are generic given an input list or array.
If, on the other hand, you want to fill nulls with a sequence 1, 2, 3, etc, you can use cumsum:
# fillna solution
df['b'] = df['b'].fillna(df['b'].isnull().cumsum())
# loc solution
nulls = df['b'].isnull()
df.loc[nulls, 'b'] = nulls.cumsum()
You can't feed fillna a list of values, as stated here and in the documentation. Also, if you're selecting the column, no need to tell fillna which column to use. You could do:
df.fillna({'b':1}, inplace=True)
Or
df['b'].fillna(1, inplace=True)
By the way, inplace is on the way to deprecation in Pandas, the preferred way to do this is, for example
df = df.fillna({'b':1})
You can interpolate. Example:
s = pd.Series([0, 1, np.nan, 3])
s.interpolate()
0 0
1 1
2 2
3 3
If I understand wording " consecutive values 1 and 2" correctly, the solution may be:
from itertools import isclice, cycle
filler = [1, 2]
nans = df.b.isna()
df.loc[nans, 'b'] = list(islice(cycle(filler), sum(nans)))

How to compare two dataframes ignoring column names?

Suppose I want to compare the content of two dataframes, but not the column names (or index names). Is it possible to achieve this without renaming the columns?
For example:
df = pd.DataFrame({'A': [1,2], 'B':[3,4]})
df_equal = pd.DataFrame({'a': [1,2], 'b':[3,4]})
df_diff = pd.DataFrame({'A': [1,2], 'B':[3,5]})
In this case, df is df_equal but different to df_diff, because the values in df_equal has the same content, but the ones in df_diff. Notice that the column names in df_equal are different, but I still want to get a true value.
I have tried the following:
equals:
# Returns false because of the column names
df.equals(df_equal)
eq:
# doesn't work as it compares four columns (A,B,a,b) assuming nulls for the one that doesn't exist
df.eq(df_equal).all().all()
pandas.testing.assert_frame_equal:
# same as equals
pd.testing.assert_frame_equal(df, df_equal, check_names=False)
I thought that it was going to be possible to use the assert_frame_equal, but none of the parameters seem to work to ignore column names.
pd.DataFrame is built around pd.Series, so it's unlikely you will be able to perform comparisons without column names.
But the most efficient way would be to drop down to numpy:
assert_equal = (df.values == df_equal.values).all()
To deal with np.nan, you can use np.testing.assert_equal and catch AssertionError, as suggested by #Avaris :
import numpy as np
def nan_equal(a,b):
try:
np.testing.assert_equal(a,b)
except AssertionError:
return False
return True
assert_equal = nan_equal(df.values, df_equal.values)
I just needed to get the values (numpy array) from the data frame, so the column names won't be considered.
df.eq(df_equal.values).all().all()
I would still like to see a parameter on equals, or assert_frame_equal. Maybe I am missing something.
An advantage of this compared to #jpp answer is that, I can get see which columns do not match, calling only all() only once:
df.eq(df_diff.values).all()
Out[24]:
A True
B False
dtype: bool
One problem is that when eq is used, then np.nan is not equal to np.nan, in which case the following expression, would serve well:
(df.eq(df_equal.values) | (df.isnull().values & df_equal.isnull().values)).all().all()
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
print(df1.iloc[i, j] == df2.iloc[i, j])
Will return:
True
True
True
True
Same thing for:
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
One obvious issue is that column names matters in Pandas to sort dataframes. For example:
df1 = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
df2 = pd.DataFrame({'a': [1, 2], 'B': [3, 4]})
print(df1)
print(df2)
renders as ('B' is before 'a' in df2):
a b
0 1 3
1 2 4
B a
0 3 1
1 4 2

pandas drop_duplicates using comparison function

Is it somehow possible to use pandas.drop_duplicates with a comparison operator which compares two objects in a particular column in order to identify duplicates? If not, what is the alternative?
Here is an example where it could be used:
I have a pandas DataFrame which has lists as values in a particular column and I would like to have duplicates removed based on column A
import pandas as pd
df = pd.DataFrame( {'A': [[1,2],[2,3],[1,2]]} )
print df
giving me
A
0 [1, 2]
1 [2, 3]
2 [1, 2]
Using pandas.drop_duplicates
df.drop_duplicates( 'A' )
gives me a TypeError
[...]
TypeError: type object argument after * must be a sequence, not itertools.imap
However, my desired result is
A
0 [1, 2]
1 [2, 3]
My comparison function would be here:
def cmp(x,y):
return x==y
But in principle it could be something else, e.g.,
def cmp(x,y):
return x==y and len(x)>1
How can I remove duplicates based on the comparison function in a efficient way?
Even more, what could I do if I had more columns to compare using a different comparison function, respectively?
Option 1
df[~pd.DataFrame(df.A.values.tolist()).duplicated()]
Option 2
df[~df.A.apply(pd.Series).duplicated()]
IIUC, your question is how to use an arbitrary function to determine what is a duplicate. To emphasize this, let's say that two lists are duplicates if the sum of the first item, plus the square of the second item, is the same in each case
In [59]: In [118]: df = pd.DataFrame( {'A': [[1,2],[4,1],[2,3]]} )
(Note that the first and second lists are equivalent, although not same.)
Python typically prefers key functions to comparison functions, so here we need a function to say what is the key of a list; in this case, it is lambda l: l[0] + l[1]**2.
We can use groupby + first to group by the values of the key function, then take the first of each group:
In [119]: df.groupby(df.A.apply(lambda l: l[0] + l[1]**2)).first()
Out[119]:
A
A
5 [1, 2]
11 [2, 3]
Edit
Following further edits in the question, here are a few more examples using
df = pd.DataFrame( {'A': [[1,2],[2,3],[1,2], [1], [1], [2]]} )
Then for
def cmp(x,y):
return x==y
this could be
In [158]: df.groupby(df.A.apply(tuple)).first()
Out[158]:
A
A
(1,) [1]
(1, 2) [1, 2]
(2,) [2]
(2, 3) [2, 3]
for
def cmp(x,y):
return x==y and len(x)>1
this could be
In [184]: class Key(object):
.....: def __init__(self):
.....: self._c = 0
.....: def __call__(self, l):
.....: if len(l) < 2:
.....: self._c += 1
.....: return self._c
.....: return tuple(l)
.....:
In [187]: df.groupby(df.A.apply(Key())).first()
Out[187]:
A
A
1 [1]
2 [1]
3 [2]
(1, 2) [1, 2]
(2, 3) [2, 3]
Alternatively, this could also be done much more succinctly via
In [190]: df.groupby(df.A.apply(lambda l: np.random.rand() if len(l) < 2 else tuple(l))).first()
Out[190]:
A
A
0.112012068449 [2]
0.822889598152 [1]
0.842630848774 [1]
(1, 2) [1, 2]
(2, 3) [2, 3]
but some people don't like these Monte-Carlo things.
Lists are unhashable in nature. Try converting them to hashable types such as tuples and then you can continue to use drop_duplicates:
df['A'] = df['A'].map(tuple)
df.drop_duplicates('A').applymap(list)
One way of implementing it using a function would be based on computing value_counts of the series object, as duplicated values get aggregated and we are interested in only the index part (which by the way is unique) and not the actual count part.
def series_dups(col_name):
ser = df[col_name].map(tuple).value_counts(sort=False)
return (pd.Series(data=ser.index.values, name=col_name)).map(list)
series_dups('A')
0 [1, 2]
1 [2, 3]
Name: A, dtype: object
If you do not want to convert the values to tuple but rather process the values as they are, you could do:
Toy data:
df = pd.DataFrame({'A': [[1,2], [2,3], [1,2], [3,4]],
'B': [[10,11,12], [11,12], [11,12,13], [10,11,12]]})
df
def series_dups_hashable(frame, col_names):
for col in col_names:
ser, indx = np.unique(frame[col].values, return_index=True)
frame[col] = pd.Series(data=ser, index=indx, name=col)
return frame.dropna(how='all')
series_dups_hashable(df, ['A', 'B']) # Apply to subset/all columns you want to check

Insert list of lists into single column of pandas df

I am trying to place multiple lists into a single column of a Pandas df. My list of lists is very long, so I cannot do so manually.
The desired out put would look like this:
list_of_lists = [[1,2,3],[3,4,5],[5,6,7],...]
df = pd.DataFrame(list_of_lists)
>>> df
0
0 [1,2,3]
1 [3,4,5]
2 [5,6,7]
3 ...
Thank you for the assistance.
You can assign it by wrapping it in a Series vector if you're trying to add to an existing df:
In [7]:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randn(5,3), columns=list('abc'))
df
Out[7]:
a b c
0 -1.675422 -0.696623 -1.025674
1 0.032192 0.582190 0.214029
2 -0.134230 0.991172 -0.177654
3 -1.688784 1.275275 0.029581
4 -0.528649 0.858710 -0.244512
In [9]:
df['new_col'] = pd.Series([[1,2,3],[3,4,5],[5,6,7]])
df
Out[9]:
a b c new_col
0 -1.675422 -0.696623 -1.025674 [1, 2, 3]
1 0.032192 0.582190 0.214029 [3, 4, 5]
2 -0.134230 0.991172 -0.177654 [5, 6, 7]
3 -1.688784 1.275275 0.029581 NaN
4 -0.528649 0.858710 -0.244512 NaN
What about
df = pd.DataFrame({0: [[1,2,3],[3,4,5],[5,6,7]]})
The above solutions were helpful but wanted to add a little bit in case they didn't quite do the trick for someone...
pd.Series will not accept a np.ndarray that looks like a list-of-lists, e.g. one-hot labels array([[1, 0, 0], [0, 1, 0], ..., [0, 0, 1]]).
So in this case one can wrap the variable with list():
df['new_col'] = pd.Series(list(one-hot-labels))

Get column index from column name in python pandas

In R when you need to retrieve a column index based on the name of the column you could do
idx <- which(names(my_data)==my_colum_name)
Is there a way to do the same with pandas dataframes?
Sure, you can use .get_loc():
In [45]: df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
In [46]: df.columns
Out[46]: Index([apple, orange, pear], dtype=object)
In [47]: df.columns.get_loc("pear")
Out[47]: 2
although to be honest I don't often need this myself. Usually access by name does what I want it to (df["pear"], df[["apple", "orange"]], or maybe df.columns.isin(["orange", "pear"])), although I can definitely see cases where you'd want the index number.
Here is a solution through list comprehension. cols is the list of columns to get index for:
[df.columns.get_loc(c) for c in cols if c in df]
DSM's solution works, but if you wanted a direct equivalent to which you could do (df.columns == name).nonzero()
For returning multiple column indices, I recommend using the pandas.Index method get_indexer, if you have unique labels:
df = pd.DataFrame({"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]})
df.columns.get_indexer(['pear', 'apple'])
# Out: array([0, 1], dtype=int64)
If you have non-unique labels in the index (columns only support unique labels) get_indexer_for. It takes the same args as get_indexer:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, 1, 1])
df.index.get_indexer_for([0, 1])
# Out: array([0, 1, 2], dtype=int64)
Both methods also support non-exact indexing with, f.i. for float values taking the nearest value with a tolerance. If two indices have the same distance to the specified label or are duplicates, the index with the larger index value is selected:
df = pd.DataFrame(
{"pear": [1, 2, 3], "apple": [2, 3, 4], "orange": [3, 4, 5]},
index=[0, .9, 1.1])
df.index.get_indexer([0, 1])
# array([ 0, -1], dtype=int64)
When you might be looking to find multiple column matches, a vectorized solution using searchsorted method could be used. Thus, with df as the dataframe and query_cols as the column names to be searched for, an implementation would be -
def column_index(df, query_cols):
cols = df.columns.values
sidx = np.argsort(cols)
return sidx[np.searchsorted(cols,query_cols,sorter=sidx)]
Sample run -
In [162]: df
Out[162]:
apple banana pear orange peach
0 8 3 4 4 2
1 4 4 3 0 1
2 1 2 6 8 1
In [163]: column_index(df, ['peach', 'banana', 'apple'])
Out[163]: array([4, 1, 0])
Update: "Deprecated since version 0.25.0: Use np.asarray(..) or DataFrame.values() instead." pandas docs
In case you want the column name from the column location (the other way around to the OP question), you can use:
>>> df.columns.values()[location]
Using #DSM Example:
>>> df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
>>> df.columns
Index(['apple', 'orange', 'pear'], dtype='object')
>>> df.columns.values()[1]
'orange'
Other ways:
df.iloc[:,1].name
df.columns[location] #(thanks to #roobie-nuby for pointing that out in comments.)
To modify DSM's answer a bit, get_loc has some weird properties depending on the type of index in the current version of Pandas (1.1.5) so depending on your Index type you might get back an index, a mask, or a slice. This is somewhat frustrating for me because I don't want to modify the entire columns just to extract one variable's index. Much simpler is to avoid the function altogether:
list(df.columns).index('pear')
Very straightforward and probably fairly quick.
how about this:
df = DataFrame({"pear": [1,2,3], "apple": [2,3,4], "orange": [3,4,5]})
out = np.argwhere(df.columns.isin(['apple', 'orange'])).ravel()
print(out)
[1 2]
When the column might or might not exist, then the following (variant from above works.
ix = 'none'
try:
ix = list(df.columns).index('Col_X')
except ValueError as e:
ix = None
pass
if ix is None:
# do something
import random
def char_range(c1, c2): # question 7001144
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
df = pd.DataFrame()
for c in char_range('a', 'z'):
df[f'{c}'] = random.sample(range(10), 3) # Random Data
rearranged = random.sample(range(26), 26) # Random Order
df = df.iloc[:, rearranged]
print(df.iloc[:,:15]) # 15 Col View
for col in df.columns: # List of indices and columns
print(str(df.columns.get_loc(col)) + '\t' + col)
![Results](Results

Categories

Resources