Getting the integer index of a Pandas DataFrame row fulfilling a condition? - python

I have the following DataFrame:
a b c
b
2 1 2 3
5 4 5 6
As you can see, column b is used as an index. I want to get the ordinal number of the row fulfilling ('b' == 5), which in this case would be 1.
The column being tested can be either an index column (as with b in this case) or a regular column, e.g. I may want to find the index of the row fulfilling ('c' == 6).

Use Index.get_loc instead.
Reusing #unutbu's set up code, you'll achieve the same results.
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(5)
1

You could use np.where like this:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
print(df)
# a b c
# b
# 2 1 2 3
# 5 4 5 6
print(np.where(df.index==5)[0])
# [1]
print(np.where(df['c']==6)[0])
# [1]
The value returned is an array since there could be more than one row with a particular index or value in a column.

With Index.get_loc and general condition:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.arange(1,7).reshape(2,3),
columns = list('abc'),
index=pd.Series([2,5], name='b'))
>>> df
a b c
b
2 1 2 3
5 4 5 6
>>> df.index.get_loc(df.index[df['b'] == 5][0])
1

The other answers based on Index.get_loc() do not provide a consistent result, because this function will return in integer if the index consists of all unique values, but it will return a boolean mask array if the index does not consist of unique values. A more consistent approach to return a list of integer values every time would be the following, with this example shown for an index with non-unique values:
df = pd.DataFrame([
{"A":1, "B":2}, {"A":2, "B":2},
{"A":3, "B":4}, {"A":1, "B":3}
], index=[1,2,3,1])
If searching based on index value:
[i for i,v in enumerate(df.index == 1) if v]
[0, 3]
If searching based on a column value:
[i for i,v in enumerate(df["B"] == 2) if v]
[0, 1]

Related

Pandas dataframe selecting with index and condition on a column

I am trying for a while to solve this problem:
I have a daraframe like this:
import pandas as pd
df=pd.DataFrame(np.array([['A', 2, 3], ['B', 5, 6], ['C', 8, 9]]),columns=['a', 'b', 'c'])
j=[0,2]
But then when i try to select just a part of it filtering by a list of index and a condition on a column I get error...
df[df.loc[j]['a']=='A']
There is somenting wrong, but i don't get what is the problem here. Can you help me?
This is the error message:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).
There is filtered DataFrame compared by original, so indices are different, so error is raised.
You need compare filtered DataFrame:
df1 = df.loc[j]
print (df1)
a b c
0 A 2 3
2 C 8 9
out = df1[df1['a']=='A']
print(out)
a b c
0 A 2 3
Your solution is possible use with convert ndices of filtered mask by original indices by Series.reindex:
out = df[(df.loc[j, 'a']=='A').reindex(df.index, fill_value=False)]
print(out)
a b c
0 A 2 3
Or nicer solution:
out = df[(df['a'] == 'A') & (df.index.isin(j))]
print(out)
a b c
0 A 2 3
A boolean array and the dataframe should be the same length. here your df length is 3 but the boolean array df.loc[j]['a']=='A' length is 2
You should do:
>>> df.loc[j][df.loc[j]['a']=='A']
a b c
0 A 2 3

Calling a specific range of rows in a python database with specific columns

I'm looking to select a certain range of rows [25:100] and a certain list of indexed columns [1,3,6] from a python pandas dataframe using the subscript option.
So far I am using the following
df[25:100][[1, 3, 6]]
Use the .iloc (“location by integer”) attribute:
df.iloc[25:100, [1, 3, 6]]
Note that 25:100 select zero-based numbered rows from 25 (inclusive) to 100 (exclusive). If you want to select the row 100, too, use 25:101 instead.
The df.loc will do the task. However, for simple copies, there are other ways.
Import pandas
>>> import pandas as pd
Create dataframe
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": ["a", "b", "c"]})
>>> df
A B
0 1 a
1 2 b
2 3 c
Copy rows from one column only
>>> df1 = df["B"][1:]
>>> df1
1 b
2 c
Name: B, dtype: object
Copy rows from more than one row
>>> df2 = df[["A","B"]][1:]
>>> df2
A B
1 2 b
2 3 c
Copy specific rows and columns (df.loc)
>>> df3 = df.loc[[0,2] , ["A", "B"]]
>>> df3
A B
0 1 a
2 3 c
>>>

Accessing an Non Numerical Index in a DataFrame [duplicate]

I'm simply trying to access named pandas columns by an integer.
You can select a row by location using df.ix[3].
But how to select a column by integer?
My dataframe:
df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})
Two approaches that come to mind:
>>> df
A B C D
0 0.424634 1.716633 0.282734 2.086944
1 -1.325816 2.056277 2.583704 -0.776403
2 1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025 1.325853 -2.513373
4 1.366180 -1.265185 -2.184617 0.881514
>>> df.iloc[:, 2]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
>>> df[df.columns[2]]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
Edit: The original answer suggested the use of df.ix[:,2] but this function is now deprecated. Users should switch to df.iloc[:,2].
You can also use df.icol(n) to access a column by integer.
Update: icol is deprecated and the same functionality can be achieved by:
df.iloc[:, n] # to access the column at the nth position
You could use label based using .loc or index based using .iloc method to do column-slicing including column ranges:
In [50]: import pandas as pd
In [51]: import numpy as np
In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
In [53]: df
Out[53]:
a b c d
0 0.806811 0.187630 0.978159 0.317261
1 0.738792 0.862661 0.580592 0.010177
2 0.224633 0.342579 0.214512 0.375147
3 0.875262 0.151867 0.071244 0.893735
In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]:
a b d
0 0.806811 0.187630 0.317261
1 0.738792 0.862661 0.010177
2 0.224633 0.342579 0.375147
3 0.875262 0.151867 0.893735
In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
You can access multiple columns by passing a list of column indices to dataFrame.ix.
For example:
>>> df = pandas.DataFrame({
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
'd': np.random.rand(5)
})
>>> df
a b c d
0 0.705718 0.414073 0.007040 0.889579
1 0.198005 0.520747 0.827818 0.366271
2 0.974552 0.667484 0.056246 0.524306
3 0.512126 0.775926 0.837896 0.955200
4 0.793203 0.686405 0.401596 0.544421
>>> df.ix[:,[1,3]]
b d
0 0.414073 0.889579
1 0.520747 0.366271
2 0.667484 0.524306
3 0.775926 0.955200
4 0.686405 0.544421
The method .transpose() converts columns to rows and rows to column, hence you could even write
df.transpose().ix[3]
Most of the people have answered how to take columns starting from an index. But there might be some scenarios where you need to pick columns from in-between or specific index, where you can use the below solution.
Say that you have columns A,B and C. If you need to select only column A and C you can use the below code.
df = df.iloc[:, [0,2]]
where 0,2 specifies that you need to select only 1st and 3rd column.
You can use the method take. For example, to select first and last columns:
df.take([0, -1], axis=1)

How to assign values of series to column names of dataframe

I have Series with values:
0 1_AA
1 2_BB
2 3_CC
3 4_DD
and I want to convert this series to names of dataframe columns. It should look like this:
1_AA 2_BB 3_CC 4_DD
0
Is it possible?
One could just use the columns-argument for DataFrame:
>>> import pandas as pd
>>> s = pd.Series(['a', 'b', 'c'])
>>> pd.DataFrame(columns=s)
Empty DataFrame
Columns: [a, b, c]
Index: []
or pass it in directly as list:
>>> pd.DataFrame(columns=['1_AA', '2_BB', '3_CC', '4_DD'])
Empty DataFrame
Columns: [1_AA, 2_BB, 3_CC, 4_DD]
Index: []
You could use dict.fromkeys:
>>> import pandas as pd
>>> s = pd.Series(['1_AA', '2_BB', '3_CC', '4_DD'])
>>> pd.DataFrame(dict.fromkeys(s, [0])) # each column containing one zero - [0]
1_AA 2_BB 3_CC 4_DD
0 0 0 0 0
Or collections.OrderedDict, which garantuees that the order of your values is always kept:
>>> from collections import OrderedDict
>>> pd.DataFrame(OrderedDict.fromkeys(s, [0]))
1_AA 2_BB 3_CC 4_DD
0 0 0 0 0
You could also use empty lists as second argument for fromkeys:
>>> pd.DataFrame(dict.fromkeys(s, []))
Empty DataFrame
Columns: [1_AA, 2_BB, 3_CC, 4_DD]
Index: []
But that creates an empty dataframe - with the correct columns.

How to check if there exists a row with a certain column value in pandas dataframe

Very new to pandas.
Is there a way to check given a pandas dataframe, if there exists a row with a certain column value. Say I have a column 'Name' and I need to check for a certain name if it exists.
And once I do this, I will need to make a similar query, but with a bunch of values at a time.
I read that there is 'isin', but I'm not sure how to use it. So I need to make a query such that I get all the rows which have 'Name' column matching to any of the values in a big array of names.
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.arange(8).reshape(4,2), columns=['name', 'value'])
Result:
>>> df
name value
0 0 1
1 2 3
2 4 5
3 6 7
>>> any(df.name == 4)
True
>>> any(df.name == 5)
False
Second Part:
my_data = np.arange(8).reshape(4,2)
my_data[0,0] = 4
df = pd.DataFrame(data = my_data, columns=['name', 'value'])
Result:
>>> df.loc[df.name == 4]
name value
0 4 1
2 4 5
Update:
my_data = np.arange(8).reshape(4,2)
my_data[0,0] = 4
df = pd.DataFrame(data = my_data, index=['a', 'b', 'c', 'd'], columns=['name', 'value'])
Result:
>>> df.loc[df.name == 4] # gives relevant rows
name value
a 4 1
c 4 5
>>> df.loc[df.name == 4].index # give "row names" of relevant rows
Index([u'a', u'c'], dtype=object)
If you want to extract set of values given a sequence of row labels and column labels, and the lookup method allows for this and returns a numpy array.
Here is my snippet and output:
>>> import pandas as pd
>>> import numpy as np
>>> df = DataFrame(np.random.rand(20,4), columns = ['A','B','C','D'])
>>> df
A B C D
0 0.121190 0.360813 0.500082 0.817546
1 0.304313 0.773412 0.902835 0.440485
2 0.700338 0.733342 0.196394 0.364041
3 0.385534 0.078589 0.181256 0.440475
4 0.151840 0.956841 0.422713 0.018626
5 0.995875 0.110973 0.149234 0.543029
6 0.274740 0.745955 0.420808 0.020774
7 0.305654 0.580817 0.580476 0.210345
8 0.726075 0.801743 0.562489 0.367190
9 0.567987 0.591544 0.523653 0.133099
10 0.795625 0.163556 0.594703 0.208612
11 0.977728 0.751709 0.976577 0.439014
12 0.967853 0.214956 0.126942 0.293847
13 0.189418 0.019772 0.618112 0.643358
14 0.526221 0.276373 0.947315 0.792088
15 0.714835 0.782455 0.043654 0.966490
16 0.760602 0.487120 0.747248 0.982081
17 0.050449 0.666720 0.835464 0.522671
18 0.382314 0.146728 0.666722 0.573501
19 0.392152 0.195802 0.919299 0.181929
>>> df.lookup([0,2,4,6], ['B', 'C', 'A','D'])
array([ 0.36081287, 0.19639367, 0.15184046, 0.02077381])
>>> df.lookup([0,2,4,6], ['A', 'B', 'C','D'])
array([ 0.12119047, 0.73334194, 0.4227131 , 0.02077381])
>>>

Categories

Resources