This question already has answers here:
How to take column-slices of dataframe in pandas
(11 answers)
Closed 1 year ago.
Assume I have a dataframe in Pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
The dataframe df looks like
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
How can I write the code if I want to get the value of D when C equals to 1? In other words, how can I return the D value which is 2 when C = 1?
You can filter and use .loc:
result = df.loc[df.C == 1, "D"]
alternative equivalent syntax as noted in the comments:
result = df.loc[df['C'].eq(1), 'D']
While you can "chain operations, such as df[df.C == 1]["D"] will get you the correct result, you will encounter poor performance as your data scales.
You can search the data frame based on condition (here "C" == 1) and then get the column by index lookup
This will return a Series, you will have to use .values to get the NumPy array
>>> df
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
>>> df[df["C"] == 1]["D"].values
array([2])
>>> df[df["C"] == 1]["D"].values[0]
2
>>>
References
Selecting rows in pandas DataFrame based on conditions
Related
Assume I have a dataframe in Pandas:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': '0 1 2 3 4 5 6 7'.split(),
'D': '0 2 4 6 8 10 12 14'.split()})
The dataframe df looks like
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
Note that the type of numbers in C and D columns is string.
I'm thinking about two conditions:
(1) Consecutive search
I want to return the D values when I'm searching C=2:5. If I use df.loc[df['C'] == '2:5', "D"], it returns an error. How can I do this part?
(2) Discrete search
I'd like to return the D values when I'm searching C=0,3,6. Again, if I use df.loc[df['C'] == '0,3,6', "D"], it returns an error. What should I write this code?
Consecutive Search:
The consecutive search method for a range of numerical strings can be taken care of by the isin() and zfill() method. The zfill method in the below code creates a list containing the range of the numbers:
df.D.loc[df.C.isin([str(i).zfill(1) for i in range(2,6)])]
Discrete Search
The discrete searching can be done by using the isin() method:
df.D.loc[(df.C.isin(['0','3','6'])]
Say I have the pandas DataFrame below:
A B C D
1 foo one 0 0
2 foo one 2 4
3 foo two 4 8
4 cat one 8 4
5 bar four 6 12
6 bar three 7 14
7 bar four 7 14
I would like to select all the rows that have equal values in A but differing values in B. So I would like the output of my code to be:
A B C D
1 foo one 0 0
3 foo two 4 8
5 bar three 7 14
6 bar four 7 14
What's the most efficient way to do this? I have approximately 11,000 rows with a lot of variation in the column values, but this situation comes up a lot. In my dataset, if elements in column A are equal then the corresponding column B value should also be equal, however due to mislabeling this is not the case and I would like to fix this, it would be impractical for me to do this one by one.
You can try groupby() + filter + drop_duplicates():
>>> df.groupby('A').filter(lambda g: len(g) > 1).drop_duplicates(subset=['A', 'B'], keep="first")
A B C D
0 foo one 0 0
2 foo two 4 8
4 bar four 6 12
5 bar three 7 14
OR, in case you want to drop duplicates between the subset of columns A & B then can use below but that will have the row having cat as well.
>>> df.drop_duplicates(subset=['A', 'B'], keep="first")
A B C D
0 foo one 0 0
2 foo two 4 8
3 cat one 8 4
4 bar four 6 12
5 bar three 7 14
Use groupby + filter + head:
result = df.groupby('A').filter(lambda g: len(g) > 1).groupby(['A', 'B']).head(1)
print(result)
Output
A B C D
0 foo one 0 0
2 foo two 4 8
4 bar four 6 12
5 bar three 7 14
The first group-by and filter will remove the rows with no duplicated A values (i.e. cat), the second will create groups with same A, B and for each of those get the first element.
The current answers are correct and may be more sophisticated too. If you have complex criteria, filter function will be very useful. If you are like me and want to keep things simple, i feel following is more beginner friendly way
>>> df = pd.DataFrame({
'A': ['foo', 'foo', 'foo', 'cat', 'bar', 'bar', 'bar'],
'B': ['one', 'one', 'two', 'one', 'four', 'three', 'four'],
'C': [0,2,4,8,6,7,7],
'D': [0,4,8,4,12,14,14]
}, index=[1,2,3,4,5,6,7])
>>> df = df.drop_duplicates(['A', 'B'], keep='last')
A B C D
2 foo one 2 4
3 foo two 4 8
4 cat one 8 4
6 bar three 7 14
7 bar four 7 14
>>> df = df[df.duplicated(['A'], keep=False)]
A B C D
2 foo one 2 4
3 foo two 4 8
6 bar three 7 14
7 bar four 7 14
keep='last' is optional here
This question is very related to these two questions another and thisone, and I'll even use the example from the very helpful accepted solution on that question. Here's the example from the accepted solution (credit to unutbu):
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
But I want to have all rows of A and only the arrows in B that have 'two' in them. My attempt at it is to try
print(df.loc[df['A']) & df['B'] == 'two'])
This does not work, unfortunately. Can anybody suggest a way to implement something like this? it would be of a great help if the solution is somewhat general where for example column A doesn't have the same value which is 'foo' but has different values and you still want the whole column.
Easy , if you do
df[['A','B']][df['B']=='two']
you will get:
A B
2 foo two
4 foo two
5 bar two
To filter on both A and B:
df[['A','B']][(df['B']=='two') & (df['A']=='foo')]
You get:
A B
2 foo two
4 foo two
and if you want all the columns :
df[df['B']=='two']
you will get:
A B C D
2 foo two 2 4
4 foo two 4 8
5 bar two 5 10
I think I understand your modified question. After sub-selecting on a condition of B, then you can select the columns you want, such as:
In [1]: df.loc[df.B =='two'][['A', 'B']]
Out[1]:
A B
2 foo two
4 foo two
5 bar two
For example, if I wanted to concatenate all the string of column A, for which column B had value 'two', then I could do:
In [2]: df.loc[df.B =='two'].A.sum() # <-- use .mean() for your quarterly data
Out[2]: 'foofoobar'
You could also groupby the values of column B and get such a concatenation result for every different B-group from one expression:
In [3]: df.groupby('B').apply(lambda x: x.A.sum())
Out[3]:
B
one foobarfoo
three barfoo
two foofoobar
dtype: object
To filter on A and B use numpy.logical_and:
In [1]: df.loc[np.logical_and(df.A == 'foo', df.B == 'two')]
Out[1]:
A B C D
2 foo two 2 4
4 foo two 4 8
Row subsetting: Isn't this you are looking for ?
df.loc[(df['A'] == 'foo') & (df['B'] == 'two')]
A B C D
2 foo two 2 4
4 foo two 4 8
You can also add .reset_index() at the end to initialize indexes from zero.
In the example below. I am trying to generate a column 'E' that is assigned either [1 or 2] depending on a conditional statement on column A.
I've tried various options but they throw a slicing error. (Should it not be something like this to assign a value to new column 'E'?
df2= df.loc[df['A'] == 'foo']['E'] = 1
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print('Filter the content')
df2= df.loc[df['A'] == 'foo']
print(df2)
# A B C D E
# 0 foo one 0 0 1
# 2 foo two 2 4 1
# 4 foo two 4 8 1
# 6 foo one 6 12 1
# 7 foo three 7 14 1
df3= df.loc[df['A'] == 'bar']
print(df3)
# A B C D E
# 1 bar one 1 2 2
# 3 bar three 3 6 2
# 5 bar two 5 10 2
#Combile df2 and df3 back to df and print df
print(df)
# A B C D E
# 0 foo one 0 0 1
# 1 bar one 1 2 2
# 2 foo two 2 4 1
# 3 bar three 3 6 2
# 4 foo two 4 8 1
# 5 bar two 5 10 2
# 6 foo one 6 12 1
# 7 foo three 7 14 1
What about simply this?
df['E'] = np.where(df['A'] == 'foo', 1, 2)
This does what I think you're trying to do. Create a column E in your dataframe that is 1 if A==foo, and 2 if A!=foo.
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
df['E']=np.ones([df.shape[0],])*2
df.loc[df.A=='foo','E']=1
df.E=df.E.astype(int)
print(df)
Note: Your suggested solution df2= df.loc[df['A'] == 'foo']['E'] = 1 uses serial slicing, rather than taking advantage of loc. To slice df rows by the first conditional and return the column E, you should instead use df.loc[df['A']=='foo','E']
Note II: If you have more than one conditional, you could also use .replace() and pass in a dictionary. In this case mapping foo to 1, bar to 2, and so on.
for brevity (characters)
df.assign(E=df.A.ne('foo')+1)
A B C D E
0 foo one 0 0 1
1 bar one 1 2 2
2 foo two 2 4 1
3 bar three 3 6 2
4 foo two 4 8 1
5 bar two 5 10 2
6 foo one 6 12 1
7 foo three 7 14 1
for brevity (time)
df.assign(E=(df.A.values != 'foo') + 1)
A B C D E
0 foo one 0 0 1
1 bar one 1 2 2
2 foo two 2 4 1
3 bar three 3 6 2
4 foo two 4 8 1
5 bar two 5 10 2
6 foo one 6 12 1
7 foo three 7 14 1
This question is very related to another, and I'll even use the example from the very helpful accepted solution on that question. Here's the example from the accepted solution (credit to unutbu):
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df)
# A B C D
# 0 foo one 0 0
# 1 bar one 1 2
# 2 foo two 2 4
# 3 bar three 3 6
# 4 foo two 4 8
# 5 bar two 5 10
# 6 foo one 6 12
# 7 foo three 7 14
print(df.loc[df['A'] == 'foo'])
yields
A B C D
0 foo one 0 0
2 foo two 2 4
4 foo two 4 8
6 foo one 6 12
7 foo three 7 14
But what if I want to pick out all rows that include both 'foo' and 'one'? Here that would be row 0 and 6. My attempt at it is to try
print(df.loc[df['A'] == 'foo' and df['B'] == 'one'])
This does not work, unfortunately. Can anybody suggest a way to implement something like this? Ideally it would be general enough that there could be a more complex set of conditions in there involving and and or, though I don't actually need that for my purposes.
There is only a very small change needed in your code: change the and with & (and add parentheses for correct ordering of comparisons):
In [104]: df.loc[(df['A'] == 'foo') & (df['B'] == 'one')]
Out[104]:
A B C D
0 foo one 0 0
6 foo one 6 12
The reason you have to use & is that this will do the comparison element-wise on arrays, while and expect to compare two expressions that evaluate to True or False.
Similarly, when you want the or comparison, you can use | in this case.
You can do this with tiny altering in your code:
print(df[df['A'] == 'foo'][df['B'] == 'one'])
Output:
A B C D
0 foo one 0 0
6 foo one 6 12