I am trying to understand how to use .all, for example:
import pandas as pd
df = pd.DataFrame({
"user_id": [1,1,1,1,1,2,2,2,3,3,3,3],
"score": [1,2,3,4,5,3,4,5,5,6,7,8]
})
When I try:
df.groupby("user_id").all(lambda x: x["score"] > 2)
I get:
score
user_id
1 True
2 True
3 True
But I expect:
score
user_id
1 False # Since for first group of users the score is not greater than 2 for all
2 True
3 True
In fact it doesn't even matter what value I pass instead of 2, the result DataFrame always has True for the score column.
Why do I get the result that I get? How can I get my expected result?
I looked at the documentation: https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.all.html, but it is very brief and did not help me.
the line
df.groupby("user_id").all(lambda x: x["score"] > 2)
is not asking "are all datapoints larger than 2?", in reality is asking "are there datapoints?"
to ask what you really want you need to do the following:
df['score'].gt(2).groupby(df['user_id']).all()
Out
user_id
1 False
2 True
3 True
groupby.all does not take any function as parameter. The only parameter (skipna) accepts a boolean and is used to change how NaN values are interpreted.
You probably want:
df['score'].gt(2).groupby(df['user_id']).all()
Which can also be written as:
df.assign(flag=df['score'].gt(2)).groupby('user_id')['flag'].all()
Related
I have a DataFrame like this:
x = ['3.13.1.7-2.1', '3.21.1.8-2.2', '4.20.1.6-2.1', '4.8.1.2-2.0', '5.23.1.10-2.2']
df = pd.DataFrame(data = x, columns = ['id'])
id
0 3.13.1.7-2.1
1 3.21.1.8-2.2
2 4.20.1.6-2.1
3 4.8.1.2-2.0
4 5.23.1.10-2.2
I need to split each id string on the periods, and then I need to know when the second part is 13 and the third part is 1. Ideally, I would have one additional column that is a boolean value (in the above example, index 0 would be TRUE and all others would be FALSE). But I could live with multiple additional columns, where one or more contain individual string parts, and one is for said boolean value.
I first tried to just split the string into parts:
df['id_split'] = df['id'].apply(lambda x: str(x).split('.'))
This worked, however if I try to isolate only the second part of the string like this...
df['id_split'] = df['id'].apply(lambda x: str(x).split('.')[1])
...I get an error that the list index is out of range.
Yet, if I check any individual index in the DataFrame like this...
df['id_split'][0][1]
...this works, producing only the second item in the list of strings.
I guess I'm not familiar enough with what the .apply() method is doing to know why it won't accept list indices. But anyway, I'd like to know how I can isolate just the second and third parts of each string, check their values, and output a boolean based on those values, in a scalable manner (the actual dataset is millions of rows). Thanks!
Let's use str.split to get the parts, then you can compare:
parts = df['id'].str.split('\.', expand=True)
(parts[[1,2]] == ['13','1']).all(1)
Output:
0 True
1 False
2 False
3 False
4 False
dtype: bool
You can do something like this
df['flag'] = df['id'].apply(lambda x: True if x.split('.')[1] == '13' and x.split('.')[2]=='1' else False)
Output
id flag
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
You can do it directly, like below:
df['new'] = df['id'].apply(lambda x: str(x).split('.')[1]=='13' and str(x).split('.')[2]=='1')
>>> print(df)
id new
0 3.13.1.7-2.1 True
1 3.21.1.8-2.2 False
2 4.20.1.6-2.1 False
3 4.8.1.2-2.0 False
4 5.23.1.10-2.2 False
I need to adapt an existing function, that essentially performs a Series.str.contains and returns the resulting Series, to be able to handle SeriesGroupBy as input.
As suggested by the pandas error message
Cannot access attribute 'str' of 'SeriesGroupBy' objects, try using the 'apply' method
I have tried to use apply() on the SeriesGroupBy object, which works in a way, but results in a Series object. I would now like to apply the same grouping as before, to this Series.
Original function
def contains(series, expression):
return series.str.contains(expression)
My attempt so far
>>> import pandas as pd
... from functools import partial
...
... def _f(series, expression):
... return series.str.contains(expression)
...
... def contains(grouped_series, expression):
... result = grouped_series.apply(partial(_f, expression=expression))
... return result
>>> df = pd.DataFrame(zip([1,1,2,2], ['abc', 'def', 'abq', 'bcq']), columns=['group', 'text'])
>>> gdf = df.groupby('group')
>>> gs = gdf['text']
>>> type(gs)
<class 'pandas.core.groupby.generic.SeriesGroupBy'>
>>> r = contains(gdf['text'], 'b')
>>> r
0 True
1 False
2 True
3 True
Name: text, dtype: bool
>>> type(r)
<class 'pandas.core.series.Series'>
The desired result would by a boolean series grouped by the same indices as the original grouped_series.
The actual result is a Series object without any grouping.
EDIT / CLARIFICATION:
The initial answers make me think I didn't stress the core of the problem enough. For the sake of the question, lets assume I cannot change anything outside of the contains(grouped_series, expression) function.
I think I know how to solve my problem if I approach it from another angle, and if I don't that would then become another question. The real world context makes it very complicated to change code outside of that one function. So I would really appreciate suggestions that work within that constraint.
So, let me rephrase the question as follows:
I'm looking for a function contains(grouped_series, expression), so that the following code works:
>>> df = pd.DataFrame(zip([1,1,2,2], ['abc', 'def', 'abq', 'bcq']), columns=['group', 'text'])
>>> grouped_series = contains(df.groupby('group')['text'], 'b')
>>> grouped_series.sum()
group
1 1.0
2 2.0
Name: text, dtype: float64
groupby is not needed unless you want to do something with the "group" -- like calculating its sum or check if all rows in the group contain the letter b. When you call apply on a GroupBy object, you can pass additional argument to the function being applied by keywords:
def contains(frame, expression):
return frame['text'].str.contains(expression).all()
df.groupby('group').apply(contains, expression='b')
Result:
group
1 False
2 True
dtype: bool
I like to think that the first parameter to the function being applied (frame) is a smaller view of the original dataframe, being chopped up by the groupby clause.
That said, apply is pretty slow compared to specialized aggregate functions lime min, max or sum. Use these as much as possible and save apply for complex cases.
Following the advice of the error message, you could use apply:
df.groupby('group').apply(lambda x : x.text.str.contains('b'))
Out[10]:
group
1 0 True
1 False
2 2 True
3 True
Name: text, dtype: bool
If you want to put these indices into your data set and return a DataFrame, use reset_index:
df.groupby('group').apply(lambda x : x.text.str.contains('b')).reset_index()
Out[11]:
group level_1 text
0 1 0 True
1 1 1 False
2 2 2 True
3 2 3 True
_f has absolutely no relationship to the groups. The way to deal with this is to instead define a column prior to grouping (not a separate function), then group. Now that column (called 'to_sum') is part of your Series.GroupBy object.
df.assign(to_sum = _f(df['text'], 'b')).groupby('group').to_sum.sum()
#group
#1 1.0
#2 2.0
#Name: to_sum, dtype: float64
If you don't need the entire DataFrame for your subsequent operations, you can sum the Series returned by _f using df to group (as they will share the same index)
_f(df['text'], 'b').groupby(df['group']).sum()
You can just do this. No need to do group-by
df['eval']= df['text'].str.contains('b')
eval is the name of the column which you want add. You can name what you want.
df.groupby('group')['eval'].sum()
Run this after the first line. The result is
group
1 1.0
2 2.0
I have a dataset as follows:
I am going to filter rows where the counts value equals 1.
index count
1 4
2 5
3 1
4 1
This is my code:
booleans =[]
for number in df1.count:
if number ==1:
booleans.append (True)
else:
booleans.append (False)
but it has this error:
'method' object is not iterable
I also tried this:
df[df.count==1]
but I had the following error:
KeyError: False
any suggestion?
In your code the problem is with the this part df1.count. Actually, pandas is having a method count() which is used to count the no. of non-NA/null observations across the given axis.
And in your code it returns something like this,
<bound method DataFrame.count of index count
0 1 4
1 2 5
2 3 1
3 4 1>
Instead of it, you can use df[df['count']=='1'] to get what you were looking for.
import pandas as pd
data = {"index":['1','2','3','4'],
"count":['4','5','1','1']}
df = pd.DataFrame(data)
indexes = df[df['count']=='1']
print(indexes)
Output
index count
2 3 1
3 4 1
Count is also a method of pandas DataFrame.
When you do df.count, pandas understands you are calling the count() method, not fetching your column that happens to have the same name. Doing df["count"] would solve your issue.
The standard way to do this is to do the following:
Solution 1
df1[df1["count"]=='1']
Solution 2
However, if you really do want to get a list of booleans you might want to use lambdas:
booleans = list(df1['count'].apply(lambda x:x=='1').values)
You can then use this list to get the result you want like so:
df1[booleans]
This is basically the same thing as solution 1.
I want to assign a boolean value to a currently column of "True" if the first column contains only one period and "False" if it contains more than one period.
This is what I've gotten to at this point and I am completely stuck:
for index, row in qbstats.iterrows():
if qbstats['qb'].count(".") > 1
...... so if it's greater than one I want to assign the column labeled "num_periods_in_name" as False else wise it sets as True.
I would appreciate any help, thanks.
You can use np.where():
df['New Col'] = np.where(df['qb'].str.count('\.')>1, False, True)
Note, you will need to escape the . with a \ as well.
Below is an example:
qb
0 Hello.
1 helloo...
2 hello...ooo
3 Hell.o
And applying the code above gives:
qb New Col
0 Hello. True
1 helloo... False
2 hello...ooo False
3 Hell.o True
I wanted to use a boolean indexing, checking for rows of my data frame where a particular column does not have NaN values. So, I did the following:
import pandas as pd
my_df.loc[pd.isnull(my_df['col_of_interest']) == False].head()
to see a snippet of that data frame, including only the values that are not NaN (most values are NaN).
It worked, but seems less-than-elegant. I'd want to type:
my_df.loc[!pd.isnull(my_df['col_of_interest'])].head()
However, that generated an error. I also spend a lot of time in R, so maybe I'm confusing things. In Python, I usually put in the syntax "not" where I can. For instance, if x is not none:, but I couldn't really do it here. Is there a more elegant way? I don't like having to put in a senseless comparison.
In general with pandas (and numpy), we use the bitwise NOT ~ instead of ! or not (whose behaviour can't be overridden by types).
While in this case we have notnull, ~ can come in handy in situations where there's no special opposite method.
>>> df = pd.DataFrame({"a": [1, 2, np.nan, 3]})
>>> df.a.isnull()
0 False
1 False
2 True
3 False
Name: a, dtype: bool
>>> ~df.a.isnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
>>> df.a.notnull()
0 True
1 True
2 False
3 True
Name: a, dtype: bool
(For completeness I'll note that -, the unary negative operator, will also work on a boolean Series but ~ is the canonical choice, and - has been deprecated for numpy boolean arrays.)
Instead of using pandas.isnull() , you should use pandas.notnull() to find the rows where the column has not null values. Example -
import pandas as pd
my_df.loc[pd.notnull(my_df['col_of_interest'])].head()
pandas.notnull() is the boolean inverse of pandas.isnull() , as given in the documentation -
See also
pandas.notnull
boolean inverse of pandas.isnull