Count groups of consecutive values in pandas - python

I have a dataframe with 0 and 1 and I would like to count groups of 1s (don't mind the 0s) with a Pandas solution (not itertools, not python iteration).
Other SO posts suggest methods based on shift()/diff()/cumsum() which seems not to work when the leading sequence in the dataframe starts with 0.
df = pandas.Series([0,1,1,1,0,0,1,0,1,1,0,1,1]) # should give 4
df = pandas.Series([1,1,0,0,1,0,1,1,0,1,1]) # should also give 4
df = pandas.Series([1,1,1,1,1,0,1]) # should give 2
Any idea ?

If you only have 0/1, you can use:
s = pd.Series([0,1,1,1,0,0,1,0,1,1,0,1,1])
count = s.diff().fillna(s).eq(1).sum()
output: 4 (4 and 2 for the other two)
Then fillna ensures that Series starting with 1 will be counted
faster alternative
use the diff, count the 1 and correct the result with the first item:
count = s.diff().eq(1).sum()+(s.iloc[0]==1)
comparison of different pandas approaches:

Let us identify the diffrent groups of 1's using cumsum, then use nunique to count the number of unique groups
m = df.eq(0)
m.cumsum()[~m].nunique()
Result
case 1: 4
case 2: 4
case 3: 2

Related

Count the number of elements in a list where the list contains the empty string

I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array

How to iterate and calculate over a pandas multi-index dataframe

I have a pandas multi-index dataframe:
>>> df
0 1
first second
A one 0.991026 0.734800
two 0.582370 0.720825
B one 0.795826 -1.155040
two 0.013736 -0.591926
C one -0.538078 0.291372
two 1.605806 1.103283
D one -0.617655 -1.438617
two 1.495949 -0.936198
I'm trying to find an efficient way to divide each number in column 0 by the maximum number in column I that shares the same group under index "first", and make this into a third column. Is there a simply efficient method for doing something like this that doesn't require multiple for loops?
Use Series.div with max for maximal values per first level:
print (df[1].max(level=0))
first
A 0.734800
B -0.591926
C 1.103283
D -0.936198
Name: 1, dtype: float64
df['new'] = df[0].div(df[1].max(level=0))
print (df)
0 1 new
first second
A one 0.991026 0.734800 1.348702
two 0.582370 0.720825 0.792556
B one 0.795826 -1.155040 -1.344469
two 0.013736 -0.591926 -0.023206
C one -0.538078 0.291372 -0.487706
two 1.605806 1.103283 1.455480
D one -0.617655 -1.438617 0.659748
two 1.495949 -0.936198 -1.597898

Returning date that corresponds with maximum value in pandas dataframe [duplicate]

How can I find the row for which the value of a specific column is maximal?
df.max() will give me the maximal value for each column, I don't know how to get the corresponding row.
Use the pandas idxmax function. It's straightforward:
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].idxmax()
3
>>> df['B'].idxmax()
4
>>> df['C'].idxmax()
1
Alternatively you could also use numpy.argmax, such as numpy.argmax(df['A']) -- it provides the same thing, and appears at least as fast as idxmax in cursory observations.
idxmax() returns indices labels, not integers.
Example': if you have string values as your index labels, like rows 'a' through 'e', you might want to know that the max occurs in row 4 (not row 'd').
if you want the integer position of that label within the Index you have to get it manually (which can be tricky now that duplicate row labels are allowed).
HISTORICAL NOTES:
idxmax() used to be called argmax() prior to 0.11
argmax was deprecated prior to 1.0.0 and removed entirely in 1.0.0
back as of Pandas 0.16, argmax used to exist and perform the same function (though appeared to run more slowly than idxmax).
argmax function returned the integer position within the index of the row location of the maximum element.
pandas moved to using row labels instead of integer indices. Positional integer indices used to be very common, more common than labels, especially in applications where duplicate row labels are common.
For example, consider this toy DataFrame with a duplicate row label:
In [19]: dfrm
Out[19]:
A B C
a 0.143693 0.653810 0.586007
b 0.623582 0.312903 0.919076
c 0.165438 0.889809 0.000967
d 0.308245 0.787776 0.571195
e 0.870068 0.935626 0.606911
f 0.037602 0.855193 0.728495
g 0.605366 0.338105 0.696460
h 0.000000 0.090814 0.963927
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
In [20]: dfrm['A'].idxmax()
Out[20]: 'i'
In [21]: dfrm.iloc[dfrm['A'].idxmax()] # .ix instead of .iloc in older versions of pandas
Out[21]:
A B C
i 0.688343 0.188468 0.352213
i 0.879000 0.105039 0.900260
So here a naive use of idxmax is not sufficient, whereas the old form of argmax would correctly provide the positional location of the max row (in this case, position 9).
This is exactly one of those nasty kinds of bug-prone behaviors in dynamically typed languages that makes this sort of thing so unfortunate, and worth beating a dead horse over. If you are writing systems code and your system suddenly gets used on some data sets that are not cleaned properly before being joined, it's very easy to end up with duplicate row labels, especially string labels like a CUSIP or SEDOL identifier for financial assets. You can't easily use the type system to help you out, and you may not be able to enforce uniqueness on the index without running into unexpectedly missing data.
So you're left with hoping that your unit tests covered everything (they didn't, or more likely no one wrote any tests) -- otherwise (most likely) you're just left waiting to see if you happen to smack into this error at runtime, in which case you probably have to go drop many hours worth of work from the database you were outputting results to, bang your head against the wall in IPython trying to manually reproduce the problem, finally figuring out that it's because idxmax can only report the label of the max row, and then being disappointed that no standard function automatically gets the positions of the max row for you, writing a buggy implementation yourself, editing the code, and praying you don't run into the problem again.
You might also try idxmax:
In [5]: df = pandas.DataFrame(np.random.randn(10,3),columns=['A','B','C'])
In [6]: df
Out[6]:
A B C
0 2.001289 0.482561 1.579985
1 -0.991646 -0.387835 1.320236
2 0.143826 -1.096889 1.486508
3 -0.193056 -0.499020 1.536540
4 -2.083647 -3.074591 0.175772
5 -0.186138 -1.949731 0.287432
6 -0.480790 -1.771560 -0.930234
7 0.227383 -0.278253 2.102004
8 -0.002592 1.434192 -1.624915
9 0.404911 -2.167599 -0.452900
In [7]: df.idxmax()
Out[7]:
A 0
B 8
C 7
e.g.
In [8]: df.loc[df['A'].idxmax()]
Out[8]:
A 2.001289
B 0.482561
C 1.579985
Both above answers would only return one index if there are multiple rows that take the maximum value. If you want all the rows, there does not seem to have a function.
But it is not hard to do. Below is an example for Series; the same can be done for DataFrame:
In [1]: from pandas import Series, DataFrame
In [2]: s=Series([2,4,4,3],index=['a','b','c','d'])
In [3]: s.idxmax()
Out[3]: 'b'
In [4]: s[s==s.max()]
Out[4]:
b 4
c 4
dtype: int64
df.iloc[df['columnX'].argmax()]
argmax() would provide the index corresponding to the max value for the columnX. iloc can be used to get the row of the DataFrame df for this index.
A more compact and readable solution using query() is like this:
import pandas as pd
df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
print(df)
# find row with maximum A
df.query('A == A.max()')
It also returns a DataFrame instead of Series, which would be handy for some use cases.
Very simple: we have df as below and we want to print a row with max value in C:
A B C
x 1 4
y 2 10
z 5 9
In:
df.loc[df['C'] == df['C'].max()] # condition check
Out:
A B C
y 2 10
If you want the entire row instead of just the id, you can use df.nlargest and pass in how many 'top' rows you want and you can also pass in for which column/columns you want it for.
df.nlargest(2,['A'])
will give you the rows corresponding to the top 2 values of A.
use df.nsmallest for min values.
The direct ".argmax()" solution does not work for me.
The previous example provided by #ely
>>> import pandas
>>> import numpy as np
>>> df = pandas.DataFrame(np.random.randn(5,3),columns=['A','B','C'])
>>> df
A B C
0 1.232853 -1.979459 -0.573626
1 0.140767 0.394940 1.068890
2 0.742023 1.343977 -0.579745
3 2.125299 -0.649328 -0.211692
4 -0.187253 1.908618 -1.862934
>>> df['A'].argmax()
3
>>> df['B'].argmax()
4
>>> df['C'].argmax()
1
returns the following message :
FutureWarning: 'argmax' is deprecated, use 'idxmax' instead. The behavior of 'argmax'
will be corrected to return the positional maximum in the future.
Use 'series.values.argmax' to get the position of the maximum now.
So that my solution is :
df['A'].values.argmax()
mx.iloc[0].idxmax()
This one line of code will give you how to find the maximum value from a row in dataframe, here mx is the dataframe and iloc[0] indicates the 0th index.
Considering this dataframe
[In]: df = pd.DataFrame(np.random.randn(4,3),columns=['A','B','C'])
[Out]:
A B C
0 -0.253233 0.226313 1.223688
1 0.472606 1.017674 1.520032
2 1.454875 1.066637 0.381890
3 -0.054181 0.234305 -0.557915
Assuming one want to know the rows where column "C" is max, the following will do the work
[In]: df[df['C']==df['C'].max()])
[Out]:
A B C
1 0.472606 1.017674 1.520032
The idmax of the DataFrame returns the label index of the row with the maximum value and the behavior of argmax depends on version of pandas (right now it returns a warning). If you want to use the positional index, you can do the following:
max_row = df['A'].values.argmax()
or
import numpy as np
max_row = np.argmax(df['A'].values)
Note that if you use np.argmax(df['A']) behaves the same as df['A'].argmax().
Use:
data.iloc[data['A'].idxmax()]
data['A'].idxmax() -finds max value location in terms of row
data.iloc() - returns the row
If there are ties in the maximum values, then idxmax returns the index of only the first max value. For example, in the following DataFrame:
A B C
0 1 0 1
1 0 0 1
2 0 0 0
3 0 1 1
4 1 0 0
idxmax returns
A 0
B 3
C 0
dtype: int64
Now, if we want all indices corresponding to max values, then we could use max + eq to create a boolean DataFrame, then use it on df.index to filter out indexes:
out = df.eq(df.max()).apply(lambda x: df.index[x].tolist())
Output:
A [0, 4]
B [3]
C [0, 1, 3]
dtype: object
what worked for me is:
df[df['colX'] == df['colX'].max()
You then get the row in your df with the maximum value of colX.
Then if you just want the index you can add .index at the end of the query.

pandas groupby select cells on condition

I want to group my Dataframe, and then count the mean of the dummy occurance per group.
df3 = pd.DataFrame({'Number':['001','001','001','002','002','002','002'],
'name':['peter','chris','meg','albert','cathrine','leo','leo'],
'dummy':[0,1,1,0,0,1,1]})
i could calculate the mean number of unique occurances (based on names) per group using this code:
test=df3.groupby('Number')
test_1 = []
for name, group in test:
x= len(group.name.unique())
test_1.append(x)
pd.Series(test_1).mean()
now i want to calculate how often the dummy equals to 1 on average in a group given that the name is unique
so for this example the calculation would be (2+1)/2 =1.5.
where ( unique dummy counts from group 1 (2) + unique dummy count from group 2 (1))/divided by number of groups (2) =1.5 unique dummy counts on average per group
note that if there is no dummy in the group, the number of groups in the denominator should still increase by 1
Please comment if i didnt express the task clearly!
Ok i just found an answer to my question, even though it is a little workaround:
df3 = pd.DataFrame({'Number':['001','001','001','002','002','002','002'],
'name':['peter','chris','meg','albert','cathrine','leo','leo'],
'dummy':[0,1,0,0,0,1,1]})
df4=df3.loc[df3.dummy.isin(['1'])] #creating new dataframe with only the rows where dummy = 1
test=df4.groupby('Number') # group it by the number column
test_1 = []
for name, group in test:
x= len(group.name.unique()) #take only the unique names in each group
test_1.append(x)
pd.Series(test_1).sum()/len(test) # divide value count by number of groups
s = df3.groupby('Number').agg({"name":["nunique"], "dummy": ["sum"]})
sum(s["name"]["nunique"]/s["dummy"]["sum"])
If I understand correctly what you meant
And in a more elegant implementation -
def my_func(x):
n = x['name'].nunique()
s = x['dummy'].sum()
return n/s
df3.groupby('Number').apply(my_func).mean()
Edit
I finally think that I understood after see the suggested solution by the question asker -
df4 = df3[df3.dummy == 1]
df4.groupby('Number').apply(lambda x: x["name"].nunique()).sum()/df4.Number.nunique()

Pandas Cumulative Sum using Current Row as Condition

I've got a fairly large data set of about 2 million records, each of which has a start time and an end time. I'd like to insert a field into each record that counts how many records there are in the table where:
Start time is less than or equal to "this row"'s start time
AND end time is greater than "this row"'s start time
So basically each record ends up with a count of how many events, including itself, are "active" concurrently with it.
I've been trying to teach myself pandas to do this with but I am not even sure where to start looking. I can find lots of examples of summing rows that meet a given condition like "> 2", but can't seem to grasp how to iterate over rows to conditionally sum a column based on values in the current row.
You can try below code to get the final result.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[2,10],[5,8],[3,8],[6,9]]),columns=["start","end"])
active_events= {}
for i in df.index:
active_events[i] = len(df[(df["start"]<=df.loc[i,"start"]) & (df["end"]> df.loc[i,"start"])])
last_columns = pd.DataFrame({'No. active events' : pd.Series(active_events)})
df.join(last_columns)
Here goes. This is going to be SLOW.
Note that this counts each row as overlapping with itself, so the results column will never be 0. (Subtract 1 from the result to do it the other way.)
import pandas as pd
df = pd.DataFrame({'start_time': [4,3,1,2],'end_time': [7,5,3,8]})
df = df[['start_time','end_time']] #just changing the order of the columns for aesthetics
def overlaps_with_row(row,frame):
starts_before_mask = frame.start_time <= row.start_time
ends_after_mask = frame.end_time > row.start_time
return (starts_before_mask & ends_after_mask).sum()
df['number_which_overlap'] = df.apply(overlaps_with_row,frame=df,axis=1)
Yields:
In [8]: df
Out[8]:
start_time end_time number_which_overlap
0 4 7 3
1 3 5 2
2 1 3 1
3 2 8 2
[4 rows x 3 columns]
def counter (s: pd.Series):
return ((df["start"]<= s["start"]) & (df["end"] >= s["start"])).sum()
df["count"] = df.apply(counter , axis = 1)
This feels a lot simpler approach, using the apply method. This doesn't really compromise on speed as the apply function, although not as fast as python native functions like cumsum() or cum, it should be faster than using a for loop.

Categories

Resources