How to do the classification and count of DataFrame columns? [duplicate] - python

This question already has answers here:
GroupBy Pandas Count Consecutive Zero's
(2 answers)
Closed 1 year ago.
I want to count consecutive 0s, if there are 0s, count the consecutive numbers, and assign the numbers to the count column, and if they encounter 1, recount.
I also tried several methods, but none of them achieved my results.
An example of my Dataframe is as follows:
import numpy as np
import pandas as pd
np.random.seed(2021)
a = np.random.randint(0, 2, 20)
df = pd.DataFrame(a, columns=['No.'])
print(df)
No.
0 0
1 1
2 1
3 0
4 1
5 0
6 0
7 0
8 1
9 0
10 1
11 1
12 1
13 1
14 0
15 0
16 0
17 0
18 0
19 0
The result I need:
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 3
6 0 3
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 6
15 0 6
16 0 6
17 0 6
18 0 6
19 0 6
I tried the following methods, but none of them achieved my results. What should I do?
groups = df['No.'].ne(0).cumsum()
df['count'] = df['No.'].eq(0).groupby(groups).count()
df['count'] = df['No.'].eq(0).groupby(groups).agg(len)
df['count'] = df['No.'].groupby(groups).agg(len)
df['count'] = df['No.'].groupby(groups).count()

For your groups variable, calculate diff first, so you assign an id to each consecutive sequence that contains the same value. And to get the equal sized count Series that can be assigned to original data frame, use transform instead of agg:
df['count'] = 0
groups = df['No.'].diff().ne(0).cumsum()
df.loc[df['No.'] == 0, 'count'] = df['No.'].groupby(groups).transform('size')
df
No. count
0 0 1
1 1 0
2 1 0
3 0 1
4 1 0
5 0 3
6 0 3
7 0 3
8 1 0
9 0 1
10 1 0
11 1 0
12 1 0
13 1 0
14 0 6
15 0 6
16 0 6
17 0 6
18 0 6
19 0 6

Related

Pandas add column on condition: If value of cell is True set value of largest number in Period to true

I have a pandas dataframe with lets say two columns, for example:
value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0
Now I want to add a third column (new_boolean) with the following criteria:
I specify a period, for this example period = 4.
Now I take a look at all rows where boolean == 1.
new_boolean will be 1 for the maximum value in the last period rows.
For example I have boolean == 1 for row 2. So I look at the last period rows. The values are [1, 5], 5 is the maximum, so the value for new_boolean in row 2 will be one.
Second example: row 8 (value = 7): I get values [7, 4, 12, 9], 12 is the maximum, so the value for new_boolean in the row with value 12 will be 1
result:
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
How can I do this algorithmically?
Compute the rolling max of the 'value' column
>>> rolling_max_value = df.rolling(window=4, min_periods=1)['value'].max()
>>> rolling_max_value
0 1.0
1 5.0
2 5.0
3 5.0
4 9.0
5 12.0
6 12.0
7 12.0
8 12.0
9 8.0
10 17.0
11 17.0
12 17.0
Name: value, dtype: float64
Select only the relevant values, i.e. where 'boolean' = 1
>>> on_values = rolling_max_value[df.boolean == 1].unique()
>>> on_values
array([ 5., 9., 12., 17.])
The rows where 'new_boolean' = 1 are the ones where 'value' belongs to on_values
>>> df['new_boolean'] = df.value.isin(on_values).astype(int)
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
EDIT:
OP raised a good point
Does this also work if I have multiple columns with the same value and they have different booleans?
The previous solution doesn't account for that. To solve this, instead of computing the rolling max, we gather the row labels associated with rolling max values, i.e. the rolling argmaxor idxmax. To my knowledge, Rolling objects don't have an idxmax method, but we can easily compute it via apply.
def idxmax(values):
return values.idxmax()
rolling_idxmax_value = (
df.rolling(min_periods=1, window=4)['value']
.apply(idxmax)
.astype(int)
)
on_idx = rolling_idxmax_value[df.boolean == 1].unique()
df['new_boolean'] = 0
df.loc[on_idx, 'new_boolean'] = 1
Results:
>>> rolling_idxmax_value
0 0
1 1
2 1
3 1
4 4
5 5
6 5
7 5
8 5
9 8
10 10
11 10
12 10
Name: value, dtype: int64
>>> on_idx
[ 1 4 5 10]
>>> df
value boolean new_boolean
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 1
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 1
11 15 1 0
12 6 0 0
I did this in 2 steps, but I think the solution is much clearer:
df = pd.read_csv(StringIO('''
id value boolean
0 1 0
1 5 1
2 0 0
3 3 0
4 9 1
5 12 0
6 4 0
7 7 1
8 8 1
9 2 0
10 17 0
11 15 1
12 6 0'''),delim_whitespace=True,index_col=0)
df['new_bool'] = df['value'].rolling(min_periods=1, window=4).max()
df['new_bool'] = df.apply(lambda x: 1 if ((x['value'] == x['new_bool']) & (x['boolean'] == 1)) else 0, axis=1)
df
Result:
value boolean new_bool
id
0 1 0 0
1 5 1 1
2 0 0 0
3 3 0 0
4 9 1 1
5 12 0 0
6 4 0 0
7 7 1 0
8 8 1 0
9 2 0 0
10 17 0 0
11 15 1 0
12 6 0 0

Pandas - changing rows where less than n subsequent values are equal

I have the following dataframe:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,1,1,0,0,1,1,1,0,1,1,1,1,0,0,0]})
Now I would like to set all the rows equal to zero where less than four 1's appear "in a row", i.e. I would like to have the following resulting DataFrame:
df = pd.DataFrame({"col":[0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,1,1,1,1,0,0,0]})
I was not able to find a way to achieve this nicely...
Try with groupby and where:
streaks = df.groupby(df["col"].ne(df["col"].shift()).cumsum()).transform("sum")
output = df.where(streaks.ge(4), 0)
>>> output
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0
We can do
df.loc[df.groupby(df.col.eq(0).cumsum()).transform('count')['col']<5,'col'] = 0
df
Out[77]:
col
0 0
1 0
2 1
3 1
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 1
17 1
18 1
19 1
20 0
21 0
22 0

Multiple condition count across two dataframes

I've tried several solutions from similar problems, but so far, no luck. I know it's probably simple.
I have two pandas dataframes. One contains temperatures and months, df1. The other contains months and a possible range of temperatures, df2. I would like to count how many times a temperature for a particular month occurs based on df2.
df1:
Month Temp
1 10
1 10
1 20
2 5
2 10
2 15
df2:
Month Temp
1 0
1 5
1 10
1 15
1 20
1 25
2 0
2 5
2 10
2 15
2 20
2 25
desired output with a new columns, Count, in df2:
Month Temp Count
1 0 0
1 5 0
1 10 2
1 15 0
1 20 1
1 25 0
2 0 0
2 5 1
2 10 1
2 15 1
2 20 0
2 25 0
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*3 + [2]*3,
'Temp': [10,10,20,5,10,15]})
df2 = pd.DataFrame({'Month': [1]*6 + [2]*6,
'Temp': [0,5,10,15,20,25]*2})
df2['Count'] =
An approach using value_counts and reindex:
new_index = pd.MultiIndex.from_frame(df2)
new_df = (
df1.value_counts(["Month", "Temp"])
.reindex(new_index, fill_value=0)
.rename("Count")
.reset_index()
)
Month Temp Count
0 1 0 0
1 1 5 0
2 1 10 2
3 1 15 0
4 1 20 1
5 1 25 0
6 2 0 0
7 2 5 1
8 2 10 1
9 2 15 1
10 2 20 0
11 2 25 0
Try this:
(df2.join(
df.groupby(['Month','Temp']).size().rename('count'),
on=['Month','Temp'])
.fillna(0))
Another solution:
x = (
df1.assign(Count=1)
.merge(df2, on=["Month", "Temp"], how="outer")
.fillna(0)
.groupby(["Month", "Temp"], as_index=False)
.sum()
.astype(int)
)
print(x)
Prints:
Month Temp Count
0 1 0 0
1 1 5 0
2 1 10 2
3 1 15 0
4 1 20 1
5 1 25 0
6 2 0 0
7 2 5 1
8 2 10 1
9 2 15 1
10 2 20 0
11 2 25 0
try:
res = (df2.set_index(['Month', 'Temp'])
.join(df1.value_counts().to_frame(name='count'))
.reset_index().fillna(0).astype(int))
OR
di = df1.value_counts().to_dict()
df2['count'] = df2.apply(lambda x: 0 if tuple(x) not in di.keys() else di[tuple(x)], axis=1)
Month Temp count
0 1 0 0
1 1 5 0
2 1 10 2
3 1 15 0
4 1 20 1
5 1 25 0
6 2 0 0
7 2 5 1
8 2 10 1
9 2 15 1
10 2 20 0
11 2 25 0

Replace values in df col - pandas

I'm aiming to replace values in a df column Num. Specifically:
where 1 is located in Num, I want to replace preceding 0's with 1 until the nearest Item is 1 working backwards or backfilling.
where Num == 1, the corresponding row in Item will always be 0.
Also, Num == 0 will always follow Num == 1.
Input and code:
df = pd.DataFrame({
'Item' : [0,1,2,3,4,4,0,1,2,3,1,1,2,3,4,0],
'Num' : [0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0]
})
df['Num'] = np.where((df['Num'] == 1) & (df['Item'].shift() > 1), 1, 0)
Item Num
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 0
12 2 0
13 3 0
14 4 1
15 0 0
intended output:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
First, create groups of the rows according to the two start and end conditions using cumsum. Then we can group by this new column and sum over the Num column. In this way, all groups that contain a 1 in the Num column will get the value 1 while all other groups will get 0.
groups = ((df['Num'].shift() == 1) | (df['Item'] == 1)).cumsum()
df['Num'] = df.groupby(groups)['Num'].transform('sum')
Result:
Item Num
0 0 0
1 1 1
2 2 1
3 3 1
4 4 1
5 4 1
6 0 0
7 1 0
8 2 0
9 3 0
10 1 0
11 1 1
12 2 1
13 3 1
14 4 1
15 0 0
You could try:
for a, b in zip(df[df['Item'] == 0].index, df[df['Num'] == 1].index):
df.loc[(df.loc[a+1:b-1, 'Item'] == 1)[::-1].idxmax():b-1, 'Num'] = 1

Create Pandas DataFrame from (row, column, value) data

I have a Pandas Dataframe with three columns: row, column, value. The row values are all integers below some N, and the column values are all integers below some M. The values are all positive integers.
How do I efficiently create a Dataframe with N rows and M columns, with at index i, j the value val if (i, j , val) is a row in my original Dataframe, and some default value (0) otherwise? Furthermore, is it possible to create a sparse Dataframe immediately, since the data is already quite large, but N*M is still about 10 times the size of my data?
A NumPy solution would suit here for performance -
a = df.values
m,n = a[:,:2].max(0)+1
out = np.zeros((m,n),dtype=a.dtype)
out[a[:,0], a[:,1]] = a[:,2]
df_out = pd.DataFrame(out)
Sample run -
In [58]: df
Out[58]:
row col val
0 7 1 30
1 3 3 0
2 4 8 30
3 5 8 18
4 1 3 6
5 1 6 48
6 0 2 6
7 4 7 6
8 5 0 48
9 8 1 48
10 3 2 12
11 6 8 18
In [59]: df_out
Out[59]:
0 1 2 3 4 5 6 7 8
0 0 0 6 0 0 0 0 0 0
1 0 0 0 6 0 0 48 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 12 0 0 0 0 0 0
4 0 0 0 0 0 0 0 6 30
5 48 0 0 0 0 0 0 0 18
6 0 0 0 0 0 0 0 0 18
7 0 30 0 0 0 0 0 0 0
8 0 48 0 0 0 0 0 0 0

Categories

Resources