Column location of last matching value in dataframe row (Python) - python

I have the following dataframe:
df1 = pd.DataFrame({1:[1,2,3,4], 2:[1,2,4,5], 3:[8,1,5,6]})
df1
Out[7]:
1 2 3
0 1 1 8
1 2 2 1
2 3 4 5
3 4 5 6
and I would like to create a new column that will show the distance the last column with a particular value, 2 in this case, from the reference column, 3 in this example, or return an NaN result is no such value is found in a row. Output would be something like:
df1
Out[11]:
1 2 3 dist
0 1 1 8 NaN
1 2 2 1 1
2 3 4 5 NaN
3 4 5 6 NaN
What would be an effective way of accomplishing this task?

I think need subtract 3 (last) because reference column with column name of last 2:
df1.columns = df1.columns.astype(int)
print((df1.columns.max() - df1.eq(2).iloc[:,::-1].idxmax(axis=1)).mask(lambda x: x == 0))
0 NaN
1 1.0
2 NaN
3 NaN
dtype: float64
Details:
Compare by 2:
print (df1.eq(2))
1 2 3
0 False False False
1 True True False
2 False False False
3 False False False
Inverse order of columns:
print (df1.eq(2).iloc[:,::-1])
3 2 1
0 False False False
1 False True True
2 False False False
3 False False False
Check column name of first True (because inverse columns, it is last)
print (df1.eq(2).iloc[:,::-1].idxmax(axis=1))
0 3
1 2
2 3
3 3
dtype: int64
Subtract by max value, but it also return 0 if value in reference column and if no value match:
print (df1.columns.max() - df1.eq(2).iloc[:,::-1].idxmax(1))
0 0
1 1
2 0
3 0
dtype: int64

Related

Cumulative count resetting to and staying 0 based on a condition in pandas

I am trying to create a column 'count' on a pandas DF that cumulatively counts when field 'boolean' is True but resets and stays at 0 when 'boolean' is False. Also needs to be grouped by the ID column, so the count resets when looking at a new ID. No loops please as working with a big data set
Used the code from the following question which works but need to add a group by to include the ID column grouping
Pandas Dataframe - Row Iteration with Resetting Count-Value by Condition without loop
Expected output below: (ID, Boolean columns already exist, just need to create Count)
ID Boolean Count
1 True 1
1 True 2
1 True 3
1 True 4
1 True 5
1 False 0
1 False 0
1 False 0
1 False 0
1 True 1
1 True 2
1 True 3
2 True 1
2 True 2
2 True 3
2 True 4
2 False 0
2 False 0
2 False 0
2 True 1
2 True 2
2 True 3
Identify blocks by using cumsum on inverted boolean mask, then group the dataframe by ID and blocks and use cumsum on Boolean to create a counter
b = (~df['Boolean']).cumsum()
df['Count'] = df.groupby(['ID', b])['Boolean'].cumsum()
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
df['Count'] = df.groupby('ID')['Boolean'].diff()
df = df.fillna(False)
df['Count'] = df.groupby('ID')['Count'].cumsum()
df['Count'] = df.groupby(['ID', 'Count'])['Boolean'].cumsum()
df
ID Boolean Count
0 1 True 1
1 1 True 2
2 1 True 3
3 1 True 4
4 1 True 5
5 1 False 0
6 1 False 0
7 1 False 0
8 1 False 0
9 1 True 1
10 1 True 2
11 1 True 3
12 2 True 1
13 2 True 2
14 2 True 3
15 2 True 4
16 2 False 0
17 2 False 0
18 2 False 0
19 2 True 1
20 2 True 2
21 2 True 3
You can use a column shift for ID and Boolean columns to identify the groups to do the groupby on. Then do a cumsum for each of those groups.
groups = ((df['ID']!=df['ID'].shift()) | (df['Boolean']!=df['Boolean'].shift())).cumsum()
df.assign(Count2=df.groupby(groups)['Boolean'].cumsum())
Result
ID Boolean Count Count2
0 1 True 1 1
1 1 True 2 2
2 1 True 3 3
3 1 True 4 4
4 1 True 5 5
5 1 False 0 0
6 1 False 0 0
7 1 False 0 0
8 1 False 0 0
9 1 True 1 1
10 1 True 2 2
11 1 True 3 3
12 2 True 1 1
13 2 True 2 2
14 2 True 3 3
15 2 True 4 4
16 2 False 0 0
17 2 False 0 0
18 2 False 0 0
19 2 True 1 1
20 2 True 2 2
21 2 True 3 3

Python pandas pivot

I have a pandas dataframe like below.
id A B C
0 1 1 1 1
1 1 5 7 2
2 2 6 9 3
3 3 1 5 4
4 3 4 6 2
After evaluating conditions,
id A B C a_greater_than_b b_greater_than_c c_greater_than_a
0 1 1 1 1 False False False
1 1 5 7 2 False True False
2 2 6 9 3 False True False
3 3 1 5 4 False True True
4 3 4 6 2 False True False
And after evaluating conditions, want to aggregate the results per id.
id a_greater_than_b b_greater_than_c c_greater_than_a
1 False False False
2 False True False
3 False True False
The logic is not fully clear, but you can combine pandas.get_dummies and aggregation per group (here I am assuming the min as your example showed that 1/1/0 -> 0 and 1/1/1 -> 1, but you can use other logics, e.g. last if you want to get the last row per group after sorting by date):
out = (pd
.get_dummies(df[['color', 'size']])
.groupby(df['id'])
.min()
)
print(out)
Output:
color_blue color_yellow size_l
id
A1 0 0 1

Pandas: Filter a data-frame, and assign values to top n number of rows

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,2,5,6,7,1,8,9,2], 'city':[1,2,3,4,2,5,6,7,1,8,9,2]})
# The following code, creates a boolean filter,
filter = df.city==2
# Assigns True to all rows where filter is True
df.loc[filter,'selected']= True
What I need, is a change in the code so that it assigns True to given n number of rows.
The actual data frame has more than 3 million rows. Sometimes, I would want
df.loc[filter,'selected']= True for only 100 rows [Actual rows could be more or less than 100].
I believe you need filter by values defined in list first with isin and then for top 2 values use GroupBy.head:
cities= [2,3]
df = df1[df1.city.isin(cities)].groupby('city').head(2)
print (df)
col1 city
1 2 2
2 3 3
4 2 2
If need assign True in new column:
cities= [2,3]
idx = df1[df1.city.isin(cities)].groupby('city').head(2).index
df1.loc[idx, 'selected'] = True
print (df1)
col1 city selected
0 1 1 NaN
1 2 2 True
2 3 3 True
3 4 4 NaN
4 2 2 True
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 1 1 NaN
9 8 8 NaN
10 9 9 NaN
11 2 2 NaN
define a list of elements to be checked and pass it to city columns creating a new column with True & False booleans ..
>>> check
[2, 3]
>>> df['Citis'] = df.city.isin(check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
OR
>>> df['Citis'] = df['city'].apply(lambda x: x in check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
Matter of fact indeed you need to the starting (lets say 5 values to be read)
df['Citis'] = df.city.isin(check).head(5)
OR
df['Citis'] = df['city'].apply(lambda x: x in check).head(5)

compare multiple specific columns of all rows

I want to compare the particular columns of all the rows, if they are unique extract the value to the new column otherwise 0.
If the example dateframe as follows:
A B C D E F
13348 judte 1 1 1 1
54871 kfzef 1 1 0 1
89983 hdter 4 4 4 4
7543 bgfd 3 4 4 4
The result should be as follows:
A B C D E F Result
13348 judte 1 1 1 1 1
54871 kfzef 1 1 0 1 0
89983 hdter 4 4 4 4 4
7543 bgfd 3 4 4 4 0
I am pleased to hear some suggestions.
Use:
cols = ['C','D','E','F']
df['Result'] = np.where(df[cols].eq(df[cols[0]], axis=0).all(axis=1), df[cols[0]], 0)
print (df)
A B C D E F Result
0 13348 judte 1 1 1 1 1
1 54871 kfzef 1 1 0 1 0
2 89983 hdter 4 4 4 4 4
3 7543 bgfd 3 4 4 4 0
Detail:
First compare all column filtered by list of columns names by eq with first column of cols df[cols[0]]:
print (df[cols].eq(df[cols[0]], axis=0))
C D E F
0 True True True True
1 True True False True
2 True True True True
3 True False False False
Then check if all Trues per row by all:
print (df[cols].eq(df[cols[0]], axis=0).all(axis=1))
0 True
1 False
2 True
3 False
dtype: bool
And last use numpy.where for assign first column values for Trues and 0 for False.
I think you need apply with nunique as:
df['Result'] = df[['C','D','E','F']].apply(lambda x: x[0] if x.nunique()==1 else 0,1)
Or using np.where:
df['Result'] = np.where(df[['C','D','E','F']].nunique(1)==1,df['C'],0)
print(df)
A B C D E F Result
0 13348 judte 1 1 1 1 1
1 54871 kfzef 1 1 0 1 0
2 89983 hdter 4 4 4 4 4
3 7543 bgfd 3 4 4 4 0

dataframe merger column operation

I got a dataframe like this:
A B C
1 1 1
2 2 2
3 3 3
4 1 1
I want to 'merge' the three columns to form a D column, the rule is: if there is at least one '1' in the row, then the value of D is '1' else is '0'. How can I achieve it?
Use DataFrame.eq for compare values with DataFrame.any for check at least one True per row and last cast boolean mask to integers:
df['D'] = df.eq(1).any(axis=1).astype(int)
print (df)
A B C D
0 1 1 1 1
1 2 2 2 0
2 3 3 3 0
3 4 1 1 1
Detail:
print (df.eq(1))
A B C
0 True True True
1 False False False
2 False False False
3 False True True
print (df.eq(1).any(axis=1))
0 True
1 False
2 False
3 True
dtype: bool

Categories

Resources