Pandas, create column using previous new column value - python

I am using Python and have the following Pandas Dataframe:
idx
result
grouping
1
False
2
True
3
True
4
False
5
True
6
True
7
True
8
False
9
True
10
True
11
True
12
True
What I would like is to do the following logic...
if the result is False then I want grouping to be the idx value.
if the result is True then I want the grouping to be the previous grouping value
So the end result will be:
idx
result
grouping
1
False
1
2
True
1
3
True
1
4
False
4
5
True
4
6
True
4
7
True
4
8
False
8
9
True
8
10
True
8
11
True
8
12
True
8
I have tried all sorts to get this working from using the Pandas shift() command to using lambda, but I am just not getting it.
I know I could iterate through the dataframe and perform the calculation but there has to be a better method.
examples of what I have tried and failed with are:
df['grouping'] = df['idx'] if not df['result'] else df['grouping'].shift(1)
df['grouping'] = df.apply(lambda x: x['idx'] if not x['result'] else x['grouping'].shift(1), axis=1)
Many Thanks for any assistance you can provide.

mask true values then forward fill
df['grouping'] = df['idx'].mask(df['result']).ffill(downcast='infer')
idx result grouping
0 1 False 1
1 2 True 1
2 3 True 1
3 4 False 4
4 5 True 4
5 6 True 4
6 7 True 4
7 8 False 8
8 9 True 8
9 10 True 8
10 11 True 8
11 12 True 8

Related

Python pandas pivot

I have a pandas dataframe like below.
id A B C
0 1 1 1 1
1 1 5 7 2
2 2 6 9 3
3 3 1 5 4
4 3 4 6 2
After evaluating conditions,
id A B C a_greater_than_b b_greater_than_c c_greater_than_a
0 1 1 1 1 False False False
1 1 5 7 2 False True False
2 2 6 9 3 False True False
3 3 1 5 4 False True True
4 3 4 6 2 False True False
And after evaluating conditions, want to aggregate the results per id.
id a_greater_than_b b_greater_than_c c_greater_than_a
1 False False False
2 False True False
3 False True False
The logic is not fully clear, but you can combine pandas.get_dummies and aggregation per group (here I am assuming the min as your example showed that 1/1/0 -> 0 and 1/1/1 -> 1, but you can use other logics, e.g. last if you want to get the last row per group after sorting by date):
out = (pd
.get_dummies(df[['color', 'size']])
.groupby(df['id'])
.min()
)
print(out)
Output:
color_blue color_yellow size_l
id
A1 0 0 1

Group boolean values in Pandas Dataframe

i have a Dataframe with a random series of True, False in a column:
import pandas as pd
df = pd.DataFrame(data={'A':[True, False, True, True, False, False, True, False, True, False, False]})
df
A
0
True
1
False
2
True
3
True
4
False
5
False
6
True
7
False
8
True
9
False
10
False
and i want this: (Dont know how to explain it with easy words)
A
B
0
True
1
1
False
2
2
True
2
3
True
2
4
False
3
5
False
3
6
True
3
7
False
4
8
True
4
9
False
5
10
False
5
I've tried something with the following commands, but without success:
df[A].shift()
df[A].diff()
df[A].eq()
Many thanks for your help.
Matthias
IIUC, you can try:
df['B'] = (df.A.shift() & ~df.A).cumsum() + 1
# OR df['B'] = (df.A.shift() & ~df.A).cumsum().add(1)
OUTPUT:
A B
0 True 1
1 False 2
2 True 2
3 True 2
4 False 3
5 False 3
6 True 3
7 False 4
8 True 4
9 False 5
10 False 5
A little bit logic with diff
(~df.A.astype(int).diff().ne(-1)).cumsum()+1
Out[234]:
0 1
1 2
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 5
10 5
Name: A, dtype: int32

Trying to sort a pandas dataframe by a number column, but getting strange output

I have a dataframe (called df) with a length of 460 that looks like this
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
9 18 True
13 5 False
And I would like to sort it by 'position' so that the whole dataframe looks like this
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
13 5 False
20 6 False
28 7 True
I have attempted to use
df = df.sort_values('Position', ascending=True)
However, that outputs a rather bizarre dataframe with this form
index Position T/F
0 1 True
52 10 False
456 100 False
470 101 False
477 102 False
...
59 11 False
666 110 False
644 111 True
...
1 2 False
You get the idea. I'm not sure why it's sorting it like this, but I would like to figure out how to fix this issue so that I can output the desired DataFrame
Position seems to be string.
df['position'] = df['position'].astype(int)
Then do sorting.
df = df.sort_values('Position', ascending=True)
Output:
index Position T/F
0 1 True
1 2 False
4 3 False
8 4 False
13 5 False
20 6 False
28 7 True

Pandas: Filter a data-frame, and assign values to top n number of rows

import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4,2,5,6,7,1,8,9,2], 'city':[1,2,3,4,2,5,6,7,1,8,9,2]})
# The following code, creates a boolean filter,
filter = df.city==2
# Assigns True to all rows where filter is True
df.loc[filter,'selected']= True
What I need, is a change in the code so that it assigns True to given n number of rows.
The actual data frame has more than 3 million rows. Sometimes, I would want
df.loc[filter,'selected']= True for only 100 rows [Actual rows could be more or less than 100].
I believe you need filter by values defined in list first with isin and then for top 2 values use GroupBy.head:
cities= [2,3]
df = df1[df1.city.isin(cities)].groupby('city').head(2)
print (df)
col1 city
1 2 2
2 3 3
4 2 2
If need assign True in new column:
cities= [2,3]
idx = df1[df1.city.isin(cities)].groupby('city').head(2).index
df1.loc[idx, 'selected'] = True
print (df1)
col1 city selected
0 1 1 NaN
1 2 2 True
2 3 3 True
3 4 4 NaN
4 2 2 True
5 5 5 NaN
6 6 6 NaN
7 7 7 NaN
8 1 1 NaN
9 8 8 NaN
10 9 9 NaN
11 2 2 NaN
define a list of elements to be checked and pass it to city columns creating a new column with True & False booleans ..
>>> check
[2, 3]
>>> df['Citis'] = df.city.isin(check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
OR
>>> df['Citis'] = df['city'].apply(lambda x: x in check)
>>> df
col1 city Citis
0 1 1 False
1 2 2 True
2 3 3 True
3 4 4 False
4 2 2 True
5 5 5 False
6 6 6 False
7 7 7 False
8 1 1 False
9 8 8 False
10 9 9 False
11 2 2 True
Matter of fact indeed you need to the starting (lets say 5 values to be read)
df['Citis'] = df.city.isin(check).head(5)
OR
df['Citis'] = df['city'].apply(lambda x: x in check).head(5)

Get list of column names for columns that contain negative values

This is a simple question but I have found "slicing" DataFrames in Pandas frustrating, coming from R.
I have a DataFrame df below with 7 columns:
df
Out[77]:
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 8 8 -1 2 1 7 4
1 6 6 1 7 5 -1 3
2 2 5 4 2 2 8 1
3 -1 -1 7 2 3 2 0
4 6 6 4 2 0 5 2
5 -1 5 7 1 5 8 2
6 7 1 -1 0 1 8 1
7 6 2 4 1 2 6 1
8 3 4 4 5 8 -1 4
9 4 4 3 7 7 4 5
How do I slice df in such a way that it produces a list of columns that contain at least one negative number?
You can select them by building an appropriate Series and then using it to index into df:
>>> df < 0
fld1 fld2 fld3 fld4 fld5 fld6 fld7
0 False False True False False False False
1 False False False False False True False
2 False False False False False False False
3 True True False False False False False
4 False False False False False False False
5 True False False False False False False
6 False False True False False False False
7 False False False False False False False
8 False False False False False True False
9 False False False False False False False
>>> (df < 0).any()
fld1 True
fld2 True
fld3 True
fld4 False
fld5 False
fld6 True
fld7 False
dtype: bool
and then
>>> df.columns[(df < 0).any()]
Index(['fld1', 'fld2', 'fld3', 'fld6'], dtype='object')
or
>>> df.columns[(df < 0).any()].tolist()
['fld1', 'fld2', 'fld3', 'fld6']
depending on what data structure you want. We can also use this io index into df directly:
>>> df.loc[:,(df < 0).any()]
fld1 fld2 fld3 fld6
0 8 8 -1 7
1 6 6 1 -1
2 2 5 4 8
3 -1 -1 7 2
4 6 6 4 5
5 -1 5 7 8
6 7 1 -1 8
7 6 2 4 6
8 3 4 4 -1
9 4 4 3 4

Categories

Resources