Get next value from a row that satisfies a condition in pandas - python

I have a DataFrame that looks something like this:
| event_type | object_id
------ | ------ | ------
0 | A | 1
1 | D | 1
2 | A | 1
3 | D | 1
4 | A | 2
5 | A | 2
6 | D | 2
7 | A | 3
8 | D | 3
9 | A | 3
What I want to do is get the index of the next row where the event_type is A and the object_id is still the same, so as an additional column this would look like this:
| event_type | object_id | next_A
------ | ------ | ------ | ------
0 | A | 1 | 2
1 | D | 1 | 2
2 | A | 1 | NaN
3 | D | 1 | NaN
4 | A | 2 | 5
5 | A | 2 | NaN
6 | D | 2 | NaN
7 | A | 3 | 9
8 | D | 3 | 9
9 | A | 3 | NaN
and so on.
I want to avoid using .apply() because my DataFrame is quite large, is there a vectorized way to do this?
EDIT: for multiple A/D pairs for the same object_id, I'd like it to always use the next index of A, like this:
| event_type | object_id | next_A
------ | ------ | ------ | ------
0 | A | 1 | 2
1 | D | 1 | 2
2 | A | 1 | 4
3 | D | 1 | 4
4 | A | 1 | NaN

You can do it with groupby like:
def populate_next_a(object_df):
object_df['a_index'] = pd.Series(object_df.index, index=object_df.index)[object_df.event_type == 'A']
object_df['a_index'].fillna(method='bfill', inplace=True)
object_df['next_A'] = object_df['a_index'].where(object_df.event_type != 'A', object_df['a_index'].shift(-1))
object_df.drop('a_index', axis=1)
return object_df
result = df.groupby(['object_id']).apply(populate_next_a)
print(result)
event_type object_id next_A
0 A 1 2.0
1 D 1 2.0
2 A 1 NaN
3 D 1 NaN
4 A 2 5.0
5 A 2 NaN
6 D 2 NaN
7 A 3 9.0
8 D 3 9.0
9 A 3 NaN
GroupBy.apply will not have as much overhead as a simple apply.
Note you cannot (yet) store integer with NaN: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na so they end up as float values

Related

How to unmerge the features of a dataframe from one column into several single columns separated by "\" via pandas?

More visually, I would like to move from this dataframe :
| A\B\C\D | Unnamed:1 | Unnamed:2 | Unnamed:3 | Unnamed:4 |
| --------| ----------|
0 | 1\2\3\4 | NaN | NaN | NaN | NaN |
1 | 1\2\3\4 | NaN | NaN | NaN | NaN |
2 | a\2\7\C | NaN | NaN | NaN | NaN |
3 | d\2\u\4 | NaN | NaN | NaN | NaN |
to this one:
| A | B | C | D |
| --------| ----------|
0 | 1 | 2 | 3 | 4 |
1 | 1 | 2 | 3 | 4 |
2 | a | 2 | 7 | C |
3 | d | 2 | u | 4 |
Thanks !
Try splitting the values first and then split the column name:
df2 = df.iloc[:,0].str.split('\\', expand = True)
df2.columns = df.columns[0].split('\\')
df2
result:
A B C D
0 1 2 3 4
1 1 2 3 4
2 a 2 7 C
3 d 2 u 4
You can use DataFrame constructor:
out = pd.DataFrame(df.iloc[:, 0].str.split('\\').tolist(),
columns=df.columns[0].split('\\'))
print(out)
# Output
A B C D
0 1 2 3 4
1 1 2 3 4
2 a 2 7 C
3 d 2 u 4
The question is: why do you have a such input? Do you read your data from csv file and you don't use the right separator?

For each ID element in column 1, check if it has 2 defined rows in column 2 in any order. Write the boolean result on column 3. Python

So I have the following dataframe
df = pd.DataFrame(data={'ID': [1,1,1,2,2,2,2,3,3,3,4,4,4],
'value': ["a","b","NA","a","a","NA","NA","a","NA","b","NA","b","NA"]})
| | ID | value |
|---:|-----:|:--------|
| 0 | 1 | a |
| 1 | 1 | b |
| 2 | 1 | NA |
| 3 | 2 | a |
| 4 | 2 | a |
| 5 | 2 | NA |
| 6 | 2 | NA |
| 7 | 3 | a |
| 8 | 3 | NA |
| 9 | 3 | b |
| 10 | 4 | NA |
| 11 | 4 | b |
| 12 | 4 | NA |
I want to check if for each element in "ID" column there are the values "a" and "b" in the "value" column , and write the result on "result" column, as it is shown in the table below. In the example only IDs "1" and "3", have the values "a", "b" in the "value" column, so they have "yes" values in the "result" column
df = pd.DataFrame(data={'ID': [1,1,1,2,2,2,2,3,3,3,4,4,4],
'value': ["a","b","NA","a","a","NA","NA","a","NA","b","NA","b","NA"],
'result': ["yes","yes","yes","no","no","no","no","yes","yes","yes","no","no","no"]})
| | ID | value | result |
|---:|-----:|:--------|:---------|
| 0 | 1 | a | yes |
| 1 | 1 | b | yes |
| 2 | 1 | NA | yes |
| 3 | 2 | a | no |
| 4 | 2 | a | no |
| 5 | 2 | NA | no |
| 6 | 2 | NA | no |
| 7 | 3 | a | yes |
| 8 | 3 | NA | yes |
| 9 | 3 | b | yes |
| 10 | 4 | NA | no |
| 11 | 4 | b | no |
| 12 | 4 | NA | no |
Any suggestion? Thank you very much in advance
One solution can be this:
df["result"] = df.groupby("ID")["value"].transform(
lambda x: "yes" if 'a' in x.values and 'b' in x.values else "no")
ID value result
0 1 a yes
1 1 b yes
2 1 NA yes
3 2 a no
4 2 a no
5 2 NA no
6 2 NA no
7 3 a yes
8 3 NA yes
9 3 b yes
10 4 NA no
11 4 b no
12 4 NA no
Let us do correct the NA to NaN then transform with nunique
df.value = df.value.replace('NA',np.nan)
df['new'] = df.groupby('ID')['value'].transform('nunique')==2
df
Out[135]:
ID value new
0 1 a True
1 1 b True
2 1 None True
3 2 a False
4 2 a False
5 2 None False
6 2 None False
7 3 a True
8 3 None True
9 3 b True
10 4 None False
11 4 b False
12 4 None False
Try this. This will work if there are values other than just a and b in the df.
l = ['a','b']
df['result'] = df['ID'].map(df.groupby(['ID','value']).size().loc[(slice(None),l)].unstack().gt(0).all(axis=1))
or
df['ID'].map(df.groupby('ID')['value'].agg(set).ge({'a','b'}).map({True:'yes',False:'No'}))

Count the number of duplicate grouped by ID pandas

I'm not sure if this is a duplicate question, but here it goes.
Assuming I have the following table:
import pandas
lst = [1,1,1,2,2,3,3,4,5]
lst2 = ['A','A','B','D','E','A','A','A','E']
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['ID', 'val'])
will output the following table
+----+-----+
| ID | Val |
+----+-----+
| 1 | A |
+----+-----+
| 1 | A |
+----+-----+
| 1 | B |
+----+-----+
| 2 | D |
+----+-----+
| 2 | E |
+----+-----+
| 3 | A |
+----+-----+
| 3 | A |
+----+-----+
| 4 | A |
+----+-----+
| 5 | E |
+----+-----+
The goal is count the duplicates on VAL grouped by ID:
+----+-----+--------------+
| ID | Val | is_duplicate |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | B | 0 |
+----+-----+--------------+
| 2 | D | 0 |
+----+-----+--------------+
| 2 | E | 0 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 4 | A | 0 |
+----+-----+--------------+
| 5 | E | 0 |
+----+-----+--------------+
I tried the following code but its counting the overall duplicates
df_grouped = df.groupby(['notes']).size().reset_index(name='count')
while the following code does only the duplicate count
df.duplicated(subset=['notes'])
what would be the best approach for this?
Let us try duplicated
df['is_dup']=df.duplicated(subset=['ID','val'],keep=False).astype(int)
df
Out[21]:
ID val is_dup
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0
8 5 E 0
You can use .groupby on the relevant columns and get the count. Then if you add >1 to the end, then that will mean the value for the specified group contains duplicates. The > 1 will create a boolean True/False data type. Finally, to change to 1 or 0, simply use .astype(int) to transform the data type from a boolean data type to an int, which changes True to 1 and False to 0:
df['is_duplicate'] = (df.groupby(['ID','val'])['val'].transform('count') > 1).astype(int)
Out[7]:
ID val is_duplicate
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0

How do you specify pandas groupby operations that operate on previous records?

I have a Pandas dataframe as following which has to be sorted by Col_2:
+----+-------+-------+
| id | Col_1 | Col_2 |
+----+-------+-------+
| 1 | 0 | 21 |
| 1 | 1 | 24 |
| 1 | 1 | 32 |
| 1 | 0 | 35 |
| 1 | 1 | 37 |
| 2 | 0 | 2 |
| 2 | 0 | 5 |
+----+-------+-------+
How can I create two new columns:
Col_1_sum: the sum of values in the previous rows for each id.
Col_2_max: the max value of Col_2 in the last rows which Col_1 was one. (for each id)
For example for above dataframe the result should be:
+----+-------+-------+-----------+-----------+
| id | Col_1 | Col_2 | Col_1_Sum | Col_2_Max |
+----+-------+-------+-----------+-----------+
| 1 | 0 | 21 | 0 | 0 |
| 1 | 1 | 24 | 0 | 0 |
| 1 | 1 | 32 | 1 | 24 |
| 1 | 0 | 35 | 2 | 32 |
| 1 | 1 | 37 | 2 | 32 |
| 2 | 0 | 2 | 0 | 0 |
| 2 | 0 | 5 | 0 | 0 |
+----+-------+-------+-----------+-----------+
You have two questions. One at a time.
Your first question is answered with groupby, shift, and cumsum:
df.groupby('id').Col_1.apply(lambda x: x.shift().cumsum())
0 NaN
1 0.0
2 1.0
3 2.0
4 2.0
5 NaN
6 0.0
Name: Col_1, dtype: float64
Or, if you prefer cleaner output,
df.groupby('id').Col_1.apply(lambda x: x.shift().cumsum()).fillna(0).astype(int)
0 0
1 0
2 1
3 2
4 2
5 0
6 0
Name: Col_1, dtype: int64
Your second, also similar, using groupby, shift, cummax and ffill:
df.Col_2.where(df.Col_1.eq(1)).groupby(df.id).apply(
lambda x: x.shift().cummax().ffill()
)
0 NaN
1 NaN
2 24.0
3 32.0
4 32.0
5 NaN
6 NaN
Name: Col_2, dtype: float64
In both cases, the essential ingredient is a groupby followed by a subsequent shift call. Note that these answers are difficult to solve sans apply because there are multiple operations to be carried out on sub-groups.
Consider taking the lambda out by defining a custom function. You'll save a few cycles on larger data.

append pandas dataframe to column

I'm stuck and need some help. I have the following dataframe:
+-----+---+---+--+--+
| | A | B | | |
+-----+---+---+--+--+
| 288 | 1 | 4 | | |
+-----+---+---+--+--+
| 245 | 2 | 3 | | |
+-----+---+---+--+--+
| 543 | 3 | 6 | | |
+-----+---+---+--+--+
| 867 | 1 | 9 | | |
+-----+---+---+--+--+
| 345 | 2 | 7 | | |
+-----+---+---+--+--+
| 122 | 3 | 8 | | |
+-----+---+---+--+--+
| 233 | 1 | 1 | | |
+-----+---+---+--+--+
| 346 | 2 | 6 | | |
+-----+---+---+--+--+
| 765 | 3 | 3 | | |
+-----+---+---+--+--+
Column A has repeating values as shown. What I want to do is every time I see the repeating value in Column A I want to append a new colum with the corresponding values from column B as column C as shown below:
+-----+---+---+-----+
| | A | B | C |
+-----+---+---+-----+
| 288 | 1 | 4 | 9 |
+-----+---+---+-----+
| 245 | 2 | 3 | 7 |
+-----+---+---+-----+
| 543 | 3 | 6 | 8 |
+-----+---+---+-----+
| 867 | 1 | 9 | 1 |
+-----+---+---+-----+
| 345 | 2 | 7 | 6 |
+-----+---+---+-----+
| 122 | 3 | 8 | 3 |
+-----+---+---+-----+
| 233 | 1 | 1 | NaN |
+-----+---+---+-----+
| 346 | 2 | 6 | NaN |
+-----+---+---+-----+
| 765 | 3 | 3 | NaN |
+-----+---+---+-----+
Thanks.
Assuming that val is one of the repeated values,
slice = df.loc[df.A == val, 'B'].shift(-1)
will create a one-column data frame with the values re-indexed to their new positions.
Since none of the re-assigned index values should be redundant, you can use pandas.concat to stitch the different slices together without fear of losing data. Then just attach them as a new column:
df['C'] = pd.concat([df.loc[df['A'] == x, 'B'].shift(-1) for x in [1, 2, 3]])
When the column is assigned, the index values will make everything line up:
A B C
0 1 4 9.0
1 2 3 7.0
2 3 6 8.0
3 1 9 1.0
4 2 7 6.0
5 3 8 3.0
6 1 1 NaN
7 2 6 NaN
8 3 3 NaN
Reverse the dataframe order, groupby transform it against shift function, and reverse it back:
df = df[::-1]
df['C'] = df.groupby(df.columns[0]).transform('shift')
df = df[::-1]
df
A B C
0 1 4 9.0
1 2 3 7.0
2 3 6 8.0
3 1 9 1.0
4 2 7 6.0
5 3 8 3.0
6 1 1 NaN
7 2 6 NaN
8 3 3 NaN

Categories

Resources