recombine string columns based on another columns in pandas - python

I have a pandas DataFrame with 3 columns :
id product_id is_opt
1 1 False
1 2 False
1 3 True
1 4 True
2 5 False
2 6 False
2 7 False
3 8 False
3 9 False
3 10 True
I want to transform this DataFrame this way :
For a set of rows that shares the same id, if all rows are is_opt = False, then the set of rows stays unchanged. For example, the rows with id = 2 do not change.
For a set of rows that shares the same id, if at least one row is is_opt = True, then we apply this transformation:
All rows that are is_opt = True stay unchanged.
All rows that are is_opt = False take at the end of their product_id all the product_ids of the rows that are is_opt = True. If there are n rows is_opt = True, then 1 row with is_opt = False gives n rows. For exemple, the first row [1, 1, False] gives 2 rows [1, 1-3, False] and [1, 1-4, False].
The expected output for the example is:
id product_id
1 1-3
1 1-4
1 2-3
1 2-4
1 3
1 4
2 5
2 6
2 7
3 8-10
3 9-10
3 10
is_opt column has been droped in the expected result.
Can you help me with a way to get this result in an efficient set of operations ? It is straightforward with some for loops but I would like something efficient because the DataFrames in production are huge.

You can use a custom function and itertools.product:
from itertools import product
def combine(df):
if df['is_opt'].any():
a = df.loc[~df['is_opt'], 'product_id']
b = df.loc[df['is_opt'], 'product_id']
l = ['-'.join(map(str, p)) for p in product(a, b)]
return pd.Series(l+b.tolist())
return df['product_id']
out = df.groupby('id').apply(combine).droplevel(1).reset_index(name='product_id')
output:
id product_id
0 1 1-3
1 1 1-4
2 1 2-3
3 1 2-4
4 1 3
5 1 4
6 2 5
7 2 6
8 2 7
9 3 8-10
10 3 9-10
11 3 10

Related

Adding boolean column to pandas dataframe where one row being true should make all same users rows true

I have problems with pandas dataframe when adding a boolean column. Data has users who have projects they can open in several places. I would need to have a group of users who have worked with the same project in several places. If the same user has opened the same project in different places even once it would make shared_projects true. Then all rows with that user_id should be true.
Here is an example df:
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
3 10 11
3 12 10
This is a simple example what I would like to get out. If the condition is true in one line it will be true in all the users with that user_id.
user_id project_id_x project_id_y shared_projects
1 1 2 false
1 3 4 false
2 5 6 true
2 7 7 true
2 8 9 true
3 10 11 true
3 12 10 true
I can get boolean values based on each row but I am stuck how to make it true to all users if it is true on one row.
Assuming you want to match on the same row:
df['shared_projects'] = (df['project_id_x'].eq(df['project_id_y'])
.groupby(df['user_id']).transform('any')
)
If you want to match on any value x/y for a given user, you can use a set intersection:
s = df.groupby('user_id').apply(lambda g: bool(set(g['project_id_x'])
.intersection(g['project_id_y'])))
df.merge(s.rename('shared_project'), left_on='user_id', right_index=True)
output:
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True
First you will have to do a complex selection to find the user that have worked in the same project in different columns:
df['shared_projects'] = (df['project_id_x'] == df['project_id_y'])
That will create a new boolean column as you've already done. But then you can use the index of those True values to apply to the rest, assuming that "user_id" is your index for the dataframe.
for index in df[df['shared_projects'] == True]].index.unique():
df.at[index, 'project_id_x'] = True
df.at[index, 'project_id_y'] = True
Update
Another approach without apply, using value_counts.
user_id = df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id']) \
.loc[lambda x: x > 1].index.get_level_values('user_id')
df['shared_projects'] = df['user_id'].isin(user_id)
Output:
>>> df
user_id project_id_x project_id_y
1 1 2
1 3 4
2 5 6
2 7 7
2 8 9
# Intermediate result
>>> df.melt('user_id', var_name='project', value_name='project_id') \
.value_counts(['user_id', 'project_id'])
user_id project_id
2 7 2 # <- project 7 in multiple places for user 2
1 1 1
2 1
3 1
4 1
2 5 1
6 1
8 1
9 1
dtype: int64
Old answer
You can use melt:
shared_projects = lambda x: len(set(x)) != len(x)
user_id = df.melt('user_id').groupby('user_id')['value'].apply(shared_projects)
df['shared_projects'] = df['user_id'].isin(user_id[user_id].index)
Output:
>>> df
user_id project_id_x project_id_y shared_projects
0 1 1 2 False
1 1 3 4 False
2 2 5 6 True
3 2 7 7 True
4 2 8 9 True

drop row if one column is greater than another

I have the following data frame:
order_id amount records
1 2 1
2 5 10
3 20 5
4 1 3
I want to remove rows where the amount is greater than the records, the output should be:
order_id amount records
2 5 10
4 1 3
Here is what I've attempted:
df = df.drop(
df[df.amount > df.records].index, inplace=True)
this is removing all rows, any suggestions are welcome.
Simply filter by:
df = df[df['amount']<df['records']]
and you get the desired results:
order_id amount records
1 2 5 10
3 4 1 3
df.loc[~df.amount.gt(df.records)]
order_id amount records
1 2 5 10
3 4 1 3
Explanation: comparisions return a boolean:
~df.amount.gt(df.records)
0 False
1 True
2 False
3 True
dtype: bool
This returns values where amount is not greater than records.
You can use this boolean to index into the dataframe to get your desired values.
Alternatively, you could use the code below as well, without having to call on the negation (~) :
df.loc[df.amount.le(df.records)]

remove NaNs if only value already exists in corresponding Ids pandas

I have this dataframe
Id,ProductId,Product
1,100,a
1,100,x
1,100,NaN
2,150,NaN
3,150,NaN
4,100,a
4,100,x
4,100,NaN
Here I want to remove some of the rows which contains NaN and some I don't want to remove.
The removing criteria is as follow.
I want to remove only those NaNs rows whose Id already contains the value in Product columns.
for example, here Id1 has already value in Product columns and still contains NaN, so I want to remove that row.
But for id2, there exists only NaN in Product column. So I don't want to remove that one. Similarly for Id3 also, there is only NaN values in the Product columns and I want to keep it that one too.
Final Output would be like this
Id,ProductId,Product
1,100,a
1,100,x
2,150,NaN
3,150,NaN
4,100,a
4,100,x
Dont use groupby if exist alternative, because slow.
vals = df.loc[df['Product'].notnull(), 'Id'].unique()
df = df[~(df['Id'].isin(vals) & df['Product'].isnull())]
print (df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
Explanation:
First get all Id with some non missing values:
print (df.loc[df['Product'].notnull(), 'Id'].unique())
[1 4]
Then check these groups with missing values:
print (df['Id'].isin(vals) & df['Product'].isnull())
0 False
1 False
2 True
3 False
4 False
5 False
6 False
7 True
dtype: bool
Invert boolean mask:
print (~(df['Id'].isin(vals) & df['Product'].isnull()))
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 False
dtype: bool
And last filter by boolean indexing:
print (df[~(df['Id'].isin(vals) & df['Product'].isnull())])
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
You can group the dataframe by Id and drop the NaN if the group has more than one element:
>> df.groupby(level='Id', group_keys=False
).apply(lambda x: x.dropna() if len(x) > 1 else x)
ProductId Product
Id
1 100 a
1 100 x
2 150 NaN
3 150 NaN
4 100 a
4 100 x
Calculate groups (Id) where values (Product) are all null, then remove required rows via Boolean indexing with loc accessor:
nulls = df.groupby('Id')['Product'].apply(lambda x: x.isnull().all())
nulls_idx = nulls[nulls].index
df = df.loc[~(~df['Id'].isin(nulls_idx) & df['Product'].isnull())]
print(df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x
Use groupby+transform with parameter count and then boolean indexing using isnull of Product column as:
count = df.groupby('Id')['Product'].transform('count')
df = df[~(count.ne(0) & df.Product.isnull())]
print(df)
Id ProductId Product
0 1 100 a
1 1 100 x
3 2 150 NaN
4 3 150 NaN
5 4 100 a
6 4 100 x

Retain only duplicated rows in a pandas dataframe

I have a dataframe with two columns: "Agent" and "Client"
Each row corresponds to an interaction between an Agent and a client.
I want to keep only the rows if a client had interactions with at least 2 agents.
How can I do that?
Worth adding that now you can use df.duplicated()
df = df.loc[df.duplicated(subset='Agent', keep=False)]
Use groupby and transform by value_counts.
df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
Note, that, as mentioned here, you might have one agent interacting with the same client multiple times. This might be retained as a false positive. If you do not want this, you could add a drop_duplicates call before filtering:
df = df.drop_duplicates()
df = df[df.Agent.groupby(df.Agent).transform('value_counts') > 1]
print(df)
A B
0 1 2
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1
mask = df.B.groupby(df.B).transform('value_counts') > 1
print(mask)
0 False
1 True
2 True
3 True
4 True
5 True
Name: B, dtype: bool
df = df[mask]
print(df)
A B
1 2 5
2 3 1
3 4 1
4 5 5
5 6 1

Select row in DataFrame based on values in multiple rows

I've got a DataFrame and I'd like to select rows where in one column they have a certain value, AND the row above has a certain value in another column. How do I do this without a for loop?
For example:
df = pd.DataFrame({'one': [1,2,3,4,1,2,3,4], 'two': [1,2,3,4,5,6,7,8]})
Where I'd like to find the row where df.one on that row equals 1, and df.two on the row above equals 4, so in the example row number 4 with values [1,5].
You can try shift with boolean indexing:
print df
one two
0 1 1
1 2 2
2 3 3
3 4 4
4 1 5
5 2 6
6 3 7
7 4 8
print (df.one == 1) & (df.two.shift() == 4)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
dtype: bool
print df[(df.one == 1) & (df.two.shift() == 4)]
one two
4 1 5

Categories

Resources