I have dataframe where new columns need to be added based on existing column values conditions and I am looking for an efficient way of doing.
For Ex:
df = pd.DataFrame({'a':[1,2,3],
'b':['x','y','x'],
's':['proda','prodb','prodc'],
'r':['oz1','0z2','oz3']})
I need to create 2 new columns ['c','d'] based on following conditions
If df['b'] == 'x':
df['c'] = df['s']
df['d'] = df['r']
elif df[b'] == 'y':
#assign different values to c, d columns
We can use numpy where and apply conditions on new column like
df['c] = ny.where(condition, value)
df['d'] = ny.where(condition, value)
But I am looking if there is a way to do this in a single statement or without using for loop or multiple numpy or panda apply.
The exact output is unclear, but you can use numpy.where with 2D data.
For example:
cols = ['c', 'd']
df[cols] = np.where(df['b'].eq('x').to_numpy()[:,None],
df[['s', 'r']], np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 NaN NaN
2 3 x prodc oz3 prodc oz3
If you want multiple conditions, use np.select:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq('x').to_numpy()[:,None],
df['b'].eq('y').to_numpy()[:,None]
],
[df[['s', 'r']],
df[['r', 'a']]
], np.nan)
it is however easier here to use a loop for the conditions if you have many:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq(c).to_numpy()[:,None] for c in ['x', 'y']],
[df[repl] for repl in (['s', 'r'], ['r', 'a'])],
np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 0z2 2
2 3 x prodc oz3 prodc oz3
Related
I have a dataframe with a unique index and columns 'users', 'tweet_time' and 'tweet_id'.
I want to count the number of duplicate tweet_time values per user.
users = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
tweet_times = ['01-01-01 01:00', '02-02-02 02:00', '03-03-03 03:00', '09-09-09 09:00',
'04-04-04 04:00', '04-04-04 04:00', '05-05-05 05:00', '09-09-09 09:00',
'06-06-06 06:00', '06-06-06 06:00', '07-07-07 07:00', '07-07-07 07:00']
d = {'users': users, 'tweet_times': tweet_times}
df = pd.DataFrame(data=d)
Desired Output
A: 0
B: 1
C: 2
I manage to get the desired output (except for the A: 0) using the code below. But is there a more pythonic / efficient way to do this?
# group by both columns
df2 = pd.DataFrame(df.groupby(['users', 'tweet_times']).tweet_id.count())
# filter out values < 2
df3 = df2[df2.tweet_id > 1]
# turn multi-index level 1 into column
df3.reset_index(level=[1], inplace=True)
# final groupby
df3.groupby('users').tweet_times.count()
We can use crosstab to create a frequency table then check for counts greater than 1 to create a boolean mask then sum this mask along axis=1
pd.crosstab(df['users'], df['tweet_times']).gt(1).sum(1)
users
A 0
B 1
C 2
dtype: int64
This works,
df1 = pd.DataFrame(df.groupby(['users'])['tweet_times'].value_counts()).reset_index(level = 0)
df1.groupby('users')['tweet_times'].apply(lambda x: sum(x>1))
users
A 0
B 1
C 2
Name: tweet_times, dtype: int64
you can use a custom boolean with your groupby.
the keep=False returns True when a value is duplicated and false if not.
# df['tweet_times'] = pd.to_datetime(df['tweet_times'],errors='coerce')
df.groupby([df.duplicated(subset=['tweet_times'],keep=False),'users']
).nunique().loc[True]
tweet_times
users
A 0
B 1
C 2
There might be a simpler way, but this is all I can come up with for now :)
df.groupby("users")["tweet_times"].agg(lambda x: x.count() - x.nunique()).rename("count_dupe")
Output:
users
A 0
B 1
C 2
Name: count_dupe, dtype: int64
This looks quite pythonic to me:
df.groupby("users")["tweet_times"].count() - df.groupby("users")["tweet_times"].nunique()
Output:
users
A 0
B 1
C 2
Name: tweet_times, dtype: int64
I want to replace in PF column value by a value in another Dataframe if exists In yellow no correspondance, so leave value as it is):
and Dataframe with old value comparison and new value:
I tried to do this but does not work
unicite['CustomerID'] = np.where(unicite['CustomerId'] == Fidclients['CustomerId'],Fidclients['Newvalue'] , unicite['CustomerID'])
If I'm understanding the question correctly, you want to replace the values in CustomerID in the table unicite with the values in the column Newvalue if they exist in the column CustomerID within the table Fidclients.
I believe you'll have to merge the tables to achieve this. For example,
unicite = pd.DataFrame({'CustomerID': ['a', 'b', 'c']})
print(unicite)
CustomerID
0 a
1 b
2 c
Fidclients = pd.DataFrame({'CustomerID': ['c', 'f', 'g'], 'Newvalue': ['x', 'y', 'z']})
print(Fidclients)
CustomerID Newvalue
0 c x
1 f y
2 g z
merged = unicite.merge(Fidclients, on='CustomerID', how='left')
merged.loc[merged.Newvalue.notnull(), 'CustomerID'] = merged.Newvalue
merged.drop('Newvalue', axis=1)
CustomerID
0 a
1 b
2 x
This question already has answers here:
pandas - filter dataframe by another dataframe by row elements
(7 answers)
Closed 2 years ago.
I have two dataframes like this
import pandas as pd
df1 = pd.DataFrame(
{
'A': list('abcaewar'),
'B': list('ghjglmgb'),
'C': list('lkjlytle'),
'ignore': ['stuff'] * 8
}
)
df2 = pd.DataFrame(
{
'A': list('abfu'),
'B': list('ghio'),
'C': list('lkqw'),
'stuff': ['ignore'] * 4
}
)
and I would like to remove all rows in df1 where A, B and C are identical to values in df2, so in the above case the expected outcome is
A B C ignore
0 c j j stuff
1 e l y stuff
2 w m t stuff
3 r b e stuff
One way of achieving this would be
comp_columns = ['A', 'B', 'C']
df1 = df1.set_index(comp_columns)
df2 = df2.set_index(comp_columns)
keep_ind = [
ind for ind in df1.index if ind not in df2.index
]
new_df1 = df1.loc[keep_ind].reset_index()
Does anyone see a more straightforward way of doing this which avoids the reset_index() operations and the loop to identify non-overlapping indices, e.g. by a mart way of masking? Ideally, I don't have to hardcode the columns, but can define them in a list as above as I sometimes need 2, sometimes 3 or sometimes 4 or more columns for the removal.
Use DataFrame.merge with optional parameter indicator=True, then use boolean masking to filter the rows in df1:
df3 = df1.merge(df2[['A', 'B', 'C']], on=['A', 'B', 'C'], indicator=True, how='left')
df3 = df3[df3.pop('_merge').eq('left_only')]
Result:
# print(df3)
A B C ignore
2 c j j stuff
4 e l y stuff
5 w m t stuff
7 r b e stuff
I am starting on with panda catagorical dataframes.
Let's say I have (1):
A B C
-------------
3 Z M
O X T
4 A B
I filtered the dataframe like that : df[ df['B'] != "X"]
So I would get as result (2):
A B C
-------------
3 Z M
4 A B
In (1) df['B'].cat.categories #would equal to ['Z', 'X', 'A']
In (2) df['B'].cat.categories #still equal to ['Z', 'X', 'A']
How to update the DF categories of all columns after this kind of filtering operation ?
BONUS : If you want to clean up the indexes after the filtering
df.reset_index()
remove_unused_categories from the columns after filtering.
As piRSquared points out you can do this succinctly given every column is a categorical dtype:
df = df.query('B != "X"').apply(lambda s: s.cat.remove_unused_categories())
This loops over the columns after filtering.
print(df)
# A B C
#0 3 Z M
#1 O X T
#2 4 A B
df['B'].cat.categories
#Index(['A', 'X', 'Z'], dtype='object')
df = df[ df['B'] != 'X']
# Update all category columns
for col in df.dtypes.loc[lambda x: x == 'category'].index:
df[col] = df[col].cat.remove_unused_categories()
df['B'].cat.categories
#Index(['A', 'Z'], dtype='object')
df['C'].cat.categories
#Index(['B', 'M'], dtype='object')
Pandas store the categories separately and don't remove them if not used, if you want do do that you can use this attribute: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.cat.remove_unused_categories.html#pandas.Series.cat.remove_unused_categories
I have two Pandas dataframes, df1 and df2. I would like to combine these into a single dataframe (df) but drop any rows where the value that appears in the 'A' column of df1 but is not present in the 'A' column of df2.
Input:
[in] df1 = A B
0 i y
1 ii y
[in] df2 = A B
0 ii x
1 i y
2 iii z
3 iii z
Desired output:
[out] df = A B
0 i y
1 ii y
2 ii x
3 i y
In the example above, all rows were added to df except those in df2 with 'iii' in the 'A' column, because 'iii' does not appear anywhere in column 'A' of df1.
To take this a step further, the initial number of dataframes is not limited to two. There could be three or more, and I would want to drop any column 'A' values that do not appear in ALL of the dataframes.
How can I make this happen?
Thanks in advance!
This will work for any generic list of dataframes. Also, order of dataframes does not matter.
df1 = pd.DataFrame([['i', 'y'], ['ii', 'y']], columns=['A', 'B'])
df2 = pd.DataFrame([['ii', 'x'], ['i', 'y'], ['iii', 'z'], ['iii', 'z']], columns=['A', 'B'])
dfs = [df1, df2]
set_A = set.intersection(*[set(dfi.A.tolist()) for dfi in dfs])
df = pd.concat([dfi[dfi.A.isin(set_A)] for dfi in dfs])