Pandas replace values of a column with comparison to another Dataframe - python

I want to replace in PF column value by a value in another Dataframe if exists In yellow no correspondance, so leave value as it is):
and Dataframe with old value comparison and new value:
I tried to do this but does not work
unicite['CustomerID'] = np.where(unicite['CustomerId'] == Fidclients['CustomerId'],Fidclients['Newvalue'] , unicite['CustomerID'])

If I'm understanding the question correctly, you want to replace the values in CustomerID in the table unicite with the values in the column Newvalue if they exist in the column CustomerID within the table Fidclients.
I believe you'll have to merge the tables to achieve this. For example,
unicite = pd.DataFrame({'CustomerID': ['a', 'b', 'c']})
print(unicite)
CustomerID
0 a
1 b
2 c
Fidclients = pd.DataFrame({'CustomerID': ['c', 'f', 'g'], 'Newvalue': ['x', 'y', 'z']})
print(Fidclients)
CustomerID Newvalue
0 c x
1 f y
2 g z
merged = unicite.merge(Fidclients, on='CustomerID', how='left')
merged.loc[merged.Newvalue.notnull(), 'CustomerID'] = merged.Newvalue
merged.drop('Newvalue', axis=1)
CustomerID
0 a
1 b
2 x

Related

assign 0 when value_count() is not found

I have a column that looks like this:
group
A
A
A
B
B
C
The value C exists sometimes but not always. This works fine when the C is present. However, if C does not occur in the column, it throws a key error.
value_counts = df.group.value_counts()
new_df["C"] = value_counts.C
I want to check whether C has a count or not. If not, I want to assign new_df["C"] a value of 0. I tried this but i still get a keyerror. What else can I try?
value_counts = df.group.value_counts()
new_df["C"] = value_counts.C
if (df.group.value_counts()['consents']):
new_df["C"] = value_counts.consents
else:
new_df["C"] = 0
One way of doing it is by converting series into dictionary and getting the key, unless not found return the default value (in your case it is 0):
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'D']})
new_df = {}
character = "C"
new_df[character] = df.group.value_counts().to_dict().get(character, 0)
output of new_df
{'C': 0}
However, I am not sure what new_df should be, it seems that it is a dictionary? Or it might be a new dataframe object?
One way could be to convert the group column to Categorical type with specified categories. eg:
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B']})
print(df)
# group
# 0 A
# 1 A
# 2 A
# 3 B
# 4 B
categories = ['A', 'B', 'C']
df['group'] = pd.Categorical(df['group'], categories=categories)
df['group'].value_counts()
[out]
A 3
B 2
C 0
Name: group, dtype: int64

Assign multiple columns different values based on conditions in Panda dataframe

I have dataframe where new columns need to be added based on existing column values conditions and I am looking for an efficient way of doing.
For Ex:
df = pd.DataFrame({'a':[1,2,3],
'b':['x','y','x'],
's':['proda','prodb','prodc'],
'r':['oz1','0z2','oz3']})
I need to create 2 new columns ['c','d'] based on following conditions
If df['b'] == 'x':
df['c'] = df['s']
df['d'] = df['r']
elif df[b'] == 'y':
#assign different values to c, d columns
We can use numpy where and apply conditions on new column like
df['c] = ny.where(condition, value)
df['d'] = ny.where(condition, value)
But I am looking if there is a way to do this in a single statement or without using for loop or multiple numpy or panda apply.
The exact output is unclear, but you can use numpy.where with 2D data.
For example:
cols = ['c', 'd']
df[cols] = np.where(df['b'].eq('x').to_numpy()[:,None],
df[['s', 'r']], np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 NaN NaN
2 3 x prodc oz3 prodc oz3
If you want multiple conditions, use np.select:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq('x').to_numpy()[:,None],
df['b'].eq('y').to_numpy()[:,None]
],
[df[['s', 'r']],
df[['r', 'a']]
], np.nan)
it is however easier here to use a loop for the conditions if you have many:
cols = ['c', 'd']
df[cols] = np.select([df['b'].eq(c).to_numpy()[:,None] for c in ['x', 'y']],
[df[repl] for repl in (['s', 'r'], ['r', 'a'])],
np.nan)
output:
a b s r c d
0 1 x proda oz1 proda oz1
1 2 y prodb 0z2 0z2 2
2 3 x prodc oz3 prodc oz3

pandas groupby expanding df based on unique values

I have df below:
df = pd.DataFrame({
'ID': ['a', 'a', 'a', 'b', 'c', 'c'],
'V1': [False, False, True, True, False, True],
'V2': ['A', 'B', 'C', 'B', 'B', 'C']
})
I want to achieve the following. For each unique ID, the bottom row is True (this is V1). I want to count how many times each unique value of V2 occurs where V1==True. This part would be achieved by something like:
df.groupby('V2').V1.sum()
However, I also want to add, for each unique value of V2, a column indicating how many times that value occurred after the point where V1==True for the V2 value indicated by the row. I understand this might sound confusing; here's how the output woud look like in this example:
df
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 2 0
It is important that the solution is general enough to be applicable on a similar case with more unique values than just A, B and C.
UPDATE
As a bonus, I am also interested in how, instead of the count, one can instead return the sum of some value column, under the same conditions, divided by the corresponding "count" in the rows. Example: suppose we now depart from df below instead:
df = pd.DataFrame({
'ID': ['a', 'a', 'a', 'b', 'c', 'c'],
'V1': [False, False, True, True, False, True],
'V2': ['A', 'B', 'C', 'B', 'B', 'C'],
'V3': [1, 2, 3, 4, 5, 6],
})
The output would need to sum V3 for the cases indicated by the counts in the solution by #jezrael, and divide that number by V1. The output would instead look like:
df
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 3.5 0
First aggregate sum:
df1 = df.groupby('V2').V1.sum().astype(int).reset_index()
print (df1)
V2 V1
0 A 0
1 B 1
2 C 2
Then grouping by ID and create heper column by last value by GroupBy.transform and last, then remove last rows of ID by Series.duplicated and use crosstab for counts, add all possible unique values of V2 and last append to df1 by DataFrame.join:
val = df['V2'].unique()
df['new'] = df.groupby('ID').V2.transform('last')
df = df[df.duplicated('ID', keep='last')]
df = pd.crosstab(df['new'], df['V2']).reindex(columns=val, index=val, fill_value=0)
df = df1.join(df, on='V2')
print (df)
V2 V1 A B C
0 A 0 0 0 0
1 B 1 0 0 0
2 C 2 1 2 0
UPDATE
The updated part of the question should be possible to achieve by changing the crosstab part with pivot table:
df = df.pivot_table(
index='n',
columns='V2',
aggfunc=({
'V3': 'mean'
})
).V3.reindex(columns=v, index=v, fill_value=0)

Pandas combine dataframes, drop rows that where value does not appear in all initial dataframes

I have two Pandas dataframes, df1 and df2. I would like to combine these into a single dataframe (df) but drop any rows where the value that appears in the 'A' column of df1 but is not present in the 'A' column of df2.
Input:
[in] df1 = A B
0 i y
1 ii y
[in] df2 = A B
0 ii x
1 i y
2 iii z
3 iii z
Desired output:
[out] df = A B
0 i y
1 ii y
2 ii x
3 i y
In the example above, all rows were added to df except those in df2 with 'iii' in the 'A' column, because 'iii' does not appear anywhere in column 'A' of df1.
To take this a step further, the initial number of dataframes is not limited to two. There could be three or more, and I would want to drop any column 'A' values that do not appear in ALL of the dataframes.
How can I make this happen?
Thanks in advance!
This will work for any generic list of dataframes. Also, order of dataframes does not matter.
df1 = pd.DataFrame([['i', 'y'], ['ii', 'y']], columns=['A', 'B'])
df2 = pd.DataFrame([['ii', 'x'], ['i', 'y'], ['iii', 'z'], ['iii', 'z']], columns=['A', 'B'])
dfs = [df1, df2]
set_A = set.intersection(*[set(dfi.A.tolist()) for dfi in dfs])
df = pd.concat([dfi[dfi.A.isin(set_A)] for dfi in dfs])

Sort or groupby dataframe in python using given string

I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a

Categories

Resources