Replace column values using mask and multiple mappings - python

I have two dataframes. One is v_df and looks like this:
VENDOR_ID
VENDOR_NAME
123
APPLE
456
GOOGLE
987
FACEBOOK
The other is n_df and looks like this:
Vendor_Name
GL_Transaction_Description
AMEX
HELLO 345
Not assigned
BYE 456
Not assigned
THANKS 123
I want to populate the 'Vendor_Name' column in n_df on the condition that the 'GL_Transaction_Description' on the same row contains any VENDOR_ID value from v_df. So the resulting n_df would be this:
Vendor_Name
GL_Transaction_Description
AMEX
HELLO 345
GOOGLE
BYE 456
APPLE
THANKS 123
So far I have written this code:
v_list = v_df['VENDOR_ID'].to_list()
mask_id = list(map((lambda x: any([(y in x) for y in v_list])), n_df['GL_Transaction_Description']))
n_df['Vendor_Name'].mask((mask_id), other = 'Solution Here', inplace=True)
I am just not able to grasp what to write in the 'other' condition of the final masking. Any ideas? (n_df has more than 100k rows, so the execution speed of the solution is of high importance)

Series.str.extract + map
i = v_df['VENDOR_ID'].astype(str)
m = v_df.set_index(i)['VENDOR_NAME']
s = n_df['GL_Transaction_Description'].str.extract(r'(\d+)', expand=False)
n_df['Vendor_Name'].update(s.map(m))
Explanations
Create a mapping series m from the v_df dataframe by setting the VENDOR_ID column as the index and selecting the VENDOR_NAME column
>>> m
VENDOR_ID
123 APPLE
456 GOOGLE
987 FACEBOOK
Name: VENDOR_NAME, dtype: object
Now extract the vendor id from the strings in GL_Transaction_Description column
>>> s
0 345
1 456
2 123
Name: GL_Transaction_Description, dtype: object
Map the extracted vendor id with the mapping series m and update the mapped values in Vendor_Name column
>>> n_df
Vendor_Name GL_Transaction_Description
0 AMEX HELLO 345
1 GOOGLE BYE 456
2 APPLE THANKS 123

Related

If col is empty string, make adjacent column empty as well

Consider this sample df:
colAnum colB colCnum colD
123 House 456 Book
Car 789 Table
891 Chair Porch
I am trying to roll through this df and if the "num" column is an empty string, then make the adjacent column, to the right, empty as well.
This is the expected output:
colAnum colB colCnum colD
123 House 456 Book
789 Table
891 Chair
I attempted this with variations on this:
for idx, col in enumerate(df.columns):
if df.iloc[idx, col] == '':
df[idx+1,col] == ''
I am sure I am missing something simple to make this occur, but cannot work my way around it.
Try with shift with mask
out = df.mask(df.eq('').shift(axis=1).fillna(False),'')
colAnum colB colCnum colD
0 123.0 House 456.0 Book
1 789.0 Table
2 891.0 Chair

I want to count occurence of a subset in pandas dataframe

If in a data frame I have a data like below:
Name Id
Alex 123
John 222
Alex 123
Kendal 333
So I want to add a column which will result:
Name Id Subset Count
Alex 123 2
John 222 1
Alex 123 2
Kendal 333 1
I used the below code but didnt got the output:
df['Subset Count'] = df.value_counts(subset=['Name','Id'])
try via groupby():
df['Subset Count']=df.groupby(['Name','Id'])['Name'].transform('count')
OR
via droplevel() and map()
df['Subset Count']=df['Name'].map(df.value_counts(subset=['Name','Id']).droplevel(1))

Dataframe replace with another row, based on condition

I have a dataframe like the following:
ean product_resource_id shop
----------------------------------------------------
123 abc xxl
245 bed xxl
456 dce xxl
123 0 conr
245 0 horec
I want to replace 0 "product_resource_id"s with an id where "ean"s are same.
I want to get a result like:
ean product_resource_id shop
----------------------------------------------------
123 abc xxl
245 bed xxl
456 dce xxl
123 abc conr
245 bed horec
Any help would be really helpful. Thanks in advance!
Idea is filter rows with 0 values in product_resource_id, remove duplicates by ean column if exist and create Series by DataFrame.set_index for mapping, if no match values are replace by original by values by Series.fillna, because non match values return NaNs:
#mask = df['product_resource_id'].ne('0')
#if 0 is integer
mask = df['product_resource_id'].ne(0)
s = df[mask].drop_duplicates('ean').set_index('ean')['product_resource_id']
df['product_resource_id'] = df['ean'].map(s).fillna(df['product_resource_id'])
print (df)
ean product_resource_id shop
0 123 abc xxl
1 245 bed xxl
2 456 dce xxl
3 123 abc conr
4 245 bed horec

compare columns and generate duplicate rows in mysql (or) python pandas

i am new to Mysql and just getting started with some basic concepts. i have been trying to solve this for a while now. any help is appreciated.
I have a list of users with two phone numbers. i would want to compare two columns(phone numbers) and generate a new row if the data is different in both columns, else retain the row and make no changes.
The processed data would look like the second table.
Is there any way to acheive this in MySql.
i also don't minds doing the transformation in a dataframe and then loading into a table.
id username primary_phone landline
1 John 222 222
2 Michael 123 121
3 lucy 456 456
4 Anderson 900 901
Thanks!!!
Use DataFrame.melt with remove variable column and DataFrame.drop_duplicates:
df = (df.melt(['id','username'], value_name='phone')
.drop('variable', axis=1)
.drop_duplicates()
.sort_values('id'))
print (df)
id username phone
0 1 John 222
1 2 Michael 123
5 2 Michael 121
2 3 lucy 456
3 4 Anderson 900
7 4 Anderson 901

How to retrieve rows based on duplicate value in specific column in Pandas Python?

Let's Say we have data as follows:
A B
123 John
456 Mary
102 Allen
456 Nickolan
123 Richie
167 Daniel
We want to get retrieve rows based on column A if duplicated then store in different dataframes with that code name.
[123 John, 123 Richie], These both will be stored in df_123
[456 Mary, 456 Nickolan], These both will be stored in df_456
[102 Allen] will be stored in df_102
[167 Daniel] will be stored in df_167
Thanks in Advance
group and then use list comprehension, which will return a list of dataframes based on the group:
group = df.groupby('A')
dfs = [group.get_group(x) for x in group.groups]
[ A B
2 112 Allen
5 112 Daniel, A B
0 123 John
4 123 Richie, A B
1 456 Mary
3 456 Nickolan]
groupby + tuple + dict
Creating a variable number of variables is not recommended. You can use a dictionary:
dfs = dict(tuple(df.groupby('A')))
And that's it. To access the dataframe where A == 123, use dfs[123], etc.
Note your dataframes are now distinct objects. You can no longer perform operations on dfs and have them applied to each dataframe value without a Python-level loop.

Categories

Resources