I have a dataframe like the following:
ean product_resource_id shop
----------------------------------------------------
123 abc xxl
245 bed xxl
456 dce xxl
123 0 conr
245 0 horec
I want to replace 0 "product_resource_id"s with an id where "ean"s are same.
I want to get a result like:
ean product_resource_id shop
----------------------------------------------------
123 abc xxl
245 bed xxl
456 dce xxl
123 abc conr
245 bed horec
Any help would be really helpful. Thanks in advance!
Idea is filter rows with 0 values in product_resource_id, remove duplicates by ean column if exist and create Series by DataFrame.set_index for mapping, if no match values are replace by original by values by Series.fillna, because non match values return NaNs:
#mask = df['product_resource_id'].ne('0')
#if 0 is integer
mask = df['product_resource_id'].ne(0)
s = df[mask].drop_duplicates('ean').set_index('ean')['product_resource_id']
df['product_resource_id'] = df['ean'].map(s).fillna(df['product_resource_id'])
print (df)
ean product_resource_id shop
0 123 abc xxl
1 245 bed xxl
2 456 dce xxl
3 123 abc conr
4 245 bed horec
Related
I have a dataframe containing columns in the below format
df =
ID Folder Name Country
300 ABC 12345 CANADA
1000 NaN USA
450 AML 2233 USA
111 ABC 2234 USA
550 AML 3312 AFRICA
Output needs to be in the below format
ID Folder Name Country Folder Name - ABC Folder Name - AML
300 ABC 12345 CANADA ABC 12345 NaN
1000 NaN USA NaN NaN
450 AML 2233 USA NaN AML 2233
111 ABC 2234 USA ABC 2234 NaN
550 AML 3312 AFRICA NaN AML 3312
I tried using the below python code:-
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x.str.startswith('ABC',na = False))
Can you please help me where i am going wrong?
You should not use apply but boolean indexing:
df.loc[df['Folder Name'].str.startswith('ABC', na=False),
'Folder Name - ABC'] = df['Folder Name']
However, a better approach that would not require you to loop over all possible codes would be to extract the code, pivot_table and merge:
out = df.merge(
df.assign(col=df['Folder Name'].str.extract('(\w+)'))
.pivot_table(index='ID', columns='col',
values='Folder Name', aggfunc='first')
.add_prefix('Folder Name - '),
on='ID', how='left'
)
output:
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
If you have a list with the substrings to be matched at the start of each string in df['Folder Name'], you could also achieve the result as follows:
lst = ['ABC','AML']
pat = f'^({".*)|(".join(lst)}.*)'
# '^(ABC.*)|(AML.*)'
df[[f'Folder Name - {x}' for x in lst]] = \
df['Folder Name'].str.extract(pat, expand=True)
print(df)
ID Folder Name Country Folder Name - ABC Folder Name - AML
0 300 ABC 12345 CANADA ABC 12345 NaN
1 1000 NaN USA NaN NaN
2 450 AML 2233 USA NaN AML 2233
3 111 ABC 2234 USA ABC 2234 NaN
4 550 AML 3312 AFRICA NaN AML 3312
If you do not already have this list, you can simply create it first by doing:
lst = df['Folder Name'].dropna().str.extract('^([A-Z]{3})')[0].unique()
# this will be an array, not a list,
# but that doesn't affect the functionality here
N.B. If your list contains items that won't match, you'll end up with extra columns filled completely with NaN values. You can get rid of these at the end. E.g.:
lst = ['ABC','AML','NON']
# 'NON' won't match
pat = f'^({".*)|(".join(lst)}.*)'
df[[f'Folder Name - {x}' for x in lst]] = \
df['Folder Name'].str.extract(pat, expand=True)
df = df.dropna(axis=1, how='all')
# dropping column `Folder Name - NON` with only `NaN` values
the startswith methode return True or False so your column will contains just a boolean values instead you can try this :
df_['Folder Name - ABC'] = df['Folder Name'].apply(lambda x: x if x.str.startswith('ABC',na = False))
does this code do the trick?
df['Folder Name - ABC'] = df['Folder Name'].where(df['Folder Name'].str.startswith('ABC'))
I have a table, that I did not create, in my sql server database that looks like this.
manager id
employee info
123567890123
[{'emp_name':'ash','emp_id':'123'},{{'emp_name':'brad','emp_id':'234'}]
235678901234
[{'emp_name':'sarah','emp_id':'345'},{{'emp_name':'ryan','emp_id':'456'}{{'emp_name':'chris','emp_id':'567'}]
I queried this table and have pulled into pandas dataframe. I want get each emp_name and emp_id for each manager
Below is my desired result.
manager id
emp_name
emp_id
123567890123
ash
123
123567890123
brad
234
235678901234
sarah
345
235678901234
ryan
456
235678901234
chris
567
You can use .explode() to expand list of json into individual json one per row. Then, use pd.Series to convert the json into columns.
df2 = df.explode('employee info').reset_index(drop=True)
df_out = df2.join(df2['employee info'].apply(pd.Series)).drop('employee info', axis=1)
For better performance, you can use pd.DataFrame() instead of pd.Series to convert the json into columns, as follows:
pd.DataFrame(df2['employee info'].tolist())
emp_name emp_id
0 ash 123
1 brad 234
2 sarah 345
3 ryan 456
4 chris 567
Whole set of codes as follows:
df2 = df.explode('employee info').reset_index(drop=True)
df_out = df2.join(pd.DataFrame(df2['employee info'].tolist())).drop('employee info', axis=1)
Data Input
data = {'manager id': [123567890123, 235678901234],
'employee info': [[{'emp_name':'ash','emp_id':'123'},{'emp_name':'brad','emp_id':'234'}],
[{'emp_name':'sarah','emp_id':'345'},{'emp_name':'ryan','emp_id':'456'}, {'emp_name':'chris','emp_id':'567'}]]}
df = pd.DataFrame(data)
Output:
print(df_out)
manager id emp_name emp_id
0 123567890123 ash 123
1 123567890123 brad 234
2 235678901234 sarah 345
3 235678901234 ryan 456
4 235678901234 chris 567
You can use ast.literal_eval to get expected result:
import ast
out = df['employee info'].apply(ast.literal_eval).explode().apply(pd.Series)
emp_name emp_id
0 ash 123
0 brad 234
1 sarah 345
1 ryan 456
1 chris 567
out = pd.concat([df['manager id'], out], axis='columns')
Output:
>>> df
manager id emp_name emp_id
0 123567890123 ash 123
0 123567890123 brad 234
1 235678901234 sarah 345
1 235678901234 ryan 456
1 235678901234 chris 567
I slightly modified your dataframe:
data = {'manager id': [123567890123, 235678901234],
'employee info': ["[{'emp_name':'ash','emp_id':'123'},{'emp_name':'brad','emp_id':'234'}]",
"[{'emp_name':'sarah','emp_id':'345'},{'emp_name':'ryan','emp_id':'456'},{'emp_name':'chris','emp_id':'567'}]"]}
df = pd.DataFrame(data)
I want to calculate Levenshtein distance for all rows of a single column of a Pandas DataFrame. I am getting MemoryError when I cross-join my DataFrame containing ~115,000 rows. In the end, I want to keep only those rows where Levenshtein distance is either 1 or 2. Is there an optimized way to do the same?
Here's my brute force approach:
import pandas as pd
from textdistance import levenshtein
# from itertools import product
# original df
df = pd.DataFrame({'Name':['John', 'Jon', 'Ron'], 'Phone':[123, 456, 789], 'State':['CA', 'GA', 'MA']})
# create another df containing all rows and a few columns needed for further checks
name = df['Name']
phone = df['Phone']
dic_ = {'Name_Match':name,'Phone_Match':phone}
df_match = pd.DataFrame(dic_, index=range(len(name)))
df['key'] = 1
df_match['key'] = 1
# cross join df containing all columns with another df containing some of its columns
df_merged = pd.merge(df, df_match, on='key').drop("key",1)
# keep only rows where distance = 1 or distance = 2
df_merged['distance'] = df_merged.apply(lambda x: levenshtein.distance(x['Name'], x['Name_Match']), axis=1)
Original DataFrame:
Out[1]:
Name Phone State
0 John 123 CA
1 Jon 456 GA
2 Ron 789 MA
New DataFrame from original DataFrame:
df_match
Out[2]:
Name_Match Phone_Match
0 John 123
1 Jon 456
2 Ron 789
Cross-join:
df_merged
Out[3]:
Name Phone State Name_Match Phone_Match distance
0 John 123 CA John 123 0
1 John 123 CA Jon 456 1
2 John 123 CA Ron 789 2
3 Jon 456 GA John 123 1
4 Jon 456 GA Jon 456 0
5 Jon 456 GA Ron 789 1
6 Ron 789 MA John 123 2
7 Ron 789 MA Jon 456 1
8 Ron 789 MA Ron 789 0
Final output:
df_merged[((df_merged.distance==1)==True) | ((df_merged.distance==2)==True)]
Out[4]:
Name Phone State Name_Match Phone_Match distance
1 John 123 CA Jon 456 1
2 John 123 CA Ron 789 2
3 Jon 456 GA John 123 1
5 Jon 456 GA Ron 789 1
6 Ron 789 MA John 123 2
7 Ron 789 MA Jon 456 1
Your problem is not related to levenshtein distance, your main problem is that you are running out of device memory (RAM) while doing the operations (you can check it using the task manager in windows or the top or htop commands on linux/mac).
One solution would be to partition your dataframe before starting the apply operation into smaller partitions and running it on each partition then deleting the ones that you don't need BEFORE processing the next partition.
If you are running it on the cloud, you can get a machine with more memory instead.
Bonus: I'd suggest you parallelize the apply operation using something like Pandarallel to make it way faster.
I have two dataframes. One is v_df and looks like this:
VENDOR_ID
VENDOR_NAME
123
APPLE
456
GOOGLE
987
FACEBOOK
The other is n_df and looks like this:
Vendor_Name
GL_Transaction_Description
AMEX
HELLO 345
Not assigned
BYE 456
Not assigned
THANKS 123
I want to populate the 'Vendor_Name' column in n_df on the condition that the 'GL_Transaction_Description' on the same row contains any VENDOR_ID value from v_df. So the resulting n_df would be this:
Vendor_Name
GL_Transaction_Description
AMEX
HELLO 345
GOOGLE
BYE 456
APPLE
THANKS 123
So far I have written this code:
v_list = v_df['VENDOR_ID'].to_list()
mask_id = list(map((lambda x: any([(y in x) for y in v_list])), n_df['GL_Transaction_Description']))
n_df['Vendor_Name'].mask((mask_id), other = 'Solution Here', inplace=True)
I am just not able to grasp what to write in the 'other' condition of the final masking. Any ideas? (n_df has more than 100k rows, so the execution speed of the solution is of high importance)
Series.str.extract + map
i = v_df['VENDOR_ID'].astype(str)
m = v_df.set_index(i)['VENDOR_NAME']
s = n_df['GL_Transaction_Description'].str.extract(r'(\d+)', expand=False)
n_df['Vendor_Name'].update(s.map(m))
Explanations
Create a mapping series m from the v_df dataframe by setting the VENDOR_ID column as the index and selecting the VENDOR_NAME column
>>> m
VENDOR_ID
123 APPLE
456 GOOGLE
987 FACEBOOK
Name: VENDOR_NAME, dtype: object
Now extract the vendor id from the strings in GL_Transaction_Description column
>>> s
0 345
1 456
2 123
Name: GL_Transaction_Description, dtype: object
Map the extracted vendor id with the mapping series m and update the mapped values in Vendor_Name column
>>> n_df
Vendor_Name GL_Transaction_Description
0 AMEX HELLO 345
1 GOOGLE BYE 456
2 APPLE THANKS 123
I get a Pandas series:
countrypat = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0).head(3)
The output looks like this:
China abc 1055
def 778
ghi 612
Malaysia def 554
abc 441
ghi 178
[...]
How to insert a new column (do I have to make this a dataframe) containing the ratio of the numeric column to the sum of the numbers for that country. Thus for China I would want a new column and the first row would contain (1055/(1055+778+612)). I have tried unstack() and to_df() but was unsure of the next steps.
I created a dataframe on my side, but excluded the .head(3) of your assigment:
countrypat = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0)
The following will give you the proportions with a simple apply to your groupby object:
countrypat.apply(lambda x: x / float(x.sum()))
The only 'problem' is that doing so returns you a series, so I would stock the intermediate results in two different series and combine them at the end:
series1 = asiaselect.groupby('Country')['Pattern'].value_counts()
series2 = asiaselect.groupby('Country')['Pattern'].value_counts().groupby(level=0).apply(lambda x: x / float(x.sum()))
pd.DataFrame([series1, series2]).T
China abc 1055.0 0.431493
def 778.0 0.318200
ghi 612.0 0.250307
Malaysia def 554.0 0.472293
abc 441.0 0.375959
ghi 178.0 0.151748
As to get the top three rows, you can simply add a .groupby(level=0).head(3) to each series1 and series2
series1_top = series1.groupby(level=0).head(3)
series2_top = series2.groupby(level=0).head(3)
pd.DataFrame([series1_top, series2_top]).T
I tested with a dataframe containing more than 3 rows, and it seems to work. Started with the following df:
China abc 1055
def 778
ghi 612
yyy 5
xxx 3
zzz 3
Malaysia def 554
abc 441
ghi 178
yyy 5
xxx 3
zzz 3
and ends like this:
China abc 1055.0 0.429560
def 778.0 0.316775
ghi 612.0 0.249186
Malaysia def 554.0 0.467905
abc 441.0 0.372466
ghi 178.0 0.150338