I have dataframe column with typos.
ID
Banknane
1
Bank of America
2
bnk of America
3
Jp Morg
4
Jp Morgan
And I have a list with the right names of the banks.
["Bank of America", "JPMorgan Chase]
I want to check and replace wrong banknames with the right names of the list with the help of levenshtein distance.
Here is one simple way to do it using Python standard library difflib module, which provides helpers for computing deltas.
from difflib import SequenceMatcher
# Define a helper function
def match(x, values, threshold):
def ratio(a, b):
return SequenceMatcher(None, a, b).ratio()
results = {
value: ratio(value, x) for value in values if ratio(value, x) > threshold
}
return max(results, key=results.get) if results else x
And then:
import pandas as pd
df = pd.DataFrame(
{
"ID": [1, 2, 3, 4],
"Bankname": ["Bank of America", "bnk of America", "Jp Morg", "Jp Morgan"],
}
)
names = ["Bank of America", "JPMorgan Chase"]
df["Bankname"] = df["Bankname"].apply(lambda x: match(x, names, 0.4))
So that:
print(df)
# Output
ID Bankname
0 1 Bank of America
1 2 Bank of America
2 3 JPMorgan Chase
3 4 JPMorgan Chase
Of course, you can replace the inner ratio function with any other more appropriated sequence matcher.
Related
I have a dataframe that I would like to split into multiple dataframes using the value in my Date column. Ideally, I would like to split my dataframe by decades. Do I need to use np.array_split method or is there a method that does not require numpy?
My Dataframe looks like a larger version of this:
Date Name
0 1746-06-02 Borcke (#p1)
1 1746-09-02 Jordan (#p31)
2 1747-06-02 Sa Majesté (#p32)
3 1752-01-26 Maupertuis (#p4)
4 1755-06-02 Jordan (#p31)
And so I would ideally want in this scenario two data frames like these:
Date Name
0 1746-06-02 Borcke (#p1)
1 1746-09-02 Jordan (#p31)
2 1747-06-02 Sa Majesté (#p32)
Date Name
0 1752-01-26 Maupertuis (#p4)
1 1755-06-02 Jordan (#p31)
Building up on mozways answer for getting the decades.
d = {
"Date": [
"1746-06-02",
"1746-09-02",
"1747-06-02",
"1752-01-26",
"1755-06-02",
],
"Name": [
"Borcke (#p1)",
"Jordan (#p31)",
"Sa Majesté (#p32)",
"Maupertuis (#p4)",
"Jord (#p31)",
],
}
import pandas as pd
import math
df = pd.DataFrame(d)
df["years"] = df['Date'].str.extract(r'(^\d{4})', expand=False).astype(int)
df["decades"] = (df["years"] / 10).apply(math.floor) *10
dfs = [g for _,g in df.groupby(df['decades'])]
Use groupby, you can generate a list of DataFrames:
dfs = [g for _,g in df.groupby(df['Date'].str.extract(r'(^\d{3})', expand=False)]
Or, validating the dates:
dfs = [g for _,g in df.groupby(pd.to_datetime(df['Date']).dt.year//10)]
If you prefer a dictionary for indexing by decade:
dfs = dict(list(df.groupby(pd.to_datetime(df['Date']).dt.year//10*10)))
NB. I initially missed that you wanted decades, not years. I updated the answer. The logic remains unchanged.
Sample Dataset
i am facing an issue and don't know how to approach it.
i have a large dataset with two coulmn i.e country and cityname. There are multiple entries where country and City names are Spelled incorrect due to human error. eg England is written as Egnald
Can anybody please guide me how to check and correct them in python?
i was able to found the incorrect entries by using the code below, but i am not sure how to correct them with the proper one with automated process as i cannot do it manually
Thanks
Here is what i have done so far:
import pycountry as pc
#converting billing country to lower string
df['Billing Country'].str.lower()
input_country_list=list(df['Billing Country'])
input_country_list=[element.upper() for element in input_country_list];
def country_name_check():
pycntrylst = list(pc.countries)
alpha_2 = []
alpha_3 = []
name = []
common_name = []
official_name = []
invalid_countrynames =[]
tobe_deleted = ['IRAN','SOUTH KOREA','NORTH KOREA','SUDAN','MACAU','REPUBLIC
OF IRELAND']
for i in pycntrylst:
alpha_2.append(i.alpha_2)
alpha_3.append(i.alpha_3)
name.append(i.name)
if hasattr(i, "common_name"):
common_name.append(i.common_name)
else:
common_name.append("")
if hasattr(i, "official_name"):
official_name.append(i.official_name)
else:
official_name.append("")
for j in input_country_list:
if j not in map(str.upper,alpha_2) and j not in map(str.upper,alpha_3)
and j not in map(str.upper,name) and j not in map(str.upper,common_name) and
j not in map(str.upper,official_name):
invalid_countrynames.append(j)
invalid_countrynames = list(set(invalid_countrynames))
invalid_countrynames = [item for item in invalid_countrynames if item not in
tobe_deleted]
return print(invalid_countrynames)
By running the above code i was able to get the names of misspelled country name, can anyone please guide how to replace them with the correct one now?
You can use SequenceMatcher from difflib (see here). It has ratio() method, that allows you to compare similarity of two strings (higher number means higher similarity, 1.0 means same words):
>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None,'Dog','Cat').ratio()
0.0
>>> SequenceMatcher(None,'Dog','Dogg').ratio()
0.8571428571428571
>>> SequenceMatcher(None,'Cat','Cta').ratio()
0.6666666666666666
My idea is to have list of correct names of countries, and compare each record in your dataframe to each item in this list, and select the most similar, thus you should get the correct name of country. Then you can put it into the function, and apply this function over all records in your Country column in dataframe:
>>> #let's say we have following dataframe
>>> df
number country
0 1 Austria
1 2 Autrisa
2 3 Egnald
3 4 Sweden
4 5 England
5 6 Swweden
>>>
>>> #let's specify correct names
>>> correct_names = {'Austria','England','Sweden'}
>>>
>>> #let's specify the function that select most similar word
>>> def get_most_similar(word,wordlist):
... top_similarity = 0.0
... most_similar_word = word
... for candidate in wordlist:
... similarity = SequenceMatcher(None,word,candidate).ratio()
... if similarity > top_similarity:
... top_similarity = similarity
... most_similar_word = candidate
... return most_similar_word
...
>>> #now apply this function over 'country' column in dataframe
>>> df['country'].apply(lambda x: get_most_similar(x,correct_names))
0 Austria
1 Austria
2 England
3 Sweden
4 England
5 Sweden
Name: country, dtype: object
df.replace(['Egnald', 'Cihna'], ['England', 'China'])
This will find and replace in the entire DF
Use df.replace(['Egnald', 'Cihna'], ['England', 'China'], inplace=True) if you want to do this inplace.
I would like to add the regional information to the main table that contains entity and account columns. In this way, each row in the main table should be duplicated, just like the append tool in Alteryx.
Is there a way to do this operation with Pandas in Python?
Thanks!
Unfortunately no build-in method exist, as you'll need to build cartesian product of those DataFrame check that fancy explanation of merge DataFrames in pandas
But for your specific problem, try this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['Entity', 'Account'])
df1.Entity = ['Entity1', 'Entity1']
df1.Account = ['Sales', 'Cost']
df2 = pd.DataFrame(columns=['Region'])
df2.Region = ['North America', 'Europa', 'Asia']
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
resultdf = cartesian_product_simplified(df1, df2)
print(resultdf)
output:
0 1 2
0 Entity1 Sales North America
1 Entity1 Sales Europa
2 Entity1 Sales Asia
3 Entity1 Cost North America
4 Entity1 Cost Europa
5 Entity1 Cost Asia
as expected.
Btw, please provide the Data Frame the next time as code, not as a screenshot or even as link. It helps up saving time (please check how to ask)
I have a dataframe which has a column called regional_codes. Now I need to add a new column into the dataframe where the regional codes are replaced by the list of countries that are attributed to that region.
For eg. if the regional_codes contains ['asia'] then I need my new column to have the list of asian countries like ['china','japan','india','bangaldesh'...]
Currently what I do is that I have created a separate list for each region and I use something like this code
asia_list= ['asia','china','japan','india'...]
output_list = []
output_list+= [asia_list for w in regional_codes if w in asia_list]
output_list+= [africa_list for w in regional_codes if w in africa_list]
and so on until all the regional lists are exhausted
With the codes that I have provided above, my results are exactly what I need and it is efficient in terms of running time as well. However, I feel like I am doing it in a very long way. Therefore, I am looking for some suggestions that can help me shorten my code.
One way I found to do this is to create a DataFrame with all the needed data for your regional_codes and the regional_lists
import pandas as pd
import itertools
import numpy as np
# DF is your dataframe
# df is the dataframe containing the association between the regional_code and regional lists
df = pd.DataFrame({'regional_code': ['asia', 'africa', 'europe'], 'ragional_list': [['China', 'Japan'], ['Morocco', 'Nigeria', 'Ghana'], ['France', 'UK', 'Germany', 'Spain']]})
# regional_code ragional_list
# 0 asia [China, Japan]
# 1 africa [Morocco, Nigeria, Ghana]
# 2 europe [France, UK, Germany, Spain]
df2 = pd.DataFrame({'regional_code': [['asia', 'africa'],['africa', 'europe']], 'ragional_list': [1,2]})
# regional_code ragional_list
# 0 [asia, africa] 1
# 1 [africa, europe] 2
df2['list'] = df2.apply(lambda x: list(itertools.chain.from_iterable((df.loc[df['regional_code']==i, 'ragional_list'] for i in x.loc['regional_code']))), axis=1)
# In [95]: df2
# Out[95]:
# regional_code ragional_list list
# 0 [asia, africa] 1 [[China, Japan], [Morocco, Nigeria, Ghana]]
# 1 [africa, europe] 2 [[Morocco, Nigeria, Ghana], [France, UK, Germa...
Now we flatten the df2['list']
df2['list'] = df2['list'].apply(np.concatenate)
# regional_code ragional_list list
# 0 [asia, africa] 1 [China, Japan, Morocco, Nigeria, Ghana]
# 1 [africa, europe] 2 [Morocco, Nigeria, Ghana, France, UK, Germany,...
I guess this answers your question?
I have a dataframe as:-
Filtered_data
['defence possessed russia china','factors driving china modernise']
['force bolster pentagon','strike capabilities pentagon congress detailing china']
[missiles warheads', 'deterrent face continued advances']
......
......
I just want to split each list elements into sub-elements(tokenized words).So, output Im looking for as:-
Filtered_data
[defence, possessed,russia,factors,driving,china,modernise]
[force,bolster,strike,capabilities,pentagon,congress,detailing,china]
[missiles,warheads, deterrent,face,continued,advances]
here is my code what I have tried
for text in df['Filtered_data'].iteritems():
for i in text.split():
print (i)
Use list comprehension with split and flatenning:
df['Filtered_data'] = df['Filtered_data'].apply(lambda x: [z for y in x for z in y.split()])
print (df)
Filtered_data
0 [defence, possessed, russia, china, factors, d...
1 [force, bolster, pentagon, strike, capabilitie...
2 [missiles, warheads, deterrent, face, continue...
EDIT:
For unique values is standard way use sets:
df['Filtered_data'] = df['Filtered_data'].apply(lambda x: list(set([z for y in x for z in y.split()])))
print (df)
Filtered_data
0 [russia, factors, defence, driving, china, mod...
1 [capabilities, detailing, china, force, pentag...
2 [deterrent, advances, face, warheads, missiles...
But if ordering of values is important use pandas.unique:
df['Filtered_data'] = df['Filtered_data'].apply(lambda x: pd.unique([z for y in x for z in y.split()]).tolist())
print (df)
Filtered_data
0 [defence, possessed, russia, china, factors, d...
1 [force, bolster, pentagon, strike, capabilitie...
2 [missiles, warheads, deterrent, face, continue...
You can use itertools.chain + toolz.unique. The benefit of toolz.unique versus set is it preserves ordering.
from itertools import chain
from toolz import unique
df = pd.DataFrame({'strings': [['defence possessed russia china','factors driving china modernise'],
['force bolster pentagon','strike capabilities pentagon congress detailing china'],
['missiles warheads', 'deterrent face continued advances']]})
df['words'] = df['strings'].apply(lambda x: list(unique(chain.from_iterable(i.split() for i in x))))
print(df.iloc[0]['words'])
['defence', 'possessed', 'russia', 'china', 'factors', 'driving', 'modernise']