Well, I'm cleaning a dataset, using Pandas.
I have a column called "Country", where different rows could have numbers or other information into parenthesis and I have to remove them, for example:
Australia1,
Perú (country),
3Costa Rica, etc. To do this, I'm getting the column and I make a mapping over it.
pattern = "([a-zA-Z]+[\s]*[a-aZ-Z]+)(?:[(]*.*[)]*)"
df['Country'] = df['Country'].str.extract(pattern)
But I have a problem with this regex, I cannot match names as "United States of America", because it only takes "United ". How can I repeat unlimited the pattern of the fisrt group to match the whole name?
Thanks!
In this situation, I will clean the data step by step.
df_str = '''
Country
Australia1
Perú (country)
3Costa Rica
United States of America
'''
df = pd.read_csv(io.StringIO(df_str.strip()), sep='\n')
# handle the data
(df['Country']
.str.replace('\d+', '', regex=True) # remove number
.str.split('\(').str[0] # get items before `(`
.str.strip() # strip spaces
)
Thanks for you answer, it worked!
I found other solution, and it was doing a match of the things that I don't want on the df.
pattern = "([\s]*[(][\w ]*[)][\s]*)|([\d]*)" #I'm selecting info that I don't want
df['Country'] = df['Country'].replace(pattern, "", regex = True) #I replace that information to an empty string
Related
I would like to use something like vlook-up/map function in python.
I have only a portion of entire name of some companies. i would like to know if the company is into the dataset, as the follow example.
Thank you
I can recreate the results checking one list against another. It's not very clear or logical what your match criteria are. "john usa" is a successful match with "aviation john" on the basis that "john" appears in both. But would "john usa" constitute a match with "usa mark sas" since "usa" appears in both? What about hyphens, comma's, etc?
It would help if this was cleared up.
In any case, I hope the following will help, good luck:-
#create two lists of tuples based on the existing dataframes.
check_list = list(df_check.to_records(index=False))
full_list = list(df_full.to_records(index=False))
#create a set - entries in a set are unique
results=set()
for check in check_list: #for each record to check...
for search_word in check[0].split(" "): #take the first column and split it into its words using space as a delimiter
found=any(search_word in rec[0] for rec in full_list) #is the word a substring of any of the records in full list? True or False
results.add((check[0], found)) #add the record we checked to the set with the result (the set avoids duplicate entries)
#build a dataframe based on the results
df_results=df(results, columns=["check", "found"])
df1['in DATASET'] = df1['NAME'].isin(df2['FULL DATASET'])
I'm using pandas to analyze data from 3 different sources, which are imported into dataframes and require modification to account for human error, as this data was all entered by humans and contains errors.
Specifically, I'm working with street names. Until now, I have been using .str.replace() to remove street types (st., street, blvd., ave., etc.), as shown below. This isn't working well enough, and I decided I would like to use regex to match a pattern, and transform that entire column from the original street name, to the pattern matched by regex.
df['street'] = df['street'].str.replace(r' avenue+', '', regex=True)
I've decided I would like to use regex to identify (and remove all other characters from the address column's fields): any number of integers, followed by a space, and then the first 3 number of alphabetic characters.
For example, "3762 pearl street" might become "3762 pea" if x is 3 with the following regex:
(\d+ )+\w{0,3}
How can I use panda's .str.replace to do this? I don't want to specify WHAT I want to replace with the second argument. I want to replace the original string with the pattern matched from regex.
Something that, in my mind, might work like this:
df['street'] = df['street'].str.replace(ORIGINAL STRING, r' (\d+ )+\w{0,3}, regex=True)
which might make 43 milford st. into "43 mil".
Thank you, please let me know if I'm being unclear.
you could use the extract method to overwrite the column with its own content
pat = r'(\d+\s[a-zA-Z]{3})'
df['street'] = df['street'].str.extract(pat)
Just an observation: The regex you shared (\d+ )+\w{0,3} matches the following patterns and returns some funky stuff as well
1131 1313 street
121 avenue
1 1 1 1 1 1 avenue
42
I've changed it up a bit based on what you described, but i'm not sure if that works for all your datapoints.
There are several countries with numbers and/or parenthesis in my list. How I remove these?
e.g.
'Bolivia (Plurinational State of)' should be 'Bolivia',
'Switzerland17' should be 'Switzerland'.
Run just:
df.Country.replace(r'\d+|\s*\([^)]*\)', '', regex=True, inplace=True)
Assuming that the initial content of your DataFrame is:
Country
0 Bolivia (Plurinational State of)
1 Switzerland17
2 United Kingdom
after the above replace you will have:
Country
0 Bolivia
1 Switzerland
2 United Kingdom
The above pattern contains:
the first alternative - a non-empty sequence of digits,
the second alternative:
an optional sequence of "white" chars,
an opening parenthesis (quoted),
a sequence of chars other than ) (between brackets no quotation is
needed),
a closing parenthesis (also quoted).
Use Series.str.replace with regex for replacement, \s* is for possible spaces before (, then \(.*\) is for values () and values between | is for regex or and \d+ is for numbers with 1 or more digits:
df = pd.DataFrame({'a':['Bolivia (Plurinational State of)','Switzerland17']})
df['a'] = df['a'].str.replace('(\s*\(.*\)|\d+)','')
print (df)
a
0 Bolivia
1 Switzerland
You can remove string by this way:-
Remove numbers:-
import re
a = 'Switzerland17'
pattern = '[0-9]'
res = re.sub(pattern, '', a)
print(res)
Output:-
'Switzerland'
Remove parenthesis:-
b = 'Bolivia (Plurinational State of)'
pattern2 = '(\s*\(.*\))'
res2 = re.sub(pattern2, '', b)
print(res2)
Output:-
'Bolivia'
Using Regex and simple List Operation
Go through the list items, find the regex matching in each item, and replace the values in place. This regex "[a-zA-Z]{2,}" works for only string matching with the minimum size of two or more. It gives your freedom based on parenthesis. The better approach for Regex is to use Matching string based on your input domain (i.e country in your case) and a Country name cannot have a number in its name or Parenthesis. SO you should use the following.
import re
list_of_country_strings = ["Switzerland17", "America290","Korea(S)"]
for index in range(len(list_of_country_strings)):
x = re.match("[a-zA-Z]{2,}",string = list_of_country_strings[index])
if x:
list_of_country_strings[index] = list_of_country_strings[index][x.start():x.end()]
print(list_of_country_strings)
Output
['Switzerland', 'America', 'Korea']
I have an Import/export trade data of the country. From initial data, some country names have a weird symbol: ��.
For this reason, I am struggling to replace those strings.
Currently, I am replacing country names to their 3 letter country code. For example, China = CHI, Russian Federation = RUS. My code works fine for most of the country names.
Except: C��ina, ��etnam, Turk��, T��rkey, Uzbekist��n, Uzb��kistan etc.
I can manually format it for the first time, however, this data is updating every month, and size is now almost 2 billion rows.
for i,j in all_3n.items():
df['Country'] = df['Country'].str.replace(j,i)
This is the code how I am replacing now. Furthermore, how to replace the whole string, not only the founded string?
For example, for lookup I have Russia and string in the database is Russian Federation, it is returning me RUSn Federation. any ideas on how to overcome these two challenges? Thanks
You should use the code '\uFFFD' for the replacement character �:
df['Country'] = df['Country'].str.replace('\uFFFD', '')
I have the following Pandas code where I am trying to replace the names of countries with the string <country>.
df['title_type2'] = df['title_type']
countries = open(r'countries.txt').read().splitlines() # Reads all lines into a list and removes \n.
countries = [country.replace(' ', r'\s') for country in countries]
pattern = r'\b' + '|'.join(countries) + r'\b'
df['title_type2'].str.replace(pattern, '<country>')
However I can't get countries with spaces (like South Korea) to work correctly, since they do not get replaced. The problem seems to be that my \s is turning into \\s. How can I avoid this or how can I fix the issue?
There is no need to replace any space with \s.
Your pattern should rather include:
\b - "starting" word boundary,
(?:...|...|...) a non-capturing group with country names (alternatives),
\b - "ending" word boundary,
something like:
pattern = r'\b(?:China|South Korea|Taiwan)\b'
Then you can do the replacement:
df['title_type2'].str.replace(pattern, '<country>')
I created test data as follows:
df = pd.DataFrame(['Abc Taiwan', 'Xyz China', 'Zxx South Korea', 'No country name'],
columns=['title_type'])
df['title_type2'] = df['title_type']
and got:
0 Abc <country>
1 Xyz <country>
2 Zxx <country>
3 No country name
Name: title_type2, dtype: object