How to Remove Extra characters from the column Value using python - python

I am trying to Map the values from the dictionary, where if the Field values matches with the dictionary it must remove all the extra values from the same. However i can match the things but how i can remove the extra charaters from the column.
Input Data
col_data
Indi8
United states / 08
UNITED Kindom (55)
ITALY 22
israel
Expected Output:
col_data
India
United States
United Kindom
Italy
Israel
Script i am using :
match_val=['India','United Kingdom','Israel','United States','Italy']
lower = [x.lower() for x in match_val]
def nearest(s):
idx = np.argmax([SequenceMatcher(None, s.lower(), i).ratio() for i in lower])
return np.array(match_val)[idx]
df['col_data'] = df['col_data'].apply(nearest)
The above script matches the vales with the List, But not able to remove the extra characters from the same. How i can modify the script so that it can remove the extra characters as well after mapping.

I like this str.extract approach:
df['col_data'] = df['col_data'].str.extract(r'([A-Za-z]+(?: [A-Za-z]+)*)').str.title()
The regex ([A-Za-z]+(?: [A-Za-z]+)*) will match all all-letter words from the start of the column, omitting all content at the end which you want to remove.

Related

Find value from string using the characters from list Using Python

I have been working on an Excel sheet using python, where i have to extract only the specific value from the column, using an list with set of charaters.
Need to check every character from the column check with the list, If it matches need to return the matched value into the dataframe which can be used for further analysis.
Input Data :
text-value
19 Freezeland Lane, United Kingdom BD23 0UN
44 Bishopthorpe Road, United States LL55 1EU
Worthy Lane Denmark, LN11 9LP
88 Carriers Road, Mexico , DG3 1LB
HongKong
Expected Output:
text_value
United Kingdom
United States
Denmark
Mexico
HongKong
Code Snippet:
import pandas as pd
import re
countries=['United Kingdom','Denmark','India','United States','Mexico','HongKong']
df['text_value'] = re.findall(countries, df.text_value)
But It didn't worked
Also Tried :
if re.compile('|'.join(countries),re.IGNORECASE).search(df['text_value']):
df['text_value']
You can use
df['country_list'] = df['text_value'].str.findall(r'(?i)\b(?:{})\b'.format('|'.join(countries)))
Here, Series.str.findall returns all matches found in each cell in the country_list column, and the pattern, that looks like (?i)\b(?:Country1|Country2|...)\b, matches
(?i) - case insensitive inline modifier option
\b - a word boundary
(?:Country1|Country2|...) - a list of countries
\b - a word boundary

split column regex dataframe python

I have a column in a dataframe where in some rows I have the state, and sometimes just the city. For example in some rows I just have: 'Los Angeles', but in other rows I may have 'CA Los Angeles'.
I want to split that column into two new ones: states and cities, and if the state is not specified, then it can be blank. Something like this:
COLUMN
STATE
CITY
FL Miami
FL
Miami
Houston
null
Houston
I was thinking by maybe splitting using regex like '[A-Z][A-Z]\s' or something like that but I cannot make it work. Any ideas?
You can use
^(?:([A-Z]{2})\s+)?(.*)
See the regex demo. Details:
^ - start of string
(?:([A-Z]{2})\s+)? - an optional occurrence of
([A-Z]{2}) - Group 1: two uppercase ASCII letters
\s+ - one or more whitespaces
(.*) - Group 2: any zero or more chars other than line break chars as many as possible.
If you are using Pandas use
df[['STATE','CITY']] = df['COLUMN'].str.extract(r'^(?:([A-Z]{2})\s+)?(.*)', expand=False)

How to replace string if some characters are the same on pandas?

I have an Import/export trade data of the country. From initial data, some country names have a weird symbol: ��.
For this reason, I am struggling to replace those strings.
Currently, I am replacing country names to their 3 letter country code. For example, China = CHI, Russian Federation = RUS. My code works fine for most of the country names.
Except: C��ina, ��etnam, Turk��, T��rkey, Uzbekist��n, Uzb��kistan etc.
I can manually format it for the first time, however, this data is updating every month, and size is now almost 2 billion rows.
for i,j in all_3n.items():
df['Country'] = df['Country'].str.replace(j,i)
This is the code how I am replacing now. Furthermore, how to replace the whole string, not only the founded string?
For example, for lookup I have Russia and string in the database is Russian Federation, it is returning me RUSn Federation. any ideas on how to overcome these two challenges? Thanks
You should use the code '\uFFFD' for the replacement character �:
df['Country'] = df['Country'].str.replace('\uFFFD', '')

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

How to Python split by a character yet maintain that character?

Google Maps results are often displayed thus:
'\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
Another variation:
'Clayton Village Shopping Center, 14856 Clayton Rd\nChesterfield, MO, United States\n(636) 227-2844'
And another:
'Wildwood, MO\nUnited States\n(636) 458-7707'
Notice the variation in the placement of the \n characters.
I'm looking to extract the first X lines as address, and the last line as phone number. A regex such as (.*\n.*)\n(.*) would suffice for the first example, but falls short for the other two. The only thing I can rely on is that the phone number will be in the form (ddd) ddd-dddd.
I think a regex that will allow for each and every possible variation will be hard to come by. Is it possible to use split(), but maintain the character we have split by? So in this example, split by "(", to split out the address and phone number, but retain this character in the phone number? I could concatenate the "(" back into split("(")[1], but is there a neater way?
Don't use regex. Just split the string on the '\n'. The last index is a phone number, the other indexes are the address.
lines = inputString.split('\n')
phone = lines[-1] if lines[-1].match(REGEX_PHONE_US) else None
address = '\n'.join(lines[:-1]) if phone else inputString
Python has a lot of great built in tools for manipulating strings in a more... human way... than regex allows.
If I understand you correctly, you want to "extract the first X lines as address". Assuming that all the addresses you need are in the US this regex code should work for you. In any case, it works on the 3 examples you provided:
import re
x = 'Wildwood, MO\nUnited States\n(636) 458-7707'
print re.findall(r'.*\n+.*\States', x)
The output is:
['Wildwood, MO\nUnited States']
If you want to print it later without the \n you can do it this way:
x = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
y = re.findall(r'.*\n+.*\States', x)
y = y[0].rstrip()
When you print y the output:
113 W 5th St
Eureka, MO, United States
And, if you want to extract the phone number separately you can do this:
tel = '\n113 W 5th St\nEureka, MO, United States\n(636) 938-9310\n'
num = re.findall(r'.*\d+\-\d+', tel)
num = num[0].rstrip()
When you print num the output:
(636) 938-9310

Categories

Resources