Replacing certain values in a column with messy data - python

I have a very lengthy dataset, that is stored as a dataframe. The column I am looking at is called "Country". This column has quite a few countries within it. The issue is that I want to change various values to "USA".
The values I am trying to change are
U.S
United States
United states
etc.
There are too many variations and typos (more than 100) for me to go through. Is there any simpler way to change these values? Since there are other countries in the dataset, I cannot just change all the values to USA

One of thing you can do is to stick to the first letter of each word. For all of the instance the first letter is U and for the second part (if you split the whole string) is S. Here, I am using regular expressions package that is usually used when you are working with texts.
Import re
Split_parts = [re.split(r'[^A-Z,^a-z]', i) for i in df['country']]
The above line of code splits the string based on any none alphabetic character (e.g. period, comma, semicolon, etc.).
after splitting you can create a for loop that generates True, False elements if the first characters are U and S respectively.
value= []
for i in Split_parts:
if i[0][0] in ['u','U'] and i[1][0] in ['s','S']:
value.append(True)
else:
value.append(False)
After that you can replace the string with what you need (i.e. USA):
for i in range(len(value)):
if value[i]==True:
df['country'][i]='USA'
The only country in world that has U and S as the first letters of its words respectively is United States. The solution here is not something that can be used for all problems you may face. For each one you have to look for differences.

Related

Search for specific strings in rows of dataframe and if strings exist then mark in another column in python

I have a dataframe with two columns
Current Dataframe
SE# Response COVID Words Mentioned
123456 As a merchant I appreciated your efforts in pricing with Covid-19
456789 you guys suck and didn't handle our relationship during this pandemic
347896 I love your company
Desired Dataframe
SE# Response COVID Words Mentioned
123456 As a merchant I appreciated your efforts in pricing with Covid-19 Y
456789 you guys suck and didn't handle our relationship during this pandemic Y
347896 I love your company N
terms = ['virus', 'Covid-19','covid19','flu','covid','corona','Corona','COVID-19','co-vid19','Coronavirus','Corona Virus','COVID','purell','pandemic','epidemic','coronaviru','China','Chinese','chinese','crona','korona']
These are the list of strings that need to be checked in each response. The goal is to be able to add or remove elements from the list of terms.
The above are just examples of records. I have a list of strings related to covid-19 that need to be searched in each response. If any of the strings exist, in the 'COVID Words Mentioned' column, mark a "Y" and "N" if the words do not show up.
How do I code this in python?
Much appreciated!
For each search term, set up a result vector:
d = {}
for i in LIST_OF_STRINGS:
d[i] = df['response'].str.contains(i, na=False)
I pass na=False because otherwise, Pandas will fill NA in cases where the string column is itself NA. We don't want that behaviour. The complexity of the operation increases rapidly with the number of search terms. Also consider changing this function if you want to match whole words, because contains matches sub-strings.
Regardless, take results and reduce them with bit-wise and. You need two imports:
from functools import reduce
from operator import and_
df[reduce(and_, d.values())]
The final line there selects the only elements with any of the words. You could alternatively try mapping the output of the reduction from {True, False} to {'Y', 'N'} using np.where.

Extract dollar amount at multiple places in sentence in python

I have a sentence as below where in I need to extract the dollar amounts with commas to be able to populate into a dictionary.I have tried with few options but couldn't succeed.Please guide.
For par\n
$3,500 single /$7,000 group
For nonpar\n
$7,000 single /$14,000 group
Expected output is :
"rates":{
"single" : "$3,500 (par) / $7,000 (nonpar)",
"group" : "$7,000 (par) / $14,000 (nonpar)"
}
\n here is on a new line
Amount might have decimal points and commas after every 3 values as below.
I was able to write regex for amount alone,but not finding right approach to extend it to my requirement.
re.search(r'^\$\d{1,3}(,\d{3})*(\.\d+)?$','$1,212,500.23')
Edit1:
Went ahead with one more step:
re.findall(r'\$\d{1,3}(?:\,\d{3})*(?:\.\d{2})?', str)
Could get all values in list (but need to have strategy to know which value corresponds to what?)
Edit2:
re.findall(r'\For par\W*(\$\d{1,3}(?:\,\d{3})*(?:\.\d{2})?\W*single)\s*\W*(\$\d{1,3}(?:\,\d{3})*(?:\.\d{2})?\W*group)', str)
Please help me to refine this and make it more generic.
Thanks

Extract a specific number from a string

I have this string 553943040 21% 50.83MB/s 0:00:39
The length of the numbers can vary
The percent can contain one or two numbers
The spaces between the start of the string and the first number may vary
I need to extract the first number, in this case 553943040
I was thinking that the method could be to:
1) Replace the percent with a separator. something like:
string=string.replace("..%","|") # where the "." represent any character, even an space.
2) Get the first part of the new string by cutting everything after the separator.
string=string.split("|")
string=string[0]
3) Remove the spaces.
string=string.strip()
I know that the stages 2 and 3 works, but I'm stocked on the first. Also if there is any better method of getting it would be great to know it!
Too much work.
>>> '553943040 21% 50.83MB/s 0:00:39'.split()[0]
'553943040'

Breaking 1 String into 2 Strings based on special characters using python

I am working with python and I am new to it. I am looking for a way to take a string and split it into two smaller strings. An example of the string is below
wholeString = '102..109'
And what I am trying to get is:
a = '102'
b = '109'
The information will always be separated by two periods like shown above, but the number of characters before and after can range anywhere from 1 - 10 characters in length. I am writing a loop that counts characters before and after the periods and then makes a slice based on those counts, but I was wondering if there was a more elegant way that someone knew about.
Thanks!
Try this:
a, b = wholeString.split('..')
It'll put each value into the corresponding variables.
Look at the string.split method.
split_up = [s.strip() for s in wholeString.split("..")]
This code will also strip off leading and trailing whitespace so you are just left with the values you are looking for. split_up will be a list of these values.

How can I uniquely shorten a list of strings so that they are at most x characters long

I'm looking for an algorithm that will take a vector of strings v1 and return a similar vector of strings v2 where each string is less than x characters long and is unique. The strings in v1 may not be unique.
While I need to accept ASCII in v1, I'd prefer to only insert alphanumeric characters ([A-Za-z0-9]) when insertion of new characters is required.
Obviously there are three caveats here:
For some values of v1 and x, there is no possible unique v2. For example, when v1 has 37 elements and x == 1.
"Similar" as specified in the question is subjective. The strings will be user facing, and presumably short natural language phrases (e.g. "number of colours"). I want a human to be able to map the original to the shortened string as easily as possible. This probably means taking advantage of heuristics such as disemvoweling. Because there is probably no objective measure of my similarness construct (string distance probably won't be the most useful here, although it might) my judgement on what is good will be arbitrary. The method should be suitable for English - other languages are irrelevant.
Obviously this is a (programming) language-agnostic problem, but I'd look favourably on an implementation in python (because I find its string processing language straight-forward).
a few notes / pointers about doing this in python.
Use the bisect module to keep your result array in order to easily spot potential non-uniques. This is helpful even if v1 is already sorted (e.g. name and enemy will collide after disenvoweling)
Disemvoweling can be achieved by simple calling .translate(None, "aeiouyAEIOUY") on the string.
In case of duplicates you could try to resolve the collisions first by lowercasing all results and using swapcase as "bitmask", i.e. multiple occurences of aaa become ["aaa", "aaA", "aAa", "aAA"] etc. and if this is not enough "incrementing" characters starting from the end, until a non-colliding identifier is found, eg. ["aa"]*7 would become ["aa", "aA", "Aa", "AA", "ab", "aB", "Ab"]
Sketch -
Develop a list of functions that reduce the size of an english string. Order the functions from least to most obscuring.
For each string in v1 repeatedly apply an obscuring function until it can no longer reduce the size of the string and then move on to the next function.
When desired size x is achieved, verify reduced string is unique with respect to strings already in v2. If so, add it to v2, if not, continue to apply obscuring functions.
Following are some ideas for size reducing functions subjectively ordered from least to most obscuring. (The random selections are intended to increase the probability of the reduced string being unique.)
Replace a random occurrence of two white spaces characters with a single space
Replace a random occurrence of punctuation followed by space with a single space
Remove a single character word at random that is also a member of a kill list (eg "I", "a")
Remove a two character word at random that is also a member of a kill list (eg "an", "of")
Remove a three character word at random that is also a member of a kill list (eg "the", "and")
Replace a five or more character word with word composed of first three and last character (eg "number" becomes "numr", "colours" becomes "colrs")
Remove a vowel at random
Remove a word that occurs in a large number of strings in v1. The idea being that very common words have low value.
Translate a word/phrase to a shorter "vanity license plate" word based on a dictionary(thesaurus) ( such as http://www.baac.net/michael/plates/index.html )
(Note: the last two functions would require access to the initial unaltered string, and the correspondences between the unaltered and altered words.)
def split_len(seq, length):
return [seq[i:i+length] for i in range(0, len(seq), length)]
newListOfString=[]
for item in listOfStrings:
newListOfString.append(split_len(item,8)[0])
this returns the 1st eight chars.

Categories

Resources