Remove specific words from column using Python - python

The data originally is derived from PDF for doing further analysis on the data, There is an [identity] column where some the values are spelled wrong, i.e it contains wrong spelling or Special characters.
Looking out to remove the Unwanted characters from the column .
Input Data:
identity
UK25463AC
ID:- UN67342OM
#ID!?
USA5673OP
Expected Output:
identity
UK25463AC
UN67342OM
NAN
USA5673OP
Script I have Tried so far:
stop_word = ['#ID!?','ID:-']
pat = '|'.join(r"\b{}\b".format(x) for x in stop_words)
df['identity'] = df['identity'].str.replace(pat, '')
So I have no clue how to handle this problem

From expected output is necessary remove words boundaries \b\b and because special regex chcarecer is added re.escape, then is used Series.replace for empty string and if only empty string to missing value:
import re
stop_words = ['#ID!?','ID:-']
pat = '|'.join(r"{}".format(re.escape(x)) for x in stop_words)
df['identity'] = df['identity'].replace(pat, '', regex=True).replace('', np.nan)
print (df)
identity
0 UK25463AC
1 UN67342OM
2 NaN
3 USA5673OP

Related

Problem with strip, replace functions in pandas dataframe

I am trying to strip all the special characters from a pandas dataframe column of words with the split() and replace() functions.
Howerver, it does not work. The special characters are not stripped from the words.
Can somebody enlight me please ?
import pandas as pd
import datetime
df = pd.read_csv("2022-12-08_word_selection.csv")
for n in df.index:
i = str(df.loc[n, "words"])
if len(i) > 12:
df.loc[n, "words"] = ""
df["words"] = df["words"].str.replace("$", "s")
df["words"] = df["words"].str.strip('[,:."*+-#/\^`#}{~&%’àáâæ¢ß¥£™©®ª×÷±²³¼½¾µ¿¶·¸º°¯§…¤¦≠¬ˆ¨‰øœšÞùúûý€')
df["words"] = df["words"].str.strip("\n")
df = df.groupby(["words"]).mean()
print(df)
Firstly, the program replaces all words in the "words" column longer than 12 characters. Then , I was hoping it would strip all the special characters from the "words" column.
First, avoid using a loop and instead use transform() to replace words longer than 12 characters with an empty string. Second, the Series.str conversion is not necessary prior to calling replace(). Third, split() only removes leading and trailing characters so it is not what you want. Use a regular expression with replace() instead. Finally, to remove special characters, it is cleaner to use a regex negative set to match and remove only the characters that are not letters or numbers. This looks like: "[^A-Za-z0-9]".
Here is some example data and code that works:
import pandas as pd
import re
df = pd.DataFrame(
{
"words": [
123,
"abcd",
"efgh",
"abcdefghijklmn",
"lol%",
"Hornbæk",
"10:03",
"$999¼",
]
}
)
# Faster and more concise than a loop
df["words"] = df["words"].transform(lambda x: "" if len(x) > 12 else x)
# Not sure why you do this but okay
df["words"] = df["words"].replace("$", "s")
# Use a regex negative set to keep only letters and numbers
df["words"] = df["words"].replace(re.compile("[^A-Za-z0-9]"), "")
display(df)
outputs:
words
0 123
1 abcd
2 efgh
3 abcdefghijklmn
4 lol
5 Hornbk
6 1003
7 999

Lambda not working even when the condition is satisfied in python

I want to print 1 if the word is in the paragraph, if not then print 0
The first line contains the word bestselling yet lambda is printing 0
A good way to do that is to use any function and cast to int the result.
text = "this is a text used for an example."
first_list = ["word", "second_word", "example"]
second_list = ["word", "second_word", "third_word"]
is_in = int(any(k in text for k in first_list))
print(is_in) # print 1
not_in = int(any(k in text for k in second_list))
print(not_in) # print 0
A way to search in a DataFrame would be using the contains method of str (documentation here). In your case you want to search whether multiple words are in the text, so a regular expression could be used:
df["sun"][0].str.contains("brilliant|bestselling|best|best-selling|loved|great|amazing", regex=True)
If you also want to match the word regardless of if it's in lowercase or uppercase you can add:
import re
df["sun"][0].str.contains("brilliant|bestselling|best|best-selling|loved|great|amazing", flags=re.IGNORECASE, regex=True)

How to get only the word from a string in python?

I am new to pandas, I have an issue with strings. So I have a string s = "'hi'+'bikes'-'cars'>=20+'rangers'" I want only the words from the string, not the symbols or the integers. How can I do it?
My input:
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
Excepted Output:
s = "'hi','bikes','cars','rangers'"
try this using regex
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
samp= re.compile('[a-zA-z]+')
word= samp.findall(s)
not sure about pandas, but you can also do it with Regex as well, and here is the solution
import re
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
words = re.findall("(\'.+?\')", s)
output = ','.join(words)
print(output)
For pandas I would convert the column in the dataframe to string first:
df
a b
0 'hi'+'bikes'-'cars'>=20+'rangers' 1
1 random_string 'with'+random,# 4
2 more,weird/stuff=wrong 6
df["a"] = df["a"].astype("string")
df["a"]
0 'hi'+'bikes'-'cars'>=20+'rangers'
1 random_string 'with'+random,#
2 more,weird/stuff=wrong
Name: a, dtype: string
Now you can see that dtype is string, which means you can do string operations on it,
including translate and split (pandas strings). But first you have to make a translate table with punctuation and digits imported from string module string docs
from string import digits, punctuation
Then make a dictionary mapping each of the digits and punctuation to whitespace
from itertools import chain
t = {k: " " for k in chain(punctuation, digits)}
create the translation table using str.maketrans (no import necessary with python 3.8 but may be a bit different with other versions) and apply the translate and split (with "str" in between) to the column)
t = str.maketrans(t)
df["a"] = df["a"].str.translate(t).str.split()
df
a b
0 [hi, bikes, cars, rangers] 1
1 [random, string, with, random] 4
2 [more, weird, stuff, wrong] 6
As you can see you only have the words now.

How to replace string in python string with specific character?

for example, I have a column named Children in data frame of python,
few names are [ tom (peter) , lily, fread, gregson (jaeson 123)] etc.
I want to ask that what code I should write, that could remove part of each name staring with bracket e.g '(' and so on. So that from my given names example tom(peter) will become tom in my column and gregson (123) would become gregson. Since there are thousands of names with bracket part and I want to remove the part of string staring from bracket '(' and ending on bracket ')'. This is a data frame of many columns but i want to do this editing in one specific column named as CHILDREN in my dataframe named DF.
As suggested by #Ruslan S., you can use pandas.Series.str.replace or you could also use re.sub (and there are other methods as well):
import pandas as pd
df = pd.DataFrame({"name":["tom (peter)" , "lily", "fread", "gregson (jaeson 123)"]})
# OPTION 1 with str.replace :
df["name"] = df["name"].str.replace(r"\([a-zA-Z0-9\s]+\)", "").str.strip()
# OPTION 2 :with re sub
import re
r = re.compile(r"\([a-zA-Z0-9\s]+\)")
df["name"] = df["name"].apply(lambda x: r.sub("", x).strip())
And the result in both cases:
name
0 tom
1 lily
2 fread
3 gregson
Note that I also use strip to remove leading and trailing whitespaces here. For more info on the regular expression to use, see re doc for instance.
You can try:
#to remove text between ()
df['columnname'] = df['columnname'].str.replace(r'\((.*)\)', '')
#to remove text between %%
df['columnname'] = df['columnname'].str.replace(r'%(.*)%', '')

Pandas to match column contents to keywords (with spaces and brackets )

A columns in data frame contains the keywords I want to match with.
I want to check if each column contains any of the keywords. If yes, print them.
Tried below:
import pandas as pd
import re
Keywords = [
"Caden(S, A)",
"Caden(a",
"Caden(.A))",
"Caden.Q",
"Caden.K",
"Caden"
]
data = {'People' : ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]}
df = pd.DataFrame(data)
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
print df["found"]
It returns Nan. I guess the challenge lies in the spaces and brackets in the keywords.
What's the right way to get the ideal outputs? Thank you.
Caden(S, A); Caden.K
Caden(a
Caden(.A))
Caden.Q
Since you do not need to find every keyword, but the longest ones if they are overlapping you may use a regex with findall approach.
The point here is that you need to sort the keywords by length in the descending order first (because there are whitespaces in them), then you need to escape these values as they contain special characters, then you must amend the word boundaries to use unambiguous word boundaries, (?<!\w) and (?!\w) (note that \b is context dependent).
Use
pat = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
See an online Python test:
import re
Keywords = ["Caden(S, A)", "Caden(a","Caden(.A))", "Caden.Q", "Caden.K", "Caden"]
rx = r'(?<!\w)(?:{})(?!\w)'.format('|'.join(map(re.escape, sorted(Keywords,key=len,reverse=True))))
# => (?<!\w)(?:Caden\(S,\ A\)|Caden\(\.A\)\)|Caden\(a|Caden\.Q|Caden\.K|Caden)(?!\w)
strs = ["Caden(S, A) Charlotte.A, Caden.K;", "Emily.P Ethan.B; Caden(a", "Grayson.Q, Lily; Caden(.A))", "Mason, Emily.Q Noah.B; Caden.Q - Riley.P"]
for s in strs:
print(re.findall(rx, s))
Output
['Caden(S, A)', 'Caden.K']
['Caden(a']
['Caden(.A))']
['Caden.Q']
Hey don't know if this solution is optimal but it works. I just replaced dot by 8 and '(' by 6 and ')' by 9 don't know why those character are ignored by str.findall ?
A kind of bijection between {8,6,9} and {'.','(',')'}
for i in range(len(Keywords)):
Keywords[i] = Keywords[i].replace('(','6').replace(')','9').replace('.','8')
for i in range(len(df['People'])):
df['People'][i] = df['People'][i].replace('(','6').replace(')','9').replace('.','8')
And then you apply your function
pat = '|'.join(r"\b{}\b".format(x) for x in Keywords)
df["found"] = df['People'].str.findall(pat).str.join('; ')
Final step get back the {'.','(',')'}
for i in range(len(df['found'])):
df['found'][i] = df['found'][i].replace('6','(').replace('9',')').replace('8','.')
df['People'][i] = df['People'][i].replace('6','(').replace('9',')').replace('8','.')
Voilà

Categories

Resources