I am trying to strip all the special characters from a pandas dataframe column of words with the split() and replace() functions.
Howerver, it does not work. The special characters are not stripped from the words.
Can somebody enlight me please ?
import pandas as pd
import datetime
df = pd.read_csv("2022-12-08_word_selection.csv")
for n in df.index:
i = str(df.loc[n, "words"])
if len(i) > 12:
df.loc[n, "words"] = ""
df["words"] = df["words"].str.replace("$", "s")
df["words"] = df["words"].str.strip('[,:."*+-#/\^`#}{~&%’àáâæ¢ß¥£™©®ª×÷±²³¼½¾µ¿¶·¸º°¯§…¤¦≠¬ˆ¨‰øœšÞùúûý€')
df["words"] = df["words"].str.strip("\n")
df = df.groupby(["words"]).mean()
print(df)
Firstly, the program replaces all words in the "words" column longer than 12 characters. Then , I was hoping it would strip all the special characters from the "words" column.
First, avoid using a loop and instead use transform() to replace words longer than 12 characters with an empty string. Second, the Series.str conversion is not necessary prior to calling replace(). Third, split() only removes leading and trailing characters so it is not what you want. Use a regular expression with replace() instead. Finally, to remove special characters, it is cleaner to use a regex negative set to match and remove only the characters that are not letters or numbers. This looks like: "[^A-Za-z0-9]".
Here is some example data and code that works:
import pandas as pd
import re
df = pd.DataFrame(
{
"words": [
123,
"abcd",
"efgh",
"abcdefghijklmn",
"lol%",
"Hornbæk",
"10:03",
"$999¼",
]
}
)
# Faster and more concise than a loop
df["words"] = df["words"].transform(lambda x: "" if len(x) > 12 else x)
# Not sure why you do this but okay
df["words"] = df["words"].replace("$", "s")
# Use a regex negative set to keep only letters and numbers
df["words"] = df["words"].replace(re.compile("[^A-Za-z0-9]"), "")
display(df)
outputs:
words
0 123
1 abcd
2 efgh
3 abcdefghijklmn
4 lol
5 Hornbk
6 1003
7 999
Related
The data originally is derived from PDF for doing further analysis on the data, There is an [identity] column where some the values are spelled wrong, i.e it contains wrong spelling or Special characters.
Looking out to remove the Unwanted characters from the column .
Input Data:
identity
UK25463AC
ID:- UN67342OM
#ID!?
USA5673OP
Expected Output:
identity
UK25463AC
UN67342OM
NAN
USA5673OP
Script I have Tried so far:
stop_word = ['#ID!?','ID:-']
pat = '|'.join(r"\b{}\b".format(x) for x in stop_words)
df['identity'] = df['identity'].str.replace(pat, '')
So I have no clue how to handle this problem
From expected output is necessary remove words boundaries \b\b and because special regex chcarecer is added re.escape, then is used Series.replace for empty string and if only empty string to missing value:
import re
stop_words = ['#ID!?','ID:-']
pat = '|'.join(r"{}".format(re.escape(x)) for x in stop_words)
df['identity'] = df['identity'].replace(pat, '', regex=True).replace('', np.nan)
print (df)
identity
0 UK25463AC
1 UN67342OM
2 NaN
3 USA5673OP
I am new to pandas, I have an issue with strings. So I have a string s = "'hi'+'bikes'-'cars'>=20+'rangers'" I want only the words from the string, not the symbols or the integers. How can I do it?
My input:
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
Excepted Output:
s = "'hi','bikes','cars','rangers'"
try this using regex
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
samp= re.compile('[a-zA-z]+')
word= samp.findall(s)
not sure about pandas, but you can also do it with Regex as well, and here is the solution
import re
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
words = re.findall("(\'.+?\')", s)
output = ','.join(words)
print(output)
For pandas I would convert the column in the dataframe to string first:
df
a b
0 'hi'+'bikes'-'cars'>=20+'rangers' 1
1 random_string 'with'+random,# 4
2 more,weird/stuff=wrong 6
df["a"] = df["a"].astype("string")
df["a"]
0 'hi'+'bikes'-'cars'>=20+'rangers'
1 random_string 'with'+random,#
2 more,weird/stuff=wrong
Name: a, dtype: string
Now you can see that dtype is string, which means you can do string operations on it,
including translate and split (pandas strings). But first you have to make a translate table with punctuation and digits imported from string module string docs
from string import digits, punctuation
Then make a dictionary mapping each of the digits and punctuation to whitespace
from itertools import chain
t = {k: " " for k in chain(punctuation, digits)}
create the translation table using str.maketrans (no import necessary with python 3.8 but may be a bit different with other versions) and apply the translate and split (with "str" in between) to the column)
t = str.maketrans(t)
df["a"] = df["a"].str.translate(t).str.split()
df
a b
0 [hi, bikes, cars, rangers] 1
1 [random, string, with, random] 4
2 [more, weird, stuff, wrong] 6
As you can see you only have the words now.
I have a dataframe and alot of the values in one of the columns have python-unfriendly characters, like &.
I wanted to make a dictionary and then loop through with find and replacements
sort of like this:
replacements = {
" ": ""
,"&": "and"
,"/":""
,"+":"plus"
,"(":""
,")":""
}
df['VariableName']=df['VariableName'].replace(replacements,regex=True)
however this brings up the following error code:
error: nothing to repeat at position 0
I think you need escape special regex characters in dictionary comprehension:
import re
df = pd.DataFrame({'VariableName':['ss dd +','(aa)']})
replacements = {re.escape(k):v for k, v in replacements.items()}
df['VariableName']=df['VariableName'].replace(replacements,regex=True)
print (df)
VariableName
0 ssddplus
1 aa
I'm using this below code to remove special characters and punctuations from a column in pandas dataframe. But this method of using regex.sub is not time efficient. Is there other options I could try to have better time efficiency and remove punctuations and special characters? Or the way I'm removing special characters and parsing it back to the column, pandas dataframe is causing me major computation burn?
for n, string in data['text'].iteritems():
data['text'] = re.sub('([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])','', string)
One way would be to keep only alphanumeric. Consider this dataframe
df=pd.DataFrame({'Text':['#^#346fetvx#!.,;:', 'fhfgd54#!#><?']})
Text
0 #^#346fetvx#!.,;:
1 fhfgd54#!#><?
You can use
df['Text'] = df['Text'].str.extract('(\w+)', expand = False)
Text
0 346fetvx
1 fhfgd54
Use Regex and lambda function:
import re
data['PROD_NAME'] = data['PROD_NAME'].apply(lambda x: re.sub('[^A-Za-z0-9]', ' ', x))
This would remove all characters except alphabets and digits.
Say I've got a column in my Pandas Dataframe that looks like this:
s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
I would like to use this column for fuzzy matching and therefore I want to remove characters ('.' , '/' , '-') but only at the end of each string so it looks like this:
s = pd.Series(["ab-cd", "abc", "abc-def", "ab.cde", "abcd"])
So far I started out easy so instead of generating a list with characters I want removed I just repeated commands for different characters like:
if s.str[-1] == '.':
s.str[-1].replace('.', '')
But this simply produces an error. How do I get the result I want, that is strings without characters at the end (characters in the rest of the string need to be preserved)?
Replace with regex will help you get the output
s.replace(r'[./-]$','',regex=True)
or with the help of apply incase looking for an alternative
s.apply(lambda x :x[:-1] if x[-1] is '.' or '-' or '/' else x)
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
You can use str.replace with a regex:
>>> s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
>>> s.str.replace("\.$|/$|\-$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
which can be reduced to this:
>>> s.str.replace("[./-]$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
You can use str.replace with a regular expression
s.str.replace(r'[./-]$','')
Substitute inside [./-] any characters you want to replace. $ means the match should be at the end of the string.
To replace "in-place" use Series.replace
s.replace(r'[./-]$','', inplace=True, regex=True)
I was able to remove characters from the end of strings in a column in a pandas DataFrame with the following line of code:
s.replace(r'[./-]$','',regex=True)
Where all entries in between brackets ( [./-] ) indicate characters to be removed and $ indicate they should be removed from the end