How to get only the word from a string in python? - python

I am new to pandas, I have an issue with strings. So I have a string s = "'hi'+'bikes'-'cars'>=20+'rangers'" I want only the words from the string, not the symbols or the integers. How can I do it?
My input:
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
Excepted Output:
s = "'hi','bikes','cars','rangers'"

try this using regex
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
samp= re.compile('[a-zA-z]+')
word= samp.findall(s)

not sure about pandas, but you can also do it with Regex as well, and here is the solution
import re
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
words = re.findall("(\'.+?\')", s)
output = ','.join(words)
print(output)

For pandas I would convert the column in the dataframe to string first:
df
a b
0 'hi'+'bikes'-'cars'>=20+'rangers' 1
1 random_string 'with'+random,# 4
2 more,weird/stuff=wrong 6
df["a"] = df["a"].astype("string")
df["a"]
0 'hi'+'bikes'-'cars'>=20+'rangers'
1 random_string 'with'+random,#
2 more,weird/stuff=wrong
Name: a, dtype: string
Now you can see that dtype is string, which means you can do string operations on it,
including translate and split (pandas strings). But first you have to make a translate table with punctuation and digits imported from string module string docs
from string import digits, punctuation
Then make a dictionary mapping each of the digits and punctuation to whitespace
from itertools import chain
t = {k: " " for k in chain(punctuation, digits)}
create the translation table using str.maketrans (no import necessary with python 3.8 but may be a bit different with other versions) and apply the translate and split (with "str" in between) to the column)
t = str.maketrans(t)
df["a"] = df["a"].str.translate(t).str.split()
df
a b
0 [hi, bikes, cars, rangers] 1
1 [random, string, with, random] 4
2 [more, weird, stuff, wrong] 6
As you can see you only have the words now.

Related

Problem with strip, replace functions in pandas dataframe

I am trying to strip all the special characters from a pandas dataframe column of words with the split() and replace() functions.
Howerver, it does not work. The special characters are not stripped from the words.
Can somebody enlight me please ?
import pandas as pd
import datetime
df = pd.read_csv("2022-12-08_word_selection.csv")
for n in df.index:
i = str(df.loc[n, "words"])
if len(i) > 12:
df.loc[n, "words"] = ""
df["words"] = df["words"].str.replace("$", "s")
df["words"] = df["words"].str.strip('[,:."*+-#/\^`#}{~&%’àáâæ¢ß¥£™©®ª×÷±²³¼½¾µ¿¶·¸º°¯§…¤¦≠¬ˆ¨‰øœšÞùúûý€')
df["words"] = df["words"].str.strip("\n")
df = df.groupby(["words"]).mean()
print(df)
Firstly, the program replaces all words in the "words" column longer than 12 characters. Then , I was hoping it would strip all the special characters from the "words" column.
First, avoid using a loop and instead use transform() to replace words longer than 12 characters with an empty string. Second, the Series.str conversion is not necessary prior to calling replace(). Third, split() only removes leading and trailing characters so it is not what you want. Use a regular expression with replace() instead. Finally, to remove special characters, it is cleaner to use a regex negative set to match and remove only the characters that are not letters or numbers. This looks like: "[^A-Za-z0-9]".
Here is some example data and code that works:
import pandas as pd
import re
df = pd.DataFrame(
{
"words": [
123,
"abcd",
"efgh",
"abcdefghijklmn",
"lol%",
"Hornbæk",
"10:03",
"$999¼",
]
}
)
# Faster and more concise than a loop
df["words"] = df["words"].transform(lambda x: "" if len(x) > 12 else x)
# Not sure why you do this but okay
df["words"] = df["words"].replace("$", "s")
# Use a regex negative set to keep only letters and numbers
df["words"] = df["words"].replace(re.compile("[^A-Za-z0-9]"), "")
display(df)
outputs:
words
0 123
1 abcd
2 efgh
3 abcdefghijklmn
4 lol
5 Hornbk
6 1003
7 999

Remove specific words from column using Python

The data originally is derived from PDF for doing further analysis on the data, There is an [identity] column where some the values are spelled wrong, i.e it contains wrong spelling or Special characters.
Looking out to remove the Unwanted characters from the column .
Input Data:
identity
UK25463AC
ID:- UN67342OM
#ID!?
USA5673OP
Expected Output:
identity
UK25463AC
UN67342OM
NAN
USA5673OP
Script I have Tried so far:
stop_word = ['#ID!?','ID:-']
pat = '|'.join(r"\b{}\b".format(x) for x in stop_words)
df['identity'] = df['identity'].str.replace(pat, '')
So I have no clue how to handle this problem
From expected output is necessary remove words boundaries \b\b and because special regex chcarecer is added re.escape, then is used Series.replace for empty string and if only empty string to missing value:
import re
stop_words = ['#ID!?','ID:-']
pat = '|'.join(r"{}".format(re.escape(x)) for x in stop_words)
df['identity'] = df['identity'].replace(pat, '', regex=True).replace('', np.nan)
print (df)
identity
0 UK25463AC
1 UN67342OM
2 NaN
3 USA5673OP

How to remove substring from string in pandas column that contain both numbers and chars

I know that if we have strings like this
May21
James
Adi22
Hello
Girl90
zt411
We can use regex with \d+ to remove all the numbers. But how would I also remove the entire string if the string also contains characters. Thus the only thing that would be returned in the latter above would be James and Hello?
I can do this for just one string:
c = 'xterm has been replaced new mac 008064c79202'
c = ' '.join(w for w in c.split() if not any(x.isdigit() for x in w))
c
How would I apply this across an entire dataframe?
You can apply your function to a column as follows:
df = pd.DataFrame(['May21', 'James', 'Adi22', 'Hello', 'Girl90', 'zt411'], columns=['word'])
def remove_semi_nums(c):
return ' '.join(w for w in c.split() if not any(x.isdigit() for x in w))
# option A: list comprehension (I like this better)
df['word'] = [remove_semi_nums(x) for x in df.word]
# option B: use `apply` which I don't recommend for big data sets because it's slow. (Also cumbersome to use for functions that use multiple columns as args)
df['word'] = df['word'].apply(remove_semi_nums)
Use a regular expression like
(?:[A-Za-z]+\d|\d+[A-Za-z])[A-Za-z\d]+$
with Series.str.match. See the regex demo. Details:
^ (implicit in .match): start of string
(?:[A-Za-z]+\d|\d+[A-Za-z]) - either one or more letters and then a digit or one or more digits and then a letter
[A-Za-z\d]+ - one or more letters or digits
$ - end of string.
See the Pandas test:
df = pd.DataFrame(['May21', 'James', 'Adi22', 'Hello', 'Girl90', 'zt411'], columns=['word'])
df[df['word'].str.match(r'(?:[A-Za-z]+\d|\d+[A-Za-z])[A-Za-z\d]+$')] = ""
>>> df
word
0
1 James
2
3 Hello
4
5

separate upper case chars with digits from lower case chars with digits

I have a column Name with data in format below:
Name Name2
0 MORR1223ldkeha12 ldkeha12
1 FRAN2771yetg4fq1 yetg4fq1
2 MORR56333gft4tsd1 gft4tsd1
I wanted to separate name as per column Name2. There is a pattern of 4 upper case chars, followed by 4-5 digits, and I'm interested in what follows these 4-5 digits.
Is there any way to achieve this?
You can try below logic:
import re
_names = ['MORR1223ldkeha12', 'FRAN2771yetg4fq1', 'MORR56333gft4tsd1']
result = []
for _name in _names:
m = re.search('^[A-Z]{4}[0-9]{4,5}(.+)', _name)
result.append(m.group(1))
print(result)
Using str.extract
import pandas as pd
df = pd.DataFrame({"Name": ['MORR1223ldkeha12', 'FRAN2771yetg4fq1', 'MORR56333gft4tsd1']})
df["Name2"] = df["Name"].str.extract(r"\d{4,5}(.*)")
print(df)
Output:
Name Name2
0 MORR1223ldkeha12 ldkeha12
1 FRAN2771yetg4fq1 yetg4fq1
2 MORR56333gft4tsd1 gft4tsd1
You could use a regex to find out if there are 4 or 5 digits and then remove either the first 8 or 9 letters. So if the pattern ^[A-Z]{4}[0-9]{5}.* matches, there are 5 digits, else 4.
If you change your re like this '(^[A-Z]{4})([0-9]{4,5})(.+)' you can access the different parts using the submatches of the match result.
So in Anil's code, group(0) will return the whole match, 1 the first group, 2 the second one and 3 the rest.

conditional replacement within strings of pandas dataframe column

Say I've got a column in my Pandas Dataframe that looks like this:
s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
I would like to use this column for fuzzy matching and therefore I want to remove characters ('.' , '/' , '-') but only at the end of each string so it looks like this:
s = pd.Series(["ab-cd", "abc", "abc-def", "ab.cde", "abcd"])
So far I started out easy so instead of generating a list with characters I want removed I just repeated commands for different characters like:
if s.str[-1] == '.':
s.str[-1].replace('.', '')
But this simply produces an error. How do I get the result I want, that is strings without characters at the end (characters in the rest of the string need to be preserved)?
Replace with regex will help you get the output
s.replace(r'[./-]$','',regex=True)
or with the help of apply incase looking for an alternative
s.apply(lambda x :x[:-1] if x[-1] is '.' or '-' or '/' else x)
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
You can use str.replace with a regex:
>>> s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
>>> s.str.replace("\.$|/$|\-$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
which can be reduced to this:
>>> s.str.replace("[./-]$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
You can use str.replace with a regular expression
s.str.replace(r'[./-]$','')
Substitute inside [./-] any characters you want to replace. $ means the match should be at the end of the string.
To replace "in-place" use Series.replace
s.replace(r'[./-]$','', inplace=True, regex=True)
I was able to remove characters from the end of strings in a column in a pandas DataFrame with the following line of code:
s.replace(r'[./-]$','',regex=True)
Where all entries in between brackets ( [./-] ) indicate characters to be removed and $ indicate they should be removed from the end

Categories

Resources