extract uppercase words from string

extract uppercase words from string - python

I want to extract all the words that are complete in uppercase (so not only the first letter, but all the letters in the word) from strings in columnY in dataset X
I have the following script:
X['uppercase'] = X['columnY'].str.extract('([A-Z][A-Z]+)')
But that only extract the first uppercased word in the string.
Then I tried extractall:
X['uppercase'] = X['columnY'].str.extractall('([A-Z][A-Z]+)')
But I got the following error:
TypeError: incompatible index of inserted column with frame index
What am I doing wrong?

We can use regular expressions and list comprehensions as below
import re
def extract_uppercase_words(text):
return re.findall(r'\b[A-Z]+\b', text)
X['columnY'].apply(extract_uppercase_words)

Try this,
X['uppercase'] = X['columnY'].str.findall('\b[A-Z]+\b')
This will give you a list of all the UPPERCASE words.
And If you want all these words to be concatenated in a single string you can use the below code.
X['uppercase'] = X['columnY'].str.findall('\b[A-Z]+\b').str.join(' ')

Assuming you only have words in the column, you could try:
X["uppercase"] = X["columnY"].str.replace(r'\s*\b\w*[a-z]\w*\b\s*', ' ', regex=True)
.str.replace(r'\s{2,}', ' ', regex=True)
.str.strip()
The first replacement targets non all uppercase words (being defined as any word with at least one lowercase letter), as well as any surrounding spaces. We replace with just a single space. The second replacement targets any excess spaces and replaces with just a single space.

Related

Regex to match first occurrence of non alpha-numeric characters

I am parsing some user input to make a basic Discord bot assigning roles and such. I am trying to generalize some code to reuse for different similar tasks (doing similar things in different categories/channels).
Generally, I am looking for a substring (the category), then taking the string after as that categories value. I am looking line by line for my category, replacing the "category" substring and returning a stripped version. However, what I have now also replaces any space in the "value" string.
Originally the string looks like this:
Gamertag : 00test gamertag
What I want to do, is preserve the spaces in the value. The regex I am trying to do is: match all non alpha-numeric chars until the first letter.
My return is already matching non alpha but can't figure out how to get just first group, looks like it should be simply adding a ? to make it a lazy operator but not sure.. example code and string below (regex I want to replace is the final print string).
String I am working with:
- 00test Gamertag #(or any non-alpha delimiter)
Desired Results (by matching and stripping the extra characters)
00test Gamertag #(remove leading space and any non-alpha characters before the first words)
The regex I am trying to do is: match all non alpha-numeric chars until the first letter. Should be something like the following, which is close to what I use to strip non-alphas now but it does all not the first group - so I want to match the first group of non-alphas in a string to strip that part using re.sub..
\W+?
https://www.online-python.com/gDVhZrnmlq
Thank you!

Your regex will substitute the non-alphanumerical characters anywhere in the input string. If you only need to have this happening at the start of the string, then use the start-of-input anchor (i.e. ^):
^\W+

It depends on your inputs, you can use two regex to achieve your goal, the first to remove all non alpha-numeric from your string including the ones between words, and the second one to remove whitespaces between words if there is more than one space between each two words :
import re
gamer_tag = "µ& - 00test - Gamertag"
gamer_tag = re.sub(r"[^a-zA-Z0-9\s]", "", gamer_tag)
gamer_tag = re.sub(r" +", " ", gamer_tag)
print(gamer_tag.strip())
# Output: 00test Gamertag
You can remove the second re.sub() if you sure that there will no more than one space between words.
gamer_tag = "- 00test Gamertag "
gamer_tag = re.sub(r"[^a-zA-Z0-9\s]", "", gamer_tag)
print(gamer_tag.strip())
# Output: 00test Gamertag

Remove tuple based on character count

I have a dataset consisting of tuple of words. I want to remove words that contain less than 4 characters, but I could not figure out a way to iterate my codes.
Here is a sample of my data:
content clean4Char
0 [yes, no, never] [never]
1 [to, every, contacts] [every, contacts]
2 [words, tried, describe] [words, tried, describe]
3 [word, you, go] [word]
Here is the code that I'm working with (it keeps showing me error warning).
def remove_single_char(text):
text = [word for word in text]
return re.sub(r"\b\w{1,3}\b"," ", word)
df['clean4Char'] = df['content'].apply(lambda x: remove_single_char(x))
df.head(3)

the problem is with your remove_single_char function. This will do the job:
Also there is no need to use lambda since you already are passing a function to applay
def remove(input):
return list(filter(lambda x: len(x) > 4, input))
df['clean4Char'] = df['content'].apply(remove)
df.head(3)

We can use str.replace here for a Pandas option:
df["clean4Char"] = df["content"].str.replace(r'\b\w{1,3}\b,?\s*', '', regex=True)
The regex used here says to match:
\b a word boundary (only match entire words)
\w{1,3} a word with no more than 3 characters
\b closing word boundary
,? optional comma
\s* optional whitespace
We then replace with empty string to effectively remove the 3 letter or less matching words along with optional trailing whitespace and comma.
Here is a regex demo showing that the replacement logic is working.

How would I be able remove punctuation then split that word into two?

How would I be able to split a word containing punctuation into 2 separate words without punctuation? For example if I have the string "half-attained", how would I make it so that I can remove the "-" as well as splitting the words into "half" and "attained".
This is what I have so far and it only removes the punctuation and puts the words together.
for n in range(0,len(test_list)):
no_punct = ""
for char in test_list[n]:
if char not in punctuations:
no_punct = no_punct + char
no_puclist.append(no_punct)

split() returns a list of the words of a string separated along a separator.
In your case:
"half-attained".split("-")
# ["half", "attained"]

split() does it well
print( "half-attained".split("-") )
# output :
# ["half", "attained"]

"half-attained".split("-") # ["half", "attained"]
Works fine. Read the docs next time.

You should use split() :
string = "half-attained"
array = string.split("-")
print(array)
str.split(sep=None, maxsplit=-1) : Return a list of the words in the
string, using sep as the delimiter string. If maxsplit is given, at
most maxsplit splits are done (thus, the list will have at most
maxsplit+1 elements). If maxsplit is not specified or -1, then there
is no limit on the number of splits (all possible splits are made).

split() splits string from separator characters and help you to get rid of those.
print("please-split-me".split("-"))
# ["please" , "split" , "me"]

Are you trying to create an array or just another string with both words? Can you show your expected output?
You might simply need: test_list.replace('-', ' ')

Calculate the index of the nth word in a string

Given the index of a word in a string starting at zero ("index" is position two in this sentence), and a word being defined as that which is separated by whitespace, I need to find the index of the first char of that word.
My whitespace regex pattern is "( +|\t+)+", just to cover all my bases (except new line chars, which are excluded). I used split() to separate the string into words, and then summed the lengths of each of those words. However, I need to account for the possibility that more than once whitespace character is used between words, so I can't simply add the number of words minus one to that figure and still be accurate every time.
Example:
>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8

Change your regular expression to include the whitespace around each word to prevent it from being lost. The expression \s*\S+\s* will first consume leading whitespace, then the actual word, then trailing spaces, so only the first word in the resulting list might have leading spaces (if the string itself started with whitespace). The rest consist of the word itself potentially followed by whitespace. After you have that list, simply find the total length of all the words before the one you want, and account for any leading spaces the string may have.
def get_word_index(s, idx):
words = re.findall(r'\s*\S+\s*', s)
return sum(map(len, words[:idx])) + len(words[idx]) - len(words[idx].lstrip())
Testing:
>>> example = "This is an example sentence"
>>> get_word_index(example, 2)
8
>>> example2 = ' ' + example
>>> get_word_index(example2, 2)
9

Maybe you could try with:
your_string.index(your_word)
documentation

Split by suffix with Python regular expression

I want to split strings only by suffixes. For example, I would like to be able to split dord word to [dor,wor].
I though that \wd would search for words that end with d. However this does not produce the expected results
import re
re.split(r'\wd',"dord word")
['do', ' wo', '']
How can I split by suffixes?

x='dord word'
import re
print re.split(r"d\b",x)
or
print [i for i in re.split(r"d\b",x) if i] #if you dont want null strings.
Try this.

As a better way you can use re.findall and use r'\b(\w+)d\b' as your regex to find the rest of word before d:
>>> re.findall(r'\b(\w+)d\b',s)
['dor', 'wor']

Since \w also captures digits and underscore, I would define a word consisting of just letters with a [a-zA-Z] character class:
print [x.group(1) for x in re.finditer(r"\b([a-zA-Z]+)d\b","dord word")]
See demo

If you're wondering why your original approach didn't work,
re.split(r'\wd',"dord word")
It finds all instances of a letter/number/underscore before a "d" and splits on what it finds. So it did this:
do[rd] wo[rd]
and split on the strings in brackets, removing them.
Also note that this could split in the middle of words, so:
re.split(r'\wd', "said tendentious")
would split the second word in two.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.