NLTK RegexpTokenizer: Regex to retain just characters in Random text [duplicate]

NLTK RegexpTokenizer: Regex to retain just characters in Random text [duplicate] - python

This question already has answers here:
Using explicitly numbered repetition instead of question mark, star and plus
(4 answers)
Closed 5 years ago.
I used tokenizer = RegexpTokenizer(r'\w+') which retains alphanumeric characters
But how do I combine a regular expression to remove every other element retaining just characters greater than length 2
Below is one row in the dataframe which contains random text
0 [ANOTHER 2'' F/P SAMPLE 01:52 ...A13232 / AS OUTPUT MSG...

I think you need for find words with len>2:
RegexpTokenizer(r'\w{3,}')
Or if need only letters:
RegexpTokenizer(r'[a-zA-Z]{3,}')

Related

How to get the numbers from a string (contains no spaces between letters and numbers)? [duplicate]

This question already has answers here:
How to extract numbers from a string in Python?
(19 answers)
Closed 3 years ago.
So, I have a string "AB256+74POL". I want to extract the numbers only into a list say num = [256,74]. How to do this in python?
I have tried string.split('+') and followed by iterating over the two parts and adding the characters which satisfy isdigit(). But is there an easier way to that?

import re
a = 'AB256+74POL'
array = re.findall(r'[0-9]+', a)

"".join([c if c.isdigit() else " " for c in mystring]).split()
Explanation
Strings are iterable in python. So we iterate on each character in the string, and replace non digits with spaces, then split the result to get all sequences of digits in a list.

how to fix ''nothing to repeat at position 2'' [duplicate]

This question already has answers here:
Escaping regex string
(4 answers)
Closed 3 years ago.
ı am trying to stemmize words in tex of dataframe
data is a dataframe , karma is text column , zargan is the dict of word and root of word
for a in range(1,100000):
for j in data.KARMA[a].split():
pattern = r'\b'+j+r'\b'
data.KARMA[a] = re.sub(pattern, str(zargan.get(j,j)),data.KARMA[a])
print(data.KARMA[1])
I want to change the word and root in the texts

Looks like j contains some regular expression special character like *. If you want it to be interpreted as literal text, you can say
pattern = r'\b'+re.escape(j)+r'\b'
and possibly the same for r if it should similarly be coerced into a literal string.

List element case conversion [duplicate]

This question already has answers here:
Convert a list with strings all to lowercase or uppercase
(13 answers)
Closed 4 years ago.
I have a list that has 12 elements. I am getting an input and matching that input with the value of another variable. Now that means that case-sensitivity will be a problem. I know how to go through the list with a loop but how can I convert every character in each element to a lowercase character?
for i in sa:
# something here to convert element in sa to lowercase

A simple one liner:
lowercase_list = [ i.lower() for i in input_list ]

Capture repeated characters and split using Python [duplicate]

This question already has answers here:
How can I tell if a string repeats itself in Python?
(13 answers)
Closed 3 years ago.
I need to split a string by using repeated characters.
For example:
My string is "howhowhow"
I need output as 'how,how,how'.
I cant use 'how' directly in my reg exp. because my input varies. I should check the string whether it is repeating the character and need to split that characters.

import re
string = "howhowhow"
print(','.join(re.findall(re.search(r"(.+?)\1", string).group(1), string)))
OUTPUT
howhowhow -> how,how,how
howhowhowhow -> how,how,how,how
testhowhowhow -> how,how,how # not clearly defined by OP
The pattern is non-greedy so that howhowhowhow doesn't map to howhow,howhow which is also legitimate. Remove the ? if you prefer the longest match.

lengthofRepeatedChar = 3
str1 = 'howhowhow'
HowmanyTimesRepeated = int(len(str1)/lengthofRepeatedChar)
((str1[:lengthofRepeatedChar]+',')*HowmanyTimesRepeated)[:-1]
'how,how,how'
Works When u know the length of repeated characters

Find last occurrence of character in string Python [duplicate]

This question already has answers here:
Find index of last occurrence of a substring in a string
(12 answers)
Closed 8 years ago.
How would I find the last occurrence of a character in a string?
string = "abcd}def}"
string = string.find('}',last) # Want something like this

You can use rfind:
>>> "abcd}def}".rfind('}')
8

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

NLTK RegexpTokenizer: Regex to retain just characters in Random text [duplicate] - python

I think you need for find words with len>2: RegexpTokenizer(r'\w{3,}') Or if need only letters: RegexpTokenizer(r'[a-zA-Z]{3,}')

Related

How to get the numbers from a string (contains no spaces between letters and numbers)? [duplicate]

how to fix ''nothing to repeat at position 2'' [duplicate]

List element case conversion [duplicate]

Capture repeated characters and split using Python [duplicate]

Find last occurrence of character in string Python [duplicate]

Categories

Resources