Regex expression for words with length of even number - python

I want to write a regex expression for words with even-numbered length.
For example, the output I want from the list containing the words:
{"blue", "ah", "sky", "wow", "neat"} is {"blue", "ah", "neat}.
I know that the expression \w{2} or \w{4} would produce 2-worded or 4-worded words, but what I want is something that could work for all even numbers. I tried using \w{%2==0} but it doesn't work.

You can repeat 2 word characters as a group between anchors ^ to assert the start and $ to assert the end of the string, or between word boundaries \b
^(?:\w{2})+$
See a regex demo.
import re
strings = [
"blue",
"ah",
"sky",
"wow",
"neat"
]
for s in strings:
m = re.match(r"(?:\w{2})+$", s)
if m:
print(m.group())
Output
blue
ah
neat

If you need no extra validation for the strings in your set, you can simply use
words = {"blue", "ah", "sky", "wow", "neat"}
print( list(w for w in words if len(w) % 2 == 0) )
# => ['ah', 'blue', 'neat']
See this Python demo.
If you want to make sure the words you return are made of letters, you can use
import re
words = {"blue", "ah", "sky", "wow", "neat"}
rx = re.compile(r'(?:[^\W\d_]{2})+') # For any Unicode letter words
# rx = re.compile(r'(?:[a-zA-Z]{2})+') # For ASCII only letter words
print( [w for w in words if rx.fullmatch(w)] )
# => ['blue', 'ah', 'neat']
See this Python demo. A (?:[^\W\d_]{2})+ pattern matches one or more occurrences of any two Unicode letters. Together with re.fullmatch, it requires a string to consist of an even amount of letters.

Related

Check for exact number of consecutive repetitions with a regex

With regex only, how to match an exact number of consecutive repetitions of an arbitrary single token? For example, matching the "aaa" in “ttaaabbb” instead of the "aaaa" “ttaaaabbb”, given the desired number of repetitions is 3.
Clarification: Note I was using "a" for an example, the token can be arbitrary character/number/symbols. That is, given the desired number of repetitions is 3, the desired match of "aaaa**!!!cccc333**" only gives "!!!" and "333".
In short, I want to find a list of tokens "X" where YXXXY appeared in the given string (Y is some other tokens that are different from X, Y can also be the start of the string or the end of the string). Note there can be repeated tokens in the list, e.g., "aaabbbbaaa" should give ["a", "a"].
Some other examples:
Input: "aaabbbbbbaaa****ccc", output: ["a", "a", "c"]
Input: "!!! aaaabbbaaa ccc!!!", output: ["!", "b", "a", "c", "!"].
What I have tried: I tried (.)\1{2} but unfortunately, it matches "aaaa" and "ccccc" as well in the example above. I further changed it to (?!\1)(.)\1{2}(?!\1) such that the prefix and postfix of the repeating pattern differ from it. However, I got an error in this case since the first \1 is undefined when being referred to.
You might use a pattern with 2 capture groups and a repeated backreference.
First match 4 or more times the same repeated character that you want to avoid, then match 3 times the same character.
The single characters that you want are in capture group 2, which you can get using re.finditer for example.
(\S)\1{3,}|(\S)\2{2}
The pattern matches:
(\S)\1{3,} Capture group 1, match a non whitespace char and repeat the backreference 3 or more times
| Or
(\S)\2{2} Capture group 2, match a non whitespace char and repeat the backreference 2 times
Regex demo | Python demo
For example:
import re
strings = [
"aaaa**!!!cccc333**",
"aaabbbbaaa",
"aaabbbbbbaaa****ccc",
"!!! aaaabbbaaa ccc!!!"
]
pattern = r"(\S)\1{3,}|(\S)\2{2}"
for s in strings:
matches = re.finditer(pattern, s)
result = []
for matchNum, match in enumerate(matches, start=1):
if match.group(2):
result.append(match.group(2))
print(result)
Output
['!', '3']
['a', 'a']
['a', 'a', 'c']
['!', 'b', 'a', 'c', '!']
You can do something like this using a regex and a loop:
def exact_re_match(string, length):
regex = re.compile(r'(.)\1*')
for match in regex.finditer(string):
elm = match.group()
if len(elm) == length:
yield elm
string = "aaaa!!!cccc333"
out = list(exact_re_match(string, 3))
print(out)
# ['!!!', '333']

Convert every line in the string into dictionary key

Hi I am new to python and dont know whether can I ask this basic question in this site or not
I want to convert every line in the string into a key and assign 0 as an value
MY string is:
s = '''
sarika
santha
#
akash
nice
'''
I had tried this https://www.geeksforgeeks.org/ways-to-convert-string-to-dictionary/ ways but thought not useful for my requirement
Pls help anyone Thanks in advance
Edit:
Actually I had asked for basic string but I am literally for followed string
s="""
san
francisco
Santha
Kumari
this one
"""
Here it should take {sanfrancisco:0 , santha kumari:0 , this one: 0 }
This is the challenge I am facing
Here in my string if having more than 1 new line gap it should take the nextline string as one word and convert into key
You can do it in the below way:
>>> s="""
... hello
... #
... world
...
... vk
... """
>>> words = s.split("\n")
>>> words
['', 'hello', '#', 'world', '', 'vk', '']
>>> words = words[1:len(words)-1]
>>> words
['hello', '#', 'world', '', 'vk']
>>> word_dic = {}
>>> for word in words:
... if word not in word_dic:
... word_dic[word]=0
...
>>> word_dic
{'': 0, 'world': 0, '#': 0, 'vk': 0, 'hello': 0}
>>>
Please let me know if you have any question.
You could continuously match either all lines followed by 2 newlines, or match all lines followed by a single newline.
^(?:\S.*(?:\n\n\S.*)+|\S.*(?:\n\S.*)*)
The pattern matches
^ Start of string
(?: Non capture group
\S.* Match a non whitespace char and the rest of the line
(?:\n\n\S.*)+ Repeat matching 1+ times 2 newlines, a non whitespace char and the rest of the line
| Or
\S.* Match a single non whitespace char and the rest of the line
(?:\n\S.*)* Optionally match a newline, a non whitespace char and the rest of the line
) Close non capture group
Regex demo | Python demo
For those matches, replace 2 newlines with a space and replace a single newline with an empty string.
Then from the values, create a dictionary and initialize all values with 0.
Example
import re
s="""
san
francisco
Santha
Kumari
this one
"""
pattern = r"^(?:\S.*(?:\n\n\S.*)+|\S.*(?:\n\S.*)*)"
my_dict = dict.fromkeys(
[
re.sub(
r"(\n\n)|\n",
lambda n: " " if n.group(1) else "", s.lower()
) for s in re.findall(pattern, s, re.MULTILINE)
],
0
)
print(my_dict)
Output
{'sanfrancisco': 0, 'santha kumari': 0, 'this one': 0}
You could do it like this:
# Split the string into a list
l = s.split()
dictionary = {}
# iterate through every element of the list and assign a value of 0 to it
n = 0
for word in l:
while n < len(l) - 1:
if word == "#":
continue
w = l[n] + l[n+1]
dictionary.__setitem__(w, 0)
n+=2
print(dictionary)
steps -
Remove punctuations from a string via translate.
split words if they're separated by 2 \n character
remove the spaces from the list
remove \n character and use dict comprehension to generate the required dict
import string
s = '''
sarika
santha
#
akash
nice
'''
s = s.translate(str.maketrans('', '', string.punctuation))
word_list = s.split('\n\n')
while '' in word_list:
word_list.remove('')
result = {word.replace('\n', ''): 0 for word in word_list}
print(result)

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??
Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']
This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.
For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']
In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

Finding groups of letters, numbers, or symbols

How can I split a string into substrings based on the characters contained in the substrings. For example, given a string "ABC12345..::", I would like to get a list like ['ABC', '12345', '..::']. I know the valid characters for each substring, but I don't know the lengths. So the string could also look like "CC123:....:", in which case I would like to have ['CC', '123', ':....:'] as the result.
By your example you don't seem to have anything to split with (e.g. nothing between C and 1), but what you do have is a well-formed pattern that you can match. So just simply create a pattern that groups the strings you want matched:
>>> import re
>>> s = "ABC12345..::"
>>> re.match('([A-Z]*)([0-9]*)([\.:]*)', s).groups()
('ABC', '12345', '..::')
Alternative, compile the pattern into a reusable regex object and do this:
>>> patt = re.compile('([A-Z]*)([0-9]*)([\.:]*)')
>>> patt.match(s).groups()
('ABC', '12345', '..::')
>>> patt.match("CC123:....:").groups()
('CC', '123', ':....:')
Match each group with the following regex
[0-9]+|[a-zA-Z]+|[.:]+
[0-9]+ any digits repeated any times, or
[a-zA-Z]+ any letters repeated any times, or
[.:]+ any dots or colons repeated any times
This will allow you to match groups in any order, ie: "123...xy::ab..98765PQRS".
import re
print(re.findall( r'[0-9]+|[a-zA-Z]+|[.:]+', "ABC12345..::"))
# => ['ABC', '12345', '..::']
ideone demo
If you want a non-regex approach:
value = 'ABC12345..::'
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['ABC', '12345', '..::']
Another string:
value = "CC123:....:"
indexes = [i for i, char in enumerate(value) if char.isdigit()] # Collect indexes of any digits
arr = [ value[:indexes[0]], value[indexes[0]:indexes[-1]+1], value[indexes[-1]+1:] ] # Use splicing to build list
Output:
['CC', '123', ':....:']
EDIT:
Just did a benchmark, metatoaster's method is slightly faster than this :)

match a list of words in a line using regex in python

I am looking for an expression to match strings against a list of words like ["xxx", "yyy", "zzz"]. The strings need to contain all three words but they do not need to be in the same order.
E.g., the following strings should be matched:
'"yyy" string of words and than “zzz" string of words “xxx"'
or
'string of words “yyy””xxx””zzz” string of words'
Simple string operation:
mywords = ("xxx", "yyy", "zzz")
all(x in mystring for x in mywords)
If word boundaries are relevant (i. e. you want to match zzz but not Ozzzy):
import re
all(re.search(r"\b" + re.escape(word) + r"\b", mystring) for word in mywords)
I'd use all and re.search for finding matches.
>>> words = ('xxx', 'yyy' ,'zzz')
>>> text = "sdfjhgdsf zzz sdfkjsldjfds yyy dfgdfgfd xxx"
>>> all([re.search(w, text) for w in words])
True

Categories

Resources