find all characters NOT in regex pattern

find all characters NOT in regex pattern - python

Let's say I have a regex of legal characters
legals = re.compile("[abc]")
I can return a list of legal characters in a string like this:
finder = re.finditer(legals, "abcdefg")
[match.group() for match in finder]
>>>['a', 'b', 'c']
How can I use regex to find a list of the characters NOT in the regex? IE in my case it would return
['d','e','f','g']
Edit: To clarify, I'm hoping to find a way to do this without modifying the regex itself.

Negate the character class:
>>> illegals = re.compile("[^abc]")
>>> finder = re.finditer(illegals, "abcdefg")
>>> [match.group() for match in finder]
['d', 'e', 'f', 'g']
If you can't do that (and you're only dealing with one-character length matches), you could
>>> legals = re.compile("[abc]")
>>> remains = legals.sub("", "abcdefg")
>>> [char for char in remains]
['d', 'e', 'f', 'g']

Related

How to split a string by spaces and remove non-ASCII characters?

When I am given a string like "Ready[[[, steady, go!", I want to turn it into a list like this: [Ready, steady, go!]. Currently, the best I could do are two list comprehensions but I couldn't figure out a way to combine them.
text_list = [i for i in text.split()]
output: ['Ready[[[,', 'steady,', 'go!']
clean_list = [x for x in list(text) if x in string.ascii_letters]
output: ['R', 'e', 'a', 'd', 'y', 's', 't', 'e', 'a', 'd', 'y', 'g', 'o']
clean_list does succeed in removing non-ASCII letters but literally turns every single character into a list element. text_list keeps the format intact but does not remove non-ASCII characters. How do I combine the two logics to give me the output that I want?

This should work:
import re, string
# filter out all unwanted characters using regex
pattern = re.compile(f"[^{string.ascii_letters} !]")
filtered = pattern.sub('', "Ready[[[, steady, go!")
# split
result = filtered.split()

A translator that replaces vowels with a string

For those that don't know, replacing vowels with 'ooba' has become a popular trend on https://reddit.com/r/prequelmemes . I would like to automate this process by making a program with python 2.7 that replaces vowels with 'ooba'. I have no idea where to get started

You could use a simple regular expression:
import re
my_string = 'Hello!'
my_other_string = re.sub(r'[aeiou]', 'ooba', my_string)
print(my_other_string) # Hooballooba!

Following method is suggested if the line is short. I would prefer using regex otherwise. Following assumes that your text is s.
s = ''.join(['ooba' if i in ['a', 'e', 'i', 'o', 'u'] else i for i in s])
Regex approach:
import re
s = re.sub(r'a|e|i|o|u', "ooba", s)

For a quick and simple answer, you could feed string meme into here
for i, c in enumerate(meme):
if c in ['a', 'e', 'i', 'o', 'u']:
meme[:i] = meme[:i] + 'ooba' + meme[i+1:]
It goes over each character in the string, and checks if it is a vowel. If it is, it slices around the index and inserts 'ooba' where it used to be.

Using regex to findall lowercase letters in string append to list. Python

I'm looking for a way to get the lowercase values out of a string that has both uppercase and potentially lowercase letters
here's an example
sequences = ['CABCABCABdefgdefgdefgCABCAB','FEGFEGFEGwowhelloFEGFEGonemoreFEG','NONEARELOWERCASE'] #sequences with uppercase and potentially lowercase letters
this is what i want to output
upper_output = ['CABCABCABCABCAB','FEGFEGFEGFEGFEGFEG','NONEARELOWERCASE'] #the upper case letters joined together
lower_output = [['defgdefgdefg'],['wowhello','onemore'],[]] #the lower case letters in lists within lists
lower_indx = [[9],[9,23],[]] #where the lower case values occur in the original sequence
so i want the lower_output list be a LIST of SUBLISTS. the SUBLISTS would have all the strings of lowercase letters .
i was thinking of using regex . . .
import re
lower_indx = []
for seq in sequences:
lower_indx.append(re.findall("[a-z]", seq).start())
print lower_indx
for the lowercase lists i was trying:
lower_output = []
for seq in sequences:
temp = ''
temp = re.findall("[a-z]", seq)
lower_output.append(temp)
print lower_output
but the values are not in separate lists (i still need to join them)
[['d', 'e', 'f', 'g', 'd', 'e', 'f', 'g', 'd', 'e', 'f', 'g'], ['w', 'o', 'w', 'h', 'e', 'l', 'l', 'o', 'o', 'n', 'e', 'm', 'o', 'r', 'e'], []]

Sounds like (I may be misunderstanding your question) you just need to capture runs of lowercase letters, rather than each individual lowercase letter. This is easy: just add the + quantifier to your regular expression.
for seq in sequences:
lower_output.append(re.findall("[a-z]+", seq)) # add substrings
The + quantifier specifies that you want "at least one, and as many as you can find in a row" of the preceding expression (in this case '[a-z]'). So this will capture your full runs of lowercase letters all in one group, which should cause them to appear as you want them to in your output lists.
It gets a little big uglier if you want to preserve your list-of-list structure and get the indices as well, but it's still very simple:
for seq in sequences:
matches = re.finditer("[a-z]+", seq) # List of Match objects.
lower_output.append([match.group(0) for match in matches]) # add substrings
lower_indx.append([match.start(0) for match in matches]) # add indices
print lower_output
>>> [['defgdefgdefg'], ['wowhello', 'onemore'], []]
print lower_indx
>>> [[9], [9, 23], []]

Apart from regex you can also use itertools.groupby here:
In [39]: sequences = ['CABCABCABdefgdefgdefgCABCAB','FEGFEGFEGwowhelloFEGFEGonemoreFEG','NONEARELOWERCASE'] #sequences with uppercase and potentially lowercase letters
In [40]: lis=[["".join(v) for k,v in groupby(x,key=lambda z:z.islower())] for x in sequences]
In [41]: upper_output=["".join(x[::2]) for x in lis]
In [42]: lower_output=[x[1::2] for x in lis]
In [43]: upper_output
Out[43]: ['CABCABCABCABCAB', 'FEGFEGFEGFEGFEGFEG', 'NONEARELOWERCASE']
In [44]: lower_output
Out[44]: [['defgdefgdefg'], ['wowhello', 'onemore'], []]
In [45]: lower_indx=[[sequences[i].index(y) for y in x] for i,x in enumerate(lower_output)]
In [46]: lower_indx
Out[46]: [[9], [9, 23], []]

Split string based on a regular expression

I have the output of a command in tabular form. I'm parsing this output from a result file and storing it in a string. Each element in one row is separated by one or more whitespace characters, thus I'm using regular expressions to match 1 or more spaces and split it. However, a space is being inserted between every element:
>>> str1="a b c d" # spaces are irregular
>>> str1
'a b c d'
>>> str2=re.split("( )+", str1)
>>> str2
['a', ' ', 'b', ' ', 'c', ' ', 'd'] # 1 space element between!!!
Is there a better way to do this?
After each split str2 is appended to a list.

By using (,), you are capturing the group, if you simply remove them you will not have this problem.
>>> str1 = "a b c d"
>>> re.split(" +", str1)
['a', 'b', 'c', 'd']
However there is no need for regex, str.split without any delimiter specified will split this by whitespace for you. This would be the best way in this case.
>>> str1.split()
['a', 'b', 'c', 'd']
If you really wanted regex you can use this ('\s' represents whitespace and it's clearer):
>>> re.split("\s+", str1)
['a', 'b', 'c', 'd']
or you can find all non-whitespace characters
>>> re.findall(r'\S+',str1)
['a', 'b', 'c', 'd']

The str.split method will automatically remove all white space between items:
>>> str1 = "a b c d"
>>> str1.split()
['a', 'b', 'c', 'd']
Docs are here: http://docs.python.org/library/stdtypes.html#str.split

When you use re.split and the split pattern contains capturing groups, the groups are retained in the output. If you don't want this, use a non-capturing group instead.

Its very simple actually. Try this:
str1="a b c d"
splitStr1 = str1.split()
print splitStr1

Regular expression group capture with multiple matches

Quick regular expression question.
I'm trying to capture multiple instances of a capture group in python (don't think it's python specific), but the subsequent captures seems to overwrite the previous.
In this over-simplified example, I'm essentially trying to split a string:
x = 'abcdef'
r = re.compile('(\w){6}')
m = r.match(x)
m.groups() # = ('f',) ?!?
I want to get ('a', 'b', 'c', 'd', 'e', 'f'), but because regex overwrites subsequent captures, I get ('f',)
Is this how regex is supposed to behave? Is there a way to do what I want without having to repeat the syntax six times?
Thanks in advance!
Andrew

You can't use groups for this, I'm afraid. Each group can match only once, I believe all regexes work this way. A possible solution is to try to use findall() or similar.
r=re.compile(r'\w')
r.findall(x)
# 'a', 'b', 'c', 'd', 'e', 'f'

The regex module can do this.
> m = regex.match('(\w){6}', "abcdef")
> m.captures(1)
['a', 'b', 'c', 'd', 'e', 'f']
Also works with named captures:
> m = regex.match('(?P<letter>)\w)', "abcdef")
> m.capturesdict()
{'letter': ['a', 'b', 'c', 'd', 'e', 'f']}
The regex module is expected to replace the 're' module - it is a drop-in replacement that acts identically, except it has many more features and capabilities.

To find all matches in a given string use re.findall(regex, string). Also, if you want to obtain every letter here, your regex should be either '(\w){1}' or just '(\w)'.
See:
r = re.compile('(\w)')
l = re.findall(r, x)
l == ['a', 'b', 'c', 'd', 'e', 'f']

I suppose your question is a simplified presentation of your need.
Then, I take an exemple a little more complex:
import re
pat = re.compile('[UI][bd][ae]')
ch = 'UbaUdeIbaIbeIdaIdeUdeUdaUdeUbeIda'
print [mat.group() for mat in pat.finditer(ch)]
result
['Uba', 'Ude', 'Iba', 'Ibe', 'Ida', 'Ide', 'Ude', 'Uda', 'Ude', 'Ube', 'Ida']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find all characters NOT in regex pattern - python

Related

How to split a string by spaces and remove non-ASCII characters?

A translator that replaces vowels with a string

Using regex to findall lowercase letters in string append to list. Python

Split string based on a regular expression

Regular expression group capture with multiple matches

Categories

Resources