Cut within a pattern using Python regex - python

Objective: I am trying to perform a cut in Python RegEx where split doesn't quite do what I want. I need to cut within a pattern, but between characters.
What I am looking for:
I need to recognize the pattern below in a string, and split the string at the location of the pipe. The pipe isn't actually in the string, it just shows where I want to split.
Pattern: CDE|FG
String: ABCDEFGHIJKLMNOCDEFGZYPE
Results: ['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
What I have tried:
I seems like using split with parenthesis is close, but it doesn't keep the search pattern attached to the results like I need it to.
re.split('CDE()FG', 'ABCDEFGHIJKLMNOCDEFGZYPE')
Gives,
['AB', 'HIJKLMNO', 'ZYPE']
When I actually need,
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Motivation:
Practicing with RegEx, and wanted to see if I could use RegEx to make a script that would predict the fragments of a protein digestion using specific proteases.

A non regex way would be to replace the pattern with the piped value and then split.
>>> pattern = 'CDE|FG'
>>> s = 'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> s.replace('CDEFG',pattern).split('|')
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']

You can solve it with re.split() and positive "look arounds":
>>> re.split(r"(?<=CDE)(\w+)(?=FG)", s)
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
Note that if one of the cut sequences is an empty string, you would get an empty string inside the resulting list. You can handle that "manually", sample (I admit, it is not that pretty):
import re
s = "ABCDEFGHIJKLMNOCDEFGZYPE"
cut_sequences = [
["CDE", "FG"],
["FGHI", ""],
["", "FGHI"]
]
for left, right in cut_sequences:
items = re.split(r"(?<={left})(\w+)(?={right})".format(left=left, right=right), s)
if not left:
items = items[1:]
if not right:
items = items[:-1]
print(items)
Prints:
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
['ABCDEFGHI', 'JKLMNOCDEFGZYPE']
['ABCDE', 'FGHIJKLMNOCDEFGZYPE']

To keep the splitting pattern when you split with re.split, or parts of it, enclose them in parentheses.
>>> data
'ABCDEFGHIJKLMNOCDEFGZYPE'
>>> pieces = re.split(r"(CDE)(FG)", data)
>>> pieces
['AB', 'CDE', 'FG', 'HIJKLMNO', 'CDE', 'FG', 'ZYPE']
Easy enough. All the parts are there, but as you can see they have been separated. So we need to reassemble them. That's the trickier part. Look carefully and you'll see you need to join the first two pieces, the last two pieces, and the rest in triples. I simplify the code by padding the list, but you could do it with the original list (and a bit of extra code) if performance is a problem.
>>> pieces = [""] + pieces
>>> [ "".join(pieces[i:i+3]) for i in range(0,len(pieces), 3) ]
['ABCDE', 'FGHIJKLMNOCDE', 'FGZYPE']
re.split() guarantees a piece for every capturing (parenthesized) group, plus a piece for what's between. With more complex regular expressions that need their own grouping, use non-capturing groups to keep the format of the returned data the same. (Otherwise you'll need to adapt the reassembly step.)
PS. I also like Bhargav Rao's suggestion to insert a separator character in the string. If performance is not an issue, I guess it's a matter of taste.
Edit: Here's a (less transparent) way to do it without adding an empty string to the list:
pieces = re.split(r"(CDE)(FG)", data)
result = [ "".join(pieces[max(i-3,0):i]) for i in range(2,len(pieces)+2, 3) ]

A safer non-regex solution could be this:
import re
def split(string, pattern):
"""Split the given string in the place indicated by a pipe (|) in the pattern"""
safe_splitter = "####SPLIT_HERE####"
safe_pattern = pattern.replace("|", safe_splitter)
string = string.replace(pattern.replace("|", ""), safe_pattern)
return string.split(safe_splitter)
s = "ABCDEFGHIJKLMNOCDEFGZYPE"
print(split(s, "CDE|FG"))
print(split(s, "|FG"))
print(split(s, "FGH|"))
https://repl.it/C448

Related

In python, find tokens in line

long time ago I wrote a tool for parsing text files, line by line, and do some stuff, depending on commands and conditions in the file.
I used regex for this, however, I was never good in regex.
A line holding a condition looks like this:
[type==STRING]
And the regex I use is:
re.compile(r'^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*$', re.MULTILINE)
This regex would result me the keyword "type" and the value "STRING".
However, now I need to update my tool to have more conditions in one line, e.g.
[type==STRING][amount==0]
I need to update my regex to get me two pairs of results, one pair type/STRING and one pair amount/0.
But I'm lost on this. My regex above gets me zero results with this line.
Any ideas how to do this?
You could either match a second pair of groups:
^[^\[\]]*\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*(?:\[([^\]\[=]*)==([^\]\[=]*)\][^\]\[]*)?$
Regex demo
Or you can omit the anchors and the [^\[\]]* part to get the group1 and group 2 values multiple times:
\[([^\]\[=]*)==([^\]\[=]*)\]
Regex demo
Is it a requirement that you use regex? You can alternatively accomplish this pretty easily using the split function twice and stripping the first opening and last closing bracket.
line_to_parse = "[type==STRING]"
# omit the first and last char before splitting
pairs = line_to_parse[1:-1].split("][")
for pair in pairs:
x, y = pair.split("==")
Rather depends on the precise "rules" that describe your data. However, for your given data why not:
import re
text = '[type==STRING][amount==0]'
words = re.findall('\w+', text)
lst = []
for i in range(0, len(words), 2):
lst.append((words[i], words[i+1]))
print(lst)
Output:
[('type', 'STRING'), ('amount', '0')]

Replace Items with regex (kind of)

I am handed a bunch of data and trying to get rid of certain characters. The data contains multiple instances of "^{number}" → "^0", "^1", "^2", etc.
I am trying to set all of these instances to an empty string, "", is there a better way to do this than
string.replace("^0", "").replace("^1", "").replace("^2", "")
I understand you can use a dictionary, but it seems a little overkill considering each item will be replaced with "".
I understand that the digits are always at the end of the string, have a look at the solutions below.
with regex:
import re
text = 'xyz125'
s = re.sub("\d+$",'', text)
print(s)
it should print:
'xyz'
without regex, keep in mind that this solution removes all digits and not only the ones at the end of a string:
text = 'xyz125'
result = ''.join(i for i in text if not i.isdigit())
print(result)
it should print:
'xyz'

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?
This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')
If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d
You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.
Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

Python - How to use regex to find multiple words and extract them at the same time

Using Regular Expression, I want to find all the match words in a sentence and extract the wanted part in the matches words at the same time.
I use the API "findall" from "re" module to find the match words and plus the brackets to extract the parts I want.
For example I have a string "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C".
I only want the remaining two words after "0xQQ" or "0xWW", which will result in a list ["1A", "2B, "4C"].
Here is my code:
import re
MyString = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
MySearch = re.compile("0xQQ(\w{2})|0xWW(\w{2})")
MyList = MySearch.findall(MyString)
print MyList
So my expected result is ["1A", "2B, "4C"].
But the actual result is [('1A', ''), ('', '2B'), ('4C', '')]
I think I might have used the combination of "()" and "|" in the wrong way.
Thx for the help!
Two different capturing groups will result in two items in the output (whatever matched each).
Instead, use a single capturing group and put your | (OR) earlier:
re.compile("0x(?:QQ|WW)(\w{2})")
((?:...) is a non-capturing group that matches ... - used to limit the effects of the | to only the QQ/WW split, without adding another capture to the output.)
You can try this:
import re
string = "0xQQ1A, 0xWW2B, 0xEE3C, 0xQQ4C"
pattern = re.compile(r"(0xQQ|0xWW)(\w{2})")
result = [match[2] for match in pattern.finditer(string)]
result will be:
['1A', '2B', '4C']

Parsing and reformatting CSV/text data using Python

sorry if this a bit of a beginner's question, but I haven't had much experience with python, and could really use some help in figuring this out. If there is a better programming language for tackling this, I'd be more than open to hearing it
I'm working on a small project, and I have two blocks of data, formatted differently from each other. They're all spreadsheets saved as CSV files, and I'd really like to make one group match the other without having to manually edit all the data.
What I need to do is go through a CSV, and format any data saved like this:
10W
20E
15-16N
17-18S
To a format like this (respective line to respective format):
10,W
20,E
,,15,16,N
,,17,18,S
So that they can just be copied over when opened as spreadsheets
I'm able to get the files into a string in python, but I'm unsure of how to properly write something to search for a number-hyphen-number-letter format.
I'd be immensely grateful for any help I can get. Thanks
This sounds like a good use-case for regular expressions. Once you've split the lines up into individual strings and stripped the whitespace (using s.strip()) these should work (I'm assuming those are cardinal directions; you'll need to change [NESW] to something else if that assumption is incorrect.):
>>> import re
>>> re.findall('\A(\d+)([NESW])', '16N')
[('16', 'N')]
>>> re.findall('\A(\d+)([NESW])', '15-16N')
[]
>>> re.findall('\A(\d+)-(\d+)([NESW])', '15-16N')
[('15', '16', 'N')]
>>> re.findall('\A(\d+)-(\d+)([NESW])', '16N')
[]
The first regex '\A(\d+)([NESW])' matches only a string that begins with a sequence of digits followed by a capital letter N, E, S, or W. The second matches only a string that begins with a sequence of digits followed by a hyphen, followed by another sequence of digits, followed by a capital letter N, E, S, or W. Forcing it to match at the beginning ensures that these regexes don't match a suffix of a longer string.
Then you can do something like this:
>>> vals = re.findall('\A(\d+)([NESW])', '16N')[0]
>>> ','.join(vals)
'16,N'
>>> vals = re.findall('(\d+)-(\d+)([NESW])', '15-16N')[0]
>>> ',,' + ','.join(vals)
',,15,16,N'
This is a whole solution that uses regexs. #senderle has beat me to the answer, so feel free to tick his response. This is just added here as I know how difficult it was to wrap my head around re in my code at first.
import re
dash = re.compile('(\d{2})-(\d{2})([WENS])')
no_dash = re.compile( '(\d{2})([WENS])' )
raw = '''10W
20E
15-16N
17-18S'''
lines = raw.split('\n')
data = []
for l in lines:
if '-' in l:
match = re.search(dash, l).groups()
data.append( ',,%s,%s,%s' % (match[0], match[1], match[2] ) )
else:
match = re.search(no_dash, l).groups()
data.append( '%s,%s' % (match[0], match[1] ) )
print '\n'.join(data)
In your case, I think the quick solution would involve regexps
You can either use the match method to extract your different tokens when they match a given regular expression, or the split method to split your string into tokens given a separator.
However, in your case, the separator would be a single character, so you can use the split method from the str class.

Categories

Resources