How do I extract certain parts of strings in Python? - python

Say I have three strings:
abc534loif
tvd645kgjf
tv96fjbd_gfgf
and three lists:
beginning captures just the first part of the string "the name"
middle captures just the number
end contains only the rest of the characters that are after the number portion
How do I accomplish this in the most efficent way?

Use regular expressions?
>>> import re
>>> strings = 'abc534loif tvd645kgjf tv96fjbd_gfgf'.split()
>>> for s in strings:
... for match in re.finditer(r'\b([a-z]+)(\d+)(.+?)\b', s):
... print match.groups()
...
('abc', '534', 'loif')
('tvd', '645', 'kgjf')
('tv', '96', 'fjbd_gfgf')

This is language agnostic approach that aims at higher efficiency:
find first digit in the string and save its position p0
find last digit in the string and save its position p1
extract substring from 0 to p0-1 into beginning
extract substring from p0 to p1 into middle
extract substring from p1+1 to length-1 into end

I guess you're looking for re.findall:
strs = """
abc534loif
tvd645kgjf
tv96fjbd_gfgf
"""
import re
print re.findall(r'\b(\w+?)(\d+)(\w+)', strs)
>> [('abc', '534', 'loif'), ('tvd', '645', 'kgjf'), ('tv', '96', 'fjbd_gfgf')]

>>> import itertools as it
>>> s="abc534loif"
>>> [''.join(j) for i,j in it.groupby(s, key=str.isdigit)]
['abc', '534', 'loif']

I'd something like this:
>>> import re
>>> l = ['abc534loif', 'tvd645kgjf', 'tv96fjbd_gfgf']
>>> regex = re.compile('([a-z_]+)(\d+)([a-z_]+)')
>>> beginning, middle, end = zip(*[regex.match(s).groups() for s in l])
>>> beginning
('abc', 'tvd', 'tv')
>>> middle
('534', '645', '96')
>>> end
('loif', 'kgjf', 'fjbd_gfgf')

I wouls use regualar expressions like:
(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)
and pull out the three matching sections.
import re
m = re.match(r"(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)", "abc534loif")
m.group('beginning')
m.group('middle')
m.group('end')

import re #You want to match a string against a pattern so you import the regular expressions module 're'
mystring = "abc1234def" #Just a string to test with
match = re.match(r"^(\D+)([0)9]+](\D+)$") #Our regular expression. Everything between brackets is 'captured', meaning that it is accessible as one of the 'groups' in the returned match object. The ^ sign matches at the beginning of a string, while the $ matches the end. the characters in between the square brackets [0-9] are character ranges, so [0-9] matches any digit character, \D is any non-digit character.
if match: # match will be None if the string didn't match the pattern, so we need to check for that, as None.group doesn't exist.
beginning = match.group(1)
middle = match.group(2)
end = match.group(3)

Related

Combining two patterns with named capturing group in Python?

I have a regular expression which uses the before pattern like so:
>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:)([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:I118uailfriedx151201005423521">>')
>>> x.group('sid')
'I118uailfriedx151201005423521'
and another like so:
>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:<<")([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:<<"I118uailfriedx151201005423521')
>>> x.group('sid')
'I118uailfriedx151201005423521'
How can I combine these two patterns in a way that, after parsing these two different lines,:
sid:A111uancalual2626x151130185758596
sid:<<"I118uailfriedx151201005423521">>
returns only the corresponding id to me.
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9]+))')
Use this, I've just tested and it is working for me. I've moved some part out.
Instead of tweaking your regex, you can make your strings easier to parse by just removing any characters except alphanumeric and a colon. Then, just split by colon and get the last item:
>>> import re
>>>
>>> test_strings = ['sid:I118uailfriedx151201005423521">>', 'sid:<<"I118uailfriedx151201005423521']
>>> pattern = re.compile(r"[^A-Za-z0-9:]")
>>> for test_string in test_strings:
... print(pattern.sub("", test_string).split(":")[-1])
...
I118uailfriedx151201005423521
I118uailfriedx151201005423521
You can achieve what you want with a single regex:
\bsid:\W*(?P<sid>\w+)
See the regex demo
The regex breakdown:
\bsid - whole word sid
: - a literal colon
\W* - zero or more non-word characters
(?P<sid>\w+) - one or more word characters captured into a group named "sid"
Python demo:
import re
p = re.compile(r'\bsid:\W*(?P<sid>\w+)')
#test_str = "sid:I118uailfriedx151201005423521\">>" # => I118uailfriedx151201005423521
test_str = "sid:<<\"I118uailfriedx151201005423521" # => I118uailfriedx151201005423521
m = p.search(test_str)
if m:
print(m.group("sid"))

Add [] around numbers in strings

I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(
Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'
You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/

python 3 regular expression match string meta-character

I want to write a line of regular expression that can match strings like "(2000)" with years in parentheses. then I can check if any string contains the substring "2000".
for example, I want the regex to match (2000) not 2000, or (20000),or (200).
That is to say: they have to have exactly four digits, the first digit between 1 and 2; they have to include the parentheses.
also 2000 is just an example I use but really I want to the regex to include all the possible years.
You have to escape the open and close paranthesis,
>>> import re
>>> str = """foo(2000)bar(1000)foobar2000"""
>>> regex = r'\(2000\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)']
OR
>>> import re
>>> str = """foo(2000)bar(1000)foobar(2014)barfoo(2020)"""
>>> regex = r'\([0-9]{4}\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)', '(1000)', '(2014)', '(2020)']
It matches all the four digit numbers(year's) present within the paranthesis.
Special characters need to be escaped with a backslash. A parenthesis ( becomes \(. Therefore (2000) becomes \(2000\).
Then you can do something like:
if re.search(r"\(2000\)", subject):
# Successful match
else:
# Match attempt failed
>>> import re
>>> x = re.match(r'\((\d*?)\)', "(2000)")
>>> x.group(1)
'2000'

How to find a non-alphanumeric character and move it to the end of a string in Python

I have the following string:
"string.isnotimportant"
I want to find the dot (it could be any non-alphanumeric character), and move it to the end of the string.
The result should look like:
"stringisnotimportant."
I am looking for a regular expression to do this job.
import re
inp = "string.isnotimportant"
re.sub('(\w*)(\W+)(\w*)', '\\1\\3\\2', inp)
>>> import re
>>> string = "string.isnotimportant"
#I explain a bit about this at the end
>>> regex = '\w*(\W+)\w*' # the brackets in the regex mean that item, if matched will be stored as a group
#in order to understand the re module properly, I think your best bet is to read some docs, I will link you at the end of the post
>>> x = re.search(regex, string)
>>> x.groups() #remember the stored group above? well this accesses that group.
#if there were more than one group above, there would be more items in the tuple
('.',)
#here I reassign the variable string to a modified version where the '.' is replaced with ''(nothing).
>>> string = string.replace('.', '')
>>> string += x.groups()[0] # here I basically append a letter to the end of string
The += operator appends a character to the end of a string. Since strings don't have an .append method like lists do, this is a handy feature. x.groups()[0] refers to the first item(only item in this case) of the tuple above.
>>> print string
"stringisnotimportant."
about the regex:
"\w" Matches any alphanumeric character and the underscore: a through z, A through Z, 0 through 9, and '_'.
"\W" Matches any non-alphanumeric character. Examples for this include '&', '$', '#', etc.
https://developers.google.com/edu/python/regular-expressions?csw=1
http://python.about.com/od/regularexpressions/a/regexprimer.htm

How to substitute chars using unicode regex range

I am trying to remove chars from an unicode string. I have a whitelist of allowed unicode chars and I would like to remove everything that is not on the list.
allowed_list = ur'[\u0041-\u005A]|[\u0061-\u007A]|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u012F]|\u0131|[\u0386]|[\u0388-\u038A]'
negated_list = ur'[^\u0041-\u005A]|[^\u0061-\u007A]|[^\u00C0-\u00D6]|[^\u00D8-\u00F6]|[^\u00F8-\u012F]|^\u0131|[^\u0386]|[^\u0388-\u038A]'
I am testing it with a subset of my list and I don't get why it is not working.
This removes all but lowercase latin chars:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0061-\u007A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
'rugg'
This removes all but uppercase latin chars:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0041-\u005A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
'AT'
But when I combine them, all chars get removed:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0041-\u005A]|[^\u0061-\u007A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
''
When I tested the regex [^\u0041-\u005A]|[^\u0061-\u007A] on https://pythex.org/ it does what I am expecting, but when I atempt to use it in my code, it is not doing what I want it to. What am I missing?
Thank you in advance!
Your regex is not correct, you are using | which checks if either one is true.
You need to create one expression with multiple ranges,
[^\u0041-\u005A\u0061-\u007A] will match any characters except range \u0041-\u005A or \u0061-\u007A.
import re
regex = r"[^\u0041-\u005A\u0061-\u007A]"
test_str = "Arugg^]T"
myre = re.compile(regex, re.UNICODE)
result = myre.sub('', test_str)
print(result)
# output,
AruggT
Implicitly positive, regex class items are OR'd together.
Your regex is then the same as
[\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]
But for the Negative regex class [^], items are individually negated then AND'ed together.
That regex is then
[^\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]
which is logically the same as
[^\u0041-\u005A] and [^\u0061-\u007A] and [^\u00C0-\u00D6] and [^\u00D8-\u00F6] and [^\u00F8-\u012F] and [^\u0131] and [^\u0386] and [^\u0388-\u038A]
What you tried to do was negate each item, then OR them together
which is not the same.
You are replacing all characters that are
not in '[^\u0041-\u005A]' or not in [^\u0061-\u007A]' (due to the ^) .
If either one is true, all get replaced by '' - so its always true no matter what you have.
Use ur'[^\u0041-\u005A\u0061-\u007A]' instead (both ranges inside one [...].

Categories

Resources