How to use * or + with brackets in regular expressions in Python?

How to use * or + with brackets in regular expressions in Python? - python

There are multiple space separated characters in the input eg: string = "a b c d a s e "
What should the pattern be such that when I do re.search on the input using the pattern, I'd get the j'th character along with the space following it in the input by using .group(j)?
I tried something of the sort "^(([a-zA-Z])\s)+" but this is not working. What should I do?
EDIT
My actual question is in the heading and the body described only a special case of it:
Here's the general version of the question: if I have to take in all patterns of a specific type (initial question had the pattern "[a-zA-Z]\s") from a string, what should I do?

Use findall() instead and get the j-th match by index:
>>> j = 2
>>> re.findall(r"[a-zA-Z]\s", string)[j]
'c '
where [a-zA-Z]\s would match a lower or upper case letter followed by a single space character.

Why use regex when you can simply use str.split() method and access to the characters with a simple indexing?
>>> new = s.split()
>>> new
['a', 'b', 'c', 'd', 'a', 's', 'e']

You could do:
>>> string = "a b c d a s e "
>>> j=2
>>> re.search(r'([a-zA-Z]\s){%i}' % j, string).group(1)
'b '
Explanation:
With the pattern ([a-zA-Z]\s) you capture a letter then the space;
With the repetition {2} added, you capture the last of the repetition -- in this case the second one (base 1 vs base 0 indexing...).
Demo

Related

Extract Information with brackets using python

I got a badly managed log, and need to extract into a dictionary using Python.
# Pattern: (keys are not kw1, kw2 ,etc... no pattern in key)
"para1=a, kw2=b, (b, b=b), bb, kw3=c, t4=..."
# where
# - para1=a
# - kw2=b, (b, b=b), bb
# - kw3=c
# - and so on
# extract into a dict:
out = {"para1": "a", "kw2": "b, (b, b=b), bb", "kw3": "c", "t4": ...}
# Notes several important features
'''
1. all 'kw=value' are joined with certain spliter: ', '
2. value and kw themselves may contain spliter. i.e. 'f(x, y)=3, f(x=3, y=2, z=1)=g(x=1, t=2)'
3. all brackets must be in pair, (therefore we can identify spliters in kw or value).
4. all message must be part of kw or value.
'''
Q1: Is there a regex expression(or some Python code) that helps me get above key and value?
There's no pattern in key, kwN is just a reference to key
Q2Update: Thanks to Laurent, I alr know why Q2 doesn't work: Got unexpected result. ', (.*?)=' should give me the shortest matching between ',' and '=' right?
msg = 'a, a, b=b, c=c'
re.findall(', (.*?)=', msg)
>>> ['a, b', 'c']
# I was expecting ['b','c']
# shouldn't ', (.*?)=' give me the shortest matching between ',' and '='? which is 'b' instead of 'a, b'
(New)Q3: Since I'm working with huge loads of data, working efficiency is my first priority. I've worked out a python code which could achieve the goal, but it doesnt feel quick enough, could you help me to make it better?
def my_not_efficient_solution(msg):
'''
Notes:
1. all 'kw=value' are joined with certain spliter: ', '
2. value and kw themselves may contain spliter. i.e. 'f(x, y)=3, f(x=3, y=2, z=1)=g(x=1, t=2)'
3. all brackets must be in pair, (therefore we can identify spliters in kw or value).
4. all message must be part of kw or value.
Solution:
1. split message with spliter -> get entries
2. check each spliter bracekt and equal sign
3. for each entry: append to last one or serve as part of next one or good with itself
'''
spliter=', '
eq_sign=['=']
first=False
bracket_map={'(':1,")":-1,"[":1,"]":-1}
pair_chk_func = lambda s: not sum([bracket_map.get(i,0) for i in s])
eq_chk_func = lambda s: sum([i in s for i in eq_sign])
assert pair_chk_func(msg), 'msg pair check fail.'
res = msg.split(spliter)
# step1: split entry
entries=[]
do_pre='' # last entry is not complete(lack bracket)
do_first = '' # last entry is not complete(lack eq sign)
while res.__len__()>0:
if first and entries.__len__()==2:
entries.pop(-1)
break
if do_first and entries._len__()==0:
do=do_first+res.pop(0)
else:
do_first=''
do=res.pop(0)
eq_chk=eq_chk_func(do_pre+do)
pair_chk=pair_chk_func(do_pre+do)
# case1: not valid entry, no eq sign
# case2: previous entry not complete
# case3: current entry not valid(no eq sign, will drop) and pair incomplete(may be part of next entry)
if not eq_chk or do_pre:
if entries.__len__() > 0:
entries[-1]+=spliter+do
pair_chk=pair_chk_func(entries[-1])
if pair_chk: do_pre=''
else: do_pre=entries[-1]
elif not pair_chk:
do_first=do
# case4: current entry good to go
elif eq_chk and pair_chk:
entries.append(do)
do_pre=''
# case5: current entry not complete(pair not complete)
else:
entries.append(do)
do_pre=do
# step2: split each into dict
output={}
split_mark = '|'.join(eq_sign)
for entry in entries:
splits=re.split(split_mark, entry)
if splits.__len__()<2:
raise ValueError('split fail for message')
kw = splits.pop(0)
while not pair_chk_func(kw):
kw += '='+splits.pop(0)
output[kw]='='.join(splits)
return output
msg = 'B_=a, kw2=b, f(A=3, k=2)=g(t=3, v=5), mark[(blabla), f(xx tt)=33]'
my_not_efficient_solution(msg)
>>> {'B_': 'a',
'kw2': 'b',
'f(A=3, k=2)': 'g(t=3, v=5), mark[(blabla), f(xx tt)=33]'}

Answer to Q1:
Here is my suggestion:
import re
s = "kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=..."
pattern = r'(?=(kw.)=(.*?)(?:, kw.=|$))'
result = dict(re.findall(pattern, s))
print(result) # {'kw1': 'a', 'kw2': 'b, (b, b=b), bb', 'kw3': 'c', 'kw4': '...'}
To explain the regex:
the (?=...) is a lookahead assertion to let you find overlapping matches
the ? in (.*?) makes the quantifier * (asterisk) non-greedy
the ?: makes the group (?:, kw.=|$) non-capturing
the |$ at the end allows to take account of the last value in your string
Answer to Q2:
No, this is wrong. The quantifier *? is non-greedy, so it finds the first match. Moreover there is no search for overlapping matches , which could be done with (?=...). So your observed result is the expected done.
I may suggest you this simple solution:
msg = 'a, a, b=b, c=c'
result = re.findall(', ([^,]*?)=', msg)
print(result) # ['b', 'c']

Q1: Is there a regex expression that helps me get above key and value?
To get the key:value in a dictionary format you can use
Say your string is
"kw1=a, kw2=b, (b, b=b), bb, kw3=c, kw4=dd, kw10=jndn"
Using the following regex gives you key and values in a list
results = re.findall(r'(\bkw\d+)=(.*?)(?=,+\s*\bkw\d+=|$)', s)
[('kw1', 'a'), ('kw2', 'b, (b, b=b), bb'), ('kw3', 'c'), ('kw4', 'dd'), ('kw10', 'jndn')]
You can convert it to a dictionary as
dict(results)
Output :
{
'kw1': 'a',
'kw2': 'b, (b, b=b), bb',
'kw3': 'c',
'kw4': 'dd',
'kw10': 'jndn'
}
Explanation :
\b is used like a word boundary and will only match kw and not something like XYZkw
\kw\d+= Match the word kw followed by 1+ digits and =
.*? (Lazy Match) Match as least chars as possible
(?= Positive lookahead, assert to the right
\s*\bkw\d+= Match optional whitespace chars, then pat, 1+ digits and =
| Or
$ Assert the end of the string for the last part
) Close the lookahead

Splitting string using different scenarios using regex

I have 2 scenarios so split a string
scenario 1:
"##$hello?? getting good.<li>hii"
I want to be split as 'hello','getting','good.<li>hii (Scenario 1)
'hello','getting','good','li,'hi' (Scenario 2)
Any ideas please??

Something like this should work:
>>> re.split(r"[^\w<>.]+", s) # or re.split(r"[##$? ]+", s)
['', 'hello', 'getting', 'good.<li>hii']
>>> re.split(r"[^\w]+", s)
['', 'hello', 'getting', 'good', 'li', 'hii']

This might be what your looking for \w+ it matches any digit or letter from 1 to n times as many times as possible. Here is a working Java-Script
var value = "##$hello?? getting good.<li>hii";
var matches = value.match(
new RegExp("\\w+", "gi")
);
console.log(matches)
It works by using \w+ which matches word characters as many times as possible. You cound also use [A-Za-b] to match only letters which not numbers. As show here.
var value = "##$hello?? getting good.<li>hii777bloop";
var matches = value.match(
new RegExp("[A-Za-z]+", "gi")
);
console.log(matches)
It matches what are in the brackets 1 to n timeas as many as possible. In this case the range a-z of lower case charactors and the range of A-Z uppder case characters. Hope this is what you want.

For first scenario just use regex to find all words that are contain word characters and <>.:
In [60]: re.findall(r'[\w<>.]+', s)
Out[60]: ['hello', 'getting', 'good.<li>hii']
For second one you need to repleace the repeated characters only if they are not valid english words, you can do this using nltk corpus, and re.sub regex:
In [61]: import nltk
In [62]: english_vocab = set(w.lower() for w in nltk.corpus.words.words())
In [63]: repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
In [64]: [repeat_regexp.sub(r'\1\2\3', word) if word not in english_vocab else word for word in re.findall(r'[^\W]+', s)]
Out[64]: ['hello', 'getting', 'good', 'li', 'hi']

In case you are looking for solution without regex. string.punctuation will give you list of all special characters.
Use this list with list comprehension for achieving your desired result as:
>>> import string
>>> my_string = '##$hello?? getting good.<li>hii'
>>> ''.join([(' ' if s in string.punctuation else s) for s in my_string]).split()
['hello', 'getting', 'good', 'li', 'hii'] # desired output
Explanation: Below is the step by step instruction regarding how it works:
import string # Importing the 'string' module
special_char_string = string.punctuation
# Value of 'special_char_string': '!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
my_string = '##$hello?? getting good.<li>hii'
# Generating list of character in sample string with
# special character replaced with whitespace
my_list = [(' ' if item in special_char_string else item) for item in my_string]
# Join the list to form string
my_string = ''.join(my_list)
# Split it based on space
my_desired_list = my_string.strip().split()
The value of my_desired_list will be:
['hello', 'getting', 'good', 'li', 'hii']

Program to make an acronym with a period in between each letter

so i'm trying to make a program in Python PyScripter 3.3 that takes input, and converts the input into an acronym. This is what i'm looking for.
your input: center of earth
programs output: C.O.E.
I don't really know how to go about doing this, I am looking for not just the right answer, but an explanation of why certain code is used, thanks..
What I have tried so far:
def first_letters(lst):
return [s[:1] for s in converted]
def main():
lst = input("What is the phrase you wish to convert into an acronym?")
converted = lst.split().upper()
Beyond here I am not really sure where to go, so far I know I need to captialize the input, split it into separate words, and then beyond that im not sure where to go...

I like Python 3.
>>> s = 'center of earth'
>>> print(*(word[0] for word in s.upper().split()), sep='.', end='.\n')
C.O.E.
s = 'center of earth' - Assign the string.
s.upper() - Make the string uppercase. This goes before split() because split() returns a list and upper() doesn't work on lists.
.split() - Split the uppercased string into a list.
for word in - Iterate through each element of the created list.
word[0] - The first letter of each word.
* - Unpack this generator and pass each element as an argument to the print function.
sep='.' - Specify a period to separate each printed argument.
end='.\n' - Specify a period and a newline to print after all the arguments.
print - Print it.
As an alternative:
>>> s = 'center of earth'
>>> '.'.join(filter(lambda x: x.isupper(), s.title())) + '.'
'C.O.E.'
s = 'center of earth' - Assign the string.
s.title() - Change the string to Title Case.
filter - Filter the string, retaining only those elements that are approved by a predicate (the lambda below).
lambda x: x.isupper() - Define an anonymous inline function that takes an argument x and returns whether x is uppercase.
'.'.join - Join all the filtered elements with a '.'.
+ '.' - Add a period to the end.
Note that this one returns a string instead of simply printing it to the console.

>>> import re
>>> s = "center of earth"
>>> re.sub('[a-z ]+', '.', s.title())
'C.O.E.'
>>> "".join(i[0].upper() + "." for i in s.split())
'C.O.E.'

Since you want an explanation and not just an answer:
>>> s = 'center of earth'
>>> s = s.split() # split it into words
>>> s
['center', 'of', 'earth']
>>> s = [i[0] for i in s] # get only the first letter or each word
>>> s
['c', 'o', 'e']
>>> s = [i.upper() for i in s] # convert the letters to uppercase
>>> s
['C', 'O', 'E']
>>> s = '.'.join(s) # join the letters into a string
>>> s
'C.O.E'
>>> s = s + '.' # add the dot at the end
>>> s
'C.O.E.'

python parse string into individual chararcters

In Python 2.7 how do I parse 'abc' into 'a b c' for a very long string (like 1000 chars)?
Or how would I convert 'abccda' to '1 2 3 3 4 1'? (where each unique letter maps to a unique digit, 1-4)
I imagine I could pop the chars off, one by one, but I'm new to Python and wonder if there is a simple function that does it.

For the first one use join():
>>> s = 'abc'
>>> ' '.join(s)
'a b c'
For the second one:
>>> s = 'abccda'
>>> ' '.join(chr(ord(c)-ord('a')+ord('1')) for c in s)
'1 2 3 3 4 1'
or you could simply use a dictionary to map letters to numbers:
>>> s = 'abccda'
>>> d = dict(a=1, b=2, c=3, d=4)
>>> ' '.join(str(d[c]) for c in s)
'1 2 3 3 4 1'
And yet another way is to use string.translate():
>>> from string import maketrans
>>> s = 'abccda'
>>> ' '.join(s.translate(maketrans('abcd', '1234')))
'1 2 3 3 4 1'
translate() would be the preferred one since, as opposed to the naive dict lookup, it handles unmapped characters without errors:
>>> s='abcdefgh'
>>> ' '.join(s.translate(maketrans('abcd', '1234')))
'1 2 3 4 e f g h'

x="abc"
print re.sub(r"(?<!^)(.)",r" \1",x)
For simple conversion you can try this.For mapping you can define you replfunction in re.sub.An example can be
def repl(matchobj):
if matchobj.group()=='b':
return " "+str(1)
elif matchobj.group()=='c':
return " "+str(2)
x="abc"
print re.sub(r"(?<!^)(.)",repl,x)

Do you mean the list method?
s='abccda'
list(s) # ['a', 'b', 'c', 'c', 'd', 'a']

To convert each letter into a number, you can use str.translate. This is probably overkill in this simple case, but it's worth learning.
The details are different in Python 2 and Python 3.
For Python 3, you can just use a mapping from Unicode ordinals to replacement strings, like this:
mapping = {ord(letter): str(number) for number, letter in enumerate(string.ascii_lowercase[:4], 1)}
translated = x.translate(mapping)
For Python 2, you need a special translation table, which in this case is a little less convenient (and will only let you translate characters to single characters, not to arbitrary strings like the Python 3 version—not a problem here, but if you wanted to convert 'j' to '10' it wouldn't work):
mapping = string.maketrans(string.ascii_lowercase[:4],
''.join(str(i) for i in range(1, 5))
translated = x.translate(mapping)
Then, to add spaces, use mhawke's solution:
result = ' '.join(translated)

python regular expression, pulling all letters out

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length

You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)

I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']

Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']

You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>

If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use * or + with brackets in regular expressions in Python? - python

Use findall() instead and get the j-th match by index: >>> j = 2 >>> re.findall(r"[a-zA-Z]\s", string)[j] 'c ' where [a-zA-Z]\s would match a lower or upper case letter followed by a single space character.

Why use regex when you can simply use str.split() method and access to the characters with a simple indexing? >>> new = s.split() >>> new ['a', 'b', 'c', 'd', 'a', 's', 'e']

Related

Extract Information with brackets using python

Splitting string using different scenarios using regex

Program to make an acronym with a period in between each letter

python parse string into individual chararcters

python regular expression, pulling all letters out

Categories

Resources