Python Regex is not matching the first line - python

I have a text file and the content is,
Submitted By,Assigned,Closed
Name1,10,5
Name2,20,10
Name3,30,15
I have written a Regex Pattern, to extract the value between first , and second ,
^\w+,(\w+),.*$
My Python code is
import re
f=r'sample.txt'
rePat = re.compile('^\w+,(\w+),.*$', re.MULTILINE)
text = open(f, 'r').read()
output = re.findall(rePat, text)
print (f)
print (output)
Expected Output:
Assigned
10
20
30
But I am getting
10
20
30
Why it is missing the first line?

The problem is due to the fact that \w+ matches one or more word chars (basically, letters, digits, underscores and also some diacritics). You have a space in between the second and third commas, so I suggest matching any chars between commas with [^,\n]+ (the \n here is to make sure we stay within the same line).
You can use
rePat = re.compile(r'^[^,\n]+,([^,\n]+),.*$', re.MULTILINE)
Or, a bit simplified if you do not need to extract anything else:
rePat = re.compile(r'^[^,\n]+,([^,\n]+)', re.MULTILINE)
See this regex demo. Details:
^ - start of a line
[^,\n]+ - one or more chars other than , and LF
, - a comma
([^,\n]+) - Group 1: one or more chars other than , and LF.
See a Python demo:
import re
text = r"""Submitted By,Assigned,Closed
Name1,10,5
Name2,20,10
Name3,30,15"""
rePat = re.compile('^[^,\n]+,([^,\n]+),.*$', re.MULTILINE)
output = re.findall(rePat, text)
print (output)
# => ['Assigned', '10', '20', '30']

You could add matching optional spaces and word characters after the first \w+ to match till the first comma.
^\w+(?: \w+)*,(\w+),.*$
^ Start of string
\w+ Match 1+ word chars
(?: \w+)* Optionally repeat matching a space and 1+ word chars
,(\w+), Match a comma and capture 1+ word chars in group 1
.*$ ( You could omit this part)
Regex demo
import re
f = r'sample.txt'
rePat = re.compile('^\w+(?: \w+)*,(\w+),.*$', re.MULTILINE)
text = open(f, 'r').read()
output = re.findall(rePat, text)
print(output)
Output
['Assigned', '10', '20', '30']

Related

Python Split Regex not split what I need

I have this in my file
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"#[sae](\[[\w{}=, ]*\])?"
regex = re.split(target, sample)
print(regex)
I want to split all words that start with #, so like this:
["Name: ", "#s", "\nOwner: ", "#a[tag=Admin]"]
But instead it give this:
['Name: ', None, '\nOwner: ', '[tag=Admin]', '']
How to seperating it?
I would use re.findall here:
sample = """Name: #s
Owner: #a[tag=Admin]"""
parts = re.findall(r'#\w+(?:\[.*?\])?|\s*\S+\s*', sample)
print(parts) # ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
The regex pattern used here says to match:
#\w+ a tag #some_tag
(?:\[.*?\])? followed by an optional [...] term
| OR
\s*\S+\s* any other non whitespace term,
including optional whitespace on both sides
If I understand the requirements correctly you could do that as follows:
import re
s = """Name: #s
Owner: #a[tag=Admin]
"""
rgx = r'(?=#.*)|(?=\r?\n[^#\r\n]*)'
re.split(rgx, s)
#=> ['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]\n']
Demo
The regular expression can be broken down as follows.
(?= # begin a positive lookahead
#.* # match '#' followed by >= 0 chars other than line terminators
) # end positive lookahead
| # or
(?= # begin a positive lookahead
\r?\n # match a line terminator
[^#\r\n]* # match >= 0 characters other than '#' and line terminators
) # end positive lookahead
Notice that matches are zero-width.
re.split expects the regular expression to match the delimiters in the string. It only returns the parts of the delimiters which are captured. In the case of your regex, that's only the part between the brackets, if present.
If you want the whole delimiter to show up in the list, put parentheses around the whole regex:
target = r"(#[sae](\[[\w{}=, ]*\])?)"
But you are probably better off not capturing the interior group. You can change it to a non-capturing group by using (?:…) instead of (…):
target = r"(#[sae](?:\[[\w{}=, ]*\])?)"
In your output, you keep the [tag=Admin] as that part is in a capture group, and using split can also return empty strings.
Another option is to be specific about the allowed data format, and instead of split capture the parts in 2 groups.
(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)
The pattern matches:
( Capture group 1
\s*\w+:\s* Match 1+ word characters and : between optional whitespace chars
) Close group
( Capture group 2
#[sae] Match # followed by either s a e
(?:\[[\w{}=, ]*])? Optionally match [...]
) Close group
Example code:
import re
sample = """Name: #s
Owner: #a[tag=Admin]"""
target = r"(\s*\w+:\s*)(#[sae](?:\[[\w{}=, ]*])?)"
listOfTuples = re.findall(target, sample)
lst = [s for tpl in listOfTuples for s in tpl]
print(lst)
Output
['Name: ', '#s', '\nOwner: ', '#a[tag=Admin]']
See a regex demo and a Python demo.

Extracting a complex substring using regex with data from a string in python

I have a string say
text = 'i have on 31-Dec-08 USD 5234765 which I gave it in the donation"
i tried :
pattern = r"^[\d]{2}.*,[\d]{3}$"
data = re.findall(pattern, text)
for s in data:
print(s)
my desired output :
[31-Dec-08, USD, 5234765]
you can do it that way
import re
regex = r"(\w+-\w+-\w+)|([A-Z]{3})|(\d+)"
test_str = "i have on 31-Dec-08 USD 5234765 which I gave it in the donation"
matches = re.findall(regex, test_str)
temp = [_ for tupl in matches for _ in tupl if _]
print(temp) #['31-Dec-08', 'USD', '5234765']
\w matches any word character (equivalent to [a-zA-Z0-9_])
+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
-matches the character - literally (case sensitive)
[A-Z]{3} matches the capital alphabet exactly 3 times.
\d matches a digit (equivalent to [0-9])

remove n before a string

I want to remove unrequired r and n at beginning of each upper-case word and number in this string. I tried regex. Not sure if regex or some other method would be helpful here.
This is the code I am trying to use:
text = "nFamily n49 new nTom"
regex_pattern = re.compile(r'.*n[A-Z][a-z]*|[0-9]*\s')
matches = regex_pattern.findall(text)
for match in matches:
text = text.replace(match," ")
print(text)
Expected output:
Family 49 new Tom
You can use
text = re.sub(r'\bn(?=[A-Z0-9])', '', text)
See the regex demo.
Details:
\b - here, start of a word
n - a n letter
(?=[A-Z0-9]) - a positive lookahead that requires an uppercase ASCII letter or a digit to be present immediately to the right of the current location.
See the Python demo:
import re
rx = r"\bn(?=[A-Z0-9])"
text = "nFamily n49 new nTom"
print( re.sub(rx, '', text) )
# => Family 49 new Tom

regex to remove every hyphen except between two words

I am cleaning a text and I would like to remove all the hyphens and special characters. Except for the hyphens between two words such as: tic-tacs, popcorn-flavoured.
I wrote the below regex but it removes every hyphen.
text='popcorn-flavoured---'
new_text=re.sub(r'[^a-zA-Z0-9]+', '',text)
new_text
I would like the output to be:
popcorn-flavoured
You can replace matches of the regular expression
-(?!\w)|(?<!\w)-
with empty strings.
Regex demo <¯\_(ツ)_/¯> Python demo
The regex will match hyphens that are not both preceded and followed by a word character.
Python's regex engine performs the following operations.
- match '-'
(?!\w) the previous character is not a word character
|
(?<!\w) the following character is not a word character
- match '-'
(?!\w) is a negative lookahead; (?<!\w) is a negative lookbehind.
As an alternative, you could capture a hyphen between word characters and keep that group in the replacement. Using an alternation, you could match the hyphens that you want to remove.
(\w+-\w+)|-+
Explanation
(\w+-\w+) Capture group 1, match 1+ word chars, hyphen and 1+ word chars
| Or
-+ Match 1+ times a hyphen
Regex demo | Python demo
Example code
import re
regex = r"(\w+-\w+)|-+"
test_str = ("popcorn-flavoured---\n"
"tic-tacs")
result = re.sub(regex, r"\1", test_str)
print (result)
Output
popcorn-flavoured
tic-tacs
You can use findall() to get that part that matches your criteria.
new_text = re.findall('[\w]+[-]?[\w]+', text)[0]
Play around with it with other inputs.
You can use
p = re.compile(r"(\b[-]\b)|[-]")
result = p.sub(lambda m: (m.group(1) if m.group(1) else ""), text)
Test
With:
text='popcorn-flavoured---'
Output (result):
popcorn-flavoured
Explanation
This pattern detects hyphens between two words:
(\b[-]\b)
This pattern detects all hyphens
[-]
Regex substitution
p.sub(lambda m: (m.group(1) if m.group(1) else " "), text)
When hyphen detected between two words m.group(1) exists, so we maintain things as they are
else "")
Occurs when the pattern was triggered by [-] then we substitute a "" for the hyphen removing it.

Python multiline regex delimiter

Having this multiline variable:
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
The structure is always TAG = CONTENT, both strings are NOT fixed and CONTENT could contain new lines.
I need a regex to get:
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\n'), ('PARALLEL', '4')]
Tried multiple combinations but I'm not able to stop the regex engine at the right point for TABLES tag as its content is a multiline string delimited by the next tag.
Some attempts from the interpreter:
>>> re.findall(r'(\w+?)\s=\s(.+?)', raw, re.DOTALL)
[('CONTENT', 'A'), ('TABLES', 'T'), ('PARALLEL', '4')]
>>> re.findall(r'^(\w+)\s=\s(.+)?', raw, re.M)
[('CONTENT', 'ALL'), ('TABLES', 'TEST.RAW_1'), ('PARALLEL', '4')]
>>> re.findall(r'(\w+)\s=\s(.+)?', raw, re.DOTALL)
[('CONTENT', 'ALL\nTABLES = TEST.RAW_1\n , TEST.RAW_2\n , TEST.RAW_3\n , TEST.RAW_4\nPARALLEL = 4\n')]
Thanks!
You can use a positive lookahead to make sure you lazily match the value correctly:
(\w+)\s=\s(.+?)(?=$|\n[A-Z])
^^^^^^^^^^^^
To be used with a DOTALL modifier so that a . could match a newline symbol. The (?=$|\n[A-Z]) lookahead will require .+? to match up to the end of string, or up to the newline followed with an uppercase letter.
See the regex demo.
And alternative, faster regex (as it is an unrolled version of the expression above) - but DOTALL modifier should NOT be used with it:
(\w+)\s*=\s*(.*(?:\n(?![A-Z]).*)*)
See another regex demo
Explanation:
(\w+) - Group 1 capturing 1+ word chars
\s*=\s* - a = symbol wrapped with optional (0+) whitespaces
(.*(?:\n(?![A-Z]).*)*) - Group 2 capturing 0+ sequences of:
.* - any 0+ characters other than a newline
(?:\n(?![A-Z]).*)* - 0+ sequences of:
\n(?![A-Z]) - a newline symbol not followed with an uppercase ASCII letter
.* - any 0+ characters other than a newline
Python demo:
import re
p = re.compile(r'(\w+)\s=\s(.+?)(?=$|\n[A-Z])', re.DOTALL)
raw = '''
CONTENT = ALL
TABLES = TEST.RAW_1
, TEST.RAW_2
, TEST.RAW_3
, TEST.RAW_4
PARALLEL = 4
'''
print(p.findall(raw))

Categories

Resources