How to parse a single line in regex? - python

I want to parse the line number one in the following paragraph using regex.
text=" My name is Raj
and I am an engineer."
Can someone help me with the regex to fetch the statement("My name is Raj") from the paragraph

The general pattern for parsing a single line in regex is to match a specific pattern of characters that define the line, followed by a capturing group that captures the contents of the line. For example, if you wanted to parse a line that starts with "Hello," followed by a string of characters, you could create a regex pattern like so:
/^Hello,(.*)$/
The ^ character indicates the start of the line and the $ character indicates the end of the line. The .* pattern will match any characters that come after the Hello,. Finally, the parentheses around the .* create a capturing group which will capture the contents of the line.

To get the first line:
text=''' My name is Raj
and I am an engineer.'''
s=text.split('\n')[0]
print(s)
#Output:
#My name is Raj

Related

Regex Must Match a Word (not to replace) AND a Pattern (to replace) in a Line

With regex (can be PCRE or SED, but can also python[please specify]), I want to remove all occurrences of the lines that contain a single letter comma (/,.,/g) and with the word "Labels:"
So for example in these lines:
Labels: K,ltemittel,System,j,Vakuum,s
Another tags: a,b,xxx,c,yyy,z
to
Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z
What I've tried:
non-capturing group ("Labels:" still also getting replaced)
lookahead and lookbehind (cannot use greedy)
grouping /(Labels:)*(,.,) (also capturing the non "Labels:")
You could potentially use:
(?i)(^(?!Labels:).*)|\b[a-z],|,[a-z]\b
See an online demo
(?i) - Set case-insensitive matching 'on';
( - Open 1st capture group;
^ - Start string anchor;
(?!labels:) - Assert position is not followed by 'Labels:';
.* - Match (Greedy) 0+ characters other than newline;
) - Close 1st capture group;
| - Or;
\b[a-z], - Match a word-boundary followed by a single letter and a comma;
| - Or;
,[a-z]\b - Match a comma followed by a single letter and a word-boundary.
Now replace it with your 1st capture group.
Using sed
$ sed '/Labels:/s/,[A-Za-z]\>//g;s/\<[A-Za-z],//' input_file
Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z
Explanation (Added By Tripleee)
It looks for a comma, followed by an alphabetic, followed by a word boundary, i.e. the label after the comma is a single letter. Then, it removes any remaining single-letter label immediately before a comma by similar logic
Another variation using gnu-awk.
For a line that starts with Labels: replace a comma followed by a single char a-z or A-Z and a word boundary with an empty string.
awk '/^Labels:/{gsub(/,[a-zA-Z]\y|\y[a-zA-Z],/, "")};1' file
Output
Labels: ltemittel,System,Vakuum
Another tags: a,b,xxx,c,yyy,z
As you have tagged Python and pcre, another option is to use the \G anchor and match Label: at the start of the string, and capture in group 1 what you want to keep.
(?:^Labels:\h*|\G(?!^))\K(?:([^\s,]{2,}(?:,(?![a-z]$))?)|,?[a-z],?)
See a regex demo and a Python demo using the Python PyPi regex module.
Using perl:
perl -lpe 's/(?:,[^,](?=,|$))+//g if s/^Labels:\s*\K(?:[^,](?:,|$))*//' file
After matching "Labels:" (which is \Kept), remove any leading single character items. If that happened, remove all other single character items. This assumes that the "Labels:" part cannot contain single characters separated by commas.
$ cat file
Labels: K,ltemittel,a System z,j,Vakuum,s
Another tags: a,b,xxx,c,yyy,z
$ perl -lpe 's/(?:,[^,](?=,|$))+//g if s/^Labels:\s*\K(?:[^,](?:,|$))*//' file
Labels: ltemittel,a System z,Vakuum
Another tags: a,b,xxx,c,yyy,z
Note: System was changed to a System z in the above test. Solutions that rely on matching spaces or word boundaries may not deal with this input correctly.
This might work for you (GNU sed):
sed -E '/Labels/{s/( )\S,|(,)\S,|,\S$/\1\2/g;s//\1\2/g}' file
If a line contains Labels, pattern match for 3 alternate matches and if either the first and second match replace by the matching back reference. Repeat for any overlapping.

python regex expression to match (first multipart or simple part) rar archive

I would like match
first element in multipart rar archive,
regex (.*.)part0*1.rar
or
single part rar archive,
don't match string contains ^.*(part\d+).rar$
I use this regex:
regex = r"(.*)(?:part0*1|.*[^(part\d+)])\.rar"
I 've got some issues:
apps.rar match but apps2.rar dont match and should
LA460.6.7.rar dont match and should
apps.rar should match in group(1)="apps" not group(1)="app"
You can check snippet #regex101
Could you find the error in the regex?
Thanks
The reason that you sometimes match the last character is because the pattern (.*)(?:part0*1|.*[^(part\d+)])\.rar that you tried, first captures the whole line in capture group 1.
That capture group is followed by an alternation matching either part0*1 or .*[^(part\d+)]
You can see that the lines that have part followed by a digit at the end are matched.
But, when there is no match for part0*1 the next alternative is tried which is .*[^(part\d+)].
The second alternative matches until the end of the string (where it already is), and then matches a single character of [^(part\d+)] because using the square brackets makes it a character class without a quantifier.
One option could be using a negative lookahead asserting that the string does not contain part followed by optional zeroes and either a char 2-9 and optional digits or | 1-9 and 1 or more digits.
^(?!.*part0*(?:[2-9]\d*|[1-9]\d+)\.rar)(.+)\.rar$
Regex demo
You can search for filenames that "Either have word 'part' followed by 01/1 or don't have the word 'part' at all"
Please try below regex
(.*part0?1|^(?!.*part.*).*)\.rar
Demo

Python regex to identify capitalised single word lines in a text abstract

I am looking for a way to extract words from text if they match the following conditions:
1) are capitalised
and
2) appear on a new line on their own (i.e. no other text on the same line).
I am able to extract all capitalised words with this code:
caps=re.findall(r"\b[A-Z]+\b", mytext)
but can't figure out how to implement the second condition. Any help will be greatly appreciated.
You can use the re.MULTILINE flag to make ^ and $ match the beginning and the end of a line, rather than the beginning and the end of a string:
re.findall(r"^[A-Z]+$", mytext, flags=re.MULTILINE)
please try following statements \r\n at the begining of your regex expression

How to write a Regex for centrally aligned upper case words?

THE COMPANY
ABCD is a new company in the field of Marketing. bla bla bla.
DESCRIPTION
xyz....
SUMMARY
asdf.......
For text like above, I want to pick up headings like THE COMPANY and SUMMARY. I would like to write a code in Python. I have been trying to use RegEx but have not found a way to write a pattern that matches centrally aligned words. I am open to any new method as long as it can be implemented in Python.
If I understand you correctly, you want to match lines composed of some sort of indentation, followed by some number of upper case words.
If so, the following regex should do the trick:
(?m)^(?: +)[A-Z\s]+$
Let's take that piece by piece.
(?m) tells the regex matcher to treat ^ and $ as the beginning and end of the line instead of the beginning and end of the string.
^ matches the beginning of the line.
(?: +) is a non-capturing group of one or more spaces. In other words, this part of the pattern finds as many spaces as possible, and then ignores them. After all, we're not looking for the spaces, we're looking for the text. If you want the spaces too, just remove the (?: and ), leaving you with \s+. If you prefer tabs, replace \s with \t.
[A-Z\s]+ matches one or more uppercase letters or whitespace characters.
$ matches the end of the line.
Putting it all together (and into Python) we get:
import re
headers = re.findall('(?m)^(?:\s+)[A-Z\s]+$', your_string)
I prefer to use re.match:
import re
example = ''' THE COMPANY
ABCD is a new company in the field of Marketing. bla bla bla.
DESCRIPTION
xyz....
SUMMARY
asdf.......'''
headlines = []
for line in example.split('\n'):
m = re.match(r'^\s{4,}([A-Z0-9 \t\._-]+)', line)
if m:
headlines.append(m.group(1))
print(headlines)
An other way is re.findall:
headlines = [x.lstrip(' \n') for x in re.findall(r'^\s{4,}[A-Z0-9 \t\._-]+', example, re.M)]
print(headlines)

Python reqular expressions non-greedy match

I have this code:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>.*?text.*?</b>', a).group()
and I am trying to match a minimal block between <b> and </b> which contains 'text' anywhere in between. This code is the best I could come up with, but it matches:
<b>1234</b><b>56text78</b>
while I need:
<b>56text78</b>
instead of .* use this
print re.search(r'<b>[^<]*text[^<]*</b>', a).group()
Here you say that ignore "<" character.
Why you're getting the output as <b>1234</b><b>56text78</b> when using <b>.*?text.*?</b> regex?
Basically regex engine scans the input from left to right. So first it takes the pattern <b> from the regex and try to match against the input string. Now the engine scans the input from left to right once it finds the tag <b>, it matches that tag. Now the engine takes the second pattern along with the following string text that is .*?text. Now it matches any character upto the first text string. Why i call it as first text means , if there are more than one text strings after <b>, .*?text matches upto the first text string. So <b>1234</b><b>56text will be matched. Now the engine takes the last pattern .*?</b> and macthes upto the first </b>, so <b>1234</b><b>56text78</b> got matched.
When using this <b>[^<]*text[^<]*</b> regex, it asserts that the characters before the string (text, </b>) and after the string (<b>, text) are any but not of < character. So it prevents the engine from matching also the tags.
Why doesn't <b>.*?text produce the desired output?
This is what regexp engine does:
Takes the first character from the search pattern, which is <, and
finds it in the string, then takes the second, then the third, until
it matches <b>.
The next step takes the whole .*?text pattern and tries to find it
in the string. That's because .*? without the text part would
have no sense, as it would match 0 characters. It matches
1234</b><b>56text part and adds it to <b> found in the step 1.
It actually does produce a non-greedy output, it's just non-obvious in this case. If the string was:
`<b>1234</b><b>56text78text</b><b>9012</b>`
then the greedy '<b>.*text' match would be:
<b>1234</b><b>56text78text
and the non-greedy one '<b>.*?text' would produce the one I was getting:
<b>1234</b><b>56text
So to answer the the initial question, the correct solution will be to exclude the '<>' characters from the search:
import re
a = r'<b>1234</b><b>56text78</b><b>9012</b>'
print re.search(r'<b>[^<>]*text.*?</b>', a).group()

Categories

Resources