re.sub issue when using group with \number - python

i'm trying to use a regexp to arrange some text, with re.sub.
Let's say it's an almost csv file that I have to clean to make it totally csv.
I replaced all \t by \n doing :
t = t.replace("\n", "\t")
... and it works just fine. After that, I need to get some \t back to \n, for each of my CSV lines. I use for that this expression :
t = re.sub("\t(\d*?);", "\n\1;", t, re.U)
The problem is it works... but partially. The \n are added properly, but instead of being followed by my matching group, they are followed by a ^A (according to Vim)
I tried my regexp using a re.findall and it works juste fine... so what could be wrong according to you ?
My CSV lines are finally supposed to be like :
number;text;text;...;...;\n
Thanks for your help !

Your \1 is interpreted as the ascii character 1.
Try using \\1 or r"\n\1;" .

Like Scharron said, always always always use raw-string (r'') notation with regexes. Get into that habit and then you won't have to debug weird issues like this.
r'\n\1;'

Related

The way to unescape escaped regex pattern Python

I'm trying to unescape the escaped regex pattern to apply it to a string.
It's actually dynamic I don't exactly know what it would look like, but throughout my testing I encountered one problem, the string with escaped regex pattern looks like this:
\\d{4}
I've written a simple regex which replaces every single combination of backslash and a character with just a character
And I'm applying it this way:
sub(r"\\(.)", "\\1", escaped_pattern)
But what it gives me afterwards is d{4} not \d{4} as I expect.
I've tried using raw strings for repl, escape\unescape it, it still doesnt return what I expect it to return. Would appreciate any help.
EDIT
escaped_pattern = settings.reg_exp
regexp = sub(r"\\(.)", "\\1", escaped_pattern)
search(regexp, string_to_regexp).group()[0]
Based on you update I'm pretty sure that you would get exactly your desired output if you just stopped trying to unescape it.
import re
s1 = "1234astring"
matches = re.search("\\d{4}", s1)
matches.group(0)
"1234"
matches.group()[0]
"1"
Try r"\\\\(.)" in search pattern and '\\\1' in substitution pattern.
works OK here: https://regex101.com/r/M3ikqj/1

Python. How to print a certain part of a line after it had been "re.searched" from a file

Could you tell me how to print this part of the line only '\w+.226.\w.+' ?
Code
VSP = input("Номер ВСП (четыре цифры): ")
a = re.compile(r'\w+.226.\w.+'+VSP)
b=re.search(a, open('Sample.txt').read())
print (b.group())
Номер ВСП (четыре цифры): 1020
10.226.27.60 1020
After I have found the intended line associated with my variable "VSP" in the txt file, how can exclude it from output, printing the"10.226.27.60" only?
You will need to modify your regex slightly to separate the trailing characters in the IP and the spaces that separate it from VSP. Adding a capture group will let you select the portion with just the IP address. The updated regex looks like this:
'(\d+\.226\.\S+)\s+' + VSP
\S (uppercase S) matches any non-whitespace, while \s (lowercase s) matches all whitespace. I replaced the first \w with the more specific \d (digits), and . (any character at all) with \. (actual period). The second \w is now \S, but you could use \d+\.\d+ if you wanted to be more specific.
Using the first capture group will give you the IP address:
print(b.group(1))
If you are looking for a single IP address once, not compiling your regex is fine. Also, reading in a small file in its entirety is OK as long as the file is small. If either is not the case, I would recommend compiling the regex and going through the file line by line. That will allow you to discard most lines much faster than using a regex would do.
I see you already have an answer.You can also try this regex if you were to separate the two groups by the whitespace:
import re
a = re.compile(r'(.+?)\s+(.+)') # edit: added ? to avoid
# greedy behaviour of first .+
# otherwise multiple spaces after the
# address will be caught into
# b.group(1), as per #Mad comment
b=re.search(a, '10.226.27.60 1020')
print (b.group(0))
print (b.group(1))
print (b.group(2))
or customize the first group regexp to your needs.
Edit:
This was not meant to be a proper answer but more of a comment wich I didn't think was readable as such; I am trying only to show group separation using regex, wich seems OP didn't know about or didn't use.
That is why I am not matching .226. because OP can do that. I also removed the file read part, which isn't needed for demonstration. Please read #Mad answer because its quite complete and in fact also shows how to use groups.

Replace text between parentheses in python

My string will contain () in it. What I need to do is to change the text between the brackets.
Example string: "B.TECH(CS,IT)".
In my string I need to change the content present inside the brackets to something like this.. B.TECH(ECE,EEE)
What I tried to resolve this problem is as follows..
reg = r'(()([\s\S]*?)())'
a = 'B.TECH(CS,IT)'
re.sub(reg,"(ECE,EEE)",a)
But I got output like this..
'(ECE,EEE)B(ECE,EEE).(ECE,EEE)T(ECE,EEE)E(ECE,EEE)C(ECE,EEE)H(ECE,EEE)((ECE,EEE)C(ECE,EEE)S(ECE,EEE),(ECE,EEE)I(ECE,EEE)T(ECE,EEE))(ECE,EEE)'
Valid output should be like this..
B.TECH(CS,IT)
Where I am missing and how to correctly replace the text.
The problem is that you're using parentheses, which have another meaning in RegEx. They're used as grouping characters, to catch output.
You need to escape the () where you want them as literal tokens. You can escape characters using the backslash character: \(.
Here is an example:
reg = r'\([\s\S]*\)'
a = 'B.TECH(CS,IT)'
re.sub(reg, '(ECE,EEE)', a)
# == 'B.TECH(ECE,EEE)'
The reason your regex does not work is because you are trying to match parentheses, which are considered meta characters in regex. () actually captures a null string, and will attempt to replace it. That's why you get the output that you see.
To fix this, you'll need to escape those parens – something along the lines of
\(...\)
For your particular use case, might I suggest a simpler pattern?
In [268]: re.sub(r'\(.*?\)', '(ECE,EEE)', 'B.TECH(CS,IT)')
Out[268]: 'B.TECH(ECE,EEE)'

Negative lookahead - exclude entire match if words are found?

I am trying to parse text journals, and I am only interested in specific sections of text.
I thought that I was doing fine until I noticed I was inadvertently identifying sections.
Suppose that I want to match the following section.
Section 7 - Delivering Terminal Diagnosis's
which may also show up as
Section 7. Delivering a Terminal Diagnosis
But I don't want to match anything if the words see or under precede my string like below.
see Section 7. Delivering a Terminal Diagnosis
or
filed under Section 7. Delivering a Terminal Diagnosis
should not match anything.
I tried using a negative look-ahead, but it only excludes the words, it doesn't throw out the entire match.
((?!see )Section[\s\\n]+7[\s+]+?[-:\\n\.]+?[\s+]+?(Delivering|Deliver)(.*terminal[\s+]+Diagnosis('s)?)?[\.]?)
I don't think that I am grasping the look-around concept properly. help?
Negative look-ahead does what it says: specifies a group that cannot match after your main expression. But you don't have anything before it.
Use negative lookbehind:
(?<!see|under)
in lieu of (?!see ).
Other comments: you have a case error (terminal should be Terminal) and if you make your entire string "raw" by prepending it with an r like r'my string' you don't need to double-escape characters like \n.
Try the following..
For whatever case you are using for matching, I would use r in front of your regular expression. r is Python’s raw string notation for regular expression patterns and to avoid escaping, and to avoid the fact of uppercase or lowercase to look for, use re.I for case-insensitive matching.
Here's a possible solution using double Negative Lookbehind's.
(?<!see)(?<!under)\s+(section 7[\s.:-]+(?:deliver(?:ing)?).*?terminal\s+diagnosis(?:'s)?)
See live demo
By example of using the raw string notation and re.I, this is what I meant.
matches = re.findall(r"(?<!see)(?<!under)\s+(section 7[\s.:-]+(?:deliver(?:ing)?).*?terminal\s+diagnosis(?:'s)?)", s, re.I)
print matches

Split lines by a character or whitespace python

I'm trying to split the lines in the data file I'm playing with. This was originally someone else's code, just trying to 'fix it'. They have it splitting on a semi-colon, but I realized that they actually need it to split on excess whitespace as well. I've singled my problem out to the expression in line 28. I was trying some suggestions from other users, but when I use a regex command I get an invalid literal for int() warning. This is confusing because it works if I don't use the regex. Any suggestions? Thanks.
EDIT: Edited for full code link.
No, .split with no arguments is the only form that splits on any whitespace.
Use a regex like this:
re.split(r'[\s;]+', text)

Categories

Resources