How to test filename for occurrence of substring in Python? - python

I want to test whether my filename ends in 'pos.txt' or 'neg.txt' in Python.
So for example, if my filename is 'samplepaths5000neg.txt'--how can I test for the ending? I've tried various Python regex expressions but I can't seem to get it correct. It's for a script that runs different actions depending on the filename.

The regex for end of string is the $ symbol.
So if you did want to use a regex expression for this example it would be:
(?:pos|neg)\.txt$
The (?: allows the creation of a group that isn't reported. Omitting the question mark colon would cause the group to be reported which you could use to find out if the file ended in positive or negative.
import re
file_name = "samplepaths5000neg.txt"
correct_ending = re.match(r'(?:pos|neg)\.txt$',file_name) != None
or if you want to capture the ending
ending_result = re.match(r'(pos|neg)\.txt$',file_name)
ending = ending_result.group(1) if ending_result!=None else ''
Shown matched against different filenames :
https://regex101.com/r/9TiFMr/1

Related

Getting the last occurrence of a match that is inside parenthesis using regular expressions

I want to use regular expressions to get the text inside parenthesis in a sentence. But if the string has two or more occurrence, the pattern I am using gets everything in between. I google it and some sources tells me to use negative lookahead and backreference, but it is not working as expected. The examples I found are: Here, here
An example of a string is:
s = "Para atuar no (GCA) do (CNPEM)"
What I want is to get just the last occurrence: "(CNPEM)"
The pattern I am using is:
pattern = "(\(.*\))(?!.*\1)"
But when I run (using python's re module) I get this:
output = (GCA) do (CNPEM)
How can I get just the last occurrence in this case?
You could use re.findall here, and then access the last match:
s = "Para atuar no (GCA) do (CNPEM)"
last = re.findall(r'\(.*?\)', s)[-1]
print(last) # (CNPEM)

Regex Puzzle: Match a pattern only if it is between two $$ without indefinite look behind

I am writing a snippet for the Vim plugin UltiSnips which will trigger on a regex pattern (as supported by Python 3). To avoid conflicts I want to make sure that my snippet only triggers when contained somewhere inside of $$___$$. Note that the trigger pattern might contain an indefinite string in front or behind it. So as an example I might want to match all "a" in "$$ccbbabbcc$$" but not "ccbbabbcc". Obviously this would be trivial if I could simply use indefinite look behind. Alas, I may not as this isn't .NET and vanilla Python will not allow it. Is there a standard way of implementing this kind of expression? Note that I will not be able to use any python functions. The expression must be a self-contained trigger.
If what you are looking for only occurs once between the '$$', then:
\$\$.*?(a)(?=.*?\$\$)
This allows you to match all 3 a characters in the following example:
\$\$) Matches '$$'
.*? Matches 0 or more characters non-greedily
(?=.*?\$\$) String must be followed by 0 or more arbitrary characters followed by '$$'
The code:
import re
s = "$$ccbbabbcc$$xxax$$bcaxay$$"
print(re.findall(r'\$\$.*?(a)(?=.*?\$\$)', s))
Prints:
['a', 'a', 'a']
The following should work:
re.findall("\${2}.+\${2}", stuff)
Breakdown:
Looks for two '$'
"\${2}
Then looks for one or more of any character
.+
Then looks for two '$' again
I believe this regex would work to match the a within the $$:
text = '$$ccbbabbcc$$ccbbabbcc'
re.findall('\${2}.*(a).*\${2}', text)
# prints
['a']
Alternatively:
A simple approach (requiring two checks instead of one regex) would be to first find all parts enclosed in your quoting text, then check if your search string is present withing.
example
text = '$$ccbbabbcc$$ccbbabbcc'
search_string = 'a'
parts = re.findall('\${2}.+\${2}', text)
[p for p in parts if search_string in p]
# prints
['$$ccbbabbcc$$']

Free text parsing using long regex formula leading to error: multiple repeat in python? Screenshot included

I need to parse specific strings from a free text field in an .xlsx file. I am using Python 2.7 in Spyder.
I escaped the '.' in the regex formulas but I am still getting the same error.
To do that, I used pandas to convert the .xslx file into a pandas dataframe:
data = "complaints_data.xlsx"
read_data = pd.read_excel(data)
read_data.dropna(inplace = False)
df = pd.DataFrame(read_data)
df['FMEA Assessment'] = df['FMEA Assessment'].replace({',':''}, regex=True)
Then, I used the extract function of pandas to extract my string fields FMEA, Rev and Line using regex patterns.
fmea_pattern = r'(FMEA\s*\d*\d*\d*\d*\d*|fmea\s*\d*\d*\d*\d*\d*|DOC\s*\-*[0]\d*\d*\d*\d*\d*|doc\s*\-*[0]\d*\d*\d*\d*\d*)'
df[['FMEA']] = df['FMEA Assessment'].str.extract(fmea_pattern, expand=True)
rev_pattern = r'(Rev\.*\s+\D{1,2}+|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
df[['REV']] = df['FMEA Assessment'].str.extract(rev_pattern, expand=True)
line_pattern = r'(line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Lines\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|Line item\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINES\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.|LINE\.*\s*\:*\d{1,3}\d*\.*\D*\.*\d+\d*?\.)'
df[['LINE']] = df['FMEA Assessment'].str.extract(line_pattern, expand=True)
The string fields I need to parse can be inputted in various ways and I accounted for each way in the regex formulas and for each variation of a word; for example, I accounted for line, Line, LINE, lines, Lines, etc. I have tested the regex formulas individually and separately and they are working properly. However, when I combine all of them in the code above, I get the following error message:
Also, is there another way to account for variations of the same word at the same time(lower case, upper case and title case)?
The main error in this case is due to the fact you are using a possessive quantifier instead of a regular, non-possessive quantifier.
It is a common mistake when users test their patterns in the online PCRE regex testers. You need to make sure you ALWAYS test your regexps in the environment (or with a regex engine option) that is compatible with your target environment.
Python re does not support possessive quantifiers:
{5}+
{5,}+
{5,10}+
++
?+
*+
In this case, you just need to remove the trailing + from \D{1,2}+:
rev_pattern = r'(Rev\.*\s+\D{1,2}|rev\.*\s+\D{1,2}|REV\.*\s+\D{1,2}|rev\.*\s+\D{1,2})'
It seems you may just use
rev_pattern = r'((?:[Rr]ev|REV)\.*\s+\D{1,2})' # Will only match Rev, REV and rev at the start
rev_pattern = r'(?i)(Rev\.*\s+\D{1,2})' # Will match any case variations of Rev
See the regex demo at Regex101, note the Python option selected on the left.
Also, note that it is possible to make the whole pattern case insensitive by adding (?i) at the start of the pattern, or by compiling the regex with re.I or re.IGNORECASE arguments. This will "account for variations of the same word at the same time(lower case, upper case and title case)".
NOTE: if you actually are looking to use a possessive quantifier you may emulate a possessive quantifier with the help of a positive lookahead and a backreference. However, in Python, you would need re.finditer to get access to the whole match values.

How to catch a string using regex in python and replace it by desired string

I am new to python and I wrote the following code which suppose to catch a specific string and replace it with a specific string as well.
sid=\"1722407313768658\"
I used this regex: sid=(.+?)
but it catches irrelevant string as well
https://tmobile.demdex.net/dest5.html?d_nsid=0#
as well when I am running this regex on sid=\"1722407313768658\" (replacing it with 1900117189066752 , I am getting the following result which does not replace the string but add i: sid=\1900117189066752\ "1722407313768658\"
(instead of 1722407313768658 i want to have 1900117189066752 )
this is my python code:
import re
content = c.read()
################################################################
# change sessionid in content
replace_small_sid = str('sid=\\' + "\\"+str(sid) + "\\" + " ")
content = re.sub("sid=(.+?)", replace_small_sid, content)
As I understand it you wish to match string patterns in the form:
sid=\"1722407313768658\"
With the aim of replacing the digits.
To achieve this we can use positive lookbehinds and lookaheads as described here:
https://www.regular-expressions.info/lookaround.html
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
In this case our lookbehind will match
sid=\"
Our lookahead will match
\"
Please see the example here: https://regex101.com/r/2pXcMI/2
Finally, we can use this to match and replace as follows:
import re
line = "sid=\"1722407313768658\" safklabsf ipashf oiasfoi asbg fasnk sid=\"65641\" asjobfaosb asbfaosb asf asfauv sid=\"651564165\"."
replace_with = '1900117189066752'
line = re.sub('(?<=sid=\\\")\d+(?=\\\")', replace_with, line)
line
This returns
'sid="1900117189066752" safklabsf ipashf oiasfoi asbg fasnk sid="1900117189066752" asjobfaosb asbfaosb asf asfauv sid="1900117189066752".'
since you want to replace specific string, you can do it by:
content.replace("1722407313768658","1900117189066752")

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Categories

Resources