regex: pattern fails to match what I am looking for

regex: pattern fails to match what I am looking for - python

I have the following code that tries to retrieve the name of a file from a directory based on a double \ character:
import re
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=*\\\\)*'
re.findall(pattern,string)
The reasoning behind is that the name of the file is always after a double \ , so I try to look any string which is preceeded by any text that finishes with \ .
Neverthless, when I apply this code I get the following error:
error: nothing to repeat at position 4
What am I doing wrong?
Edit: The concrete output I am looking for is getting the string 'abo_st_gas_dtd_csv' as a match.

There's a couple of things going on:
You need to declare your string definition using the same r'string' notation as for the pattern; right now your string only has a single backslash, since the first one of the two is escaped.
I'm not sure you're using * correctly. It means "repeat immediately preceding group", and not just "any string" (as, e.g., in the usual shell patterns). The first * in parentheses does not have anything preceding it, meaning that the regex is invalid. Hence the error you see. I think, what you want is .*, i.e., repeating any character 0 or more times. Furthermore, it is not needed in the parentheses. A more correct regexp would be r'(?<=\\\\).*':
import re
string = r'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
pattern = r'(?<=\\\\).*'
re.findall(pattern,string)

Your pattern is just a lookabehind, which, by itself, can't match anything. I would use this re.findall approach:
string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
filename = re.findall(r'\\([^.]+\.\w+)$', string)[0]
print(filename) # abo_st_gas_dtd.csv

files = 'I:E\\trm.csvest/PZMALIo4\ETRM841_FX_.csvDeals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv'
counter = -1
my_files = []
for f in files:
counter += 1
if ord(f) == 92:#'\'
temp = files[counter+1:len(files)]
temp_file = ""
for f1 in temp:
temp_file += f1
# [0-len(temp_file)] => if [char after . to num index of type file]== csv
if f1 == '.' and temp[len(temp_file):len(temp_file)+3] == "csv":
my_files.append(temp_file + "csv")
break
print(my_files)#['trm.csv', 'ETRM841_FX_.csv', 'abo_st_gas_dtd.csv']

Related

How to print matching strings in python with regex?

I am working on a Python script that would go through a directory with a bunch of files and extract the strings that match a certain pattern.
More specifically, I'm trying to extract the values of serial number and a max-limit, and the lines look something like this:
#serial number = 642E0523D775
max-limit=50M/50M
I've got the script to go through the files, but I'm having an issue with it actually printing the values that I want it to. Instead of it printing the values, I just get the 'Nothing fount' output.
I'm thinking that it probably has something to do with the regex I'm using, but I can't for the life of me figure out how formulate this.
The script I've come up with so far:
import os
import re
#Where I'm searching
user_input = "/path/to/files/"
directory = os.listdir(user_input)
#What I'm looking for
searchstring = ['serial number', 'max-limit']
re_first = re.compile ('serial.\w.*')
re_second = re.compile ('max-limit=\w*.\w*')
#Regex combine
regex_list = [re_first, re_second]
#Looking
for fname in directory:
if os.path.isfile(user_input + os.sep + fname):
# Full path
f = open(user_input + os.sep + fname, 'r')
f_contents = f.read()
content = fname + f_contents
files = os.listdir(user_input)
lines_seen = set()
for f in files:
print(f)
if f not in lines_seen: # not a duplicate
for regex in regex_list:
matches = re.findall(regex, content)
if matches != None:
for match in matches:
print(match)
else:
print('Nema')
f.close()

Per the documentation, the regex module's match() searches for "characters at the beginning of a string [that] match the regular expression pattern". Since you are prepending your file contents with the file name in the line:
content=fname + f_contents
and then matching your pattern against the content in the line:
result=re.match(regex, content)
there will never be a match.
Since you want to locate a match anywhere in string, use search() instead.
See also: search() vs match()
Edit
The pattern ^[\w&.\-]+$ provided would match neither serial number = 642E0523D775 as it contains a space (" "), nor max-limit=50M/50M as it contains a forward slash ("/"). Both also contain an equals sign ("=") which cannot be matched by your pattern.
Additionally, the character class in this pattern matches the backslash (""), so you may want to remove it (the dash ("-") should not be escaped when it is at the end of the character class).
A pattern to match both these strings as well could be:
^[\w&. \/=\-]+$
Try it out here

Change Python script parameters from another Python script

I have a main-script from which I want to make a couple of temporary copies, with slight changes in each of the copies.
main.py could look like this:
import NumPy as np
import module
import bar
...
foo = bar(label='C2H4', point=(1,0))
atoms = 'H4'
template = read('template.t')
size = template.lengths()
n = 4
alpha = 0.5
batch_size = 256 // (n * alpha)
dct = {
'1': [1, 2],
'6': [3, 4],
}
kwargs = {
'dict': dct,
'size': size,
'scale': size[0] / 10,
}
...
module(atoms, kwargs, foo)
module.run()
In another script, called parameter_search.py, I make the copy and change the parameters by running through each line in main.py, searching for what should be changed.
If the variable is found, a regex-command splits the line, and then changes the float (I'm not the best at regexes, so this could probably use some work):
import re
import fileinput
import os from shutil import copyfile
def is_num(var):
try:
float(var)
return True
except ValueError:
return False
def replace(filename, var_expr, new_val):
found = False
for line in fileinput.input(filename, inplace=1):
if var_expr in line:
if not found:
found = True
lst = re.split('(=|:|,)', line)
for i, char in enumerate(lst):
if is_num(char):
lst[i] = str(new_val)
line = "".join(lst)
else:
raise NameError(f'{var_expr} ambigous')
print(line, end="")
if not found:
raise NameError(f'{var_expr} not found in {filename}')
N = [10, 20]
alpha = [0.4, 0.6, 0.8]
foo = [bar(alpha=1), baz()]
for n in N:
for a in alpha:
for i, f in enumerate(foo):
newfile = 'main'
newfile += f'_n{n}'
newfile += f'_a{a}'
newfile += f'_f{i}'
newfile += '.py'
copyfile('main.py', newfile)
replace(newfile, 'n=', n)
replace(newfile, 'alpha=', use_n)
replace(newfile, 'foo = ', f)
This gives decent results, but problems arise if several variables are on the same line, such as bar(label='C2H4', point=(1, 0)) or the variable is a part of a dict as kwargs, the parameter is a string, a function or some other weird variable.
Is it possible to make something like replace() that is more general or in some other way makes this possible?

Okay, let me make a few assumptions here:
The parameters assignment happens only once in the .py script and it is also the first occurence of the parameter name.
There are exactly two ways to declare a parameter: as a single variable par = value or inside a dictionary {'par' : value} (with single or double quotes).
You can then use the re.sub function to directly substitute the value of the assignment, using what is called capture groups:
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of
pattern in string by the replacement repl. If the pattern isn’t found,
string is returned unchanged. repl can be a string or a function; if
it is a string, any backslash escapes in it are processed. That is, \n
is converted to a single newline character, \r is converted to a
carriage return, and so forth. Unknown escapes of ASCII letters are
reserved for future use and treated as errors. Other unknown escapes
such as \& are left alone. Backreferences, such as \6, are replaced
with the substring matched by group 6 in the pattern.
source: https://docs.python.org/3/library/re.html
So, your function to set a new value for a given parameter name could look like this:
def replace_parameter_value_if_found(string, parameter, new_value):
return new_string = re.sub("([\'\"]?"+parameter+"[\'\"]?\s*[=:]\s*)[\.\w]*", "\1"+new_value, string)
Now, let's break it down:
"([\'\"]?"+parameter+"[\'\"]?\s*[=:]\s*)[.\w]*"
Stuff enclosed in () is called a capture group - it can be
referenced later.
Stuff enclosed in [] matches any of those characters inside (the backslash serves as an escape character)
? is a quantifier matching zero or one occurrence of what it immediately succeeds
* is a quantifier matching any number of occurrences of what it immediately succeeds, including zero
any literal string matches that string
\s and \w match any whitespace and any word character (a-z0-9), respectively
So, say the parameter was 'alpha', the regex pattern becomes
"([\'\"]?alpha[\'\"]?\s*[=:]\s*)[\.\w]*"
and reads like this:
Open capture group 1
Match a single ' or " or neither (because of the ?)
Match the literal word alpha
Match a single ' or " or neither
Match any number of white space characters
Match a single = or :
Close capture group 1 (it now contains alpha= or 'alpha': or some variation of it)
Match any number of word characters or periods (this is what we will be replacing)
All this will then be replaced with what's in capture group 1, followed by the new value, hence:
"\1"+new_value
The string can be the entirety of the .py script, also keep in mind that what you are passing to the function are strings and they can be whatever you want.
Hope this helps.

Delete CHOOSEN special character function

please help cause Im loosing my mind. I can find similar problems but none of them is that specific.
-Im trying to create a simple compilator in Tkinter, with the function to delete a choosen special character.
-I got the buttons for each character (dot, colon, etc.), and I want to create a function that would take a special character as an argument, then delete it from the ScrolledText field. Here is my best try:
import re
content = 'Test. test . .test'
special = '.'
def delchar(char):
adjustedchar = str("'[" + char + "]'")
p = re.compile(adjustedchar)
newcontent = p.sub('', content)
print(newcontent)
delchar(special)
output (nothing has changed)>>> 'Test. test . .test'
What's going on here? How to make it work? Is there a better solution?
I know that I could create each function for each character (tried, and it's working), but that would create a 10 uneccesary functions. I want to keep it DRY. Also, my next function is gonna do the same thing, just using the user-input.
What doesn't work is that argument. If I would print eg. adjustedchar, I'd get:
'[.]'
It's a format that re.compile() should accept, right?

Your code works the problem is that . (a dot) is a special character.
Change your code to:
import re
content = 'Test. test . .test'
special = '\.'
def delchar(char):
adjustedchar = str("'[" + char + "]'")
p = re.compile(char)
newcontent = p.sub('', content)
print(newcontent)
delchar(special)
You can also check by making special = 't'. In your function you can do checks for the special characters.

You need to re.compile with the pattern you want to match, not with the replace-content:
import re
content = 'Test. test . .test'
special = '.'
def delchar(char):
adjustedchar = str("'[" + char + "]'")
p = re.compile("["+char+"]") # replace the dots, not '.'
newcontent = p.sub(adjustedchar, content) # with adjustedchar,change to '' if you like
print(newcontent)
delchar(special)
Your content does not contain '.' so it does not replace. If you change the pattern to "[.]" you are looking for literal dots to be replaced - not dots flanked by '
Output:
Test'[.]' test '[.]' '[.]'test
You could as well just use string replace: Test. Test . .test'.replace(".","'.'")

Stripping Hex code from a plain text file in Python [duplicate]

I have a string. How do I remove all text after a certain character? (In this case ...)
The text after will ... change so I that's why I want to remove all characters after a certain one.

Split on your separator at most once, and take the first piece:
sep = '...'
stripped = text.split(sep, 1)[0]
You didn't say what should happen if the separator isn't present. Both this and Alex's solution will return the entire string in that case.

Assuming your separator is '...', but it can be any string.
text = 'some string... this part will be removed.'
head, sep, tail = text.partition('...')
>>> print head
some string
If the separator is not found, head will contain all of the original string.
The partition function was added in Python 2.5.
S.partition(sep) -> (head, sep, tail)
Searches for the separator sep in S, and returns the part before it,
the separator itself, and the part after it. If the separator is not
found, returns S and two empty strings.

If you want to remove everything after the last occurrence of separator in a string I find this works well:
<separator>.join(string_to_split.split(<separator>)[:-1])
For example, if string_to_split is a path like root/location/child/too_far.exe and you only want the folder path, you can split by "/".join(string_to_split.split("/")[:-1]) and you'll get
root/location/child

Without a regular expression (which I assume is what you want):
def remafterellipsis(text):
where_ellipsis = text.find('...')
if where_ellipsis == -1:
return text
return text[:where_ellipsis + 3]
or, with a regular expression:
import re
def remwithre(text, there=re.compile(re.escape('...')+'.*')):
return there.sub('', text)

import re
test = "This is a test...we should not be able to see this"
res = re.sub(r'\.\.\..*',"",test)
print(res)
Output: "This is a test"

The method find will return the character position in a string. Then, if you want remove every thing from the character, do this:
mystring = "123⋯567"
mystring[ 0 : mystring.index("⋯")]
>> '123'
If you want to keep the character, add 1 to the character position.

From a file:
import re
sep = '...'
with open("requirements.txt") as file_in:
lines = []
for line in file_in:
res = line.split(sep, 1)[0]
print(res)

This is in python 3.7 working to me
In my case I need to remove after dot in my string variable fees
fees = 45.05
split_string = fees.split(".", 1)
substring = split_string[0]
print(substring)

Yet another way to remove all characters after the last occurrence of a character in a string (assume that you want to remove all characters after the final '/').
path = 'I/only/want/the/containing/directory/not/the/file.txt'
while path[-1] != '/':
path = path[:-1]

another easy way using re will be
import re, clr
text = 'some string... this part will be removed.'
text= re.search(r'(\A.*)\.\.\..+',url,re.DOTALL|re.IGNORECASE).group(1)
// text = some string

How can I replace part of a string with a pattern

for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.

What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python

New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234

Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex: pattern fails to match what I am looking for - python

Your pattern is just a lookabehind, which, by itself, can't match anything. I would use this re.findall approach: string = 'I:/Etrmtest/PZMALIo4/ETRM841_FX_Deals_Restructuring/FO_PRE\\abo_st_gas_dtd.csv' filename = re.findall(r'\\([^.]+\.\w+)$', string)[0] print(filename) # abo_st_gas_dtd.csv

Related

How to print matching strings in python with regex?

Change Python script parameters from another Python script

Delete CHOOSEN special character function

Stripping Hex code from a plain text file in Python [duplicate]

How can I replace part of a string with a pattern

Categories

Resources