Unable to manipulate string after grabbing from list - python

I am looking to remove the last statement in a rule used for parsing. The statements are encapsulated with # characters, and the rule itself is encapsulated with pattern tags.
What I want to do is just remove the last rule statement.
My current idea to achieve this goes like this:
Opens the rules file, saves each line as an element into a list.
Selects the line that contains the correct rule-id and then saves the rule pattern as a new string.
Reverses the saved rule pattern.
Removes the last rule statement.
Re-reverses the rule pattern.
Adds in the trailing pattern tag.
So the input will look like:
<pattern>#this is a statement# #this is also a statement#</pattern>
Output will look like:
<pattern>#this is a statement# </pattern>
My current attempt goes like this:
with open(rules) as f:
lines = f.readlines()
string = ""
for line in lines:
if ruleid in line:
position = lines.index(line)
string = lines[position + 2] # the rule pattern will be two lines down
# from where the rule-id is located, hence
# the position + 2
def reversed_string(a_string): #reverses the string
return a_string[::-1]
def remove_at(x): #removes everything until the # character
return re.sub('^.*?#','',x)
print(reversed_string(remove_at(remove_at(reversed_string(string)))))
This will reverse the string but not remove the last rule statement once it has been reversed.
Running just the reversed_string() function will successfully reverse the string, but trying to run that same string through the remove_at() function will not work at all.
But, if you manually create the input string (to the same rule pattern), and forgo opening and grabbing the rule pattern, it will successfully remove the trailing rule statement.
The successful code looks like this:
string = '<pattern>#this is a statement# #this is also a statement#</pattern>'
def reversed_string(a_string): #reverses the string
return a_string[::-1]
def remove_at(x): #removes everything until the # character
return re.sub('^.*?#','',x)
print(reversed_string(remove_at(remove_at(reversed_string(string)))))
As well, how would I add in the pattern tag after the removal is complete?

The lines you are reading probably have a \n at the end and that's why your replacement is not working. This question can guide you about reading the file without new lines.
Among the options, one could be removing the \n using rstrip() like this:
string = lines[position + 2].rstrip("\n")
Now, about the replacement, I think you could simplify it by using this regular expression:
#[^#]+#(?!.*#)
It consists of the following parts:
#[^#]+# matches one # followed by one or more characters that are not an # and then another #.
(?!.*#) is a negative lookahead to check that no # is found ahead, preceded by zero or more occurrences of any other character.
Here you can see a demo of this regular expression.
This expression should match the last statement and you would not need to reverse the string:
re.sub("#[^#]+#(?!.*#)", "", string)

Related

How can I alphabetize Python functions using Sublime Text?

I installed a plugin that will alphabetize blocks. I just need a way to select all the defs in a python file. So far I've got this regex.
This doesn't select the last line because there isn't any newline. I could enter a newline at the end, but I'd like to avoid that. In fact, ideally I'd like to avoid grabbing all the newlines above.
But I'm worried that if I don't grab the newline, then it won't match functions that have a blank line in the middle.
If there's a better way than what I'm trying--by selecting the blocks and using an alphabetizer plugin--then please suggest it. Otherwise, is there some way I can get the regex to match just the defs?
def.+(\n?\n.+)+
Will accomplish what you want. (Sublime seems to follow the usual "dot is not newline" convention)
Breaking down the components of the expression:
def.+ - match the def line, up to a newline
\n?\n.+ - match a newline, followed by some characters, optionally prepended by another newline (the prepend handles the case of an empty line in the middle of a def)
(...)+ - start a capture group, and match its pattern one or more times
(\n?\n.+)+ - combine the previous two pieces, so we match any sequence of non-empty lines with at most one empty line between any two non-empty lines (pedantically, any sequence of non-empty-line and empty-line-then-non-empty-line blocks)
The final + could be a * instead if it's permissable to match "empty" defs like
def empty():
Try this
^(\s*)(def.*(?:\n\1\s+.*|\n\s*)+$)

Extract strings that start with ${ and end with }

I'm trying to extract the strings from a file that start with ${ and ends with } using Python. I am using the code below to do so, but I don't get the expected result.
My input file looks like this:
Click ${SWIFT_TAB}
Click ${SEARCH_SWIFT_CODE}
and I want to get a list as below:
${SWIFT_TAB}
${SEARCH_SWIFT_CODE}
My current code looks like this:
def findStringFromFile(file):
import os,re
with open(file) as f:
ans = []
for line in f:
matches = re.findall(r'\b\${\S+}\b', line)
ans.extend(matches)
print (ans)
I am expecting a list of strings that start with ${ and end with }, but all I currently get is an empty list.
The problem is that your regexp is buggy, and doesn't match the strings you want to extract. Specifically, you have two issues:
{ and } are regexp metacharacters, just like $, and also need to be escaped if you want to match them literally.
\b matches a word boundary, i.e. a position between a "word character" (a letter, a number or an underscore) and a "non-word character" (anything else) or the beginning/end end of string. It does not match between, say, a space and $.
To fix these issues, change your line:
matches = re.findall(r'\b\${\S+}\b', line)
to:
matches = re.findall(r'\$\{\S+\}', line)
and it should work.
See the Python regular expressions documentation for more details.

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Editing a text file using python

I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?
It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers.
(Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()
You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.
Hope this helps!

Categories

Resources