Editing a text file using python - python

I have an auto generated bibliography file which stores my references. The citekey in the generated file is of the form xxxxx:2009tb. Is there a way to make the program to detect such a pattern and change the citekey form to xxxxx:2009?

It's not quite clear to me which expression you want to match, but you can build everything with regex, using import re and re.sub as shown. [0-9]*4 matches exactly 4 numbers.
(Edit, to incorporate suggestions)
import re
inf = 'temp.txt'
outf = 'out.txt'
with open(inf) as f,open(outf,'w') as o:
all = f.read()
all = re.sub("xxxxx:[0-9]*4tb","xxxxx:tb",all) # match your regex here
o.write(all)
o.close()

You actually just want to remove the two letters after the year in a reference. Supposing we could uniquely identify a reference as a colon followed by four numbers and two letters, than the following regular expression would work (at least it is working in this example code):
import re
s = """
according to some works (newton:2009cb), gravity is not the same that
severity (darwin:1873dc; hampton:1956tr).
"""
new_s = re.sub('(:[0-9]{4})\w{2}', r'\1', s)
print new_s
Explanation: "match a colon : followed by four numbers [0-9]{4} followed by any two "word" characters \w{2}. The parentheses catch just the part you want to keep, and r'\1' means you are replacing each whole match by a smaller part of it which is in the first (and only) group of parentheses. The r before the string is there because it is necessary to interpret \1 as a raw string, and not as an escape sequence.
Hope this helps!

Related

Regex Puzzle: Match a pattern only if it is between two $$ without indefinite look behind

I am writing a snippet for the Vim plugin UltiSnips which will trigger on a regex pattern (as supported by Python 3). To avoid conflicts I want to make sure that my snippet only triggers when contained somewhere inside of $$___$$. Note that the trigger pattern might contain an indefinite string in front or behind it. So as an example I might want to match all "a" in "$$ccbbabbcc$$" but not "ccbbabbcc". Obviously this would be trivial if I could simply use indefinite look behind. Alas, I may not as this isn't .NET and vanilla Python will not allow it. Is there a standard way of implementing this kind of expression? Note that I will not be able to use any python functions. The expression must be a self-contained trigger.
If what you are looking for only occurs once between the '$$', then:
\$\$.*?(a)(?=.*?\$\$)
This allows you to match all 3 a characters in the following example:
\$\$) Matches '$$'
.*? Matches 0 or more characters non-greedily
(?=.*?\$\$) String must be followed by 0 or more arbitrary characters followed by '$$'
The code:
import re
s = "$$ccbbabbcc$$xxax$$bcaxay$$"
print(re.findall(r'\$\$.*?(a)(?=.*?\$\$)', s))
Prints:
['a', 'a', 'a']
The following should work:
re.findall("\${2}.+\${2}", stuff)
Breakdown:
Looks for two '$'
"\${2}
Then looks for one or more of any character
.+
Then looks for two '$' again
I believe this regex would work to match the a within the $$:
text = '$$ccbbabbcc$$ccbbabbcc'
re.findall('\${2}.*(a).*\${2}', text)
# prints
['a']
Alternatively:
A simple approach (requiring two checks instead of one regex) would be to first find all parts enclosed in your quoting text, then check if your search string is present withing.
example
text = '$$ccbbabbcc$$ccbbabbcc'
search_string = 'a'
parts = re.findall('\${2}.+\${2}', text)
[p for p in parts if search_string in p]
# prints
['$$ccbbabbcc$$']

How to catch a string using regex in python and replace it by desired string

I am new to python and I wrote the following code which suppose to catch a specific string and replace it with a specific string as well.
sid=\"1722407313768658\"
I used this regex: sid=(.+?)
but it catches irrelevant string as well
https://tmobile.demdex.net/dest5.html?d_nsid=0#
as well when I am running this regex on sid=\"1722407313768658\" (replacing it with 1900117189066752 , I am getting the following result which does not replace the string but add i: sid=\1900117189066752\ "1722407313768658\"
(instead of 1722407313768658 i want to have 1900117189066752 )
this is my python code:
import re
content = c.read()
################################################################
# change sessionid in content
replace_small_sid = str('sid=\\' + "\\"+str(sid) + "\\" + " ")
content = re.sub("sid=(.+?)", replace_small_sid, content)
As I understand it you wish to match string patterns in the form:
sid=\"1722407313768658\"
With the aim of replacing the digits.
To achieve this we can use positive lookbehinds and lookaheads as described here:
https://www.regular-expressions.info/lookaround.html
Lookahead and lookbehind, collectively called "lookaround", are zero-length assertions just like the start and end of line, and start and end of word anchors explained earlier in this tutorial. The difference is that lookaround actually matches characters, but then gives up the match, returning only the result: match or no match. That is why they are called "assertions". They do not consume characters in the string, but only assert whether a match is possible or not.
In this case our lookbehind will match
sid=\"
Our lookahead will match
\"
Please see the example here: https://regex101.com/r/2pXcMI/2
Finally, we can use this to match and replace as follows:
import re
line = "sid=\"1722407313768658\" safklabsf ipashf oiasfoi asbg fasnk sid=\"65641\" asjobfaosb asbfaosb asf asfauv sid=\"651564165\"."
replace_with = '1900117189066752'
line = re.sub('(?<=sid=\\\")\d+(?=\\\")', replace_with, line)
line
This returns
'sid="1900117189066752" safklabsf ipashf oiasfoi asbg fasnk sid="1900117189066752" asjobfaosb asbfaosb asf asfauv sid="1900117189066752".'
since you want to replace specific string, you can do it by:
content.replace("1722407313768658","1900117189066752")

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Matching everything after series of hyphens

I'm trying to capture all the remaining text in a file after three hyphens at the start of a line (---).
Example:
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Everything after the first set of three hyphens should be captured. The closest I've gotten is using this regex [^(---)]+$ which works slightly. It will capture everything after the hyphens, but if the user places any hyphens after that point it instead then captures after the last hyphen the user placed.
I am using this in combination with python to capture text.
If anyone can help me sort out this regex problem I'd appreciate it.
pat = re.compile(r'(?ms)^---(.*)\Z')
The (?ms) adds the MULTILINE and DOTALL flags.
The MULTILINE flag makes ^ match the beginning of lines (not just the beginning of the string.) We need this because the --- occurs at the beginning of a line, but not necessarily the beginning of the string.
The DOTALL flag makes . match any character, including newlines. We need this so that (.*) can match more than one line.
\Z matches the end of the string (as opposed to the end of a line).
For example,
import re
text = '''\
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
'''
pat = re.compile(r'(?ms)^---(.*)\Z')
print(re.search(pat, text).group(1))
prints
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Note that when you define a regex character class with brackets, [...], the stuff inside the brackets are (in general, except for hyphenated ranges like a-z) interpreted as single characters. They are not patterns. So [---] is not different than [-]. In fact, [---] is the range of characters from - to -, inclusive.
The parenthese inside the character class are interpreted as literal parentheses too, not grouping delimiters. So [(---)] is equivalent to [-()], the character class including the hyphen and left and right parentheses.
Thus the character class [^(---)]+ matches any character other than the hyphen or parentheses:
In [23]: re.search('[^(---)]+', 'foo - bar').group()
Out[23]: 'foo '
In [24]: re.search('[^(---)]+', 'foo ( bar').group()
Out[24]: 'foo '
You can see where this is going, and why it does not work for your problem.
Sorry for not directly answering your question, but I wonder if regular expressions are overcomplicating the problem? You could do something like this:
f = open('myfile', 'r')
for i in f:
if i[:3] == "---":
break
text = f.readlines()
f.close()
Or, am I missing something?
I tend to find that regular expressions are difficult enough to maintain that if you don't need their unique capabilities for a given purpose it'll be cleaner and more readable to avoid using them entirely.
s = open(myfile).read().split('\n\n---\n\n', 1)
print s[0] # first part
print s[1] # second part after the dashes
This should work for your example. The second parameter to split specifies how many times to split the string.

Categories

Resources