Python - replace multiline string in a file - python

I'm writing a script which finds in a file a few lines of text. I wonder how to replace exactly that text with other given (new string might be shorter or longer). I'm using re.compile() to create a multiple line pattern then looking for any match in a file I do like this:
for match in pattern.finditer(text_in_file)
#if it would be possible I wish to change
#text in a file here by (probably) replacing match.group(0)
Is it possible to accomplish in this way (if yes, then how to do it in the easiest way?) or my approach is wrong or hard to do it right (if yes, then how to do it right?)

The simple solution:
Read the whole text into a variable as a string.
Use a multi-line regexp to match what you want to replace
Use output = pattern.sub('replacement', fileContent)
The complex solution:
Read the file line by line
Print any line which doesn't match the start of the pattern
If you find a match for the start, stop printing until you see the end pattern.
If you saw the end pattern, print the replacement

Use pattern.sub('replacement text', text_in_file) to replace matches.
You can use back references in the replacement pattern as needed. It doesn't matter if the string is shorter or longer; the method returns a new string value with the replacements made. If the text came from a file, you'll need to write back the text to that file to replace the contents.
You could use the fileinput module if you need to make the replacement in-place; the module takes care of moving the original file aside and write a new file in it's place.

Related

How to convert variable to a regex string?

I am working in python I am looping through a large group of strings and I want to be able to see if they are in a second list of strings.
for line in dictionary:
line = line.replace('\r\n','').replace('\n','')
for each in complex8list:
txt = re.compile(.*line.*)
if re.search(each, txt):
I need to be able to check if the string with anything before it, and anything after it is in the second list.
What is the correct syntax to do this?
If line isn't a regex, you don't even need regex for this.
if line in each:
If line is a regex, then you don't need to do anything since a leading .* is implied with re.search and a trailing .* is unnecessary.
if re.search(line, each):
BTW you seem to have the arguments to re.search backwards.

Python: select a line if specific characters spaced by tab at end of line

I am trying to find out how to best select specific lines from multiple txt files in Python. One way could be to use regex, but I have read that this would probably be a 'heavy' solution for a simpler selection process of lines. Another possibility may be string.split() but it seems that I would have to split all lines first before making my selection. The selection I intend to make is upon the following condition:
if a line end with 'a tab a tab' then I select that line
in regex this would be the following:
((a\t){2}|(b\t){2})\n # character 'a' or 'b' at end of line
The function line.endswith('a a ') is also available, yet this does not recognize tabs.
if line.endswith('a a '): # tabs are not recognized at end of line
Can you please advice if regex is a good or too heavy use or if string.split or another function like line.endswith is more appropriate?
Thank you.
endswith is enough to solve your selection problem:
\t is a nice way to represent a tab in a python string:
>>> print('a\ta\t')
a a
And endswith match it nicely:
>>> print('foobar a\ta\t'.endswith('a\ta\t'))
True

python regular expression not matching file contents with re.match and re.MULTILINE flag

I'm reading in a file and storing its contents as a multiline string. Then I loop through some values I get from a django query to run regexes based on the query results values. My regex seems like it should be working, and works if I copy the values returned by the query, but for some reason isn't matching when all the parts are working together that ends like this
My code is:
with open("/path_to_my_file") as myfile:
data=myfile.read()
#read saved settings then write/overwrite them into the config
items = MyModel.objects.filter(some_id="s100009")
for item in items:
regexString = "^\s*"+item.feature_key+":"
print regexString #to verify its what I want it to be, ie debug
pq = re.compile(regexString, re.M)
if pq.match(data):
#do stuff
So basically my problem is that the regex isn't matching. When I copy the file contents into a big old string, and copy the value(s) printed by the print regexString line, it does match, so I'm thinking theres some esoteric python/django thing going on (or maybe not so esoteric as python isnt my first language).
And for examples sake, the output of print regexString is :
^\s*productDetailOn:
File contents:
productDetailOn:true,
allOff:false,
trendingWidgetOn:true,
trendingWallOn:true,
searchResultOn:false,
bannersOn:true,
homeWidgetOn:true,
}
Running Python 2.7. Also, dumped the types of both item.feature and data, and both were unicode. Not sure if that matters? Anyway, I'm starting to hit my head off the desk after working this for a couple hours, so any help is appreciated. Cheers!
According to documentation, re.match never allows searching at the beginning of a line:
Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.
You need to use a re.search:
regexString = r"^\s*"+item.feature_key+":"
pq = re.compile(regexString, re.M)
if pq.search(data):
A small note on the raw string (r"^\s+"): in this case, it is equivalent to "\s+" because there is no \s escape sequence (like \r or \n), thus, Python treats it as a raw string literal. Still, it is safer to always declare regex patterns with raw string literals in Python (and with corresponding notations in other languages, too).

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".
You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each timeā€¦
If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Does a fast Python built-in method for reading lines and then splitting them exist?

This method works just fine in Python:
with open(file) as f:
for line in f:
for field in line.rstrip().split('\t'):
continue
However, it also means I read each line twice. First I loop over each character of the file and search for newline characters and second I loop over each character of the line and search for tab spaces. Is there a built-in method for splitting lines, while avoiding looping over the same set of characters twice? Apologies if this is a stupid question.
If you're worried about this level of efficiency then you probably shouldn't be programming in Python. Most of what is happening in that loop happens in C (if you're using the CPython implementation). You're not going to find a more efficient way to process your data using a pure python approach or without creating a very complicated looping structure.
If I wanted to avoid looping over the lines and handle the whole file in one go I would go with a regular expression. Also, regular expressions should be really fast.
import re
regexp = re.compile("\n+")
with open(file) as f:
lines = re.split(regexp, f.read())
Now \n matches one or more newlines and splits the file there. The results is a python list with all the lines. If you want to split by another character, for example whitespaces (and tabs and newlines) you would replace \n+ with \s+. Depending on what you want to do with the lines this might not be faster. Timeit is your friend.
More on pythons regexp:
https://docs.python.org/2/library/re.html

Categories

Resources