Replacing variable text in between two known elements - python

s = """Comment=This is a comment
Name=Frank J. Lapidus
GenericName=Some name"""
replace_name = "Dr. Jack Shephard"
I have some text in a file and have been trying to figure out how to search and replace a line so Name=Frank J. Lapidus becomes Name=Dr. Jack Shephard
How could I do this in Python? Edited: (BTW, the second element would be a \n just in case you were wondering).
Thanks.

Use string.replace (documented under http://docs.python.org/library/stdtypes.html#string-methods):
>>> s = """Comment=This is a comment
... Name=Frank J. Lapidus
... GenericName=Some name"""
>>> replace_name = "Dr. Jack Shephard"
>>> s.replace("Frank J. Lapidus", replace_name)
'Comment=This is a comment\nName=Dr. Jack Shephard\nGenericName=Some name'

You could use the regular expression functions from the re module. For example like this:
import re
pattern = re.compile(r"^Name=(.*)$", flags=re.MULTILINE)
re.sub(pattern, "Name=%s" % replace_name, s)
(The re.MULTILINE option makes ^ and $ match the beginning and the end of a line, respectively, in addition to the beginning and the end of the string.)
Edited to add: Based on your comments to Emil's answer, it seems you are manipulating Desktop Entry files. Their syntax seems to be quite close to that used by the ConfigParser module (perhaps some differences in the case-sensitivity of section names, and the expectation that comments should be preserved across a parse/serialize cycle).
An example:
import ConfigParser
parser = ConfigParser.RawConfigParser()
parser.optionxform = str # make option names case sensitive
parser.read("/etc/skel/examples.desktop")
parser.set("Desktop Entry", "Name", replace_name)
parser.write(open("modified.desktop", "w"))

As an alternative to the regular expression solution (Jukka's), if you're looking to do many of these replacements and the entire file is structured in this way, convert the entire file into a dictionary and then write it back out again after some replacements:
d = dict(x.split("=") for x in s.splitlines() if x.count("=") is 1)
d["Name"] = replace_name
new_string = "\n".join(x+"="+y for x,y in d.iteritems())
Caveats:
First, this only works if there are no '=' signs in your field names (it ignores lines that don't have exactly one = sign).
Second, converting to dict and back will not preserve the order of the fields, although you can at least sort the dictionary with some additional work.

Related

Regex: Capture a line when certain columns are equal to certain values

Let's say we have this data extract:
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
I want to retrieve the line when from = paris, and type = member.
Which means in this example I have only:
1,paris,berlin,member,12
That satisfy these rules. I am trying to do this with Regex only. I am still learning and I could only get this:
^.*(paris).*(member).*$
However, this will give me also the second line where paris is a destination.
The idea I guess is to:
Divide the line by commas.
Check if the second item is equal to 'paris'
Check if the fourth item is equal to 'member', or even check if there is 'member' in that line as there is no confusion with this part.
Any solution where I can use only regex?
Use [^,]* instead of .* to match a sequence of characters that doesn't include the comma separator. Use this for each field you want to skip when matching the line.
^[^,]*,paris,[^,]*,member,
Note that this is a very fragile mechanism compared to use the csv module, since it will break if you have any fields that contain comma (the csv module understands quoting a field to protect the delimiter).
This should do it:
^.*,(paris),.*,(member),.*$
As many have pointed out, I would read this into a dictionary using csv. However, if you insist on using regex, this should work:
[0-9]+\,paris.*[^-]member.*
try this.
import re
regex = r"\d,paris,\w+,member,\d+"
str = """ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10"""
str = str.split("\n")
for line in str:
if (re.match(regex, line)):
print(line)
You can try this:
import re
s = """
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
"""
final_data = re.findall('\d+,paris,\w+,member,\d+', s)
Output:
['1,paris,berlin,member,12']
However, note that the best solution is to read the file and use a dictionary:
import csv
l = list(csv.reader(open('filename.csv')))
final_l = [dict(zip(l[0], i)) for i in l[1:]]
final_data = [','.join(i[b] for b in l[0]) for i in final_l if i['from'] == 'paris' and i['type'] == 'member']

I want to split a string by a character on its first occurence, which belongs to a list of characters. How to do this in python?

Basically, I have a list of special characters. I need to split a string by a character if it belongs to this list and exists in the string. Something on the lines of:
def find_char(string):
if string.find("some_char"):
#do xyz with some_char
elif string.find("another_char"):
#do xyz with another_char
else:
return False
and so on. The way I think of doing it is:
def find_char_split(string):
char_list = [",","*",";","/"]
for my_char in char_list:
if string.find(my_char) != -1:
my_strings = string.split(my_char)
break
else:
my_strings = False
return my_strings
Is there a more pythonic way of doing this? Or the above procedure would be fine? Please help, I'm not very proficient in python.
(EDIT): I want it to split on the first occurrence of the character, which is encountered first. That is to say, if the string contains multiple commas, and multiple stars, then I want it to split by the first occurrence of the comma. Please note, if the star comes first, then it will be broken by the star.
I would favor using the re module for this because the expression for splitting on multiple arbitrary characters is very simple:
r'[,*;/]'
The brackets create a character class that matches anything inside of them. The code is like this:
import re
results = re.split(r'[,*;/]', my_string, maxsplit=1)
The maxsplit argument makes it so that the split only occurs once.
If you are doing the same split many times, you can compile the regex and search on that same expression a little bit faster (but see Jon Clements' comment below):
c = re.compile(r'[,*;/]')
results = c.split(my_string)
If this speed up is important (it probably isn't) you can use the compiled version in a function instead of having it re compile every time. Then make a separate function that stores the actual compiled expression:
def split_chars(chars, maxsplit=0, flags=0, string=None):
# see note about the + symbol below
c = re.compile('[{}]+'.format(''.join(chars)), flags=flags)
def f(string, maxsplit=maxsplit):
return c.split(string, maxsplit=maxsplit)
return f if string is None else f(string)
Then:
special_split = split_chars(',*;/', maxsplit=1)
result = special_split(my_string)
But also:
result = split_chars(',*;/', my_string, maxsplit=1)
The purpose of the + character is to treat multiple delimiters as one if that is desired (thank you Jon Clements). If this is not desired, you can just use re.compile('[{}]'.format(''.join(chars))) above. Note that with maxsplit=1, this will not have any effect.
Finally: have a look at this talk for a quick introduction to regular expressions in Python, and this one for a much more information packed journey.

Python: extracting text from strings using a key phrase

Struggling trying to find a way to do this, any help would be great.
I have a long string – it’s the Title field. Here are some samples.
AIR-LAP1142N-A-K
AIR-LP142N-A-K
Used Airo 802.11n Draft 2.0 SingleAccess Point AIR-LP142N-A-9
Airo AIR-AP142N-A-K9 IOS Ver 15.2
MINT Lot of (2) AIR-LA112N-A-K9 - Dual-band-based 802.11a/g/n
Genuine Airo 112N AP AIR-LP114N-A-K9 PoE
Wireless AP AIR-LP114N-A-9 Airy 50 availiable
I need to pull the part number out of the Title and assign it to a variable named ‘PartNumber’. The part number will always start with the characters ‘AIR-‘.
So for example-
Title = ‘AIR-LAP1142N-A-K9 W/POWER CORD’
PartNumber = yourformula(Title)
Print (PartNumber) will output AIR-LAP1142N-A-K9
I am fairly new to python and would greatly appreciate help. I would like it to ONLY print the part number not all the other text before or after.
What you’re looking for is called regular expressions and is implemented in the re module. For instance, you’d need to write something like :
>>> import re
>>> def format_title(title):
... return re.search("(AIR-\S*)", title).group(1)
>>> Title = "Cisco AIR-LAP1142N-A-K9 W/POWER CORD"
>>> PartNumber = format_title(Title)
>>> print(PartNumber)
AIR-LAP1142N-A-K9
The \S ensures you match everything from AIR- to the next blank character.
def yourFunction(title):
for word in title.split():
if word.startswith('AIR-'):
return word
>>> PartNumber = yourFunction(Title)
>>> print PartNumber
AIR-LAP1142N-A-K9
This is a sensible time to use a regular expression. It looks like the part number consists of upper-case letters, hyphens, and numbers, so this should work:
import re
def extract_part_number(title):
return re.search(r'(AIR-[A-Z0-9\-]+)', title).groups()[0]
This will throw an error if it gets a string that doesn't contain something that looks like a part number, so you'll probably want to add some checks to make sure re.search doesn't return None and groups doesn't return an empty tuple.
You may/could use the .split() function. What this does is that it'll split parts of the text separated by spaces into a list.
To do this the way you want it, I'd make a new variable (named whatever); though for this example, let's go with titleSplitList. (Where as this variable is equal to titleSplitList = Title.split())
From here, you know that the part of text you're trying to retrieve is the second item of the titleSplitList, so you could assign it to a new variable by:
PartNumber = titleSplitList[1]
Hope this helps.

Search for string in file while ignoring id and replacing only a substring

I’ve got a master .xml file generated by an external application and want to create several new .xmls by adapting and deleting some rows with python. The search strings and replace strings for these adaptions are stored within an array, e.g.:
replaceArray = [
[u'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"',
u'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"'],
[u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="false"/>',
u'<TOOL_BUFFER RowID="106874" id_tool_base="3651" use="true"/>'],
[u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="false"/>',
u'<TOOL_SELECT_LINE RowID="106871" id_tool_base="3658" use="true"/>']]
So I'd like to iterate through my file and replace all occurences of 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"' with 'ref_layerid_mapping="x4049" lyvis="on" toc_visible="on"' and so on.
Unfortunately the ID values of "RowID", “id_tool_base” and “ref_layerid_mapping” might change occassionally. So what I need is to search for matches of the whole string in the master file regardless which id value is inbetween the quotation mark and only to replace the substring that is different in both strings of the replaceArray (e.g. use=”true” instead of use=”false”). I’m not very familiar with regular expressions, but I think I need something like that for my search?
re.sub(r'<TOOL_SELECT_LINE RowID="\d+" id_tool_base="\d+" use="false"/>', "", sentence)
I'm happy about any hint that points me in the right direction! If you need any further information or if something is not clear in my question, please let me know.
One way to do this is to have a function for replacing text. The function would get the match object from re.sub and insert id captured from the string being replaced.
import re
s = 'ref_layerid_mapping="x4049" lyvis="off" toc_visible="off"'
pat = re.compile(r'ref_layerid_mapping=(.+) lyvis="off" toc_visible="off"')
def replacer(m):
return "ref_layerid_mapping=" + m.group(1) + 'lyvis="on" toc_visible="on"';
re.sub(pat, replacer, s)
Output:
'ref_layerid_mapping="x4049"lyvis="on" toc_visible="on"'
Another way is to use back-references in replacement pattern. (see http://www.regular-expressions.info/replacebackref.html)
For example:
import re
s = "Ab ab"
re.sub(r"(\w)b (\w)b", r"\1d \2d", s)
Output:
'Ad ad'

Python: re.compile and re.sub

Question part 1
I got this file f1:
<something #37>
<name>George Washington</name>
<a23c>Joe Taylor</a23c>
</something #37>
and I want to re.compile it that it looks like this f1: (with spaces)
George Washington Joe Taylor
I tried this code but it kind of deletes everything:
import re
file = open('f1.txt')
fixed = open('fnew.txt', 'w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ', text)
fixed.write(fixed_doc)
My guess is the re.compile line but I'm not quite sure what to do with it. I'm not supposed to use 3rd party extensions. Any ideas?
Question part 2
I had a different question about comparing 2 files I got this code from Alfe:
from collections import Counter
def test():
with open('f1.txt') as f:
contentsI = f.read()
with open('f2.txt') as f:
contentsO = f.read()
tokensI = Counter(value for value in contentsI.split()
if value not in [])
tokensO = Counter(value for value in contentsO.split()
if value not in [])
return not (tokensI - tokensO) and not (set(tokensO) - set(tokensI))
Is it possible to implement the re.compile and re.sub in the 'if value not in []' section?
I will explain what happens with your code:
import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ',text)
fixed.write(fixed_doc)
The instruction text = file.read() creates an object text of type string named text.
Note that I use bold characters text to express an OBJECT, and text to express the name == IDENTIFIER of this object.
As a consequence of the instruction for unwanted in text:, the identifier unwanted is successively assigned to each character referenced by the text object.
Besides, re.compile('<.*>') creates an object of type RegexObject (which I personnaly call compiled) regex or simply regex , <.*> being only the regex pattern).
You assign this compiled regex object to the identifier match: it's a very bad practice, because match is already the name of a method of regex objects in general, and of the one you created in particular, so then you could write match.match without error.
match is also the name of a function of the re module.
This use of this name for your particular need is very confusing. You must avoid that.
There's the same flaw with the use of file as a name for the file-handler of file f1. file is already an identifier used in the language, you must avoid it.
Well. Now this bad-named match object is defined, the instruction fixed_doc = match.sub(r' ',text) replaces all the occurences found by the regex match in text with the replacement r' '.
Note that it's completely superfluous to write r' ' instead of just ' ' because there's absolutely nothing in ' ' that needs to be escaped. It's a fad of some anxious people to write raw strings every time they have to write a string in a regex problem.
Because of its pattern <.+> in which the dot symbol means "greedily eat every character situated between a < and a > except if it is a newline character" , the occurences catched in the text by match are each line until the last > in it.
As the name unwanted doesn't appear in this instruction, it is the same operation that is done for each character of the text, one after the other. That is to say: nothing interesting.
To analyze the execution of a programm, you should put some printing instructions in your code, allowing to understand what happens. For example, if you do print repr(fixed_doc), you'll see the repeated printing of this: ' \n \n \n '. As I said: nothing interesting.
There's one more default in your code: you open files, but you don't shut them. It is mandatory to shut files, otherwise it could happen some weird phenomenons, that I personnally observed in some of my codes before I realized this need. Some people pretend it isn't mandatory, but it's false.
By the way, the better manner to open and shut files is to use the with statement. It does all the work without you have to worry about.
.
So , now I can propose you a code for your first problem:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
li.append(mat.span(2))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\1>))',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>Washington
Joe </zazaza>Taylor
3------------------------------------3
The principle is as follows:
When the regex detects a tag,
- if it's an end tag, it matches
- if it's a start tag, it matches only if there is a corresponding end tag somewhere further in the text
For each match, the method sub() of the regex r calls the function ripl() to perform the replacement.
If the match is with a start tag (which is necessary followed somewhere in the text by its corresponding end tag, by construction of the regex), then ripl() returns ''.
If the match is with an end tag, ripl() returns '' only if this end tag has previously in the text been detected has being the corresponding end tag of a previous start tag. This is done possible by recording in a list li the span of each corresponding end tag's span each time a start tag is detected and matching.
The recording list li is defined as a default argument in order that it's always the same list that is used at each call of the function ripl() (please, refer to the functionning of default argument to undertsand, because it's subtle).
As a consequence of the definition of li as a parameter receiving a default argument, the list object li would retain all the spans recorded when analyzing several text in case several texts would be analyzed successively. In order to avoid the list li to retain spans of past text matches, it is necessary to make the list empty. I wrote the function so that the first parameter is defined with a default argument None: that allows to call ripl() without argument before any use of it in a regex's sub() method.
Then, one must think to write ripl() before any use of it.
.
If you want to remove the newlines of the text in order to obtain the precise result you showed in your question, the code must be modified to:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
return ''
elif mat.group(2):
li.append(mat.span(3))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('( *\n *)'
'|'
'</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\2>)) *',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>WashingtonJoe </zazaza>Taylor
3------------------------------------3
You can use Beautiful Soup to do this easily:
from bs4 import BeautifulSoup
file = open('f1.txt')
fixed = open('fnew.txt','w')
#now for some soup
soup = BeautifulSoup(file)
fixed.write(str(soup.get_text()).replace('\n',' '))
The output of the above line will be:
George Washington Joe Taylor
(Atleast this works with the sample you gave me)
Sorry I don't understand part 2, good luck!
Don't need re.compile
import re
clean_string = ''
with open('f1.txt') as f1:
for line in f1:
match = re.search('.+>(.+)<.+', line)
if match:
clean_string += (match.group(1))
clean_string += ' '
print(clean_string) # 'George Washington Joe Taylor'
Figured the first part out it was the missing '?'
match = re.compile('<.*?>')
does the trick.
Anyway still not sure about the second questions. :/
For part 1 try the below code snippet. However consider using a library like beautifulsoup as suggested by Moe Jan
import re
import os
def main():
f = open('sample_file.txt')
fixed = open('fnew.txt','w')
#pattern = re.compile(r'(?P<start_tag>\<.+?\>)(?P<content>.*?)(?P<end_tag>\</.+?\>)')
pattern = re.compile(r'(?P<start><.+?>)(?P<content>.*?)(</.+?>)')
output_text = []
for text in f:
match = pattern.match(text)
if match is not None:
output_text.append(match.group('content'))
fixed_content = ' '.join(output_text)
fixed.write(fixed_content)
f.close()
fixed.close()
if __name__ == '__main__':
main()
For part 2:
I am not completely clear with what you are asking - however my guess is that you want to do something like if re.sub(value) not in []. However, note that you need to call re.compile only once prior to initializing the Counter instance. It would be better if you clarify the second part of your question.
Actually, I would recommend you to use the built-in Python diff module to find difference between two files. Using this way better than using your own diff algorithm, since the diff logic is well tested and widely used and is not vulnerable to logical or programmatic errors resulting from presence of spurious newlines, tab and space characters.

Categories

Resources