Question part 1
I got this file f1:
<something #37>
<name>George Washington</name>
<a23c>Joe Taylor</a23c>
</something #37>
and I want to re.compile it that it looks like this f1: (with spaces)
George Washington Joe Taylor
I tried this code but it kind of deletes everything:
import re
file = open('f1.txt')
fixed = open('fnew.txt', 'w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ', text)
fixed.write(fixed_doc)
My guess is the re.compile line but I'm not quite sure what to do with it. I'm not supposed to use 3rd party extensions. Any ideas?
Question part 2
I had a different question about comparing 2 files I got this code from Alfe:
from collections import Counter
def test():
with open('f1.txt') as f:
contentsI = f.read()
with open('f2.txt') as f:
contentsO = f.read()
tokensI = Counter(value for value in contentsI.split()
if value not in [])
tokensO = Counter(value for value in contentsO.split()
if value not in [])
return not (tokensI - tokensO) and not (set(tokensO) - set(tokensI))
Is it possible to implement the re.compile and re.sub in the 'if value not in []' section?
I will explain what happens with your code:
import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()
match = re.compile('<.*>')
for unwanted in text:
fixed_doc = match.sub(r' ',text)
fixed.write(fixed_doc)
The instruction text = file.read() creates an object text of type string named text.
Note that I use bold characters text to express an OBJECT, and text to express the name == IDENTIFIER of this object.
As a consequence of the instruction for unwanted in text:, the identifier unwanted is successively assigned to each character referenced by the text object.
Besides, re.compile('<.*>') creates an object of type RegexObject (which I personnaly call compiled) regex or simply regex , <.*> being only the regex pattern).
You assign this compiled regex object to the identifier match: it's a very bad practice, because match is already the name of a method of regex objects in general, and of the one you created in particular, so then you could write match.match without error.
match is also the name of a function of the re module.
This use of this name for your particular need is very confusing. You must avoid that.
There's the same flaw with the use of file as a name for the file-handler of file f1. file is already an identifier used in the language, you must avoid it.
Well. Now this bad-named match object is defined, the instruction fixed_doc = match.sub(r' ',text) replaces all the occurences found by the regex match in text with the replacement r' '.
Note that it's completely superfluous to write r' ' instead of just ' ' because there's absolutely nothing in ' ' that needs to be escaped. It's a fad of some anxious people to write raw strings every time they have to write a string in a regex problem.
Because of its pattern <.+> in which the dot symbol means "greedily eat every character situated between a < and a > except if it is a newline character" , the occurences catched in the text by match are each line until the last > in it.
As the name unwanted doesn't appear in this instruction, it is the same operation that is done for each character of the text, one after the other. That is to say: nothing interesting.
To analyze the execution of a programm, you should put some printing instructions in your code, allowing to understand what happens. For example, if you do print repr(fixed_doc), you'll see the repeated printing of this: ' \n \n \n '. As I said: nothing interesting.
There's one more default in your code: you open files, but you don't shut them. It is mandatory to shut files, otherwise it could happen some weird phenomenons, that I personnally observed in some of my codes before I realized this need. Some people pretend it isn't mandatory, but it's false.
By the way, the better manner to open and shut files is to use the with statement. It does all the work without you have to worry about.
.
So , now I can propose you a code for your first problem:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
li.append(mat.span(2))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\1>))',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>Washington
Joe </zazaza>Taylor
3------------------------------------3
The principle is as follows:
When the regex detects a tag,
- if it's an end tag, it matches
- if it's a start tag, it matches only if there is a corresponding end tag somewhere further in the text
For each match, the method sub() of the regex r calls the function ripl() to perform the replacement.
If the match is with a start tag (which is necessary followed somewhere in the text by its corresponding end tag, by construction of the regex), then ripl() returns ''.
If the match is with an end tag, ripl() returns '' only if this end tag has previously in the text been detected has being the corresponding end tag of a previous start tag. This is done possible by recording in a list li the span of each corresponding end tag's span each time a start tag is detected and matching.
The recording list li is defined as a default argument in order that it's always the same list that is used at each call of the function ripl() (please, refer to the functionning of default argument to undertsand, because it's subtle).
As a consequence of the definition of li as a parameter receiving a default argument, the list object li would retain all the spans recorded when analyzing several text in case several texts would be analyzed successively. In order to avoid the list li to retain spans of past text matches, it is necessary to make the list empty. I wrote the function so that the first parameter is defined with a default argument None: that allows to call ripl() without argument before any use of it in a regex's sub() method.
Then, one must think to write ripl() before any use of it.
.
If you want to remove the newlines of the text in order to obtain the precise result you showed in your question, the code must be modified to:
import re
def ripl(mat=None,li = []):
if mat==None:
li[:] = []
return
if mat.group(1):
return ''
elif mat.group(2):
li.append(mat.span(3))
return ''
elif mat.span() in li:
return ''
else:
return mat.group()
r = re.compile('( *\n *)'
'|'
'</[^>]+>'
'|'
'<([^>]+)>(?=.*?(</\\2>)) *',
re.DOTALL)
text = '''<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'
result
1------------------------------------1
<something #37>
<name>George <wxc>Washington</name>
<a23c>Joe </zazaza>Taylor</a23c>
</something #37>
2------------------------------------2
George <wxc>WashingtonJoe </zazaza>Taylor
3------------------------------------3
You can use Beautiful Soup to do this easily:
from bs4 import BeautifulSoup
file = open('f1.txt')
fixed = open('fnew.txt','w')
#now for some soup
soup = BeautifulSoup(file)
fixed.write(str(soup.get_text()).replace('\n',' '))
The output of the above line will be:
George Washington Joe Taylor
(Atleast this works with the sample you gave me)
Sorry I don't understand part 2, good luck!
Don't need re.compile
import re
clean_string = ''
with open('f1.txt') as f1:
for line in f1:
match = re.search('.+>(.+)<.+', line)
if match:
clean_string += (match.group(1))
clean_string += ' '
print(clean_string) # 'George Washington Joe Taylor'
Figured the first part out it was the missing '?'
match = re.compile('<.*?>')
does the trick.
Anyway still not sure about the second questions. :/
For part 1 try the below code snippet. However consider using a library like beautifulsoup as suggested by Moe Jan
import re
import os
def main():
f = open('sample_file.txt')
fixed = open('fnew.txt','w')
#pattern = re.compile(r'(?P<start_tag>\<.+?\>)(?P<content>.*?)(?P<end_tag>\</.+?\>)')
pattern = re.compile(r'(?P<start><.+?>)(?P<content>.*?)(</.+?>)')
output_text = []
for text in f:
match = pattern.match(text)
if match is not None:
output_text.append(match.group('content'))
fixed_content = ' '.join(output_text)
fixed.write(fixed_content)
f.close()
fixed.close()
if __name__ == '__main__':
main()
For part 2:
I am not completely clear with what you are asking - however my guess is that you want to do something like if re.sub(value) not in []. However, note that you need to call re.compile only once prior to initializing the Counter instance. It would be better if you clarify the second part of your question.
Actually, I would recommend you to use the built-in Python diff module to find difference between two files. Using this way better than using your own diff algorithm, since the diff logic is well tested and widely used and is not vulnerable to logical or programmatic errors resulting from presence of spurious newlines, tab and space characters.
Related
I am searching for sentences containing characters using Python regular expressions.
But I can't find the sentence I want.
Please help me
regex.py
opfile = open(file.txt, 'r')
contents = opfile.read()
opfile.close()
index = re.findall(r'\[start file\](?:.|\n)*\[end file\]', contents)
item = re.search(r'age.*', str(index))
file.txt(example)
[start file]
name: steve
age: 23
[end file]
result
<re.Match object; span=(94, 738), match='age: >
The age is not printed
There are several issues here:
The str(index) returns the string literal representation of the string list, and it makes it difficult to further process the result
(?:.|\n)* is a very resource consuming construct, use a mere . with the re.S or re.DOTALL option
If you plan to find a single match, use re.search, not re.findall.
Here is a possible solution:
match = re.search(r'\[start file].*\[end file]', contents, re.S)
if match:
match2 = re.search(r"\bage:\s*(\d+)", match.group())
if match2:
print(match2.group(1))
Output:
23
If you want to get age in the output, use match2.group().
If you want to match the age only once between the start and end file markers, you could use a single pattern with a capture group and in between match all lines that do not start with age: or the start or end marker.
^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]
Regex demo
Example
import re
regex = r"^\[start file](?:\n(?!age:|\[(?:start|end) file]).*)*\nage: (\d+)(?:\n(?!\[(?:start|end) file]).*)*\n\[end file]"
s = ("[start file]\n" "name: steve \n" "age: 23\n" "[end file]")
m = re.search(regex, s)
if m:
print(m.group(1))
Output
23
The example input looks like a list of key, value pairs enclosed between some start/end markers. For this use-case, it might be more efficient and readable to write the parsing stage as:
re.search to locate the document
splitlines() to isolate individual records
split() to extract the key and value of each record
Then, in a second step, access the extracted records.
Doing this allows to separate the parsing and exploitation parts and makes the code easier to maintain.
Additionally, a good practice is to wrap access to a file in a "context manager" (the with statement) to guarantee all resources are correctly cleaned on error.
Here is a full standalone example:
import re
# 1: Load the raw data from disk, in a context manager
with open('/tmp/file.txt') as f:
contents = f.read()
# 2: Parse the raw data
fields = {}
if match := re.search(r'\[start file\]\n(.*)\[end file\]', contents, re.S):
for line in match.group(1).splitlines():
k, v = line.split(':', 1)
fields[k.strip()] = v.strip()
# 3: Actual data exploitation
print(fields['age'])
def name():
with open('newfile.txt') as f:
lineno = f.readlines()
for line in lineno:
h = re.compile('(#DESIGNATION\ \:[\n\t]*)((.)*[\n\t]*)*?\#')
print h.match(line)
name()
newfile.txt contains about 100 lines. When run this program give error MemoryError. While removing ? from '(#DESIGNATION\ \:[\n\t]*)((.)*[\n\t]*)*?\#', gives no error. Why is this happening and what are feasible solutions.
Thanks.
If you want to match "#DESIGNATION :" followed by some lines followed by a line with a "#" at the beginning, you first need to read the text in as a single string and use re.MULTILINE to match it. Here is an example:
import re
text = '''
cat
mouse
#DESIGNATION : horse
dog
bird
lake
#
ocean
sea
#DESIGNATION : bike
box
table
#
nothing
something
'''
h = re.compile('^#DESIGNATION :(?:[^\n]|\n[^#])*\n#', re.MULTILINE)
matches = re.findall(h, text)
print repr(matches)
which outputs
['#DESIGNATION : horse\ndog\nbird\nlake\n#', '#DESIGNATION : bike\nbox\ntable\n#']
Note that I'm using the (?:) match operator here to group regexes together without capturing their matched text each time it is evaluated.
With a larger file you probably wouldn't want to have re match the entire text body at once and iterate through the lines instead. If you do that, though, you cannot use '\n' in the expression because you will only be working with a single line at a time. Instead you would need to maintain state of if you're in a #DESIGNATION block or not.
i have a custom script i want to extract data from with python, but the only way i can think is to take out the marked bits then leave the unmarked bits like "go up" "go down" in this example.
string_a = [start]go up[wait time=500]go down[p]
string_b = #onclick go up[wait time=500]go down active="False"
In trying to do so, all I managed to do was extract the marked bits, but i cant figure out a way to save the data that isnt marked! it always gets lost when i extract the other bits!
this is the function im using to extract them. I call it multiple times in order to whittle away the markers, but I can't choose the order they get extracted in!
class Parsers:
#staticmethod
def extract(line, filters='[]'):
##retval list
substring=line[:]
contents=[]
for bracket in range(line.count(str(filters[0]))):
startend =[]
for f in filters:
now= substring.find(f)
startend.append(now)
contents.append(substring[startend[0]+1:startend[1]])
substring=substring[startend[1]+1:]
return contents, substring
btw the order im calling it at the moment is like this. i think i should put the order back to the # being first, but i dont want to break it again.
star_string, first = Parsers.extract(string_a, filters='* ')
bracket_string, substring = Parsers.extract(string_a, filters='[]')
at_string, final = Parsers.extract(substring, filters='# ')
please excuse my bad python, I learnt this all on my own and im still figuring this out.
You are doing some mighty malabarisms with Python string methods above - but if all you want is to extract the content within brackets, and get the remainder of the string, that would be an eaasier thing with regular expressions (in Python, the "re" module)
import re
string_a = "[start]go up[wait time=500]go down[p]"
expr = r"\[.*?\]"
expr = re.compile(r"\[.*?\]")
contents = expr.findall(string_a)
substring = expr.sub("", string_a)
This simply tells the regexp engine to match for a literal [, and whatever characters are there(.*) up to the following ] (? is used to match the next ], and not the last one) - the findall call gets all such matches as a list of strings, and the sub call replaces all the matches for an empty string.
For nice that regular expressions are, they are less Python than their own sub-programing language. Check the documentation on them: https://docs.python.org/2/library/re.html
Still, a simpler way of doing what you had done is to check character by character, and have some variables to "know" where you are in the string (if inside a tag or not, for example) - just like we would think about the problem if we could look at only one character at a time. I will write the code thinking on Python 3.x - if you are still using Python 2.x, please convert your strings to unicode objects before trying something like this:
def extract(line, filters='[]'):
substring = ""
contents = []
inside_tag = False
partial_tag = ""
for char in line:
if char == filters[0] and not inside_tag:
inside_tag = True
elif char == filters[1] and inside_tag:
contents.append(partial_tag)
partial_tag = ""
inside_tag = False
elif inside_tag:
partial_tag += char
else:
substring += 1
if partial_tag:
print("Warning: unclosed tag '{}' ".format(partial_tag))
return contents, substring
Perceive as there is no need of complicated calculations of where each bracket falls in the line, and so on - you just get them all.
Not sure I understand this fully - you want to get [stuff in brackets] and everything else? If you are just parsing flat strings - no recursive brackets-in-brackets - you can do
import re
parse = re.compile(r"\[.*?\]|[^\[]+").findall
then
>>> parse('[start]go up[wait time=500]go down[p]')
['[start]', 'go up', '[wait time=500]', 'go down', '[p]']
>>> parse('#onclick go up[wait time=500]go down active="False"')
['#onclick go up', '[wait time=500]', 'go down active="False"']
The regex translates as "everything between two square brackets OR anything up to but not including an opening square bracket".
If this isn't what you wanted - do you want #word to be a separate chunk? - please show what string_a and string_b should be parsed as!
Input file:
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff
I'm trying to split by the first column-esque entry. For example, I want to get a list that looks like this...
Desired Output:
['rep_origin 607..1720 /label=Rep', 'Region 2643..5020 /label="region" extra_info and stuff']
I tried splitting by ' ' but that gave me some crazy stuff. If I could add a "fuzzy" search term at the end that includes all alphabet characters but NOT a whitespace. That would solve the problem. I suppose you can do it with regex with something like ' [A-Z]' findall but I wasn't sure if there was a less complicated way.
Is there a way to add a "fuzzy" search term at the very end of string.split identifier? (i.e. original_string.' [alphabet_character]'
I'm not sure exactly what you're looking for but the parse function below takes the text from your question and returns a list of sections and a section is a list of the lines from each section (with leading and trailing whitespace removed).
#!/usr/bin/env python
import re
# This is the input from your question
INPUT_TEXT = '''\
rep_origin 607..1720
/label=Rep
Region 2643..5020
/label="region"
extra_info and stuff'''
# A regular expression that matches the start of a section. A section
# start is a line that has 4 spaces before the first non-space
# character.
match_section_start = re.compile(r'^ [^ ]').match
def parse(text):
sections = []
section_lines = None
def append_section_if_lines():
if section_lines:
sections.append(section_lines)
for line in text.split('\n'):
if match_section_start(line):
# We've found the start of a new section. Unless this is
# the first section, save the previous section.
append_section_if_lines()
section_lines = []
section_lines.append(line.strip())
# Save the last section.
append_section_if_lines()
return sections
sections = parse(INPUT_TEXT)
print(sections)
I'm trying to do a simple VB6 to c translator to help me port an open source game to the c language.
I want to be able to get "NpcList[NpcIndex]" from "With Npclist[NpcIndex]" using ragex and to replace it everywhere it has to be replaced. ("With" is used as a macro in VB6 that adds Npclist[NpcIndex] when ever it needs to until it founds "End With")
Example:
With Npclist[NpcIndex]
.goTo(245) <-- it should be replaced with Npclist[NpcIndex].goTo(245)
End With
Is it possible to use regex to do the job?
I've tried using a function to perfom another regex replace between the "With" and the "End With" but I can't know the text the "With" is replacing (Npclist[NpcIndex]).
Thanks in advance
I personally wouldn't trust any single-regex solution to get it right on the first time nor feel like debugging it. Instead, I would parse the code line-to-line and cache any With expression to use it to replace any . directly preceded by whitespace or by any type of brackets (add use-cases as needed):
(?<=[\s[({])\. - positive lookbehind for any character from the set + escaped literal dot
(?:(?<=[\s[({])|^)\. - use this non-capturing alternatives list if to-be-replaced . can occur on the beginning of line
import re
def convert_vb_to_c(vb_code_lines):
c_code = []
current_with = ""
for line in vb_code_lines:
if re.search(r'^\s*With', line) is not None:
current_with = line[5:] + "."
continue
elif re.search(r'^\s*End With', line) is not None:
current_with = "{error_outside_with_replacement}"
continue
line = re.sub(r'(?<=[\s[({])\.', current_with, line)
c_code.append(line)
return "\n".join(c_code)
example = """
With Npclist[NpcIndex]
.goTo(245)
End With
With hatla
.matla.tatla[.matla.other] = .matla.other2
dont.mind.me(.do.mind.me)
.next()
End With
"""
# use file_object.readlines() in real life
print(convert_vb_to_c(example.split("\n")))
You can pass a function to the sub method:
# just to give the idea of the regex
regex = re.compile(r'''With (.+)
(the-regex-for-the-VB-expression)+?
End With''')
def repl(match):
beginning = match.group(1) # NpcList[NpcIndex] in your example
return ''.join(beginning + line for line in match.group(2).splitlines())
re.sub(regex, repl, the_string)
In repl you can obtain all the information about the matching from the match object, build whichever string you want and return it. The matched string will be replaced by the string you return.
Note that you must be really careful to write the regex above. In particular using (.+) as I did matches all the line up to the newline excluded, which or may not be what you want(but I don't know VB and I have no idea which regex could go there instead to catch only what you want.
The same goes for the (the-regex-forthe-VB-expression)+. I have no idea what code could be in those lines, hence I leave to you the detail of implementing it. Maybe taking all the line can be okay, but I wouldn't trust something this simple(probably expressions can span multiple lines, right?).
Also doing all in one big regular expression is, in general, error prone and slow.
I'd strongly consider regexes only to find With and End With and use something else to do the replacements.
This may do what you need in Python 2.7. I'm assuming you want to strip out the With and End With, right? You don't need those in C.
>>> import re
>>> search_text = """
... With Np1clist[Npc1Index]
... .comeFrom(543)
... End With
...
... With Npc2list[Npc2Index]
... .goTo(245)
... End With"""
>>>
>>> def f(m):
... return '{0}{1}({2})'.format(m.group(1), m.group(2), m.group(3))
...
>>> regex = r'With\s+([^\s]*)\s*(\.[^(]+)\(([^)]+)\)[^\n]*\nEnd With'
>>> print re.sub(regex, f, search_text)
Np1clist[Npc1Index].comeFrom(543)
Npc2list[Npc2Index].goTo(245)