REGEX parsing commands from latex lines - Python - python

I'm trying to parse and remove any \command (\textit, etc...) from each line loaded (from .tex file or other commands from lilypond files as [\clef, \key, \time]).
How could I do that?
What I've tried
import re
f = open('example.tex')
lines = f.readlines()
f.close()
pattern = '^\\*([a-z]|[0-9])' # this is the wrong regex!!
clean = []
for line in lines:
remove = re.match(pattern, line)
if remove:
clean.append(remove.group())
print(clean)
Example
Input
#!/usr/bin/latex
\item More things
\subitem Anything
Expected output
More things
Anything

You could use a simple regex substitution using this pattern ^\\[^\s]*:
Sample code in python:
import re
p = re.compile(r"^\\[^\s]*", re.MULTILINE)
str = '''
\item More things
\subitem Anything
'''
subst = ""
print re.sub(p, subst, str)
The result would be:
More things
Anything

This will work:
'\\\w+\s'
It searches for the backslash, then for one or more characters, and a space.

Related

python re.match doesn't work for a multipleline text file

I would like to write a Python program that searches and saves the vocabulary definition of an English word.
I have converted a Babylon English-Italian dictionary into a text file.
I would like that re.match() matches the first word of each line, but it doesn't.
I get always: 'not found', namely None for any query (word copied in my code) I use.
The Babylon text file can be found here:
https://tempfile.io/en/ba2voaBnDJsn24P/file
Thanks
import clipboard
import json
import re
wordcopied = clipboard.paste()
print(wordcopied)
dictionary = {}
with open("/home/user/Babylon-EI.txt", "r") as source:
lsource = source.readlines()
for line in lsource:
#print(type(line))
matc = re.match(wordcopied, line, re.MULTILINE | re.DOTALL)
if matc != None:
print(line)
dictionary = {wordcopied:line}
else:
print('not found')
break
I've tried also with re.search, and with multiple flags..
Related questions are all answered regarding flags and blanks

I wrote a regex inside of a python script to analyse xml files but sadly its not working

I wrote a script to gather information out of an XML file. Inside, there are ENTITY's defined and I need a RegEx to get the value out of it.
<!ENTITY ABC "123">
<!ENTITY BCD "234">
<!ENTITY CDE "345">
First, i open up the xml file and save the contents inside of a variable.
xml = open("file.xml", "r")
lines = xml.readlines()
Then I got a for loop:
result = "ABC"
var_search_result_list = []
var_searcher = "ENTITY\s" + result + '.*"[^"]*"\>'
for line in lines:
var_search_result = re.match(var_searcher, line)
if var_search_result != None:
var_search_result_list += list(var_search_result.groups())
print(var_search_result_list)
I really want to have the value 123 inside of my var_search_result_list list. Instead, I get an empty list every time I use this. Has anybody got a solution?
Thanks in Advance - Toki
There are a few issues in the code.
You are using re.match which has to match from the start of the string.
Your pattern is ENTITY\sABC.*"([^"]*)"\> which does not match from
the start of the given example strings.
If you want to add 123 only, you have to use a capture group, and add it using var_search_result.group(1) to the result list using append
For example:
import re
xml = open("file.xml", "r")
lines = xml.readlines()
result = "ABC"
var_search_result_list = []
var_searcher = "ENTITY\s" + result + '.*"([^"]*)"\>'
print(var_searcher)
for line in lines:
var_search_result = re.search(var_searcher, line)
if var_search_result:
var_search_result_list.append(var_search_result.group(1))
print(var_search_result_list)
Output
['123']
A bit more precise pattern could be
<!ENTITY\sABC\s+"([^"]*)"\>
Regex demo

REGEX - Finding an specific XML tag and parsing through it

My xml looks like the following :
<example>
<Test_example>Author%5773637864827/Testing-75873874hdueu47.jpg</Test_example>
<Test_example>Auth0r%5773637864827/Testing245-75873874hdu6543u47.ts</Test_example>
<newtag>
This XML has 100 lines and i am interested in the tag "<Test_example>". In this tag I want to remove everything until it sees a / and when it sees a - remove everything until it sees the full stop.
End result should be
<Test_example>Testing.jpg</Test_example>
<Test_example>Testing245.ts</Test_example>
I am a beginner and would love some help on this. I need a regex soloution. The code i have running before this is a find and replace like follows.
new = open('test.xml')
with open('test.xml', 'r') as f:
onw = f.read().replace('new:', 'ext:')
Based on your sample data I came up with the following regex and this is how I tested it.
import re
example_string = """<example>
<Test_example>Author%5773637864827/Testing-75873874hdueu47.jpg</Test_example>
<Test_example>Auth0r%5773637864827/Testing245-75873874hdu6543u47.ts</Test_example>
<newtag>"""
my_list = example_string.split('\n')
my_regex = re.compile('(<Test_example>)\S+%\d+/(\S+)-\S+(\.\S+)(</Test_example>)')
for line in my_list:
if re.search(my_regex, line):
match = re.search(my_regex, line)
print(match.group(1) + match.group(2) + match.group(3) + match.group(4))
Output:
<Test_example>Testing.jpg</Test_example>
<Test_example>Testing245.ts</Test_example>

Search for words (exact matches) in multiple texts using Python

I want to let the user choose and open multiple texts and perform a search for exact matches in the texts.
I want the encoding to be unicode.
If I search for "cat" I want it to find "cat", "cat,", ".cat" but not "catalogue".
I don't know how to let the user search for two words ("cat" OR "dog") in all of the texts at the same time??????
Maybe I can use RE?
So far I have just made it possible for the user to insert the path to the directory containing the text files to search in. Now I want to let the user (raw_input) search for two words in all of the texts, and then print and save the results (e.g. "search_word_1" and "search_word_2" found in document1.txt, "search_word_2" found in document4.txt) in a separate document (search_words).
import re, os
path = raw_input("insert path to directory :")
ex_library = os.listdir(path)
search_words = open("sword.txt", "w") # File or maybe list to put in the results
thelist = []
for texts in ex_library:
f = os.path.join(path, texts)
text = open(f, "r")
textname = os.path.basename(texts)
print textname
for line in text.read():
text.close()
Regular expressions are appropriate tool in this case.
I want it to find "cat", "cat,", ".cat" but not "catalogue".
Pattern: r'\bcat\b'
\b matches at a word boundary.
how to let the user search for two words ("cat" OR "dog") in all of the texts at the same time
Pattern: r'\bcat\b|\bdog\b'
To print "filename: <words that are found in it>":
#!/usr/bin/env python
import os
import re
import sys
def fgrep(words, filenames, encoding='utf-8', case_insensitive=False):
findwords = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
flags=re.I if case_insensitive else 0).findall
for name in filenames:
with open(name, 'rb') as file:
text = file.read().decode(encoding)
found_words = set(findwords(text))
yield name, found_words
def main():
words = [w.decode(sys.stdin.encoding) for w in sys.argv[1].split(",")]
filenames = sys.argv[2:] # the rest is filenames
for filename, found_words in fgrep(words, filenames):
print "%s: %s" % (os.path.basename(filename), ",".join(found_words))
main()
Example:
$ python findwords.py 'cat,dog' /path/to/*.txt
Alternative solutions
To avoid reading the whole file in memory:
import codecs
...
with codecs.open(name, encoding=encoding) as file:
found_words = set(w for line in file for w in findwords(line))
You could also print found words in the context they are found e.g., print lines with highlighted words:
from colorama import init # pip install colorama
init(strip=not sys.stdout.isatty()) # strip colors if stdout is redirected
from termcolor import colored # pip install termcolor
highlight = lambda s: colored(s, on_color='on_red', attrs=['bold', 'reverse'])
...
regex = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
flags=re.I if case_insensitive else 0)
for line in file:
if regex.search(line): # line contains words
line = regex.sub(lambda m: highlight(m.group()), line)
yield line
You need to split the text in each file on whitespace and punctuation. Once that's done you can simply look for the words you are searching for in the remaining list. You also need to convert everything to lowercase, unless you also want case sensitive search.
Some (maybe useful) information in addition to the existing answers:
You should be aware that what the user means when he thinks of a "character" (=grapheme) is not always the same as a Unicode character, and some graphemes can be represented by Unicode characters in more than one unique way (e.g. composite character vs. base character + combining mark).
To do a search based on graphemes (=what the user expects in most cases) and not on specific Unicode character sequences, you need to normalize your strings before you search.

Replace section of text with only knowing the beginning and last word using Python

In Python, it possible to cut out a section of text in a document when you only know the beginning and end words?
For example, using the bill of rights as the sample document, search for "Amendment 3" and remove all the text until you hit "Amendment 4" without actually knowing or caring what text exists between the two end points.
The reason I'm asking is I would like to use this Python script to modify my other Python programs when I upload them to the client's computer -- removing sections of code that exists between a comment that says "#chop-begin" and "#chop-end". I do not want the client to have access to all of the functions without paying for the better version of the code.
You can use Python's re module.
I wrote this example script for removing the sections of code in file:
import re
# Create regular expression pattern
chop = re.compile('#chop-begin.*?#chop-end', re.DOTALL)
# Open file
f = open('data', 'r')
data = f.read()
f.close()
# Chop text between #chop-begin and #chop-end
data_chopped = chop.sub('', data)
# Save result
f = open('data', 'w')
f.write(data_chopped)
f.close()
With data.txt
do_something_public()
#chop-begin abcd
get_rid_of_me() #chop-end
#chop-beginner this should stay!
#chop-begin
do_something_private()
#chop-end The rest of this comment should go too!
but_you_need_me() #chop-begin
last_to_go()
#chop-end
the following code
import re
class Chopper(object):
def __init__(self, start='\\s*#ch'+'op-begin\\b', end='#ch'+'op-end\\b.*?$'):
super(Chopper,self).__init__()
self.re = re.compile('{0}.*?{1}'.format(start,end), flags=re.DOTALL+re.MULTILINE)
def chop(self, s):
return self.re.sub('', s)
def chopFile(self, infname, outfname=None):
if outfname is None:
outfname = infname
with open(infname) as inf:
data = inf.read()
with open(outfname, 'w') as outf:
outf.write(self.chop(data))
ch = Chopper()
ch.chopFile('data.txt')
results in data.txt
do_something_public()
#chop-beginner this should stay!
but_you_need_me()
Use regular expressions:
import re
string = re.sub('#chop-begin.*?#chop-end', '', string, flags=re.DOTALL)
.*? will match all between.

Categories

Resources