I'm writing a program that can parse math papers written in .tex files. Here are what I want:
The program is supposed to detect the beginning, the end, sections, subsections, subsubsections, theorems, lemmas, definitions, conjectures, corollaries, proposition, exercises, notations and examples in a math paper and ignore the rest of the contents to produce a summary.
In the beginning the program is supposed to retain all characters until reaching token MT. In this case the lever should preserve the token and enter ig mode. Then it should ignore all characters unless it detects a theorem/lemma/definition/conjecture/corollary/example/exercise/notation/proposition, in which case it temporarily enters the INITIAL mode and retain it or a (sub/subsub)section in which case it should temporarily enter the sec mode.
\newtheorem{<name>}{<heading>}[<counter>] and \newtheorem{<name>}[<counter>]{<heading>} are detected as TH ptext THCC ptext THC ptext and TH ptext THCS ptext THSC ptext THC respectively where ptext is a bunch of TEXT.
import sys
import logging
from ply.lex import TOKEN
if sys.version_info[0] >= 3:
raw_input = input
tokens = (
'BT', 'BL', 'BD', 'BCONJ', 'BCOR', 'BE', 'ET', 'EL', 'ED', 'ECONJ', 'ECOR', 'EE', 'SEC', 'SSEC', 'SSSEC', 'ES', 'TEXT','ITEXT','BIBS','MT','BN','EN','BEXE','EEXE','BP','EP','TH','THCS','THSC','THCC','THC',
)
states = (('ig', 'exclusive'), ('sec', 'exclusive'), ('th', 'exclusive'), ('tht','exclusive'),('thc','exclusive'))
logging.basicConfig(
level = logging.DEBUG,
filename = "lexlog.txt",
filemode = "w",
format = "%(filename)10s:%(lineno)4d:%(message)s"
)
log = logging.getLogger()
th_temp = ''
thn_temp = ''
term_dic = {'Theorem':'','Lemma':'','Corollary':'','Definition':'','Conjecture':'','Example':'','Exercise':'','Notation':'','Proposition':''}
idb_list = ['','','','','','','','','']
ide_list = ['','','','','','','','','']
bb = r'\\begin\{'
eb = r'\\end\{'
ie = r'\}'
def finalize_terms():
global idb_list
global ide_list
if term_dic['Theorem'] != '':
idb_list[0] = bb + term_dic['Theorem'] + ie
ide_list[0] = eb + term_dic['Theorem'] + ie
if term_dic['Lemma'] != '':
idb_list[1] = bb + term_dic['Lemma'] + ie
ide_list[1] = eb + term_dic['Lemma'] + ie
if term_dic['Corollary'] != '':
idb_list[2] = bb + term_dic['Corollary'] + ie
ide_list[2] = eb + term_dic['Corollary'] + ie
if term_dic['Definition'] != '':
idb_list[3] = bb + term_dic['Definition'] + ie
ide_list[3] = eb + term_dic['Definition'] + ie
if term_dic['Conjecture'] != '':
idb_list[4] = bb + term_dic['Conjecture'] + ie
ide_list[4] = eb + term_dic['Conjecture'] + ie
if term_dic['Example'] != '':
idb_list[5] = bb + term_dic['Example'] + ie
ide_list[5] = eb + term_dic['Example'] + ie
if term_dic['Exercise'] != '':
idb_list[6] = bb + term_dic['Exercise'] + ie
ide_list[6] = eb + term_dic['Exercise'] + ie
if term_dic['Notation'] != '':
idb_list[7] = bb + term_dic['Notation'] + ie
ide_list[7] = eb + term_dic['Notation'] + ie
if term_dic['Proposition'] != '':
idb_list[8] = bb + term_dic['Proposition'] + ie
ide_list[8] = eb + term_dic['Proposition'] + ie
print(idb_list)
print(ide_list)
Here are some of the parsing functions:
def t_TH(t):
r'\\newtheorem\{'
t.lexer.begin('th')
return t
def t_th_THCS(t):
r'\}\['
t.lexer.begin('thc')
return t
def t_tht_THC(t):
r'\}'
if term_dic.has_key(thn_temp) == False:
print(f"{thn_temp} is unknown!")
elif len(th_temp) == 0:
print(f"No abbreviation for {thn_temp} is found!")
else:
term_dic[thn_temp] = th_temp
print(f"The abbreviation for {thn_temp} is {th_temp}!")
th_temp = ''
thn_temp = ''
t.lexer.begin('INITIAL')
return t
def t_th_THCC(t):
r'\}\{'
t.lexer.begin('tht')
return t
def t_thc_THSC(t):
r'\]\{'
t.lexer.begin('tht')
return t
#TOKEN(idb_list[0])
def t_ig_BT(t):
t.lexer.begin('INITIAL')
return t
#TOKEN(ide_list[0])
def t_ET(t):
t.lexer.begin('ig')
return t
def t_INITIAL_sec_thc_TEXT(t):
r'[\s\S]'
return t
def t_th_TEXT(t):
r'[\s\S]'
th_temp = th_temp + t.value()
return t
def t_tht_TEXT(t):
r'[\s\S]'
thn_temp = thn_temp + t.value()
return t
def t_ig_ITEXT(t):
r'[\s\S]'
pass
import ply.lex as lex
lex.lex(debug=True, debuglog = log)
Here are the errors:
ERROR: /Users/CatLover/Documents/Python_Beta/TexExtractor/texlexparse.py:154: No regular expression defined for rule 't_ET'
I don't know why the regular expression defined for 't_ET' etc using #TOKEN do not work.
Ply is a parser generator. It takes your parser/lexer description and compiles a parser/lexer from it. You cannot change the description of the language during the parse.
In this particular case, you might be better off writing a streaming ("online") scanner. But if you want to use Ply, then you will be better off not trying to modify the grammar to ignore parts of the input. Just parse the entire input and ignore the parts you're not interested in. You'll probably find that the code is much simpler.
Related
I wrote some codes trying to let the user be able to check the percentage of the money they spent(compared to the money they earned). Almost every step perform normally, until the final part.
a_c[('L'+row_t)].value return:
=<Cell 'Sheet1'.B5>/<Cell 'Sheet1'.J5>
yet I hope it should be some value.
Code:
st_column = st_column_r.capitalize()
row_s = str(a_c.max_row)
row_t = str(a_c.max_row + 1)
row = int(row_t)
a_c[('J'+row_t)] = ('=SUM(I2,J'+row_s+')') #總收入
errorprevention = a_c[('J'+row_t)].value
a_c[(st_column+row_t)] = ('=SUM('+(st_column+'2')+','+(st_column+row_s)+')')
a_c['L'+row_t].number_format = FORMAT_PERCENTAGE_00
if errorprevention != 0:
a_c[('L'+row_t)] = ('='+str(a_c[(st_column+row_t)])+'/'+str(a_c[('J'+row_t)]))
print('過往支出中,'+inputtype[st_column]+'類別佔總收入的比率為:'+a_c[('L'+row_t)].value)
Try changing the formula creation to;
a_c[('L' + row_t)].value = '=' + a_c[(st_column + row_t)].coordinate + '/' + a_c[('J' + row_t)].coordinate
or use an f string
a_c[('L' + row_t)].value = f"={a_c[(st_column + row_t)].coordinate}/{a_c[('J' + row_t)].coordinate}"
[Edit: apparently this file looks similar to h5 format]
I am trying to extract metadata from a file with extension of (.dm3) using hyperspy in Python, I am able to get all the data but it's getting saved in a treeview, but I need the data in Json I tried to make my own parser to convert it which worked for most cases but then failed:
TreeView data generated
Is there a library or package I can use to convert the treeview to JSON in pyhton?
My parser:
def writearray(file,string):
k = string.split('=')
file.write('"' + k[0] + '":' + '[')
for char in k[1]:
file.write(char)
file.write(']')
def writenum(file,string):
k = string.split('=')
file.write('"' + k[0] + '":' + k[1])
def writestr(file,string):
k = string.split('=')
file.write('"' + k[0] + '":' +'"'+ k[1]+'"')
def startnew(file,string):
file.write('"'+string+'":'+'{\n')
def closenum(file,string):
k = string.split('=')
file.write('"' + k[0] + '":' + k[1] + '\n')
file.write('},\n')
def closestr(file,string):
k = string.split('=')
file.write('"' + k[0] + '":' + '"' + k[1] + '"' + '\n')
file.write('},\n')
def closearr(file,string):
k = string.split('=')
file.write('"' + k[0] + '":' + '[')
for char in k[1]:
file.write(char)
file.write(']\n')
file.write('},\n')
def strfix(string):
temp = ''
for char in string:
if char != ' ':
temp += char
return temp
def writethis(file,string):
stripped = strfix(string)
if "=" in stripped:
temp = stripped.split("=")
if ',' in temp[1]:
writearray(file,stripped)
elif temp[1].isdigit() or temp[1].isdecimal():
writenum(file,stripped)
else:
writestr(file,stripped)
def createMetaData(dm3file):
txtfile = os.path.splitext(dm3file)[0] + '.txt'
jsonfile = os.path.splitext(dm3file)[0] + '.json'
s = hs.load(dm3file)
s.original_metadata.export(txtfile)
file1 = open(txtfile, 'r', encoding="utf-8")
Lines = file1.readlines()
k = []
for line in Lines:
k.append(line)
L = []
for string in k:
temp = ''
for char in string:
if char.isalpha() or char.isdigit() or char == '=' or char == ' ' or char == '<' or char == '>' or char == ',' or char == '.' or char == '-' or char == ':':
temp += char
L.append(temp)
file2 = open(jsonfile, 'w', encoding="utf-8")
file2.write('{\n')
for i in range(0, len(L) - 1):
currentspaces = len(L[i]) - len(L[i].lstrip())
nextspaces = len(L[i + 1]) - len(L[i + 1].lstrip())
sub = nextspaces - currentspaces
if i != len(L) - 2:
if (sub == 0):
writethis(file2, L[i])
if '=' in L[i]:
file2.write(',\n')
else:
file2.write('\n')
elif sub > 0:
startnew(file2, L[i])
else:
if sub == -3:
writethis(file2, L[i])
file2.write('\n},\n')
elif sub == -7:
writethis(file2, L[i])
file2.write('\n}\n},\n')
else:
writethis(file2, L[i])
file2.write('\n}\n}\n}\n}')
file1.close()
os.remove(txtfile)
enter code here
I wrote a parser for the tree-view format:
from ast import literal_eval
from collections import abc
from more_itertools import peekable
def parse_literal(x: str):
try:
return literal_eval(x)
except Exception:
return x.strip()
def _treeview_parse_list(lines: peekable) -> list:
list_as_dict = {}
for line in (x.strip() for x in lines):
raw_k, raw_v = line.split(' = ')
list_as_dict[int(raw_k.split()[-1][1:-1])] = parse_literal(raw_v)
peek = lines.peek(None)
if '╚' in line or (peek is not None and '├' in peek):
break
list_as_list = [None] * (max(list_as_dict) + 1)
for idx, v in list_as_dict.items():
list_as_list[idx] = v
return list_as_list
def _treeview_parse_dict(lines: peekable) -> dict:
node = {}
for line in (x.strip() for x in lines):
if ' = ' in line:
raw_k, raw_v = line.split(' = ')
node[raw_k.split()[-1]] = parse_literal(raw_v)
elif '<list>' in line:
node[line.split()[-2]] = _treeview_parse_list(lines)
else:
try:
idx = line.index('├')
except ValueError:
idx = line.index('└')
peek = lines.peek(None)
if peek is not None and '├' in peek and idx == peek.index('├'):
node[line.split()[-1]] = {}
else:
node[line.split()[-1]] = _treeview_parse_dict(lines)
if '└' in line:
break
return node
def treeview_to_dict(lines: abc.Iterable) -> dict:
return _treeview_parse_dict(peekable(lines))
Usage:
with open('meta.txt') as f:
d = treeview_to_dict(f)
You can obtain the metadata as a JSON file using Python's built-in json library:
import json
with open('meta.txt') as txt_file:
with open('meta.json', 'w') as json_file:
json.dump(treeview_to_dict(txt_file), json_file, indent=4)
I've added indent=4 to make the JSON file more human-readable, so that you can verify it against the original format. As far as I can tell they match up in a sensible way.
As I've written this, it uses the third-party more_itertools.peekable class. If you can't use more_itertools, it shouldn't be too hard to implement that functionality yourself, or just refactor the code so that it is no longer necessary to look ahead.
License:
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors of this software dedicate any and all copyright interest in the software to the public domain. We make this dedication for the benefit of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ON INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to https://unlicense.org
A more straightforward approach is to use the as_dictionary method to convert the metadata to a python dictionary, then you can convert it to json.
import hyperspy.api as hs
s = hs.load('file.dm3')
metadata_dictionary = s.original_metadata.as_dictionary()
A different approach is to use the new RosettaSciIO library, which has been split from hyperspy to extract the metadata, for more information see the documentation https://hyperspy.org/rosettasciio/
I am trying to make a simple text editor using python. I am now trying to make a find function. This is what I've got:
def Find():
text = textArea.get('1.0', END+'-1c').lower()
input = simpledialog.askstring("Find", "Enter text to find...").lower()
startindex = []
endindex = []
lines = 0
if input in text:
text = textArea.get('1.0', END+'-1c').lower().splitlines()
for var in text:
character = text[lines].index(input)
start = str(lines + 1) + '.' + str(character)
startindex.append(start)
end = str(lines + 1) + '.' + str(character + int(len(input)))
endindex.append(end)
textArea.tag_add('select', startindex[lines], endindex[lines])
lines += 1
textArea.tag_config('select', background = 'green')
This will succesfully highlight words that match the users input with a green background. But the problem is, that it only highlights the first match every line, as you can see here.
I want it to highlight all matches.
Full code here: https://pastebin.com/BkuXN5pk
Recommend using the text widget's built-in search capability. Shown using python3.
from tkinter import *
root = Tk()
textArea = Text(root)
textArea.grid()
textArea.tag_config('select', background = 'green')
f = open('mouse.py', 'r')
content = f.read()
f.close()
textArea.insert(END, content)
def Find(input):
start = 1.0
length = len(input)
while 1:
pos = textArea.search(input, start, END)
if not pos:
break
end_tag = pos + '+' + str(length) + 'c'
textArea.tag_add('select', pos, end_tag)
start = pos + '+1c'
Find('display')
root.mainloop()
I am currently blocked on a point of a program in Python.
I wish to compare in a list, the WindowName event to launch directives.
Example:
import win32api
import pyHook
liste = ["Google", "Task"]
if event.WindowName == liste:
Screenshot ()
return True
else:
return False
Complete code, he work:
def OnMouseEvent(event):
global interval
data = '\n[' + str(time.ctime().split(' ')[3]) + ']' \
+ ' WindowName : ' + str(event.WindowName)
data += '\n\tButton:' + str(event.MessageName)
data += '\n\tClicked in (Position):' + str(event.Position)
data += '\n===================='
global t, start_time, pics_names
"""
Code Edit
"""
t = t + data
if len(t) > 300:
ScreenShot()
"""
Finish
"""
if len(t) > 500:
f = open('Logfile.txt', 'a')
f.write(t)
f.close()
t = ''
if int(time.time() - start_time) == int(interval):
Mail_it(t, pics_names)
start_time = time.time()
t = ''
return True
else:
return False
When i edit the code in """ doesn't work :
t = t + data
liste = ["Google", "Task"]
if event.WindowName == liste:
ScreenShot()
He return :
File "C:\Python26\lib\site-packages\pyHook\HookManager.py", line 324, in MouseSwitch func = self.mouse_funcs.get(msg) TypeError: an integer is required
I test this :
HookManager: func = self.keyboard_funcs.get(msg) to: func=self.keyboard_funcs.get( int(str(msg)) )
But is don't work, i think i note all problem.
Thanks for you help in advance :)
This small scripts makes exactly what I need.
#!/usr/bin/python
import os
import fileinput
import sys
import shutil
import glob
import time
def replaceAll1(files,searchExp,replaceExp):
for line in fileinput.input(files, inplace=1):
if searchExp in line:
line = line.replace(searchExp,replaceExp)
sys.stdout.write(line)
param1 = [1,2,3]
param2 = [1,2,3]
param3 = [1,2,3]
for i in xrange(len(param1)):
for ii in xrange(len(param2)):
for iii in xrange(len(param3)):
os.system("cp -a cold.in input.in")
old_param1 = "param1 = 1"
old_param2 = "param2 = 1"
old_param3 = "param3 = 1"
new_param1 = "param1 = " + str(param1[i])
new_param2 = "param2 = " + str(param2[ii])
new_param3 = "param3 = " + str(param3[iii])
replaceAll1('input.in',old_param1,new_param1)
replaceAll1('input.in',old_param2,new_param2)
replaceAll1('input.in',old_param3,new_param3)
time.sleep(4)
It enters in a configuration file and replaces sequentially the input parameters according to the lists that are accessed by the loop indexes. It is simple a combination of all the three parameters between each other.
# Input file
param1 = 1 # --- Should be [1,2,3]
param2 = 1 # --- Should be [1,2,3]
param3 = 1 # --- Should be [1,2,3]
The problem is that his big brother is not behaving like it. When it loops through the lists, it gets lost in scheme = 2 and puts dissp_scheme = 2 (freezed) when it should be dissp_scheme = 1. I printed out every single variable that goes inside the function replaceAll marked with comments but when I turn on the other calls it mess up everything. Here is the script.
#!/usr/bin/python
import os
import fileinput
import sys
import shutil
import glob
import time
os.chdir(os.getcwd())
# Replaces the input file parameters
def replaceAll(files,searchExp,replaceExp):
for line in fileinput.input(files, inplace=1):
if searchExp in line:
line = line.replace(searchExp,replaceExp)
sys.stdout.write(line)
# Gets a number inside my input file.
def get_parameter(variable,file_name):
f = open(file_name,'r').readlines()
for i in xrange(len(f)):
index = f[i].find(variable)
if index != -1:
pre_found = f[i].split('=')[1]
return pre_found
# Gets the discretization scheme name.
def get_sheme(number):
if number == 1:
return "Simple Centered Scheme"
elif number == 2:
return "Lax-Wendroff Scheme"
elif number == 3:
return "MacCormack Scheme"
elif number == 4:
return "Beam-Warming Scheme"
elif number == 5:
return "Steger-Warming 1st Order Scheme"
elif number == 6:
return "Steger-Warming 2nd Order Scheme"
elif number == 7:
return "Van Leer 1st Order Scheme"
elif number == 8:
return "Van Leer 2nd Order Scheme"
elif number == 9:
return "Roe Scheme"
elif number == 10:
return "AUSM Scheme"
# Gets the dissipation scheme name.
def get_dissip(number):
if number == 1:
return "Pullian Non-Linear dissipation"
elif number == 2:
return "Second difference dissipation"
elif number == 3:
return "Fourth difference dissipation"
elif number == 4:
return "B&W dissipation"
# Generates the density gnuplot scripts.
def gnuplot(variable,pressure_ratio,scheme,dissip_scheme):
#gnuplot('Density',10,get_sheme(3),'Pullian')
# Building name of the output file.
outFileName = variable.lower() + '_ratio' + str(int(pressure_ratio)) + '_' + scheme.replace(" ","") + '_dissp' + dissip_scheme.replace(" ","") + '.tex'
gnuFileName = variable.lower() + '_ratio' + str(int(pressure_ratio)) + '_' + scheme.replace(" ","") + '_dissp' + dissip_scheme.replace(" ","") + '.gnu'
# Build title of the plot
title = 'Analytical vs Numerical | ' + scheme
f = open(gnuFileName,'w')
f.write("set term cairolatex monochrome size 15.0cm, 8cm\n")
f.write('set output "' + outFileName + '"\n')
f.write("set grid\n")
f.write('set xtics font "Times-Roman, 10\n')
f.write('set ytics font "Times-Roman, 10\n')
f.write('set xlabel "x position" center\n')
f.write('set ylabel "' + variable + '" center\n')
f.write('set title "Analytical vs Numerical Results | ' + variable + '" \n')
f.write('set pointsize 0.5\n')
f.write('set key font ",10"\n')
fortran_out_analytical = 'a' + variable.lower() + '.out'
fortran_out_numerical = variable.lower() + 'Output.out'
f.write('plot "' + fortran_out_analytical +'" u 1:2 with linespoints lt -1 lw 1 pt 4 title "Analytical",\\\n')
f.write( '"' + fortran_out_numerical + '" u 1:2 with lines lw 5 title "Numerical"\n')
f.close()
# Generate latex code.
def generate_latex(text_image_file,caption):
latex.write("\\begin{figure}[H]\n")
latex.write(" \centering\n")
latex.write(" \input{" + text_image_file + "}\n")
latex.write(" \caption{"+ caption +"}\n")
latex.write(" \label{fig:digraph}\n")
latex.write("\\end{figure}\n")
latex.write("\n\n")
# -----------------------------------------------------------------------
# Main loop.
# -----------------------------------------------------------------------
pressure_ratios = [5.0]
schemes = [1,2,3]
dissips = [1,2,3]
# Define replace lines for replace all.
scheme_line = "scheme = "
dissip_line = "dissp_scheme = "
# Open Latex export file.
latex = open("bizu.txt",'w')
i = 0
# ratios.
for i in xrange(len(pressure_ratios)):
print "----------------------------------------"
print " + Configuring File for pressure ratio: " + str(pressure_ratios[i])
print "----------------------------------------\n"
# Schemes
for jj in xrange(len(schemes)):
print " + Configuring file for scheme: " + get_sheme(schemes[jj]) + "\n"
for kkk in xrange(len(dissips)):
print " + Configuring file for dissip: " + get_dissip(dissips[kkk])
# We always work with a brand new file.
os.system("rm input.in")
os.system("cp -a cold.in input.in")
# Replace pressures.
p1_line_old = 'p1 = 5.0d0'
rho1_line_old = 'rho1 = 5.0d0'
p1_line_new = 'p1 = ' + str(pressure_ratios[i]) + 'd0'
rho1_line_new = 'rho1 = ' + str(pressure_ratios[i]) + 'd0'
replaceAll('input.in',p1_line_old,p1_line_new)
replaceAll('input.in',rho1_line_old,rho1_line_new)
# Replace discretization scheme.
old_scheme = scheme_line + "1"
new_scheme = scheme_line + str(schemes[jj])
#==========================================================
# This call is messing everything up when scheme turns to 2
#==========================================================
replaceAll('input.in',old_scheme,new_scheme)
# Replace dissipation scheme.
old_dissp_scheme = dissip_line + "1"
new_dissp_scheme = dissip_line + str(dissips[kkk])
print p1_line_old
print new_scheme
print new_dissp_scheme
replaceAll('input.in',old_dissp_scheme, new_dissp_scheme)
time.sleep(3)
# ### Calling program
# os.system("./sod")
#
latex.close()
And the input file that the it works on is:
&PAR_physical
p1 = 5.0d0
p4 = 1.0d0
rho1 = 5.0d0
rho4 = 1.0d0
fgamma = 1.4d0
R_const = 287.0d0
F_Cp = 1004.5
F_Cv = 717.5
/
&PAR_geometry
total_mesh_points = 1001
start_mesh_point = -5.0d0
final_mesh_point = 5.0d0
print_step = 100
/
&PAR_numeric
scheme = 3
iterations = 10000
time_step = 0.0001d0
/
&PAR_dissip
dissp_scheme = 3
dissip_omega = 0.5d0
/
Thank you all !