General algorithm/pattern for processing text files

General algorithm/pattern for processing text files - python

Is there a general algorithm/pattern for reading multiline text files, where some lines are dependent on preceding ones? I'm referring to data in a form like:
H0 //start header
HEADER1
H9 //end header
R0 RECORD1
R0 RECORD2
H0 //start header
HEADER2
H9 //end header
R0 RECORD3
R0 RECORD4
Where one needs to associate the current "header" info with each following record.
I realise there are countless solutions to this sort of task, but are there tried and tested patterns that more experienced developers converge on?
EDIT:
My intuition is that one should use some sort of state machine, with states like "reading header", "reading records" etc. Am I on the right path?
EDIT:
Although the example is simple, something that can handle higher degrees of nesting would be preferable

This can be looked at as a parsing problem, although the grammar of the language is very simple. It is indeed regular, and thus FSM, as you correctly noted, will work. Generally speaking, any established parsing technique will work; you would avoid explicit state if using recursive descent parsing, which becomes not really recursive in case of a regular language. The following is pseudocode:
function accept_file:
while not_eof
read_line;
case prefix of
"H0": accept_header;
"R0": accept_record;
otherwise: syntax_error;
function accept_record:
make_record from substring of current_line from position 3;
function accept_header:
read_line;
while current_line does not start with "H9"
add line to accumulated_lines
create header from accumulated_lines

I agree with kkm, depending on how "complex" is your grammar, you may consider using some kind of parsing lib like ply

Related

Segmenting XML mixed content (in Python)

I have a lot of XML documents with mixed content, i.e. they contain paragraphs of normal text with interspersed XML for formatting etc., unrelated to the document structure).
I need to segment these paragraphs within the existing XML document:
identify permissible "breakpoints" based on textual criteria (sentence boundaries - full stop, tab etc.)
divide the paragraph into segments defined by adjacent pairs of breakpoints (segment start and end points), i.e. wrap everything between the two breakpoints in a <seg> tag.
The paragraph start and end are also valid breakpoints.
But a breakpoint pair cannot be used if it clashes with the XML structure.
The simplest example goes like this:
<par>Hello <x>you</x>. How are you?</par> might be segmented into <par><seg>Hello <x>you</x>.</seg> <seg>How are you?</seg></par>
But when the interspersed tags span across a potential breakpoint:
<par>Hello <x>you. How are you</x>?</par> cannot be split up and I can only make <par><seg>Hello <x>you. How are you</x>?</seg></par>
A complication is that a breakpoint, if defined simply as a text index, is ambiguous in terms of the XML structure, e.g.:
<par><x>Hello you. How are you?</x></par> can only be split with all breakpoints inside the <x> tag as <par><x><seg>Hello you.</seg> <seg>How are you?</seg></x></par>
I've been trying to do this with lxml, but that quickly became rather complicated. Each segment's start and end breakpoints have to be at the same "level" within the tree, but that could mean being in the text property of one tag and the tail of another; inserting a new tag means moving some of the surrounding text to other tags; the "level" is ambiguous for empty text/tails, etc etc. It didn't feel very natural at all.
What's a better way to do this?
Thank you so much!

The best option to transform xml is xslt.
Maybe you take a look here as a starting point: "How to transform an XML file using XSLT in Python?"
And this question: "Tokenize mixed content in XSLT" explains some basics in that what you also might want.

Best way to parse a file Python

I have a text file that I need to read, identify some parts to change, and write to a new file. Here's a snippet of what the text file (which is about 600 lines long) would look similar to:
<REAPER_PROJECT 0.1 "4.731/x64" 1431724762
RIPPLE 0
RECORD_PATH "Audio" ""
<RECORD_CFG
ZXZhdxgA
>
<APPLYFX_CFG
>
LOCK 1
<METRONOME 6 2
VOL 0.25 0.125
FREQ 800 1600 1
BEATLEN 4
SAMPLES "" ""
>
>
So, for example, I'd need to change "LOCK 1" to "LOCK 0". Right now I'm reading the file line by line, looking for when I hit the "LOCK" keyword and then instead of writing "LOCK 1", I write "LOCK 0" (all other lines are written as is). Pretty straightforward.
Part of this seems kinda messy to me, though, as sometimes when I have to use nested for loops to parse a sub-section of the text file I run into weirdness dealing with the file pointer off-by-one errors - not a biggie and manageable, but I was kinda looking for some opinions on this. Instead, I was wondering if it would make more sense to read the entire file into a list, parse through the list, looking for keywords to change, updating those specific lines in the list, and then writing the whole list to the new file. It seems like I would have a bit more control over things as I wouldn't have to process the file in a linear fashion which I'm kinda forced to do now.
So, I guess the last sentence kinda justified why it could be advantageous to pull it all into a list, process the list, and then write it out. I'm kinda curious how others with more programming experience (as mine is somewhat limited) would tackle this kind of issue. Any other ways that would prove even more efficient?
Btw, I didn't generate this file - other software did, and I don't have any communication with the developer so I have no way of knowing what they're using to read/write the file. I'd absolutely love it if I had a neat reader that could read the file and populate it into variables and then rewrite it out, but for me to code something that would do that would be overkill for what I'm trying to accomplish.
I'm kinda tempted to rewrite my script to read it into a list as it seems like it would be a better way to go, but I thought I'd ask people what they thought before I did. My version works, but I don't mind going through the motions, either, as it's a good lesson regardless. I figured this could also be a case where there are always different ways to tackle a problem, but I'd like to try and be as efficient as possible.
UPDATE
So, I probably should have mentioned this, but I was still trying to figure out what to ask - while I need to find certain elements and change them, I can only find those elements by finding their header (i.e. "ITEM") and then replacing the element within the block. So it'll be something like this:
<METRONOME
NAME Clicky
SPEED fast
>
<ITEM
LOOP 0
NAME Mike
FILE something.wav
..
>
<ITEM
LOOP 1
NAME Joe
FILE anotherfile.wav
..
>
So the only way to identify the correct block of data is to first find the ITEM header, then keep reading until I find the NAME element, and then update the file name for that whole ITEM block. There are other elements within that block that I need to update, and the name header isn't the first item. Also, I can't assume that the name element also exists just in ITEM blocks.
So maybe this really has less to do with reading it into memory and more of how to properly parse this type of file? Or are there some benefits to reading it into memory and being easier to manipulate? Sorry I didn't clarify that in the original question...

If it has only ~600 lines, you can take it into memory
replace = [('LOCK 1', 'LOCK 0'), (), ()....]
with open('read.txt') as r:
read = r.read()
for i in replace:
read.replace(*i)
with open('write.txt', 'w') as w:
w.write(read)

Here's my answer using regex:
import re
text = """<REAPER_PROJECT 0.1 "4.731/x64" 1431724762
RIPPLE 0
RECORD_PATH "Audio" ""
<RECORD_CFG
ZXZhdxgA
>
<APPLYFX_CFG
>
LOCK 1
<METRONOME 6 2
VOL 0.25 0.125
FREQ 800 1600 1
BEATLEN 4
SAMPLES "" ""
>
>
"""
print(re.sub("LOCK 1\D", "LOCK 0" + "\n", text))
If you're interested in writing the file to disk.
with open("written.txt", 'w') as f:
f.write(re.sub("LOCK 1\D", "LOCK 0" + "\n", text))
EDIT
You said that you wanted it to be more flexible?
Okay, I tried to make an example, however for that I would need more information about your setup..etc. So instead, I'll point you to a resource that could help you. This will also be good, if you ever want to change or add anything, now you'll understand what to do.
https://www.youtube.com/watch?v=DRR9fOXkfRE # How regex works for
python in general.
https://regexone.com/references/python # Some information about
regex and python.
https://stackoverflow.com/a/5658439/4837005 # An example of using
regex to replace a string.
I hope this helps.

Does Pyparsing Support Context-Sensitive Grammars?

Forgive me if I have the incorrect terminology; perhaps just getting the "right" words to describe what I want is enough for me to find the answer on my own.
I am working on a parser for ODL (Object Description Language), an arcane language that as far as I can tell is now used only by NASA PDS (Planetary Data Systems; it's how NASA makes its data available to the public). Fortunately, PDS is finally moving to XML, but I still have to write software for a mission that fell just before the cutoff.
ODL defines objects in something like the following manner:
OBJECT = TABLE
ROWS = 128
ROW_BYTES = 512
END_OBJECT = TABLE
I am attempting to write a parser with pyparsing, and I was doing fine right up until I came to the above construction.
I have to create some rule that is able to ensure that the right-hand-value of the OBJECT line is identical to the RHV of END_OBJECT. But I can't seem to put that into a pyparsing rule. I can ensure that both are syntactically valid values, but I can't go the extra step and ensure that the values are identical.
Am I correct in my intuition that this is a context-sensitive grammar? Is that the phrase I should be using to describe this problem?
Whatever kind of grammar this is in the theoretical sense, is pyparsing able to handle this kind of construction?
If pyparsing is not able to handle it, is there another Python tool capable of doing so? How about ply (the Python implementation of lex/yacc)?

It is in fact a grammar for a context-sensitive language, classically abstracted as wcw where w is in (a|b)* (note that wcw' , where ' indicates reversal, is context-free).
Parsing Expression Grammars are capable of parsing wcw-type languages by using semantic predicates. PyParsing provides the matchPreviousExpr() and matchPreviousLiteral() helper methods for this very purpose, e.g.
w = Word("ab")
s = w + "c" + matchPreviousExpr(w)
So in your case you'd probably do something like
table_name = Word(alphas, alphanums)
object = Literal("OBJECT") + "=" + table_name + ... +
Literal("END_OBJECT") + "=" +matchPreviousExpr(table_name)

As a general rule, parsers are built as context-free parsing engines. If there is context sensitivity, it is grafted on after parsing (or at least after the relevant parsing steps are completed).
In your case, you want to write context-free grammar rules:
head = 'OBJECT' '=' IDENTIFIER ;
tail = 'END_OBJECT' '=' IDENTIFIER ;
element = IDENTIFIER '=' value ;
element_list = element ;
element_list = element_list element ;
block = head element_list tail ;
The checks that the head and tail constructs have matching identifiers isn't technically done by the parser.
Many parsers, however, allow a semantic action to occur when a syntactic element is recognized, often for the purpose of building tree nodes. In your case, you want
to use this to enable additional checking. For element, you want to make sure the IDENTIFIER isn't a duplicate of something already in the block; this means for each element encountered, you'll want to capture the corresponding IDENTIFIER and make a block-specific list to enable duplicate checking. For block, you want to capture the head *IDENTIFIER*, and check that it matches the tail *IDENTIFIER*.
This is easiest if you build a tree representing the parse as you go along, and hang the various context-sensitive values on the tree in various places (e.g., attach the actual IDENTIFIER value to the tree node for the head clause). At the point where you are building the tree node for the tail construct, it should be straightforward to walk up the tree, find the head tree, and then compare the identifiers.
This is easier to think about if you imagine the entire tree being built first, and then a post-processing pass over the tree is used to this checking. Lazy people in fact do it this way :-} All we are doing is pushing work that could be done in the post processing step, into the tree-building steps attached to the semantic actions.
None of these concepts is python specific, and the details for PyParsing will vary somewhat.

Python regex to extract data from C header file

I have a C header file with a lot of enums, typedefs and function prototypes. I want to extract this data using Python regex (re). I really need help with the syntax, because I constantly seem to forget it every time I learn.
ENUMS
-----
enum
{
(tab character)(stuff to be extracted - multiple lines)
};
TYPES
-----
typedef struct (extract1) (extract2)
FUNCTIONS
---------
(return type)
(name)
(
(tab character)(arguments - multiple lines)
);
If anyone could point me in the right direction, I would be grateful.

I imagine something like this is what you're after?
>>> re.findall('enum\s*{\s*([^}]*)};', 'enum {A,B,C};')
['A,B,C']
>>> re.findall("typedef\s+struct\s+(\w+)\s+(\w+);", "typedef struct blah blah;")
[('blah', 'blah')]
There are of course numerous variations on the syntax, and functions are much more complicated, so I'll leave those for you, as frankly these regexps are already fragile and inelegant enough. I would urge you to use an actual parser unless this is just a one-off project where robustness is totally unimportant and you can be sure of the format of your inputs.

A better way to assign list into a var

Was coding something in Python. Have a piece of code, wanted to know if it can be done more elegantly...
# Statistics format is - done|remaining|200's|404's|size
statf = open(STATS_FILE, 'r').read()
starf = statf.strip().split('|')
done = int(starf[0])
rema = int(starf[1])
succ = int(starf[2])
fails = int(starf[3])
size = int(starf[4])
...
This goes on. I wanted to know if after splitting the line into a list, is there any better way to assign each list into a var. I have close to 30 lines assigning index values to vars. Just trying to learn more about Python that's it...

done, rema, succ, fails, size, ... = [int(x) for x in starf]
Better:
labels = ("done", "rema", "succ", "fails", "size")
data = dict(zip(labels, [int(x) for x in starf]))
print data['done']

What I don't like about the answers so far is that they stick everything in one expression. You want to reduce the redundancy in your code, without doing too much at once.
If all of the items on the line are ints, then convert them all together, so you don't have to write int(...) each time:
starf = [int(i) for i in starf]
If only certain items are ints--maybe some are strings or floats--then you can convert just those:
for i in 0,1,2,3,4:
starf[i] = int(starf[i]))
Assigning in blocks is useful; if you have many items--you said you had 30--you can split it up:
done, rema, succ = starf[0:2]
fails, size = starf[3:4]

I might use the csv module with a separator of | (though that might be overkill if you're "sure" the format will always be super-simple, single-line, no-strings, etc, etc). Like your low-level string processing, the csv reader will give you strings, and you'll need to call int on each (with a list comprehension or a map call) to get integers. Other tips include using the with statement to open your file, to ensure it won't cause a "file descriptor leak" (not indispensable in current CPython version, but an excellent idea for portability and future-proofing).
But I question the need for 30 separate barenames to represent 30 related values. Why not, for example, make a collections.NamedTuple type with appropriately-named fields, and initialize an instance thereof, then use qualified names for the fields, i.e., a nice namespace? Remember the last koan in the Zen of Python (import this at the interpreter prompt): "Namespaces are one honking great idea -- let's do more of those!"... barenames have their (limited;-) place, but representing dozens of related values is not one -- rather, this situation "cries out" for the "let's do more of those" approach (i.e., add one appropriate namespace grouping the related fields -- a much better way to organize your data).

Using a Python dict is probably the most elegant choice.
If you put your keys in a list as such:
keys = ("done", "rema", "succ" ... )
somedict = dict(zip(keys, [int(v) for v in values]))
That would work. :-) Looks better than 30 lines too :-)
EDIT: I think there are dict comphrensions now, so that may look even better too! :-)
EDIT Part 2: Also, for the keys collection, you'd want to break that into multpile lines.
EDIT Again: fixed buggy part :)

Thanks for all the answers. So here's the summary -
Glenn's answer was to handle this issue in blocks. i.e. done, rema, succ = starf[0:2] etc.
Leoluk's approach was more short & sweet taking advantage of python's immensely powerful dict comprehensions.
Alex's answer was more design oriented. Loved this approach. I know it should be done the way Alex suggested but lot of code re-factoring needs to take place for that. Not a good time to do it now.
townsean - same as 2
I have taken up Leoluk's approach. I am not sure what the speed implication for this is? I have no idea if List/Dict comprehensions take a hit on speed of execution. But it reduces the size of my code considerable for now. I'll optimize when the need comes :) Going by - "Pre-mature optimization is the root of all evil"...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.