Writing/parsing text file with fixed width lines

Writing/parsing text file with fixed width lines - python

I'm a newbie to Python and I'm looking at using it to write some hairy EDI stuff that our supplier requires.
Basically they need an 80-character fixed width text file, with certain "chunks" of the field with data and others left blank. I have the documentation so I know what the length of each "chunk" is. The response that I get back is easier to parse since it will already have data and I can use Python's "slices" to extract what I need, but I can't assign to a slice - I tried that already because it sounded like a good solution, and it didn't work since Python strings are immutable :)
Like I said I'm really a newbie to Python but I'm excited about learning it :) How would I go about doing this? Ideally I'd want to be able to say that range 10-20 is equal to "Foo" and have it be the string "Foo" with 7 additional whitespace characters (assuming said field has a length of 10) and have that be a part of the larger 80-character field, but I'm not sure how to do what I'm thinking.

You don't need to assign to slices, just build the string using % formatting.
An example with a fixed format for 3 data items:
>>> fmt="%4s%10s%10s"
>>> fmt % (1,"ONE",2)
' 1 ONE 2'
>>>
Same thing, field width supplied with the data:
>>> fmt2 = "%*s%*s%*s"
>>> fmt2 % (4,1, 10,"ONE", 10,2)
' 1 ONE 2'
>>>
Separating data and field widths, and using zip() and str.join() tricks:
>>> widths=(4,10,10)
>>> items=(1,"ONE",2)
>>> "".join("%*s" % i for i in zip(widths, items))
' 1 ONE 2'
>>>

Hopefully I understand what you're looking for: some way to conveniently identify each part of the line by a simple variable, but output it padded to the correct width?
The snippet below may give you what you want
class FixWidthFieldLine(object):
fields = (('foo', 10),
('bar', 30),
('ooga', 30),
('booga', 10))
def __init__(self):
self.foo = ''
self.bar = ''
self.ooga = ''
self.booga = ''
def __str__(self):
return ''.join([getattr(self, field_name).ljust(width)
for field_name, width in self.fields])
f = FixWidthFieldLine()
f.foo = 'hi'
f.bar = 'joe'
f.ooga = 'howya'
f.booga = 'doin?'
print f
This yields:
hi joe howya doing
It works by storing a class-level variable, fields which records the order in which each field should appear in the output, together with the number of columns that field should have. There are correspondingly-named instance variables in the __init__ that are set to an empty string initially.
The __str__ method outputs these values as a string. It uses a list comprehension over the class-level fields attribute, looking up the instance value for each field by name, and then left-justifying it's output according to the columns. The resulting list of fields is then joined together by an empty string.
Note this doesn't parse input, though you could easily override the constructor to take a string and parse the columns according to the field and field widths in fields. It also doesn't check for instance values that are longer than their allotted width.

You can use justify functions to left-justify, right-justify and center a string in a field of given width.
'hi'.ljust(10) -> 'hi '

I know this thread is quite old, but we use a library called django-copybook. It has nothing to do with django (anymore). We use it to go between fixed width cobol files and python. You create a class to define your fixed width record layout and can easy move between typed python objects and fixed width files:
USAGE:
class Person(Record):
first_name = fields.StringField(length=20)
last_name = fields.StringField(length=30)
siblings = fields.IntegerField(length=2)
birth_date = fields.DateField(length=10, format="%Y-%m-%d")
>>> fixedwidth_record = 'Joe Smith 031982-09-11'
>>> person = Person.from_record(fixedwidth_record)
>>> person.first_name
'Joe'
>>> person.last_name
'Smith'
>>> person.siblings
3
>>> person.birth_date
datetime.date(1982, 9, 11)
It can also handle situations similar to Cobol's OCCURS functionality like when a particular section is repeated X times

I used Jarret Hardie's example and modified it slightly. This allows for selection of type of text alignment(left, right or centered.)
class FixedWidthFieldLine(object):
def __init__(self, fields, justify = 'L'):
""" Returns line from list containing tuples of field values and lengths. Accepts
justification parameter.
FixedWidthFieldLine(fields[, justify])
fields = [(value, fieldLenght)[, ...]]
"""
self.fields = fields
if (justify in ('L','C','R')):
self.justify = justify
else:
self.justify = 'L'
def __str__(self):
if(self.justify == 'L'):
return ''.join([field[0].ljust(field[1]) for field in self.fields])
elif(self.justify == 'R'):
return ''.join([field[0].rjust(field[1]) for field in self.fields])
elif(self.justify == 'C'):
return ''.join([field[0].center(field[1]) for field in self.fields])
fieldTest = [('Alex', 10),
('Programmer', 20),
('Salem, OR', 15)]
f = FixedWidthFieldLine(fieldTest)
print f
f = FixedWidthFieldLine(fieldTest,'R')
print f
Returns:
Alex Programmer Salem, OR
Alex Programmer Salem, OR

It's a little difficult to parse your question, but I'm gathering that you are receiving a file or file-like-object, reading it, and replacing some of the values with some business logic results. Is this correct?
The simplest way to overcome string immutability is to write a new string:
# Won't work:
test_string[3:6] = "foo"
# Will work:
test_string = test_string[:3] + "foo" + test_string[6:]
Having said that, it sounds like it's important to you that you do something with this string, but I'm not sure exactly what that is. Are you writing it back to an output file, trying to edit a file in place, or something else? I bring this up because the act of creating a new string (which happens to have the same variable name as the old string) should emphasize the necessity of performing an explicit write operation after the transformation.

You can convert the string to a list and do the slice manipulation.
>>> text = list("some text")
>>> text[0:4] = list("fine")
>>> text
['f', 'i', 'n', 'e', ' ', 't', 'e', 'x', 't']
>>> text[0:4] = list("all")
>>> text
['a', 'l', 'l', ' ', 't', 'e', 'x', 't']
>>> import string
>>> string.join(text, "")
'all text'

It is easy to write function to "modify" string.
def change(string, start, end, what):
length = end - start
if len(what)<length: what = what + " "*(length-len(what))
return string[0:start]+what[0:length]+string[end:]
Usage:
test_string = 'This is test string'
print test_string[5:7]
# is
test_string = change(test_string, 5, 7, 'IS')
# This IS test string
test_string = change(test_string, 8, 12, 'X')
# This IS X string
test_string = change(test_string, 8, 12, 'XXXXXXXXXXXX')
# This IS XXXX string

Related

python script not joining strings as expected

I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?

The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.

Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']

Forming a python table with tab

I have one question about Python 3.X. I'm able to create a nice "table" with .format, but the question is whether I'm able to do it with tabs to look similar to this one with formats?
file = open("students.csv", "r")
students= []
for i in file:
i= i.rstrip()
i_sublist = i.split(",")
students.append(i_sublist)
print("Content")
print("{0:15} {1:15} {2:15}".format("Name", "Surname", "Grade"))
print("{0:15} {1:15} {2:15}".format("-----", "-------", "---"))
for i in students:
for j in i:
print("{0:15}".format(j),"",end="")
print()
file.close()

Although I did not fully understand what your task is (especially about tabs), I'll try to show one way to build a similar table without using .format method. Suppose the list of students looks as follows:
students = [('Arc', 'Circle', 'A'), ('Numan', 'Tuman', 'B'), ('Zeta', 'Beta', 'D'),
('Raptor', 'Strange', 'A'), ('Crazy', 'Bear', 'C')]
Using other method
Then the following function will print the similar table:
def context(students):
preamble = [("Name", "Surname", "Grade"),
("----", "-------", "-----")]
message = preamble + students # join two lists together
print("Context")
for line in message:
print(*(elem.ljust(15) for elem in line)) # left justify each element in sublists
From the documentation:
S.ljust(width[, fillchar]) -> str
Return S left-justified in a Unicode string of length width. Padding is done using the specified fill character (default is a space).
Your own tabulator
If your task to implement your own tabulator it can be done by providing helper function in this way:
def tabulator(text, min_field=15):
tabs_to_append = min_field - len(text)
return_string = (text if (tabs_to_append <= 0) else text + " "*tabs_to_append)
return return_string
def context_tabulator(students):
preamble = [("Name", "Surname", "Grade"),
("----", "-------", "-----")]
message = preamble + students # join two lists together
print("Context")
for line in message:
print(*(tabulator(elem, 15) for elem in line))
if you invoke it as context_tabulator(students) it will produce the following table:
Context
Name Surname Grade
---- ------- -----
Arc Circle A
Numan Tuman B
Zeta Beta D
Raptor Strange A
Crazy Bear C
It should be noted that all of the above examples only show you general idea, and do not validate or convert the provided input for arguments. Nonetheless, I hope this will help you.

Converting plural to singular in a text file with Python

I have txt files that look like this:
word, 23
Words, 2
test, 1
tests, 4
And I want them to look like this:
word, 23
word, 2
test, 1
test, 4
I want to be able to take a txt file in Python and convert plural words to singular. Here's my code:
import nltk
f = raw_input("Please enter a filename: ")
def openfile(f):
with open(f,'r') as a:
a = a.read()
a = a.lower()
return a
def stem(a):
p = nltk.PorterStemmer()
[p.stem(word) for word in a]
return a
def returnfile(f, a):
with open(f,'w') as d:
d = d.write(a)
#d.close()
print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))
I have also tried these 2 definitions instead of the stem definition:
def singular(a):
for line in a:
line = line[0]
line = str(line)
stemmer = nltk.PorterStemmer()
line = stemmer.stem(line)
return line
def stem(a):
for word in a:
for suffix in ['s']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
Afterwards I'd like to take duplicate words (e.g. test and test) and merge them by adding up the numbers next to them. For example:
word, 25
test, 5
I'm not sure how to do that. A solution would be nice but not necessary.

If you have complex words to singularize, I don't advise you to use stemming but a proper python package link pattern :
from pattern.text.en import singularize
plurals = ['caresses', 'flies', 'dies', 'mules', 'geese', 'mice', 'bars', 'foos',
'families', 'dogs', 'child', 'wolves']
singles = [singularize(plural) for plural in plurals]
print(singles)
returns:
>>> ['caress', 'fly', 'dy', 'mule', 'goose', 'mouse', 'bar', 'foo', 'foo', 'family', 'family', 'dog', 'dog', 'child', 'wolf']
It's not perfect but it's the best I found. 96% based on the docs : http://www.clips.ua.ac.be/pages/pattern-en#pluralization

It seems like you're pretty familiar with Python, but I'll still try to explain some of the steps. Let's start with the first question of depluralizing words. When you read in a multiline file (the word, number csv in your case) with a.read(), you're going to be reading the entire body of the file into one big string.
def openfile(f):
with open(f,'r') as a:
a = a.read() # a will equal 'soc, 32\nsoc, 1\n...' in your example
a = a.lower()
return a
This is fine and all, but when you want to pass the result into stem(), it will be as one big string, and not as a list of words. This means that when you iterate through the input with for word in a, you will be iterating through each individual character of the input string and applying the stemmer to those individual characters.
def stem(a):
p = nltk.PorterStemmer()
a = [p.stem(word) for word in a] # ['s', 'o', 'c', ',', ' ', '3', '2', '\n', ...]
return a
This definitely doesn't work for your purposes, and there are a few different things we can do.
We can change it so that we read the input file as one list of lines
We can use the big string and break it down into a list ourselves.
We can go through and stem each line in the list of lines one at a time.
Just for expedience's sake, let's roll with #1. This will require changing openfile(f) to the following:
def openfile(f):
with open(f,'r') as a:
a = a.readlines() # a will equal 'soc, 32\nsoc, 1\n...' in your example
b = [x.lower() for x in a]
return b
This should give us b as a list of lines, i.e. ['soc, 32', 'soc, 1', ...]. So the next problem becomes what do we do with the list of strings when we pass it to stem(). One way is the following:
def stem(a):
p = nltk.PorterStemmer()
b = []
for line in a:
split_line = line.split(',') #break it up so we can get access to the word
new_line = str(p.stem(split_line[0])) + ',' + split_line[1] #put it back together
b.append(new_line) #add it to the new list of lines
return b
This is definitely a pretty rough solution, but should adequately iterate through all of the lines in your input, and depluralize them. It's rough because splitting strings and reassembling them isn't particularly fast when you scale it up. However, if you're satisfied with that, then all that's left is to iterate through the list of new lines, and write them to your file. In my experience it's usually safer to write to a new file, but this should work fine.
def returnfile(f, a):
with open(f,'w') as d:
for line in a:
d.write(line)
print openfile(f)
print stem(openfile(f))
print returnfile(f, stem(openfile(f)))
When I have the following input.txt
soc, 32
socs, 1
dogs, 8
I get the following stdout:
Please enter a filename: input.txt
['soc, 32\n', 'socs, 1\n', 'dogs, 8\n']
['soc, 32\n', 'soc, 1\n', 'dog, 8\n']
None
And input.txt looks like this:
soc, 32
soc, 1
dog, 8
The second question regarding merging numbers with the same words changes our solution from above. As per the suggestion in the comments, you should take a look at using dictionaries to solve this. Instead of doing this all as one big list, the better (and probably more pythonic) way to do this is to iterate through each line of your input, and stemming them as you process them. I'll write up code about this in a bit, if you're still working to figure it out.

The Nodebox English Linguistics library contains scripts for converting plural form to single form and vice versa. Checkout tutorial: https://www.nodebox.net/code/index.php/Linguistics#pluralization
To convert plural to single just import singular module and use singular() function. It handles proper conversions for words with different endings, irregular forms, etc.
from en import singular
print(singular('analyses'))
print(singular('planetoids'))
print(singular('children'))
>>> analysis
>>> planetoid
>>> child

In pyparsing, how to assign a "no match" key value?

I'd like to make the 'pyparsing' parsing result come out as a dictionary without neeing to post-process. For this, I need to define my own key strings. The following the best I could come up with that produces the desired results.
Line to parse:
%ADD22C,0.35X*%
Code:
import pyparsing as pyp
floatnum = pyp.Regex(r'([\d\.]+)')
comma = pyp.Literal(',').suppress()
cmd_app_def = pyp.Literal('AD').setParseAction(pyp.replaceWith('aperture-definition'))
cmd_app_def_opt_circ = pyp.Group(pyp.Literal('C') +
comma).setParseAction(pyp.replaceWith('circle'))
circular_apperture = pyp.Group(cmd_app_def_opt_circ +
pyp.Group(pyp.Empty().setParseAction(pyp.replaceWith('diameter')) + floatnum) +
pyp.Literal('X').suppress())
<the grammar for the entire line>
The result is:
['aperture-definition', '20', ['circle', ['diameter', '0.35']]]
What I consider a hack here is
pyp.Empty().setParseAction(pyp.replaceWith('diameter'))
which always matches and is empty, but then I assign my desired key name to it.
Is this the best way to do this? Am I abusing pyparsing to do something it's not meant to do?

If you want to name your floatnum as "diameter", you can use named results:
cmd_app_def_opt_circ = pyp.Group(pyp.Literal('C') +
comma)("circle")
circular_apperture = pyp.Group(cmd_app_def_opt_circ +
pyp.Group(floatnum)("diameter") +
pyp.Literal('X').suppress())
In this way, every time the parses encounters floatnum in the circular_appertur context, this result is named diameter. Also, as described above, you can name circle in the same fashion. Does this work for you?

See comments in the posted code.
import pyparsing as pyp
comma = pyp.Literal(',').suppress()
# use parse actions to do type conversion at parse time, so that results fields
# can immediately be used as ints or floats, without additional int() or float()
# calls
floatnum = pyp.Regex(r'([\d\.]+)').setParseAction(lambda t: float(t[0]))
integer = pyp.Word(pyp.nums).setParseAction(lambda t: int(t[0]))
# define the command keyword - I assume there will be other commands too, they
# should follow this general pattern (define the command keyword, then all the
# options, then define the overall command)
aperture_defn_command_keyword = pyp.Literal('AD')
# define a results name for the matched integer - I don't know what this
# option is, wasn't in your original post
d_option = 'D' + integer.setResultsName('D')
# shortcut for defining a results name is to use the expression as a
# callable, and pass the results name as the argument (I find this much
# cleaner and keeps the grammar definition from getting messy with lots
# of calls to setResultsName)
circular_aperture_defn = 'C' + comma + floatnum('diameter') + 'X'
# define the overall command
aperture_defn_command = aperture_defn_command_keyword("command") + d_option + pyp.Optional(circular_aperture_defn)
# use searchString to skip over '%'s and '*'s, gives us a ParseResults object
test = "%ADD22C,0.35X*%"
appData = aperture_defn_command.searchString(test)[0]
# ParseResults can be accessed directly just like a dict
print appData['command']
print appData['D']
print appData['diameter']
# or if you prefer attribute-style access to results names
print appData.command
print appData.D
print appData.diameter
# convert ParseResults to an actual Python dict, removes all unnamed tokens
print appData.asDict()
# dump() prints out the parsed tokens as a list, then all named results
print appData.dump()
Prints:
AD
22
0.35
AD
22
0.35
{'diameter': 0.34999999999999998, 'command': 'AD', 'D': 22}
['AD', 'D', 22, 'C', 0.34999999999999998, 'X']
- D: 22
- command: AD
- diameter: 0.35

python string manipulation

Going to re-word the question.
Basically I'm wondering what is the easiest way to manipulate a string formatted like this:
Safety/Report/Image/489
or
Safety/Report/Image/490
And sectioning off each word seperated by a slash(/), and storing each section(token) into a store so I can call it later. (Reading in about 1200 cells from a CSV file).

The answer for your question:
>>> mystring = "Safety/Report/Image/489"
>>> mystore = mystring.split('/')
>>> mystore
['Safety', 'Report', 'Image', '489']
>>> mystore[2]
'Image'
>>>
If you want to store data from more than one string, then you have several options depending on how do you want to organize it. For example:
liststring = ["Safety/Report/Image/489",
"Safety/Report/Image/490",
"Safety/Report/Image/491"]
dictstore = {}
for line, string in enumerate(liststring):
dictstore[line] = string.split('/')
print dictstore[1][3]
print dictstore[2][3]
prints:
490
491
In this case you can use in the same way a dictionary or a list (a list of lists) for storage. In case each string has a especial identifier (one better than the line number), then the dictionary is the option to choose.

I don't quite understand your code and don't have too much time to study it, but I thought that the following might be helpful, at least if order isn't important ...
in_strings = ['Safety/Report/Image/489',
'Safety/Report/Image/490',
'Other/Misc/Text/500'
]
out_dict = {}
for in_str in in_strings:
level1, level2, level3, level4 = in_str.split('/')
out_dict.setdefault(level1, {}).setdefault(
level2, {}).setdefault(
level3, []).append(level4)
print out_dict
{'Other': {'Misc': {'Text': ['500']}}, 'Safety': {'Report': {'Image': ['489', '490']}}}

If your csv is line seperated:
#do something to load the csv
split_lines = [x.strip() for x in csv_data.split('\n')]
for line_data in split_lines:
split_parts = [x.strip() for x in line_data.split('/')]
# do something with individual part data
# such as some_variable = split_parts[1] etc
# if using indexes, I'd be sure to catch for index errors in case you
# try to go to index 3 of something with only 2 parts
check out the python csv module for some importing help (I'm not too familiar).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Writing/parsing text file with fixed width lines - python

You can use justify functions to left-justify, right-justify and center a string in a field of given width. 'hi'.ljust(10) -> 'hi '

Related

python script not joining strings as expected

Forming a python table with tab

Converting plural to singular in a text file with Python

In pyparsing, how to assign a "no match" key value?

python string manipulation

Categories

Resources