Parsing a series of fixed-width files [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a series (~30) of files that are made up of rows like:
xxxxnnxxxxxxxnnnnnnnxxxnn
Where x is a char and n is a number, and each group is a different field.
This is fixed for each file so would be pretty easy to split and read with a struct or slice; however I was wondering if there's an effective way of doing it for a lot of files (with each file having different fields and lengths) without hard-coding it.
One idea I had was creating an XML file with the schema for each file, and then I could dynamically add new ones where required and the code would be more portable, however I wanted to check there are no simpler/more standard ways of doing this.
I will be outputting the data into either Redis or an ORM if this helps, and each file will only be processed once (although other files with different structures will be added at later dates).
Thanks

You could use itertools.groupby, with str.isdigit for instance (or isalpha):
>>> line = "aaa111bbb22cccc345defgh67"
>>> [''.join(i[1]) for i in itertools.groupby(line,str.isdigit)]
['aaa', '111', 'bbb', '22', 'cccc', '345', 'defgh', '67']

I think #fredtantini's answer contains a good suggestion — and here's a fleshed out way of applying it to your problem coupled with a minor variation of the code in my answer to a related question titled Efficient way of parsing fixed width files in Python:
from itertools import groupby
from struct import Struct
isdigit = str.isdigit
def parse_fields(filename):
with open(filename) as file:
# determine the layout of fields from the first line of the file
firstline = file.readline().rstrip()
fieldwidths = (len(''.join(i[1])) for i in groupby(firstline, isdigit))
fmtstring = ''.join('{}s'.format(fw) for fw in fieldwidths)
parse = Struct(fmtstring).unpack_from
file.seek(0) # rewind
for line in file:
yield parse(line)
for row in parse_fields('somefile.txt'):
print(row)

Related

Splitting the paragraphs in two Python strings into lines of a maximum width [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 months ago.
Improve this question
I have two strings in a Python script which each contain single lines of text, blank lines and multiple paragraphs. Some of the paragraphs in the strings are very long so I would like to split them into multiple lines of text so that each line in the paragraphs is a certain maximum width. I would then like to split each string into lines so that the strings may be compared using the HtmlDiff class in the difflib module. Might someone know a quick and easy way to do this? I would greatly appreciate it. Thanks so much.
By searching, I found the following link:
How to modify list entries during for loop?
Using the information in the first answer, and the first comment to this question, I was able to achieve what I was looking for using code as the following below:
firstListOfLines = firstText.splitlines()
for index, line in enumerate(firstListOfLines):
firstListOfLines[index] = textwrap.fill(line)
firstListOfLines = '\n'.join(firstListOfLines).splitlines()
secondListOfLines = secondText.splitlines()
for index, line in enumerate(secondListOfLines):
secondListOfLines[index] = textwrap.fill(line)
secondListOfLines = '\n'.join(secondListOfLines).splitlines()
Thanks so much. The first comment helped me to think about what to do. Thanks again.

Python take the string of anything, ignoring all escape characters [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I am trying to make a function that takes typically copy-pasted text that very often includes \n characters. An example of such is as follows:
func('''This
is
some
text
that I entered''')
The problem with this function is the text can sometimes be rather large, so taking it line by line to avoid ', " or ''' isn't plausible. A piece of text that can cause issues is as follows:
func('''This
is'''
some"
text'
that I entered''')
I wanted to know if there is any way I can take the text as seen in the second example and use it as a string regardless of what it is comprised of.
Thanks!
To my knowledge, you won't be able to paste the text directly into your file. However, you could paste it into a text file.
Use regex to find triple quotes ''' and other invalid characters.
Example python:
def read_paste(file):
import re
with open(file,'r') as f:
data = f.readlines()
for i,line in enumerate(data):
data[i] = re.sub('("|\')',r'\\\1',line)
output = str()
for line in data:
output += line
return output

Is there any way to retrieve file name using Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
In a Linux directory, I have several numbered files, such as "day1" and "day2". My goal is to write a code that retrieves the number from the files and add 1 to the file that has the biggest number and create a new file. So, for example, if there are files, 'day1', 'day2' and 'day3', the code should read the list of files and add 'day4'. To do so, at least I need to know how to retrieve the numbers on the file name.
I'd use os.listdir to get all the file names, remove the "day" prefix, convert the remaining characters to integers, and take the maximum.
From there, it's just a matter of incrementing the number and appending it to the same prefix:
import os
max_file = max([int(f[3:]) for f in os.listdir('some_directory')])
new_file = 'day' + str(max_file + 1)
Get all files with the os module/package (don't have the exact command handy) and then use regex(package) to get the numbers. If you don't want to look into regex you could remove the letters from your string with replace() and convert that string with int().
Glob would be good for this. It is kind of regex, but specially for file search and simpler. Basically you just use * as a wildcard, and you can select numbers too. Just google what it exactly is. It can be pretty powerful and is native to the bash shell for example.
for glob import glob
from pathlib import Path
pattern = "day"
last_file_number = max(map(lambda f: int(f[len(pattern):]), glob(pattern + "[0-9]*")))
Path("%s%d" % (pattern, last_file_number + 1)).touch()
You can also see that I use pathlib here. This is a library to deal with the file system in an OOP manner. Some people like, some don't.
So, a little disclaimer: Glob is not as powerful as regex. Here daydream for example won't be matched, but day0dream would still be matched. You can also try day*[0-9], but then daydream0 would still be matched. Off course you can also use day[0-9] if you know you stay below double digits. So, if your use case requires this, you can use glob and filter down with regex.

How do I read a text file into a string variable in Python starting at the second line? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
I use the following code segment to read a file in python
file = open("test.txt", "rb")
data=file.readlines()[1:]
file.close
print data
However, I need to read the entire file (apart from the first line) as a string into the variable data.
As it is, when my files contents are test test test, my variable contains the list ['testtesttest'].
How do I read the file into a string?
I am using python 2.7 on Windows 7.
The solution is pretty simple. You just need to use a with ... as construct like this, read from lines 2 onward, and then join the returned list into a string. In this particular instance, I'm using "" as a join delimiter, but you can use whatever you like.
with open("/path/to/myfile.txt", "rb") as myfile:
data_to_read = "".join(myfile.readlines()[1:])
...
The advantage of using a with ... as construct is that the file is explicitly closed, and you don't need to call myfile.close().

Need to merge information from four files together [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a 2 big files, both include segments of information I want, I have extracted the information into output files and now I have 4 files which hold the information I need.
What I want to do is merge the information from the four files into 1 file, in a neat format as long as it is a line by line format including 4 columns and separated by commas, and I want to be able to put something at the top of the file when it opens as to let the user know what information is in the columns. Is this possible in python>?
Here is the info I want to merge:
'/usr/share/doc/HTML/es/kioslave/index.docbook'
Redhat 7.3'
Linux'
D84270022E57F1850C8464FA432ADFF99588157B'
every line is 1 line from the files I have, they go for many lines so I cannot post the whole thing, but that is an example of the info.
The Python zip function is used to combine multiple sources into a single tuple.
for row in zip(file1, file2, file3, file4):
# output the 4 column values in row
It is entirely possible--look at the csv module, which is most likely what you need. It's easy to use.
You'll be creating a comma separated value file (.csv) where the first row will be headers indicating the contents of each row, i.e.
path,distro,os,serial
'/usr/share/doc/HTML/es/kioslave/index.docbook' ,'Redhat 7.3', 'Linux', D84270022E57F1850C8464FA432ADFF99588157B'

Categories

Resources