Parsing specific contents in a file

Parsing specific contents in a file - python

I have a file that looks like this
!--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
!------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
I want to read sections [DISK] and [CAPACITY].. there will be more sections like these. I want to read the parameters defined under those sections.
I wrote a following code:
file_open = open(myFile,"r")
all_lines = file_open.readlines()
count = len(all_lines)
file_open.close()
my_data = {}
section = None
data = ""
for line in all_lines:
line = line.strip() #remove whitespace
line = line.replace(" ", "")
if len(line) != 0: # remove white spaces between data
if line[0] == "[":
section = line.strip()[1:]
data = ""
if line[0] !="[":
data += line + ","
my_data[section] = [bit for bit in data.split(",") if bit != ""]
print my_data
key = my_data.keys()
print key
Unfortunately I am unable to get those sections and the data under that. Any ideas on this would be helpful.

As others already pointed out, you should be able to use the ConfigParser module.
Nonetheless, if you want to implement the reading/parsing yourself, you should split it up into two parts.
Part 1 would be the parsing at file level: splitting the file up into blocks (in your example you have two blocks: DISK and CAPACITY).
Part 2 would be parsing the blocks itself to get the values.
You know you can ignore the lines starting with !, so let's skip those:
with open('myfile.txt', 'r') as f:
content = [l for l in f.readlines() if not l.startswith('!')]
Next, read the lines into blocks:
def partition_by(l, f):
t = []
for e in l:
if f(e):
if t: yield t
t = []
t.append(e)
yield t
blocks = partition_by(content, lambda l: l.startswith('['))
and finally read in the values for each block:
def parse_block(block):
gen = iter(block)
block_name = next(gen).strip()[1:-1]
splitted = [e.split('=') for e in gen]
values = {t[0].strip(): t[1].strip() for t in splitted if len(t) == 2}
return block_name, values
result = [parse_block(b) for b in blocks]
That's it. Let's have a look at the result:
for section, values in result:
print section, ':'
for k, v in values.items():
print '\t', k, '=', v
output:
DISK :
DIRECTION = 'OK'
TYPE = 'normal'
CAPACITY :
code = 0
ID = 110

Are you able to make a small change to the text file? If you can make it look like this (only changed the comment character):
#--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
#------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
Then parsing it is trivial:
from ConfigParser import SafeConfigParser
parser = SafeConfigParser()
parser.read('filename')
And getting data looks like this:
(Pdb) parser
<ConfigParser.SafeConfigParser instance at 0x100468dd0>
(Pdb) parser.get('DISK', 'DIRECTION')
"'OK'"
Edit based on comments:
If you're using <= 2.7, then you're a little SOL.. The only way really would be to subclass ConfigParser and implement a custom _read method. Really, you'd just have to copy/paste everything in Lib/ConfigParser.py and edit the values in line 477 (2.7.3):
if line.strip() == '' or line[0] in '#;': # add new comment characters in the string
However, if you're running 3'ish (not sure what version it was introduced in offhand, I'm running 3.4(dev)), you may be in luck: ConfigParser added the comment_prefixes __init__ param to allow you to customize your prefix:
parser = ConfigParser(comment_prefixes=('#', ';', '!'))

If the file is not big, you can load it and use Regexes to find parts that are of interest to you.

Related

how to edit a number in a text field in python?

So i have am trying to go through a text file, put it into a dictionary, and then check to see if a string is already in it. if it is, i want to change the "1" to a "2". Currently if the string is already in the text file, it will just make a new line but with a "2". is there a way to edit the text file so the number can stay in the same place but be replaced?
class Isduplicate:
dicto = {}
def read(self):
f = open(r'C:\Users\jacka\OneDrive\Documents\outputs.txt', "r")
for line in f:
k, v = line.strip().split(':')
self.dicto[k.strip()] = int(v.strip())
return self.dicto
Is = Isduplicate()
while counter < 50:
e = str(elem[counter].get_attribute("href"))
e = e.replace("https://www.reddit.com/r/", "")
e = e[:-1]
if e in Is.read():
Is.dicto[e] += 1
else:
Is.dicto[e] = 1
text_file.write(e + ":" + str(Is.dicto[e]) + '\n')
print(e)
counter = counter +2

You can not rewrite a particular byte in a file, you have to rewrite the file in a whole.
Probably reading the file into the list of strings, processing that list and writing it back to the file would solve your task.

Python 3.X combining similar lines in .txt files together

A question regarding combining values from a text file into a single variable and printing it.
An example I can give is a .txt file such as this:
School, 234
School, 543
I want to know the necessary steps to combining both of the school into a single variable "school" and have a value of 777.
I know that we will need to open the .txt file for reading and then splitting it apart with the .split(",") method.
Code Example:
schoolPopulation = open("SchoolPopulation.txt", "r")
for line in schoolPopulation:
line = line.split(",")
Could anyone please advise me on how to tackle this problem?

Python has rich standard library, where you can find classes for many typical tasks. Counter is what you need in current situation:
from collections import Counter
c = Counter()
with open('SchoolPopulation.txt', 'r') as fh:
for line in fh:
name, val = line.split(',')
c[name] += int(val)
print(c)

Something like this?
schoolPopulation = open("SchoolPopulation.txt", "r")
results = {}
for line in schoolPopulation:
parts = line.split(",")
name = parts[0].lower()
val = int(parts[1])
if name in results:
results[name] += val
else:
results[name] = val
print(results)
schoolPopulation.close()
You could also use defaultdict and the with keyword.
from collections import defaultdict
with open("SchoolPopulation.txt", "r") as schoolPopulation:
results = defaultdict(int)
for line in schoolPopulation:
parts = line.split(",")
name = parts[0].lower()
val = int(parts[1])
results[name] += val
print(results)
If you'd like to display your results nicely you can do something like
for key in results:
print("%s: %d" % (key, results[key]))

school = population = prev = ''
pop_count = 0
with open('SchoolPopulation.txt', 'r') as infile:
for line in infile:
line = line.split(',')
school = line[0]
population = int(line[1])
if school == prev or prev == '':
pop_count += line[1]
else:
pass #do something else here
prev = school

Python - delete uuencoding lines

I am processing many text files which (some of them) contain uuencoding which can be .jpg or .pdf or .zip of .xlsx etc. I don't care about the embedded UUencoded data, so I would just like to discard these passages and keep the rest of the text. I'm struggling with how to come up with a method to skip only just enough, but not too much.
To summarize http://en.wikipedia.org/wiki/Uuencoding each blob begins with
begin 644 filename.extension
every line after the begin 644 seems to start by the letter
M
so this might also help. Any idea how to have a function that deletes all these lines for all .txt files in a folder (directory)?
For example, the following is a .jpg uuencoding
GRAPHIC
18
g438975g32h99a01.jpg
begin 644 g438975g32h99a01.jpg
M_]C_X``02D9)1#`!`#$`8`!#``#_[0G64&AO;=&]S:&]P(#,N,``X0DE-`^T`
M`````!``8`````$``0!#`````0`!.$))300-```````$````'CA"24T$&0``
M````!````!XX0DE-`_,```````D```````````$`.$))300*```````!```X
M0DE-)Q````````H``0`````````".$))30/U``````!(`"]F9#`!`&QF9;#`&
M```````!`"]F9#`!`*&9F#`&```````!`#(````!`%H````&```````!`#4`
M```!`"T````&```````!.$))30/X``````!P``#_____________________
M________`^#`````_____________________________P/H`````/______
M______________________\#Z`````#_____________________________
M`^#``#A"24T$"```````$`````$```)````"0``````X0DE-!!X```````0`
M````.$))300:``````!M````!#``````````````)P```+`````&`&<`,P`R
M`&#`.0`Y`````0`````````````````````````!``````````````"P````
M)P`````````````````````````````````````````````X0DE-!!$`````
M``$!`#A"24T$%```````!`````(X0DE-!`P`````!SH````!````<````!D`
M``%0```#T```!QX`&``!_]C_X``02D9)1#`!`#$`2`!(``#_[#`.061O8F4`
M9(`````!_]L`A``,"`#("0#,"0D,$0L*"Q$5#PP,#Q48$Q,5$Q,8$0P,#`P,
M#!$,#`P,#`P,#`P,#`P,#`P,#`P,#`P,#`P,#`P,`0T+"PT.#1`.#A`4##X.
M%!0.##X.%!$,#`P,#!$1#`P,#`P,$0P,#`P,#`P,#`P,#`P,#`P,#`P,#`P,
M#`P,#`S_P``1"``9`'`#`2(``A$!`Q$!_]T`!``'_\0!/P```04!`0$!`0$`
M`````````P`!`#0%!#<("0H+`0`!!0$!`0$!`0`````````!``(#!`4&!P#)
M"#L0``$$`0,"!`(%!P8(!0,,,P$``A$#!"$2,05!46$3(G&!,#84D:&Q0B;,D
M%5+!8C,T<H+10P)E\K.$P]-U
MX_-&)Y2DA;25Q-3D]*6UQ=7E]59F=H:6IK;&UN;V-T=79W>'EZ>WQ]?G]Q$`
M`#(!`#0$`P0%!#<'!#4U`0`"$0,A,1($05%A<2(3!3*!D12AL4(CP5+1\#,D
M8N%R#I)#4Q5C<S3Q)086HK*#!R8UPM)$DU2C%V1%539T9>+RLX3#TW7C\T:4
MI(6TE<34Y/2EM<75Y?569G:&EJ;:VQM;F]B
I would like to be left with just
GRAPHIC
18
g438975g32h99a01.jpg
For background, see also my earlier question How to remove weird encoding from txt file
EDIT : Here is a try
start_marker = 'begin 644'
with open('fileWithBegin644.txt') as inf:
ignoreLines = False
for line in inf:
if start_marker in line:
print line,
ignoreLines = True
if not ignoreLines:
with open("strip_" + inf, "w") as f:
f.write(line.get_text().encode('utf-8'))
But I am getting the following error
File "removeUuencodingFromAll.py", line 10
with open("strip_" + inf, "w") as f:
^
IndentationError: expected an indented block

I coded up what was supposed to be a rather simple generator. Because the spec is slightly tedious (why two separate end markers on different lines?) it is rather bulky, but here goes. It should work as a validator for uuencode at the same time, but I have only tested it in very limited settings.
import re
def unuuencode (iterator, collector=None, ignore_length_errors=False):
"""
Yield lines from iterator except when they are in an uuencode blob.
If collector is not None, append to it the uuencoded blobs as a list
of a list of lines, one for each uuencoded blob.
"""
state = None # one of { None, 'in_blob', 'closing', 'closed' }
collectitem = None
regex = re.compile(r'^begin\s+[0-7]{3,6}\s+.*?(?:\r?\n)?$')
for line in iterator:
if state == None:
if regex.match(line):
if collector != None:
collectitem = [line]
state = 'in_blob'
continue
else:
yield line
else:
stripped = line.rstrip('\r\n')
if state == 'in_blob' and line.startswith('`'):
state = 'closing'
if state == 'closing':
if stripped != '`':
raise ValueError('Expected "`" but got "%s"' % line)
state = 'closed'
elif state == 'closed':
if stripped != 'end':
raise ValueError('Expected "end" but got "%s"' % line)
state = None
else:
expect = ord(line[0:1])-32
actual = len(stripped)
seen = (len(stripped)-1)*6/8
if seen != expect:
if not ignore_length_errors:
raise ValueError('Wrong prefix on line: %s '
'(indicated %i, 6/8 %i, actual length %i)' % (
line, expect, seen, actual))
if line[0:1] != 'M':
state = 'closing'
if collectitem:
collectitem.append(line)
if state is None:
if collectitem:
collector.append(collectitem)
collectitem = None
continue
Use it like this:
with open(file, 'r') as f:
lines = [x for x in unuuencode(f)]
or like this:
with open(file, 'r') as f:
blobs = []
lines = [x for x in unuuencode(f, collector=blobs)]
or like this:
with open(file, 'r') as f:
lines = f.read().split('\n')
# ... or whichever way you obtained your content as an array of lines
lines = [x for x in unuuencode(lines)]
or in the case of the code you seem to be using:
for fi in sys.argv[1:]:
with open(fi) as markup:
soup = BeautifulSoup(''.join(unuuencode(markup, ignore_length_errors=True)))
with open("strip_" + fi, "w") as f:
f.write(soup.get_text().encode('utf-8'))
The sample you linked to had an invalid length indicator in the second uuencoded blob, so I added an option to ignore that.

How to read from file after some 'mark'?

For example, if I have some text / log file with very simple structure, where here is a few different parts of it, with different structure, and splitted by some mark line, e.g.:
0x23499 0x234234 0x234234
...
0x34534 0x353454 0x345464
$$$NEW_SECTION$$$
4345-34534-345-345345-3453
3453-34534-346-766788-3534
...
So, how I can read file by these parts? E.g. read file in one variable before that $$$NEW_SECTION$$$ mark, and after it (without using regexps, etc). Are here any simple solutions for that?

Here is the solution without reading the whole file into memory:
data1 = []
pos = 0
with open('data.txt', 'r') as f:
line = f.readline()
while line and not line.startswith('$$$'):
data1.append(line)
line = f.readline()
pos = f.tell()
data2 = []
with open('data.txt', 'r') as f:
f.seek(pos)
for line in f:
data2.append(line)
print data1
print data2
The first iteration can't be made with for line in f not to spoil the accurate position in the file.

The simplest solution is str.split
>>> s = filecontents.split("$$$NEW_SECTION$$$")
>>> s[0]
'0x23499 0x234234 0x234234\n\n0x34534 0x353454 0x345464\n'
>>> s[1]
'\n4345-34534-345-345345-3453\n3453-34534-346-766788-3534'

Solution 1:
If file is not very-big then:
with open('your_log.txt') as f:
parts = f.read().split('$$$NEW_SECTION$$$')
if len(parts) > 0:
part1 = parts[0]
...
Solution 2:
def FileParser(filepath):
with open(filepath) as f:
part = ''
while(line = f.readline()):
part += line
if (line != '$$$NEW_SECTION$$$'):
returnpart = part
part = ''
yield returnpart
for segment in FileParser('your_log.txt'):
print segment
Note: it is untested code so please validate before using it

Solution:
def sec(file_, sentinel):
with open(file_) as f:
section = []
for i in iter(f.readline, ''):
if i.rstrip() == sentinel:
yield section
section = []
else:
section.append(i)
yield section
and use:
>>> from pprint import pprint
>>> pprint(list(sec('file.txt')))
[['0x23499 0x234234 0x234234\n', '0x34534 0x353454 0x345464\n'],
['4345-34534-345-345345-3453\n',
'3453-34534-346-766788-3534\n',
'3453-34534-346-746788-3534\n']]
>>>
sections to variables or best sections to dict:
>>> sections = {}
>>> for n, section in enumerate(sec('file.txt')):
... sections[n] = section
>>>

python replace backslash

I'm trying to implement a simple helper class to interact with java-properties files. Fiddling with multiline properties I encountered a problem, that I can not get solved, maybe you can?
The unittest in the class first writes a multiline-property spanning over two lines to the property-file, then re-reads it and checks for equality. That just works. Now, if i use the class to add a third line to the property, it re-reads it with additional backslashes that I can't explain.
Here is my code:
#!/usr/bin/env python3
# -*- coding=UTF-8 -*-
import codecs
import os, re
import fileinput
import unittest
class ConfigParser:
reProp = re.compile(r'^(?P<key>[\.\w]+)=(?P<value>.*?)(?P<ext>[\\]?)$')
rePropExt = re.compile(r'(?P<value>.*?)(?P<ext>[\\]?)$')
files = []
def __init__(self, pathes=[]):
for path in pathes:
if os.path.isfile(path):
self.files.append(path)
def getOptions(self):
result = {}
key = ''
val = ''
with fileinput.input(self.files, inplace=False) as fi:
for line in fi:
m = self.reProp.match(line.strip())
if m:
key = m.group('key')
val = m.group('value')
result[key] = val
else:
m = self.rePropExt.match(line.rstrip())
if m:
val = '\n'.join((val, m.group('value')))
result[key] = val
fi.close()
return result
def setOptions(self, updates={}):
options = self.getOptions()
options.update(updates)
with fileinput.input(self.files, inplace=True) as fi:
for line in fi:
m = self.reProp.match(line.strip())
if m:
key = m.group('key')
nval = options[key]
nval = nval.replace('\n', '\\\n')
print('{}={}'.format(key,nval))
fi.close()
class test(unittest.TestCase):
files = ['test.properties']
props = {'test.m.a' : 'Johnson\nTanaka'}
def setUp(self):
for file in self.files:
f = codecs.open(file, encoding='utf-8', mode='w')
for key in self.props.keys():
val = self.props[key]
val = re.sub('\n', '\\\n', val)
f.write(key + '=' + val)
f.close()
def teardown(self):
pass
def test_read(self):
c = configparser(self.files)
for file in self.files:
for key in self.props.keys():
result = c.getOptions()
self.assertEqual(result[key],self.props[key])
def test_write(self):
c = ConfigParser(self.files)
changes = {}
for key in self.props.keys():
changes[key] = self.change_value(self.props[key])
c.setOptions(changes)
result = c.getOptions()
print('changes: ')
print(changes)
print('result: ')
print(result)
for key in changes.keys():
self.assertEqual(result[key],changes[key],msg=key)
def change_value(self, value):
return 'Smith\nJohnson\nTanaka'
if __name__ == '__main__':
unittest.main()
Output of the testrun:
C:\pyt>propertyfileparser.py
changes:
{'test.m.a': 'Smith\nJohnson\nTanaka'}
result:
{'test.m.a': 'Smith\nJohnson\\\nTanaka'}
Any hints welcome...

Since you are adding a backslash in front of new-lines when you are writing, you have to also remove them when you are reading. Uncommenting the line that substitutes '\n' with '\\n' solves the problem, but I expect this also means the file syntax is incorrect.
This happens only with the second line break, because you separate the value into an "oval" and an "nval" where the "oval" is the first line, and the "nval" the rest, and you only do the substitution on the nval.
It's also overkill to use regexp replacing to replace something that isn't a regexp. You can use val.replace('\n', '\\n') instead.
I'd do this parser very differently. Well, first of all, I wouldn't do it at all, I'd use an existing parser, but if I did, I'd read the file, line by line, while handling the line continuation issue, so that I had exactly one value per item in a list. Then I'd parse each item into a key and a value with a regexp, and stick that into a dictionary.
You instead parse each line separately and join continuation lines to the values after parsing, which IMO is completely backwards.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing specific contents in a file - python

If the file is not big, you can load it and use Regexes to find parts that are of interest to you.

Related

how to edit a number in a text field in python?

Python 3.X combining similar lines in .txt files together

Python - delete uuencoding lines

How to read from file after some 'mark'?

python replace backslash

Categories

Resources