Input:
#example1
abcd
efg
hijklmnopq
#example2
123456789
Script:
def parser_function(f):
name = ''
body = ''
for line in f:
if len(line) >= 1:
if line[0] == '#':
name = line
continue
body = body + line
yield name,''.join(body)
for line in parser_function(data_file):
print line
Output
('#example1', 'abcd')
('#example1', 'abcdefg')
('#example1', 'abcdefghijklmnopq')
('#example2', 'abcdefghijklmnopq123456789')
Desired Output:
('#example1', 'abcdefghijklmnopq')
('#example2', '123456789')
My problem, my generator is yielding every line but i'm not sure where to reset the line. i'm having trouble getting the desired output and i've tried a few different ways. any help would be greatly appreciated. saw some other generators that had "if name:" but they were fairly complicated. I got it to work using those codes but i'm trying to make my code as small as possible
You need to change where you yield:
def parser_function(f):
name = None
body = ''
for line in f:
if line and line[0] == '#':
if name:
yield name, body
name = line
else:
body += line
if name:
yield name, body
This yields once before every #... and once at the end.
P.S. I've renamed str to body to avoid shadowing a built-in.
Related
I have files where bash string variables are gradually appended:
URI += "path \
path \
path \
"
<some other code>
#URI += "path"
URI += "path \
path"
As you may notice there are different way of appendings, partly over several lines. There is other code as well in those files.
Now I tried to write a function which gets the content of the variables (everything between the quotes):
def grepVar(filepath, var):
list = []
with open(filepath, "r") as file:
for num, line in enumerate(file, 1):
if var in line:
if line.count('"') is 2:
list.append(line)
# until here it works for "URIs" over 1 line
else:
num = num + 1
while(line.count('"') is 0):
list.append(line)
num = num + 1
return list
print grepVar(path, "URI")
So In the else condition I try to raise the loop manually and append all lines until another quote would appear (while-loop). I am not sure if I can tie on this idea or if I have to discard it completely. In this case could you pls give me hints how to solve my problems? I am not sure if I described it well since its kind of specific.
As line if given through a higher level for num, line in enumerate(file, 1): loop, you cannot use a while (line...) inside that loop.
A common way to solve this problem is to save state between lines. You function could become (I removed num management because I could not understand the requirement):
def grepVar(filepath, var):
lst = []
inquote = False
with open(filepath, "r") as fil:
for num, line in enumerate(fil, 1):
if inquote:
lst.append(line)
if line.count('"') > 0:
inquote = False
elif var in line:
if line.count('"') == 2:
lst.append(line)
else:
lst.append(line)
inquote = True
return lst
You should also avoid to use standard Python words such as list of file for your own variables, because the hide the standard meanings.
I was trying to implement this block of code from Generator not working to split string by particular identifier . Python 2 but I found two bugs in it that I can’t seem to fix.
Input:
#m120204
CTCT
+
~##!
#this_one_has_an_at_sign
CTCTCT
+
#jfik9
#thisoneisempty
+
#empty line after + and then empty line to end file (2 empty lines)
The two bugs are:
(i) when there is a # that starts the line of code after the ‘+’ line such as the 2nd entry (#this_one_has_an_at_sign)
(ii) when there line following the #identification_line or the line following the ‘+’ lines are empty like in 3rd entry (#thisoneisempty)
I would like the output to be the same as the post that i referenced:
yield (name, body, extra)
in the case of #this_one_has_an_at_sign
name= this_one_has_an_at_sign
body= CTCTCT
quality= #jfik9
in the case of #thisoneisempty
name= thisoneisempty
body= ''
quality= ''
I tried using flags but i can’t seem to fix this issue. I know how to do it without using a generator but i’m going to be using big files so i don’t want to go down that path. My current code is:
def organize(input_file):
name = None
body = ''
extra = ''
for line in input_file:
line = line.strip()
if line.startswith('#'):
if name:
body, extra = body.split('+',1)
yield name, body, extra
body = ''
name = line
else:
body = body + line
body, extra = body.split('+',1)
yield name, body, extra
for line in organize(file_path):
print line
I'm having some trouble optimizing this part of code.
It works, but seems unnecessary slow.
The function searches after a searchString in a file starting on line line_nr and returns the line number for first hit.
import linecache
def searchStr(fileName, searchString, line_nr = 1, linesInFile):
# The above string is the input to this function
# line_nr is needed to search after certain lines.
# linesInFile is total number of lines in the file.
while line_nr < linesInFile + 1:
line = linecache.getline(fileName, line_nr)
has_match = line.find(searchString)
if has_match >= 0:
return line_nr
break
line_nr += 1
I've tried something along these lines, but never managed to implement the "start on a certain line number"-input.
Edit: The usecase. I'm post processing analysis files containing text and numbers that are split into different sections with headers. The headers on line_nr are used to break out chunks of the data for further processing.
Example of call:
startOnLine = searchStr(fileName, 'Header 1', 1, 10000000):
endOnLine = searchStr(fileName, 'Header 2', startOnLine, 10000000):
Why don't you start with simplest possible implementation ?
def search_file(filename, target, start_at = 0):
with open(filename) as infile:
for line_no, line in enumerate(infile):
if line_no < start_at:
continue
if line.find(target) >= 0:
return line_no
return None
I guess your file is like:
Header1 data11 data12 data13..
name1 value1 value2 value3...
...
...
Header2 data21 data22 data23..
nameN valueN1 valueN2 valueN3..
...
Does the 'Header' string contains any constant formats(i.e: all start with '#' or sth). If so, you can read the line directly, judge if the line contains this format (i.e: if line[0]=='#') and write different code for different kinds of lines(difination line and data line in your example).
Record class:
class Record:
def __init__(self):
self.data={}
self.header={}
def set_header(self, line):
...
def add_data(self, line):
...
iterate part:
def parse(p_file):
record = None
for line in p_file:
if line[0] == "#":
if record : yield record
else:
record = Record()
record.set_header(line)
else:
record.add_data(line)
yield record
main func:
data_file = open(...)
for rec in parse(data_file):
...
I have a file that looks like this
!--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
!------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
I want to read sections [DISK] and [CAPACITY].. there will be more sections like these. I want to read the parameters defined under those sections.
I wrote a following code:
file_open = open(myFile,"r")
all_lines = file_open.readlines()
count = len(all_lines)
file_open.close()
my_data = {}
section = None
data = ""
for line in all_lines:
line = line.strip() #remove whitespace
line = line.replace(" ", "")
if len(line) != 0: # remove white spaces between data
if line[0] == "[":
section = line.strip()[1:]
data = ""
if line[0] !="[":
data += line + ","
my_data[section] = [bit for bit in data.split(",") if bit != ""]
print my_data
key = my_data.keys()
print key
Unfortunately I am unable to get those sections and the data under that. Any ideas on this would be helpful.
As others already pointed out, you should be able to use the ConfigParser module.
Nonetheless, if you want to implement the reading/parsing yourself, you should split it up into two parts.
Part 1 would be the parsing at file level: splitting the file up into blocks (in your example you have two blocks: DISK and CAPACITY).
Part 2 would be parsing the blocks itself to get the values.
You know you can ignore the lines starting with !, so let's skip those:
with open('myfile.txt', 'r') as f:
content = [l for l in f.readlines() if not l.startswith('!')]
Next, read the lines into blocks:
def partition_by(l, f):
t = []
for e in l:
if f(e):
if t: yield t
t = []
t.append(e)
yield t
blocks = partition_by(content, lambda l: l.startswith('['))
and finally read in the values for each block:
def parse_block(block):
gen = iter(block)
block_name = next(gen).strip()[1:-1]
splitted = [e.split('=') for e in gen]
values = {t[0].strip(): t[1].strip() for t in splitted if len(t) == 2}
return block_name, values
result = [parse_block(b) for b in blocks]
That's it. Let's have a look at the result:
for section, values in result:
print section, ':'
for k, v in values.items():
print '\t', k, '=', v
output:
DISK :
DIRECTION = 'OK'
TYPE = 'normal'
CAPACITY :
code = 0
ID = 110
Are you able to make a small change to the text file? If you can make it look like this (only changed the comment character):
#--------------------------------------------------------------------------DISK
[DISK]
DIRECTION = 'OK'
TYPE = 'normal'
#------------------------------------------------------------------------CAPACITY
[CAPACITY]
code = 0
ID = 110
Then parsing it is trivial:
from ConfigParser import SafeConfigParser
parser = SafeConfigParser()
parser.read('filename')
And getting data looks like this:
(Pdb) parser
<ConfigParser.SafeConfigParser instance at 0x100468dd0>
(Pdb) parser.get('DISK', 'DIRECTION')
"'OK'"
Edit based on comments:
If you're using <= 2.7, then you're a little SOL.. The only way really would be to subclass ConfigParser and implement a custom _read method. Really, you'd just have to copy/paste everything in Lib/ConfigParser.py and edit the values in line 477 (2.7.3):
if line.strip() == '' or line[0] in '#;': # add new comment characters in the string
However, if you're running 3'ish (not sure what version it was introduced in offhand, I'm running 3.4(dev)), you may be in luck: ConfigParser added the comment_prefixes __init__ param to allow you to customize your prefix:
parser = ConfigParser(comment_prefixes=('#', ';', '!'))
If the file is not big, you can load it and use Regexes to find parts that are of interest to you.
Hi I'm new to python. I am trying to add different key value pairs to a dictionary depending on different if statements like the following:
def getContent(file)
for line in file:
content = {}
if line.startswith(titlestart):
line = line.replace(titlestart, "")
line = line.replace("]]></title>", "")
content["title"] = line
elif line.startswith(linkstart):
line = line.replace(linkstart, "")
line = line.replace("]]>", "")
content["link"] = line
elif line.startswith(pubstart):
line = line.replace(pubstart, "")
line = line.replace("</pubdate>", "")
content["pubdate"] = line
return content
print getContent(list)
However, this always returns the empty dictionary {}.
I thought it was variable scope issue at first but that doesn't seem to be it. I feel like this is a very simple question but I'm not sure what to google to find the answer.
Any help would be appreciated.
You reinitialize content for every line, move the initialization outside of the loop:
def getContent(file)
content = {}
for line in file:
etc.