This might sound banal but it has being a pain.
So I wrote code that parses lines. The .txt file has a line which match my re.match and a line which doesnt.
cat file.txt
00.00.00 : Blabla
x
In this case I treat checking the first letter "x".
def parser():
path = "file.txt"
with open (path, 'r+') as file:
msg = {}
list = []
start = 0
lines = file.readlines()
for i in range (0,len(lines)):
line = lines[i]
if re.match('MY RULES', line) is not None:
field['date'] = line[:8]
msg['msg'] = line[start + 2:]
print msg
if line.startswith('x'):
msg['msg'] += line
list.append(msg)
print chat
OUTPUT for 2 lines
{'date': '0.0.00', 'msg': 'BlaBla'}
{'msg': 'x'}
The problem is I cant append the second dict message['msg'] to the last message, if starts with "x".
The expected output is:
{'date': '0.0.00', 'msg': 'BlaBlax'}
I tried using the variant, for changing the last appended chat:
else:
list[len(list) - 1]['msg'] += + line
but then I get the error:
IndexError: list index out of range
I also tried using next(infile) to predict the next line, but then it output every other line.
How would you trick a nested loop to append a dict entry?
Cheers
First of all do not use list as a name for a variable it is builtin keyword and you are shadowing it.
Secondly if I understand correctly you would like to append the last result.
Here:
if re.match('MY RULES', line) is not None:
field['date'] = line[:8]
msg['msg'] = line[start + 2:]
print msg
if line.startswith('x'):
msg['msg'] += line
You are analyzing the same line and this msg['msg'] = line[start + 2:] in the next iteration overwrites your key msg in dictionary msg and clear the previous value. So this code
field['date'] = line[:8]
msg['msg'] = line[start + 2:]
print msg
Always gets executed even for a simple x in your input file and clears the previous values under the key msg
If you would like it to work you need if else although I would recommend storing intermediate values it in a different way then in locally scoped variable.
Full example with code fix:
def parser():
path = "file.txt"
with open(path, 'r+') as file:
msg = {}
chat = []
start = 0
lines = file.readlines()
for i in range(0, len(lines)):
line = lines[i]
if True:
if line.startswith('x'):
msg['msg'] += line
else:
msg['date'] = line[:8]
msg['msg'] = line[12:]
chat.append(msg)
print(chat)
parser()
Result:
[{'date': '00.00.00', 'msg': 'Blabla\nx'}]
Assuming that the line if re.match('MY RULES', line) is not None:
is True for all the lines in the file that is:
00.00.00 : Blabla
x
How about this:
path = "file.txt"
with open (path, 'r') as f:
msg = dict()
for line in f.readlines():
if line[0].isdigit():
tmp = line.split(':')
date = tmp[0].strip()
msg[date] = ' '.join(*[x.split() for x in tmp[1:]])
else:
msg[date] += ' ' + ' '.join(*[line.split()])
We go line by line, in case first letter of the line is a digit we assume it is a date and add it to our dict - otherwise we add the string found to the last dict entry we made. str.split() makes sure you get ride of all different whitespace characters.
You can for sure replace the if statement in the for loop with your regex... The issue i see with your implementation in general is that as soon as the input varies slightly (e.g. more whitespace chars as intended) your solution produces faulty results. Basic python string manipulations are really powerful ;)
Update
This should produce the right output:
*file.txt*
00.00.00 : Blabla
x
00.00.00 : Blabla2
x2
path = "file.txt"
with open (path, 'r') as f:
lst = list()
for line in f.readlines():
if line[0].isdigit():
tmp = line.split(':')
date = tmp[0].strip()
msg = {date: ' '.join(*[x.split() for x in tmp[1:]])}
lst.append(msg)
else:
msg[date] += ' ' + ' '.join(*[line.split()])
print(lst)
>>> [{'00.00.00': 'Blabla x'}, {'00.00.00': 'Blabla2 x2'}]
I missed the part that you want to store each pair separately in a dict and append it to a list.
Related
word = "some string"
file1 = open("songs.txt", "r")
flag = 0
index = 0
for line in file1:
index += 1
if word in line:
flag = 1
break
if flag == 0:
print(word + " not found")
else:
#I would like to print not only the line that has the string, but also the previous and next lines
print(?)
print(line)
print(?)
file1.close()
Use contents = file1.readlines() which converts the file into a list.
Then, loop through contents and if word is found, you can print contents[i], contents[i-1], contents[i+1]. Make sure to add some error handling if word is in the first line as contents[i-1] would throw and error.
word = "some string"
file1 = open("songs.txt", "r")
flag = 0
index = 0
previousline = ''
nextline = ''
for line in file1:
index += 1
if word in line:
finalindex = index
finalline = line
flag = 1
elsif flag==1
print(previousline + finalline + line)
print(index-1 + index + index+1)
else
previousline = line
You basically already had the main ingredients:
you have line (the line you currently evaluate)
you have the index (index)
the todo thus becomes storing the previous and next line in some variable and then printing the results.
have not tested it but code should be something like the above.....
splitting if you find the word, if you have found it and you flagged it previous time and if you have not flagged it.
i believe the else-if shouldnt fire unless flag ==1
aspiring Python newb (2 months) here. I am trying to create a program that inserts information to two specific places of each line of a .txt file, actually creating a new file in the process.
The information in the source file is something like this:
1,340.959,859.210,0.0010,VV53
18abc,34099.9590,85989.2100,0.0010,VV53
00y46646464,34.10,859487.2970,11.4210,RP27
Output would be:
1,7340.959,65859.210,0.0010,VV53
18abc,734099.9590,6585989.2100,0.0010,VV53
00y46646464,734.10,65859487.2970,11.4210,RP27
Each line different, hundreds of lines. The specific markers I'm looking for are the first and second occurence of a comma (,). The stuff needs to be added after the first and second comma. You'll know what I mean when you see the code.
I have gotten as far as this: the program finds the correct places and inserts what I need, but doesn't write more than 1 line to the new file. I tried debugging and seeing what's going on 'under the hood', all seemed good there.
Lots of scrapping code and chin-holding later I'm still stuck where I was a week ago.
tl;dr Code only outputs 1 line to new file, need hundreds.
f = open('test.txt', 'r')
new = open('new.txt', 'w')
first = ['7']
second = ['65']
line = f.readline()
templist = list(line)
counter = 0
while line != '':
for i, j in enumerate(templist):
if j == ',':
place = i + 1
templist1 = templist[:place]
templist2 = templist[place:]
counter += 1
if counter == 1:
for i, j in enumerate(templist2):
if j == ',':
place = i + 1
templist3 = templist2[:place]
templist4 = templist2[place:]
templist5 = templist1 + first + templist3 + second + templist4
templist6 = ''.join(templist5)
new.write(templist6)
counter += 1
break
if counter == 2:
break
break
line = f.readline()
templist = list(line)
f.close()
new.close()
If I'm understanding your samples and code correctly, this might be a valid approach:
with open('test.txt', 'r') as infd, open('new.txt', 'w') as outfd:
for line in infd:
fields = line.split(',')
fields[1] = '7' + fields[1]
fields[2] = '65' + fields[2]
outfd.write('{}\n'.format(','.join(fields)))
I am trying to set up a system for running various statistics on a text file. In this endeavor I need to open a file in Python (v2.7.10) and read it both as lines, and as a string, for the statistical functions to work.
So far I have this:
import csv, json, re
from textstat.textstat import textstat
file = "Data/Test.txt"
data = open(file, "r")
string = data.read().replace('\n', '')
lines = 0
blanklines = 0
word_list = []
cf_dict = {}
word_dict = {}
punctuations = [",", ".", "!", "?", ";", ":"]
sentences = 0
This sets up the file and the preliminary variables. At this point, print textstat.syllable_count(string) returns a number. Further, I have:
for line in data:
lines += 1
if line.startswith('\n'):
blanklines += 1
word_list.extend(line.split())
for char in line.lower():
cf_dict[char] = cf_dict.get(char, 0) + 1
for word in word_list:
lastchar = word[-1]
if lastchar in punctuations:
word = word.rstrip(lastchar)
word = word.lower()
word_dict[word] = word_dict.get(word, 0) + 1
for key in cf_dict.keys():
if key in '.!?':
sentences += cf_dict[key]
number_words = len(word_list)
num = float(number_words)
avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num
mcw = sorted([(v, k) for k, v in word_dict.items()], reverse=True)
print( "Total lines: %d" % lines )
print( "Blank lines: %d" % blanklines )
print( "Sentences: %d" % sentences )
print( "Words: %d" % number_words )
print('-' * 30)
print( "Average word length: %0.2f" % avg_wordsize )
print( "30 most common words: %s" % mcw[:30] )
But this fails as 22 avg_wordsize = len(''.join([k*v for k, v in word_dict.items()]))/num returns a ZeroDivisionError: float division by zero. However, if I comment out the string = data.read().replace('\n', '') from the first piece of code, I can run the second piece without problem and get the expected output.
Basically, how do I set this up so that I can run the second piece of code on data, as well as textstat on string?
The call to data.read() places the file pointer at the end of the file, so you dont have anything more to read at this point. You either have to close and reopen the file or more simply reset the pointer at the begining using data.seek(0)
First see the line:
string = data.read().replace('\n', '')
You are reading from data once. Now, cursor is in the end of data.
Then see the line,
for line in data:
You are trying to read it again, but you just can't do it, because there is nothing else in data, you are at the end of it.so len(word_list) are returning 0.
You are dividing by it and getting the error.
ZeroDivisionError: float division by zero.
But when you comment it, now you are reading only once, which is valid, so second portion of your codes now work.
Clear now?
So, what to do now?
Use data.seek() after data.read()
Demo:
>>> a = open('file.txt')
>>> a.read()
#output
>>>a.read()
#nothing
>>> a.seek(0)
>>> a.read()
#output again
Here is a simple fix. Replace the line for line in data: by :
data.seek(0)
for line in data.readlines():
...
It basically points back to the beginning of the file and read it again line by line.
While this should work, you may want to simplify the code and read the file only once. Something like:
with open(file, "r") as fin:
lines = fin.readlines()
string = ''.join(lines).replace('\n', '')
I'm trying to print out a medium sized list in Python and what I'm doing is printing out the entire list on one line to test the program to make sure the right data is being put in to the list in the right order. I read in 2 files and put all the data into 2 dictionaries. Then, I split the dictionaries into parts and put all the similar data into a list. I'm super new to Python and this is a tutorial I found on dictionaries and I'm a little stuck. This line prints the list on one line:
print '[%s]' % ', '.join(map(str, player_list))
But this line prints each value of the list on a separate line which I don't want:
print '[%s]' % ', '.join(map(str, army_list))
Here's my code if needed that adds to the list:
import collections
import operator
terridict = {}
gsdict = {}
terr_list = []
player_list = []
army_list = []
list_length = []
total_territories = 0
with open('territories.txt', 'r') as territory:
for line in territory:
terridict["territory"], terridict["numeric_id"], terridict["continent"] = line.split(',')
with open('gameState.txt', 'r') as gameState:
for line in gameState:
gsdict["numeric_id"], gsdict["player"], gsdict["num_armies"] = line.split(',')
terr_num = gsdict["numeric_id"]
player_num = gsdict["player"]
army_size = gsdict["num_armies"]
if terr_num >= 1 and player_num >= 1 and army_size >= 1:
terr_list.append(terr_num)
player_list.append(player_num)
army_list.append(army_size)
player_list.sort()
counter = collections.Counter(player_list)
print (counter)
total_territories = total_territories + 1
x = counter
sorted_x = sorted(x.items(), key=operator.itemgetter(0))
counter = sorted_x
print terr_num, player_num, army_size
print counter
print "Number of territories: %d" % total_territories
print '[%s]' % ', '.join(map(str, terr_list))
print '[%s]' % ', '.join(map(str, player_list))
print '[%s]' % ', '.join(map(str, army_list))
line, when you read it in, ends with a newline. For example (I'm guessing here):
"1 nelson2013 23\n"
When you split it by space, you get this:
["1", "nelson2013", "23\n"]
Notice that the player name does not end with a newline, but army size does. When you join army sizes together, they end up like this:
"23\n, 18\n, 121\n"
i.e. separated by newlines, which makes them print one per line.
To combat this, you want to invoke rstrip() on line immediately at the top of the loop, before you process it any further.
You probably want to fix what line is now because line.rsplit() doesn't work very well by itself. Building off what Amadan said:
line = line.rsplit()
This way, the new line character is removed and line can be set to a condition where the newline character is not involved. I tried it out and this worked.
I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a MemoryError
Here is the generator
def readFastaEntry( fp ):
name = ""
seq = ""
for line in fp:
if line.startswith( ">" ):
tmp = []
tmp.append( name )
tmp.append( seq )
name = line
seq = ""
yield tmp
else:
seq = seq.join( line )
and here is the caller stub more will be added after this part works
fp = open( sys.argv[1], 'r' )
for seq in readFastaEntry( fp ) :
print seq[0]
For those not fimilar with the fasta format here is an example
>1 (PB2)
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>2 (PB1)
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC
each entry starts with a ">" stating the name etc then the next N lines are data. There is no defined ending of the data other than the next line having a ">" at the beginning.
Have you considered using BioPython. They have a sequence reader that can read fasta files. And if you are interested in coding one yourself, you can take a look at BioPython's code.
Edit: Code added
def read_fasta(fp):
name, seq = None, []
for line in fp:
line = line.rstrip()
if line.startswith(">"):
if name: yield (name, ''.join(seq))
name, seq = line, []
else:
seq.append(line)
if name: yield (name, ''.join(seq))
with open('f.fasta') as fp:
for name, seq in read_fasta(fp):
print(name, seq)
A pyparsing parser for this format is only a few lines long. See the annotations in the following code:
data = """>1 (PB2)
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>2 (PB1)
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC"""
from pyparsing import Word, nums, QuotedString, Combine, OneOrMore
# define some basic forms
integer = Word(nums)
key = QuotedString("(", endQuoteChar=")")
# sequences are "words" made up of the characters A, G, C, and T
# we want to match one or more of them, and have the parser combine
# them into a single string (Combine by default requires all of its
# elements to be adjacent within the input string, but we want to allow
# for the intervening end of lines, so we add adjacent=False)
sequence = Combine(OneOrMore(Word("AGCT")), adjacent=False)
# define the overall pattern to scan for - attach results names
# to each matched element
seqEntry = ">" + integer("index") + key("key") + sequence("sequence")
for seq,s,e in seqEntry.scanString(data):
# just dump out the matched data
print seq.dump()
# could also access fields as seq.index, seq.key and seq.sequence
Prints:
['>', '1', 'PB2', 'AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG']
- index: 1
- key: PB2
- sequence: AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG
['>', '2', 'PB1', 'AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC']
- index: 2
- key: PB1
- sequence: AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC
Without having a great understanding of what you are doing, I would have written the code like this:
def readFastaEntry( fp ):
name = ""
while True:
line = name or f.readline()
if not line:
break
seq = []
while True:
name = f.readline()
if not name or name.startswith(">"):
break
else:
seq.append(name)
yield (line, "".join(seq))
This gathers up the data after a starting line up to the next starting line. Making seq an array means that you minimize the string joining until the last possible moment. Yielding a tuple makes more sense than a list.
def read_fasta(filename):
name = None
with open(filename) as file:
for line in file:
if line[0] == ">":
if name:
yield (name, seq)
name = line[1:-1].split("|")[0]
seq = ""
else:
seq += line[:-1]
yield (name, seq)