i want to parse a huge file consisting of thousands of blocks each of which contains several sub-blocks.for making it simple consider the the input file containing bellow lines:
a
2
3
4
b
3
9
2
c
7
each on separate lines.where alphabets define each block and numbers are properties of the block,
i want the output as a dictionary with block name as key and list of properties just 2 and 3 (if present) like this:
{a:[2,3],b:[3,2],c:[]}
i think the best way is using two while loops to read and search lines like this:
dict={}
with open('sample_basic.txt','r') as file:
line=file.readline()
line=line.strip()
while line:
if line.isalpha():
block_name=line
line=file.readline()
line=line.strip()
list=[]
while line:
lev_1=line
if lev_1 in ['2','3']:
list.append(lev_1)
line=file.readline()
line=line.strip()
if lev_1.isalpha():
dict[block_name]=list
break
else:
line=file.readline()
line=line.strip()
but it just goes to a infinite loop by execution.
i was looking for error but i cant find where it is.
i appreciate if anyone could give me some hint about it.
I did not check your code too closely, so I can not help you with the infinite loop, but I wrote new code without nested loops:
import collections
d = collections.defaultdict(list)
with open('sample_basic.txt') as f:
for line in f:
line = line.strip()
if line.isalpha():
blockname=line
else:
if line in ('2', '3'):
d[blockname].append(int(line))
The output using a file with the content you write is {'b': [3, 2], 'a': [2, 3]}.
If you want the empty list with the key c included in your dictionary do
d={}
with open('sample_basic.txt') as f:
for line in f:
line = line.strip()
if line.isalpha():
blockname=line
d[blockname] = []
else:
if line in ('2', '3'):
d[blockname].append(int(line))
Related
I want to return the length on each line as an element in a list named lst but
my code is not working, the output always comes to be an empty list. Please tell
me what's wrong with my code.
# this is the file
f = open("abcd.txt", 'w')
f.write("Hello How Ar3 you?")
f.write("\nHope You're doing fine")
f.write("\nI'm doing okay too.")
f.write("\nSizole!")
f.close()
This is the code I wrote to return a list of length of lines in the file:
f = open("abcd.txt", 'r')
t = f.readlines()
print(t)
lst = []
for line in f.readlines():
lst.append(len(line))
print(lst)
Output: lst == []
Just make it simple by reading the line once and do the length count.
below code is used list comprehension.
texts = f.readlines()
lst = [len(line) for line in texts]
print(lst)
Here's the output of the above code. Hope this helps and most of them had given the correct answers.
[19, 23, 20, 7]
When you read the file back in the second code snippet, the:
t = f.readlines()
...
reads the entire file in to list and assigns it to the variable t.
You then try to read all the lines again with the:
for line in f.readlines():
...
Which will not work because they have all been read already.
To fix it, just change the for loop to this:
for line in t:
You don't need to read the lines before the loop (that is the line t = ... is unnecessary).
In fact doing so is likely causing the problem - once you read the lines, the file pointer is at the end of the file so there's nothing left to read.
In your code, you are calling f.readlines() twice. You just need to call it once:
f = open("abcd.txt", 'r')
t = f.readlines()
print(t)
lst = []
# instead of for line in f.readlines(), we can simply use t
for line in t:
lst.append(len(line))
print(lst)
or if the variable t is not necessary:
f = open("abcd.txt", 'r')
lst = []
for line in f.readlines():
lst.append(len(line))
print(lst)
f.readlines() will move the file pointer to the end of file. Calling it again will return an empty list, which is not what we want.
I have 2 text files. I want to compare the 2 text files and return a list that has every line number that is different. Right now, I think my code returns the lines that are different, but how do I return the line number instead?
def diff(filename1, filename2):
with open('./exercise-files/text_a.txt', 'r') as filename1:
with open('./exercise-files/text_b.txt', 'r') as filename2:
difference = set(filename1).difference(filename2)
difference.discard('\n')
with open('diff.txt', 'w') as file_out:
for line in difference:
file_out.write(line)
Testing on:
diff('./exercise-files/text_a.txt', './exercise-files/text_b.txt') == [3, 4, 6]
diff('./exercise-files/text_a.txt', './exercise-files/text_a.txt') == []
difference = [
line_number + 1 for line_number, (line1, line2)
in enumerate(zip(filename1, filename2))
if line1 != line2
]
zip takes two (or more) generators and returns a generator of tuples, where each tuple contains the corresponding entries of each generator. enumerate takes this generator and returns a generator of tuples, where the first element is the index and the second the value from the original generator. And it's straightforward from there.
Here is an example which will ignore any surplus lines if one file has more lines than the other. The key is to use enumerate when iterating to get the line number as well as the contents. next can be used to get a line from the file iterator which is not used directly by the for loop.
def diff(filename1, filename2):
difference_line_numbers = []
with open(filename1, "r") as file1, open(filename2, "r") as file2:
for line_number, contents1 in enumerate(file1, 1):
try:
contents2 = next(file2)
except StopIteration:
break
if contents1 != contents2:
difference_line_numbers.append(line_number)
return difference_line_numbers
I have this part of a code:
def readTXT():
part_result = []
'''Reading all data from text file'''
with open('dataset/sometext.txt', 'r') as txt:
for lines in txt:
part = lines.split()
part_result = [int(i) for i in part]
#sorted([(p[0], p[14]) for p in part_result], key=lambda x: x[1])
print(part_result)
return part_result
And I'm trying to get all lists as a return, but for now I'll get only the first one, what is quite obvious, because my return is inside the for loop. But still, shouldn't the loop go through every line and return the corresponding list?
After doing research, all I found was return list1, list2 etc. But have should I manage it, if my lists will be generated from a text file line by line?
It frustates me, not being able to return multiple lists at once.
Here's my suggestion. Creating a 'major_array' and adding 'part_result' in that array on each iteration of loop. This way if your loop iterates 10 times, you will then have 10 arrays added in your 'major_array'. And finally the array is returned when the for loop finishes. :)
def readTXT():
#create a new array
major_array = []
part_result = []
'''Reading all data from text file'''
with open('dataset/sometext.txt', 'r') as txt:
for lines in txt:
part = lines.split()
part_result = [int(i) for i in part]
#sorted([(p[0], p[14]) for p in part_result], key=lambda x: x[1])
print(part_result)
major_array.append(part_result)
return major_array
Here is a solution:
def readTXT():
with open('dataset/sometext.txt') as lines:
all_lists = []
for line in lines:
all_lists.append([int(cell) for cell in line.split()])
return all_lists
Note that the return statement is outside of the loop. You get only one list because you return inside the loop.
For a more advanced user, this solution is a shorter and more efficient but at the cost of being a little hard to understand:
def readTXT():
with open('dataset/sometext.txt') as lines:
return [[int(x) for x in line.split()] for line in lines]
I have a file containing numbers and 2 words : "start" and "middle"
I want to read numbers from "start" to "middle" in one array and numbers from "middle" to end of the file into another array.
this is my python code:
with open("../MyList","r") as f:
for x in f.readlines():
if x == "start\n":
continue
if x == "middle\n":
break
x = x.split("\n")[0]
list_1.append(int(x))
print list_1
for x in f.readlines():
if x == "middle\n":
continue
list_2.append(int(x))
print list_2
but the problem is that my program never enters second loop and jumps to
print list_2
I searched in older questions but can not figure out the problem.
Its because you are reading the whole at the 1st loop, when it enter 2nd loop, file pointer is already at end of file and you will get an empty list from f.readlines().
You can fix that either by reopen the file or set the file pointer to the beginning of file again with f.seek(0) before the 2nd for loop
with open("../MyList","r") as f:
with open("../MyList","r") as f:
for x in f.readlines():
# process your stuff for 1st loop
# reset file pointer to beginning of file again
f.seek(0)
for x in f.readlines():
# process your stuff for 2nd loop
it will be not so efficient by reading whole file into memory if you are processing large file, you can just iterate over the file object instead of read all into memory like code below
list1 = []
list2 = []
list1_start = False
list2_start = False
with open("../MyList","r") as f:
for x in f:
if x.strip() == 'start':
list1_start = True
continue
elif x.strip() == 'middle':
list2_start = True
list1_start = False
continue
if list1_start:
list1.append(x.strip())
elif list2_start:
list2.append(x.strip())
print(list1)
print(list2)
Your first loop is reading the entire file to the end, but processes only half of it. When the second loop hits, the file pointer is already at the end, so no new lines are read.
From the python docs:
file.readlines([sizehint])
Read until EOF using readline() and return a list containing the lines
thus read. If the optional sizehint argument is present, instead of
reading up to EOF, whole lines totalling approximately sizehint bytes
(possibly after rounding up to an internal buffer size) are read.
Objects implementing a file-like interface may choose to ignore
sizehint if it cannot be implemented, or cannot be implemented
efficiently.
Either process everything in one loop, or read line-by-line (using readline instead of readlines).
You can read the whole file once in a list and later you can slice it.
if possible you can try this:
with open("sample.txt","r") as f:
list_1 = []
list_2 = []
fulllist = []
for x in f.readlines():
x = x.split("\n")[0]
fulllist.append(x)
print fulllist
start_position = fulllist.index('start')
middle_position = fulllist.index('middle')
end_position = fulllist.index('end')
list_1 = fulllist[start_position+1 :middle_position]
list_2 = fulllist[middle_position+1 :end_position]
print "list1 : ",list_1
print "list2 : ",list_2
Discussion
Your problem is that you read the whole file at once, and when you
start the 2nd loop there's nothing to be read...
A possible solution involves reading the file line by line, tracking
the start and middle keywords and updating one of two lists
accordingly.
This imply that your script, during the loop, has to mantain info about
its current state, and for this purpose we are going to use a
variable, code, that's either 0, 1 or 2 meaning no action,
append to list no. 1 or append to list no. 2, Because in the beginning
we don't want to do anything, its initial value must be 0
code = 0
If we want to access one of the two lists using the value of code as
a switch, we could write a test or, in place of a test, we can use a
list of lists, lists, containing a dummy list and two lists that are
updated with valid numbers. Initially all these inner lists are equal
to the empty list []
l1, l2 = [], []
lists = [[], l1, l2]
so that later we can do as follows
lists[code].append(number)
With these premises, it's easy to write the body of the loop on the
file lines,
read a number
if it's not a number, look if it is a keyword
if it is a keyword, change state
in any case, no further processing
if we have to append, append to the correct list
try:
n = int(line)
except ValueError:
if line == 'start\n' : code=1
if line == 'middle\n': code=2
continue
if code: lists[code].append(n)
We have just to add a bit of boilerplate, opening the file and
looping, that's all.
Below you can see my test data, the complete source code with all the
details and a test execution of the script.
Demo
$ cat start_middle.dat
1
2
3
start
5
6
7
middle
9
10
$ cat start_middle.py
l1, l2 = [], []
code, lists = 0, [[], l1, l2]
with open('start_middle.dat') as infile:
for line in infile.readlines():
try:
n = int(line)
except ValueError:
if line == 'start\n' : code=1
if line == 'middle\n': code=2
continue
if code: lists[code].append(n)
print(l1)
print(l2)
$ python start_middle.py
[5, 6, 7]
[9, 10]
$
I am attempting to read a txt file and create a dictionary from the text. a sample txt file is:
John likes Steak
John likes Soda
John likes Cake
Jane likes Soda
Jane likes Cake
Jim likes Steak
My desired output is a dictionary with the name as the key, and the "likes" as a list of the respective values:
{'John':('Steak', 'Soda', 'Cake'), 'Jane':('Soda', 'Cake'), 'Jim':('Steak')}
I continue to run into the error of adding my stripped word to my list and have tried a few different ways:
pred = ()
prey = ()
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
while line!= "":
line = line.split()
pred.append = (line[0])
prey.append = (line[2])
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
and also:
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
while line!= "":
line = line.split()
if line[0] in chain:
chain[line[0] = [0, line[2]]
else:
chain[line[0]] = line[2]
spacedLine = inf.readline()
line = spacedLine.rstrip('\n')
any ideas?
This will do it (without needing to read the entire file into memory first):
likes = {}
for who, _, what in (line.split()
for line in (line.strip()
for line in open('likes.txt', 'rt'))):
likes.setdefault(who, []).append(what)
print(likes)
Output:
{'Jane': ['Soda', 'Cake'], 'John': ['Steak', 'Soda', 'Cake'], 'Jim': ['Steak']}
Alternatively, to simplify things slightly you could use a temporarycollections.defaultdict:
from collections import defaultdict
likes = defaultdict(list)
for who, _, what in (line.split()
for line in (line.strip()
for line in open('likes.txt', 'rt'))):
likes[who].append(what)
print(dict(likes)) # convert to plain dictionary and print
Your input is a sequence of sequences. Parse the outer sequence first, parse each item next.
Your outer sequence is:
Statement
<empty line>
Statement
<empty line>
...
Assume that f is the open file with the data. Read each statement and return a list of them:
def parseLines(f):
result = []
for line in f: # file objects iterate over text lines
if line: # line is non-empty
result.append(line)
return result
Note that the function above accepts a much wider grammar: it allows arbitrarily many empty lines between non-empty lines, and two non-empty lines in a row. But it does accept any correct input.
Then, your statement is a triple: X likes Y. Parse it by splitting it by whitespace, and checking the structure. The result is a correct pair of (x, y).
def parseStatement(s):
parts = s.split() # by default, it splits by all whitespace
assert len(parts) == 3, "Syntax error: %r is not three words" % s
x, likes, y = parts # unpack the list of 3 items into varaibles
assert likes == "likes", "Syntax error: %r instead of 'likes'" % likes
return x, y
Make a list of pairs for each statement:
pairs = [parseStatement(s) for s in parseLines(f)]
Now you need to group values by key. Let's use defaultdict which supplies a default value for any new key:
from collections import defaultdict
the_answer = defaultdict(list) # the default value is an empty list
for key, value in pairs:
the_answer[key].append(value)
# we can append because the_answer[key] is set to an empty list on first access
So here the_answer is what you need, only it uses lists as dict values instead of tuples. This must be enough for you to understand your homework.
dic={}
for i in f.readlines():
if i:
if i.split()[0] in dic.keys():
dic[i.split()[0]].append(i.split()[2])
else:
dic[i.split()[0]]=[i.split()[2]]
print dic
This should do it.
Here we iterater through f.readlines f being the file object,and on each line we fill up the dictionary by using first part of split as key and last part of split as value