Turning text file into dictionary when the same keys appear multiple times - python

I have a text file that looks like this:
tomato 7000
potato and pear 8000
prunes 892
tomato 8
carrot 600
prunes 3
To turn it into a dictionary that ignores the lines where there are more words (which is what I want, so potato and pear are ignored, which is fine), I wrote:
with open("C:\\path\\food.txt", encoding="utf-8") as f_skipped:
result = {}
for line in f_skipped:
try:
k, v = line.split()
except ValueError:
pass
else:
result[k] = v
But since there can't be duplicate keys, it takes the value that appears later, so tomato and prunes have values 8 and 3, respectively. Is there any way of taking only the first appearance and ignoring the later once?
I thought of keeping my code and just turning the text around (sounds a bit silly) or detecting whether there are duplicate words (the latter is a bit risky since there are lots of rows with many words that I simply wanna ignore anyway).

Try this .get(key) method of the dictionary will return None if the key doesn't exit otherwise return the value for the key. so you can use it in if condition.
I hope this is what you want by reading your question.
filename = "text.txt"
with open(filename, encoding="utf-8") as f_skipped:
result = {}
for line in f_skipped:
try:
k, v = line.split()
except ValueError:
pass
else:
if result.get(k) is None:
result[k] = v
print(result)
Output
py code.py
{'tomato': '7000', 'prunes': '892', 'carrot': '600'}

Try this:-
with open('food.txt') as food:
D = {}
for line in food:
t = line.rsplit(' ', 1)
k = t[0]
if not k in D:
D[k] = t[1].split()
print(D)

Related

Reading a Tuple Assignment (e.g.written as such d1: p, m, h, = 20, 15, 22) from a Text File and Performing Calculations with Each Variable (e.g. p*h)

I'm a reading a text file with several hundred lines of data in python. The text file contains data written as a tuple assignment. For example, the data looks exactly like this in the text file:
d1: p,h,t,m= 74.15 18 6 0.1 ign: 0.0003
d2: p,h,t,m= 54. 378 -0.14 0.1 ign: 0.0009
How can I separate the data as such:
p = 20
t = 15
etc.
Then, how can I perform calculations on the tuple assignment? For example calculate:
p*p = 20*15?
I am not sure if I should convert the tuple assignment to an array. But I was not successful. In addition, I do not know how to get rid of the d1 and d2: which is there to identify which data set I am looking at
I have read the data and picked out the lines that have the data, (ignoring the First Set line and of Data Given as line)
The results that I need would be:
p (from first set of data d1)*p(from first set of data d2) = 20*15 = 300
p (from second set of data d1)*p(from second set of data d2) = 12*5 = 60
I believe I would need to do this over some kind of loop so that I can separate the data in all the lines in the file.
I would appreciate any help on this! I couldn't find anything pertaining to my question. I would only find how to deal with tuples in the simplest manner but nothing on how to extract variables and performing calculations on a tuple assignment contained in a text file.
EDIT:
After looking at the answer given for this question given by #JArunMani, I went back to try to see if I can understand each line of code. I understand that we need to create a dictionary that fills in the respective values for p, q, etc...
When I try to rewrite the code to how I understand it, I have:
with open("d.txt") as fp: # Opens the file
# The database kinda thing here
line = fp.readline() # Read the file's first line
number, _,cont = line.partition(":")#separates m1 from p, m, h, n =..."
print(cont)
data, _,ignore = cont.partition("int") #separates int from p, m, h, n =..."
print(data) #prints tuple assignment needed
keys, _,values = data.partition("=")
print(keys) #prints p, m, h, n
print(values) #prints values (all numbers after =)
thisdict = {} #creating an empty dictionary to fill with keys and values
thisdict[keys] = values
print(thisdict)
if "m" in thisdict:
print("Yes")
print(thisdict) gives me the Output: {' p,m,h,n': ' 76 6818 2.2 1 '}
However, if "m" in thisdict: did not print anything. I do not understand why m is not in the dictionary, yet print(thisdict) shows that thisdict = {} has been filled. Also, is it necessary to add the for loop in the answer given below?
Thank you.
EDIT 2
I am now trying my second attempt to this problem. I combining both answers to write the code since I using what I understand from each code:
def DataExtract(self):
with open("muonsdata.txt") as fp: # Opens the file
line = fp.readline() # Read the file's first line
number, _,cont = line.partition(":")#separates m1 from pt, eta, phi, m =..."
print(cont)
data, _,ignore = cont.partition("dptinv") #separates dptinv from pt, eta, phi, m =..."
print(data) #prints tuple assignment needed
keys, _,values = data.partition("=")
print(keys) #prints pt, eta, phi, m
print(values) #prints values (all numbers after =)
key = [k for k in keys.split(",")]
value = [v for v in values.strip().split(" ")]
print(key)
print(value)
thisdict = {}
data = {}
for k, v in zip(key, value): #creating an empty dictionary to fill with keys and values
thisdict[k] = v
print(thisdict)
if "m" in thisdict:
print("Yes")
x = DataExtract("C:/Users/username/Desktop/data.txt")
mul_p = x['m1']['p'] * x['d2']['p']
print(mul_p)
However, this gives me the error: Traceback (most recent call last):
File "read.py", line 29, in
mul_p = x['d1']['p'] * x['d2']['p']
TypeError: 'NoneType' object is not subscriptable
EDIT 3
I have the code made from a combination of answers 1 and 2, BUT...
the only thing is that I have the code written and working but why doesn't the while loop go on until we reach the end of the file. I only get one answer from the calculating the values from the first two lines, but what about the remaining lines? Also, it seems like it is not reading the d2 data lines (or the line = fp.readline is not doing anything), because when I try to calculate m , I get the error Traceback (most recent call last):
File "read.py", line 37, in
m = math.cosh(float(data[" m2"]["eta"])) * float(data["m1"][" pt"])
KeyError: ' m2'
Here is my code that I have:
import math
with open("d.txt") as fp: # Opens the file
data ={} #final dictionary
line = fp.readline() # Read the file's first line
while line: #continues to end of file
name, _,cont = line.partition(":")#separates d1 from p, m, h, t =..."
#print(cont)
numbers, _,ignore = cont.partition("ign") #separates ign from p, m, h, t =..."
#print(numbers) #prints tuple assignment needed
keys, _,values = numbers.partition("=")
#print(keys) #prints p, m, h, t
#print(values) #prints values (all numbers after =)
key = [k for k in keys.split(",")]
value = [v for v in values.strip().split(" ")]
#print(key) #prints pt, eta, phi, m
#print(value)
thisdict = {}
for k, v in zip(key, value): #creating an empty dictionary to fill with keys and values
#thisdict[k] = v
#print(thisdict)
#data[name]=thisdict
line = fp.readline()#read next lines, not working I think
thisdict[k] = v
data[name]=thisdict
print(thisdict)
#if " m2" in thisdict:
#print("Yes")
#print(data)
#mul_p = float(data["d1"][" p"])*float(data["d1"]["m"])
m = math.cosh(float(data[" d2"]["m"])) * float(data["m1"][" p"])
#m1 = float(data["d1"][" p"]) * float(2)
print(m)
#print(mul_p)
If I replace the d2's with d1 the code runs fine, except it skips the last d1. I do not know what I am doing wrong. Would appreciate any input or guidance.
So the following function returns a dictionary with values of 'p', 'q' and other variables. But I leave it to you to find out how to multiply or perform operations on them ^^
def DataExtract(path): # 'path' is the path to the data file
fp = open(path) # Opens the file
data = {} # The database kinda thing here
line = fp.readline() # Read the file's first line
while line: # This goes on till we reach end of file (EOF)
name, _, cont = line.partition(":") # So this gives, 'd1', ':', 'p, q, ...'
keys, _, values = cont.partition("=") # Now we split the text into RHS and LHS
keys = keys.split(",") # Split the variables by ',' as separator
values = values.split(",") # Split the values
temp_d = {} # Dict for variables
for i in range(len(keys)):
key = keys[i].strip() # Get the item at the index and remove left-right spaces
val = values[i].strip() # Same
temp_d[key] = float(val) # Store it in dictionary but as number
data[name.strip()] = temp_d # Store the temp_d itself in main dict
line = fp.readline() # Now read next line
fp.close() # Close the file
return data # Return the data
I used simple methods, to make it easy for you. Now to access data, you have to do something like this:
x = DataExtract("your_file_path")
mul_p = x['d1']['p'] * x['d2']['p']
print(mul_p) # Tadaaa !
Feel free to comment...
This answer is quite familiar with #JArunMani, but it's shorter a bit and sure that can run successfully.
The idea is return your data to dictionary.
lines = "d1: p,h,t,m= 74.15 18 6 0.1 ign: 0.0003\nd2: p,h,t,m= 54. 378 -0.14 0.1 ign: 0.0009".split("\n") # lines=open("d.txt",'r').read().split("\n")
data = {}
for line in lines:
l = line.split("ign")[0] # remove "ign:.."
name_dict, vals_dict = l.split(":") #['d1',' p,h,t,m= 74.15 18 6 0.1']
keys_str, values_str = vals_dict.split("=") #[' p,h,t,m',' 74.15 18 6 0.1']
keys=[k for k in keys_str.strip().split(',')] #['p','h','t','m']
values=[float(v) for v in values_str.strip().split(' ')] #[74.15, 18, 6, 0.1]
sub_dict = {}
for k,v in zip(keys, values):
sub_dict[k]=v
data[name_dict]=sub_dict
Result:
>>>data
{'d1': {'p': 74.15, 'h': 18.0, 't': 6.0, 'm': 0.1}, 'd2': {'p': 54.0, 'h': 378.0, 't': -0.14, 'm': 0.1}}
>>>data['d1']['p']*data['d2']['p']
4004.1000000000004

python: recursive dictionary of dictionary

I need help with a pretty simple exercise I am trying to execute, just syntactically I'm a bit lost
basically I read in a very brief text file containing 15 lines of 3 elements (essentially 2 keys and a value)
put those elements into a dictionary comprised of dictionaries
the 1st dictionary contains location and the 2nd dictionary which is made up of the type of the item and how much it costs for example
gymnasium weights 15
market cereal 5
gymnasium shoes 50
saloon beer 3
saloon whiskey 10
market bread 5
which would result in this
{
'gymnasium': {
'weights': 15,
'shoes': 50
},
'saloon': {
'beer': 3,
'whiskey': 10
}
}
and so on for the other keys
basically I need to loop through this file but I'm struggling to read in the contents as a dict of dicts.
moreover without that portion i cant figure out how to append the inner list to the outer list if an instance of the key in the outer list occurs.
I would like to do this recursively
location_dict = {} #row #name day weight temp
item_dict = {}
for line in file:
line = line.strip()
location_dict[item_dict['location'] = item_dict`
this is a good use for setdefault (or defaultdict)
data = {}
for line in file:
key1,key2,value = line.split()
data.setdefault(key1,{})[key2] = value
print data
or based on your comment
from collections import defaultdict
data = defaultdict(lambda:defaultdict(int))
for line in file:
key1,key2,value = line.split()
data[key1][key2] += value
print data
Here is another solution.
yourFile = open("yourFile.txt", "r")
yourText = yourFile.read()
textLines = yourText.split("\n")
locationDict = {}
for line in textLines:
k1, k2, v = line.split(" ")
if k1 not in locationDict.keys():
locationDict[k1] = {}
else:
if k2 not in locationDict[k1].keys():
locationDict[k1][k2] = int(v)
else:
locationDict[k1][k2] += int(v)
print locationDict
Hope it helps!

Calculating means of values for subgroups of keys in python dictionary

I have a dictionary which looks like this:
cq={'A1_B2M_01':2.04, 'A2_B2M_01':2.58, 'A3_B2M_01':2.80, 'B1_B2M_02':5.00,
'B2_B2M_02':4.30, 'B2_B2M_02':2.40 etc.}
I need to calculate mean of triplets, where the keys[2:] agree. So, I would ideally like to get another dictionary which will be:
new={'_B2M_01': 2.47, '_B2M_02': 3.9}
The data is/should be in triplets so in theory I could just get the means of the consecutive values, but first of all, I have it in a dictionary so the keys/values will likely get reordered, besides I'd rather stick to the names, as a quality check for the triplets assigned to names (I will later add a bit showing error message when there will be more than three per group).
I've tried creating a dictionary where the keys would be _B2M_01 and _B2M_02 and then loop through the original dictionary to first append all the values that are assigned to these groups of keys so I could later calculate an average, but I am getting errors even in the first step and anyway, I am not sure if this is the most effective way to do this...
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
trips=set([x[2:] for x in cq.keys()])
new={}
for each in trips:
for k,v in cq.iteritems():
if k[2:]==each:
new[each].append(v)
Traceback (most recent call last):
File "<pyshell#28>", line 4, in <module>
new[each].append(v)
KeyError: '_B2M_01'
I would be very grateful for any suggestions. It seems like a fairly easy operation but I got stuck.
An alternative result which would be even better would be to get a dictionary which contains all the names used as in cq, but with values being the means of the group. So the end result would be:
final={'A1_B2M_01':2.47, 'A2_B2M_01':2.47, 'A3_B2M_01':2.47, 'B1_B2M_02':3.9,
'B2_B2M_02':3.9, 'B2_B2M_02':3.9}
Something like this should work. You can probably make it a little more elegant.
cq = {'A1_B2M_01':2.04, 'A2_B2M_01':2.58, 'A3_B2M_01':2.80, 'B1_B2M_02':5.00, 'B2_B2M_02':4.30, 'B2_B2M_02':2.40 }
sum = {}
count = {}
mean = {}
for k in cq:
if k[2:] in sum:
sum[k[2:]] += cq[k]
count[k[2:]] += 1
else:
sum[k[2:]] = cq[k]
count[k[2:]] = 1
for k in sum:
mean[k] = sum[k] / count[k]
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
sums = dict()
for k, v in cq.iteritems():
_, p2 = k.split('_', 1)
if p2 not in sums:
sums[p2] = [0, 0]
sums[p2][0] += v
sums[p2][1] += 1
res = {}
for k, v in sums.iteritems():
res[k] = v[0]/float(v[1])
print res
also could be done with one iteration
Grouping:
SEPARATOR = '_'
cq={'A1_B2M_01':2.4, 'A2_B2M_01':5, 'A3_B2M_01':4, 'B1_B2M_02':3, 'B2_B2M_02':7, 'B3_B2M_02':6}
groups = {}
for key in cq:
group_key = SEPARATOR.join(key.split(SEPARATOR)[1:])
if group_key in groups:
groups[group_key].append(cq[key])
else:
groups[group_key] = [cq[key]]
Generate means:
def means(groups):
for group, group_vals in groups.iteritems():
yield (group, float(sum(group_vals)) / len(group_vals),)
print list(means(groups))

Do dictionaries keep track of the point in time where a item was assigned?

I was coding a High Scores system where the user would enter a name and a score then the program would test if the score was greater than the lowest score in high_scores. If it was, the score would be written and the lowest score, deleted. Everything was working just fine, but i noticed something. The high_scores.txt file was like this:
PL1 50
PL2 50
PL3 50
PL4 50
PL5 50
PL1 was the first score added, PL2 was the second, PL3 the third and so on. Then I tried adding another score, higher than all the others (PL6 60) and what happened was that the program assigned PL1 as the lowest score. PL6 was added and PL1 was deleted. That was exactly the behavior I wanted but I don't understand how it happened. Do dictionaries keep track of the point in time where a item was assigned? Here's the code:
MAX_NUM_SCORES = 5
def getHighScores(scores_file):
"""Read scores from a file into a list."""
try:
cache_file = open(scores_file, 'r')
except (IOError, EOFError):
print("File is empty or does not exist.")
return []
else:
lines = cache_file.readlines()
high_scores = {}
for line in lines:
if len(high_scores) < MAX_NUM_SCORES:
name, score = line.split()
high_scores[name] = int(score)
else:
break
return high_scores
def writeScore(file_, name, new_score):
"""Write score to a file."""
if len(name) > 3:
name = name[0:3]
high_scores = getHighScores(file_)
if high_scores:
lowest_score = min(high_scores, key=high_scores.get)
if new_score > high_scores[lowest_score] or len(high_scores) < 5:
if len(high_scores) == 5:
del high_scores[lowest_score]
high_scores[name.upper()] = int(new_score)
else:
return 0
else:
high_scores[name.upper()] = int(new_score)
write_file = open(file_, 'w')
while high_scores:
highest_key = max(high_scores, key=high_scores.get)
line = highest_key + ' ' + str(high_scores[highest_key]) + '\n'
write_file.write(line)
del high_scores[highest_key]
return 1
def displayScores(file_):
"""Display scores from file."""
high_scores = getHighScores(file_)
print("HIGH SCORES")
if high_scores:
while high_scores:
highest_key = max(high_scores, key=high_scores.get)
print(highest_key, high_scores[highest_key])
del high_scores[highest_key]
else:
print("No scores yet.")
def resetScores(file_):
open(file_, "w").close()
No. The results you got were due to arbitrary choices internal to the dict implementation that you cannot depend on always happening. (There is a subclass of dict that does keep track of insertion order, though: collections.OrderedDict.) I believe that with the current implementation, if you switch the order of the PL1 and PL2 lines, PL1 will probably still be deleted.
As others noted, the order of items in the dictionary is "up to the implementation".
This answer is more a comment to your question, "how min() decides what score is the lowest?", but is much too long and format-y for a comment. :-)
The interesting thing is that both max and min can be used this way. The reason is that they (can) work on "iterables", and dictionaries are iterable:
for i in some_dict:
loops i over all the keys in the dictionary. In your case, the keys are the user names. Further, min and max allow passing a key argument to turn each candidate in the iterable into a value suitable for a binary comparison. Thus, min is pretty much equivalent to the following python code, which includes some tracing to show exactly how this works:
def like_min(iterable, key=None):
it = iter(iterable)
result = it.next()
if key is None:
min_val = result
else:
min_val = key(result)
print '** initially, result is', result, 'with min_val =', min_val
for candidate in it:
if key is None:
cmp_val = candidate
else:
cmp_val = key(candidate)
print '** new candidate:', candidate, 'with val =', cmp_val
if cmp_val < min_val:
print '** taking new candidate'
result = candidate
return result
If we run the above on a sample dictionary d, using d.get as our key:
d = {'p': 0, 'ayyy': 3, 'b': 5, 'elephant': -17}
m = like_min(d, key=d.get)
print 'like_min:', m
** initially, result is ayyy with min_val = 3
** new candidate: p with val = 0
** taking new candidate
** new candidate: b with val = 5
** new candidate: elephant with val = -17
** taking new candidate
like_min: elephant
we find that we get the key whose value is the smallest. Of course, if multiple values are equal, the choice of "smallest" depends on the dictionary iteration order (and also whether min actually uses < or <= internally).
(Also, the method you use to "sort" the high scores to print them out is O(n2): pick highest value, remove it from dictionary, repeat until empty. This traverses n items, then n-1, ... then 2, then 1 => n+(n-1)+...+2+1 steps = n(n+1)/2 = O(n2). Deleting the high one is also an expensive operation, although it should still come in at or under O(n2), I think. With n=5 this is not that bad (5 * 6 / 2 = 15), but ... not elegant. :-) )
This is pretty much what http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/ is about.
Short version: Get the treap module, which works like a sorted dictionary, and keep the keys in order. Or use the nest module to get the n greatest (or least) values automatically.
collections.OrderedDict is good for preserving insertion order, but not key order.

What is the most efficient way to match list items to lines in a large file in Python?

I have a large file (5Gb) called my_file. I have a list called my_list. What is the most efficient way to read each line in the file and, if an item from my_list matches an item from a line in my_file, create a new list called matches that contains items from the lines in my_file AND items from my_list where a match occurred. Here is what I am trying to do:
def calc(my_file, my_list)
matches = []
my_file.seek(0,0)
for i in my_file:
i = list(i.rstrip('\n').split('\t'))
for v in my_list:
if v[1] == i[2]:
item = v[0], i[1], i[3]
matches.append(item)
return matches
here are some lines in my_file:
lion 4 blue ch3
sheep 1 red pq2
frog 9 green xd7
donkey 2 aqua zr8
here are some items in my_list
intel yellow
amd green
msi aqua
The desired output, a list of lists, in the above example would be:
[['amd', 9, 'xd7'], ['msi', 2, 'zr8']]
My code is currently work, albeit really slow. Would using a generator or serialization help? Thanks.
You could build a dictonary for looking up v. I added further little optimizations:
def calc(my_file, my_list)
vd = dict( (v[1],v[0]) for v in my_list)
my_file.seek(0,0)
for line in my_file:
f0, f1, f2, f3 = line[:-1].split('\t')
v0 = vd.get(f2)
if v0 is not None:
yield (v0, f1, f3)
This should be much faster for a large my_list.
Using get is faster than checking if i[2] is in vd + accessing vd[i[2]]
For getting more speedup beyond these optimizations I recommend http://www.cython.org
Keep the items in a dictional rather than a list (let's call it items). Now iterate through your file as you're doing and pick out the key to look for (i[2]) and then check if it's there in the in items.
items would be.
dict (yellow = "intel", green = "amd", aqua = "msi")
So the checking part would be.
if i[2] in items:
yield [[items[i[2]], i[1], i[3]]
Since you're just creating the list and returning it, using a generator might help memory characteristics of the program rather than putting the whole thing into a list and returning it.
There isn't much you can do with the overheads of reading the file in, but based on your example code, you can speed up the matching by storing your list as a dict (with the target field as the key).
Here's an example, with a few extra optimisation tweaks:
mylist = {
"yellow" : "intel",
"green" : "amd",
# ....
}
matches = []
for line in my_file:
i = line[:-1].split("\t")
try: # faster to ask for forgiveness than permission
matches.append([mylist[i[2]], i[1], i[3]])
except NameError:
pass
But again, do note that most of your performance bottleneck will be in the reading of the file and optimisation at this level may not have a big enough impact on the runtime.
Here's a variation on #rocksportrocker's answer using csv module:
import csv
def calc_csv(lines, lst):
d = dict((v[1], v[0]) for v in lst) # use dict to speed up membership test
return ((d[f2], f1, f3)
for _, f1, f2, f3 in csv.reader(lines, dialect='excel-tab')
if f2 in d) # assume that intersection is much less than the file
Example:
def test():
my_file = """\
lion 4 blue ch3
sheep 1 red pq2
frog 9 green xd7
donkey 2 aqua zr8
""".splitlines()
my_list = [
("intel", "yellow"),
("amd", "green"),
("msi", "aqua"),
]
res = list(calc_csv(my_file, my_list))
assert [('amd', '9', 'xd7'), ('msi', '2', 'zr8')] == res
if __name__=="__main__":
test()

Categories

Resources