Python: Read text file into dict and ignore comments - python

I am trying to put the following text file into a dictionary but I would like any section starting with '#' or empty lines ignored.
My text file looks something like this:
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
My desired output would be:
myVariables = {'Apples': 1, 'Oranges': 3, 'Bananas': 5}
My Python code reads as follows:
filename = "myFile.txt"
myVariables = {}
with open(filename) as f:
for line in f:
if line.startswith('#') or not line:
next(f)
key, val = line.split()
myVariables[key] = val
print "key: " + str(key) + " and value: " + str(val)
The error I get:
Traceback (most recent call last):
File "C:/Python27/test_1.py", line 11, in <module>
key, val = line.split()
ValueError: need more than 1 value to unpack
I understand the error but I do not understand what is wrong with the code.
Thank you in advance!

Given your text:
text = """
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
"""
We can do this in 2 ways. Using regex, or using Python generators. I would choose the latter (described below) as regex is not particularly fast(er) in such cases.
To open the file:
with open('file_name.xyz', 'r') as file:
# everything else below. Just substitute `for line in lines` with
# `for line in file.readline()`
Now to create a similar, we split the lines, and create a list:
lines = text.split('\n') # as if read from a file using `open`.
Here is how we do all you want in a couple of lines:
# Discard all comments and empty values.
comment_less = filter(None, (line.split('#')[0].strip() for line in lines))
# Separate items and totals.
separated = {item.split()[0]: int(item.split()[1]) for item in comment_less}
Lets test:
>>> print(separated)
{'Apples': 1, 'Oranges': 3, 'Bananas': 5}
Hope this helps.

This doesn't exactly reproduce your error, but there's a problem with your code:
>>> x = "Apples\t1\t# This is a comment"
>>> x.split()
['Apples', '1', '#', 'This', 'is', 'a', 'comment']
>>> key, val = x.split()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
Instead try:
key = line.split()[0]
val = line.split()[1]
Edit: and I think your "need more than 1 value to unpack" is coming from the blank lines. Also, I'm not familiar with using next() like this. I guess I would do something like:
if line.startswith('#') or line == "\n":
pass
else:
key = line.split()[0]
val = line.split()[1]

To strip comments, you could use str.partition() which works whether the comment sign is present or not in the line:
for line in file:
line, _, comment = line.partition('#')
if line.strip(): # non-blank line
key, value = line.split()
line.split() may raise an exception in this code too—it happens if there is a non-blank line that does not contain exactly two whitespace-separated words—it is application depended what you want to do in this case (ignore such lines, print warning, etc).

You need to ignore empty lines and lines starting with # splitting the remaining lines after either splitting on # or using rfind as below to slice the string, an empty line will have a new line so you need and line.strip() to check for one, you cannot just split on whitespace and unpack as you have more than two elements after splitting including what is in the comment:
with open("in.txt") as f:
d = dict(line[:line.rfind("#")].split() for line in f
if not line.startswith("#") and line.strip())
print(d)
Output:
{'Apples': '1', 'Oranges': '3', 'Bananas': '5'}
Another option is to split twice and slice:
with open("in.txt") as f:
d = dict(line.split(None,2)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
Or splitting twice and unpacking using an explicit loop:
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
k, v, _ = line.split(None, 2)
d[k] = v
You can also use itertools.groupby to group the lines you want.
from itertools import groupby
with open("in.txt") as f:
grouped = groupby(f, lambda x: not x.startswith("#") and x.strip())
d = dict(next(v).split(None, 2)[:2] for k, v in grouped if k)
print(d)
To handle where we have multiple words in single quotes we can use shlex to split:
import shlex
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
data = shlex.split(line)
d[data[0]] = data[1]
print(d)
So changing the Banana line to:
Bananas 'north-side disabled' # I want to ignore this comment too!
We get:
{'Apples': '1', 'Oranges': '3', 'Bananas': 'north-side disabled'}
And the same will work for the slicing:
with open("in.txt") as f:
d = dict(shlex.split(line)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)

If the format of the file is correctly defined you can try a solution with regular expressions.
Here's just an idea:
import re
fruits = {}
with open('fruits_list.txt', mode='r') as f:
for line in f:
match = re.match("([a-zA-Z0-9]+)[\s]+([0-9]+).*", line)
if match:
fruit_name, fruit_amount = match.groups()
fruits[fruit_name] = fruit_amount
print fruits
UPDATED:
I changed the way of reading lines taking care of large files. Now I read line by line and not all in one. This improves the memory usage.

Related

Get duplicated lines in file

Working on file with thousands of line
trying to find which line is duplicated exactly ( 2 times )
from collections import Counter
with open('log.txt') as f:
string = f.readlines()
c = Counter(string)
print c
it give me the result of all duplicated lines but i need to get the repeated line (2 times only)
You're printing all the strings and not just the repeated ones, to print only the ones which are repeated twice, you can print the strings which have a count of two.
from collections import Counter
with open('log.txt') as f:
string = f.readlines()
c = Counter(string)
for line, count in c.items():
if count==2:
print(line)
The Counter Object also provides information about how often a line occurs.
You can filter it using e.g. list comprehension.
This will print all lines, that occur exactly two times in the file
with open('log.txt') as f:
string = f.readlines()
print([k for k,v in Counter(string).items() if v == 2])
If you want to have all repeated lines (lines duplicated two or more times)
with open('log.txt') as f:
string = f.readlines()
print([k for k,v in Counter(string).items() if v > 1])
You could use Counter.most_common i.e.
from collections import Counter
with open('log.txt') as f:
c = Counter(f)
print(c.most_common(1))
This prints the Counter entry with the highest count.

how to read file line by line in python

I'm trying to read below text file in python, I'm struggling to get as key value in output but its not working as expected:
test.txt
productId1 ProdName1,ProdPrice1,ProdDescription1,ProdDate1
productId2 ProdName2,ProdPrice2,ProdDescription2,ProdDate2
productId3 ProdName3,ProdPrice3,ProdDescription3,ProdDate3
productId4 ProdName4,ProdPrice4,ProdDescription4,ProdDate4
myPython.py
import sys
with open('test.txt') as f
lines = list(line.split(' ',1) for line in f)
for k,v in lines.items();
print("Key : {0}, Value: {1}".format(k,v))
I'm trying to parse the text file and trying to print key and value separately. Looks like I'm doing something wrong here. Need some help to fix this?
Thanks!
You're needlessly storing a list.
Loop, split and print
with open('test.txt') as f:
for line in f:
k, v = line.rstrip().split(' ',1)
print("Key : {0}, Value: {1}".format(k,v))
This should work, with a list comprehension:
with open('test.txt') as f:
lines = [line.split(' ',1) for line in f]
for k, v in lines:
print("Key: {0}, Value: {1}".format(k, v))
You can make a dict right of the bat with a dict comp and than iterate the list to print as you wanted. What you had done was create a list, which does not have an items() method.
with open('notepad.txt') as f:
d = {line.split(' ')[0]:line.split(' ')[1] for line in f}
for k,v in d.items():
print("Key : {0}, Value: {1}".format(k,v))
lines is a list of lists, so the good way to finish the job is:
import sys
with open('test.txt') as f:
lines = list(line.split(' ',1) for line in f)
for k,v in lines:
print("Key : {0}, Value: {1}".format(k,v))
Perhaps I am reading too much into your description but I see one key, a space and a comma limited name of other fields. If I interpret that as their being data for those items that is comma limited then I would conclude you want a dictionary of dictionaries. That would lead to code like:
data_keys = 'ProdName', 'ProdPrice', 'ProdDescription', 'ProdDate'
with open('test.txt') as f:
for line in f:
id, values = l.strip().split() # automatically on white space
keyed_values = zip(data_keys, values.split(','))
print(dict([('key', id)] + keyed_values))
You can use the f.readlines() function that returns a list of lines in the file f. I changed the code to include f.lines in line 3.
import sys
with open('test.txt') as f:
lines = list(line.split(' ',1) for line in f.readlines())
for k,v in lines.items();
print("Key : {0}, Value: {1}".format(k,v))

ValueError: invalid literal for int() with base 10: '2\n3'

I would like to convert my text file below into a list:
4,9,2
3,5,7
8,1,6
Here's my python code so far, but I couldn't understand why it doesn't work:
def main():
file = str(input("Please enter the full name of the desired file (with extension) at the prompt below: \n"))
print (parseCSV(file))
def parseCSV(file):
file_open = open(file)
#print (file_open.read())
with open(file) as f:
d = f.read().split(',')
data = list(map(int, d))
print (data)
main()
The error message is:
line 12, in parseCSV
data = list(map(int, d))
ValueError: invalid literal for int() with base 10: '2\n3'
Thanks :)
With d = f.read().split(','), you're reading the entire file and splitting on commas. Since the file consists of multiple lines, it will contain newline characters. These characters are not removed by split(',').
To fix this, iterate over the lines first instead of splitting the whole thing on commas:
d = (item for line in f for item in line.split(','))
Read is reading the entire file (including the newlines). So your actual data looks like:
'4,9,2\n3,5,7\n8,1,6'
You can either read the content in a single line at a time using
d = f.readline().split(',')
while d != "":
data = list(map(int, d))
print(data)
d = f.readline().split(',')
Or, you can handle the new lines ("\n" and or "\n\r") as follows:
d = f.readline().replace("\n", ",").split(',')
f.read() will read everything including the newline character (\n) and so map(int, d) will spit out error.
with open(file) as f:
for line in f:
d = line.split(',')
data = list(map(int, d))
print (data)
for line in f is a standard way to read a file line by line in python
You need to split by newlines ('\n'), in this case you should use csv library.
>>> import csv
>>> with open('foo.csv') as f:
print [map(int, row) for row in csv.reader(f)]
[[4, 9, 2], [3, 5, 7], [8, 1, 6]]

Find the smallest float in a file and printing the line

I have a data file like this:
1 13.4545
2 10.5578
3 12.5578
4 5.224
I am trying to find the line with the smallest float number and print or write the entire line (including the integer) to another file. so i get this:
4 5.224
I have this but does not work:
with open(file) as f:
small = map(float, line)
mini = min(small)
print mini
also tried using this:
with open(file) as f:
mylist = [[line.strip(),next(f).strip()] for line in f]
minimum = min(mylist, key = lambda x: float(x[1]))
print minimum
Using your data file, we can iterate over each line of the file inside min since min takes an iterator:
>>> with open(fn) as f:
... print min(f)
...
1 13.4545
Obviously, that is using the ascii value of the integer for determining min.
Python's min takes a key function:
def kf(s):
return float(s.split()[1])
with open(fn) as f:
print min(f, key=kf)
Or:
>>> with open(fn) as f:
... print min(f, key=lambda line: float(line.split()[1]))
...
4 5.224
The advantage (in both versions) is that the file is processed line by line -- no need to read the entire file into memory.
The entire line is printed but only the float part is used to determine the min value of that line.
To fix YOUR version, the issue is your first list comprehension. Your version has next() in it which you probably thought was the next number. It isn't: It is the next line:
>>> with open(fn) as f:
... mylist = [[line.strip(),next(f).strip()] for line in f]
...
>>> mylist
[['1 13.4545', '2 10.5578'], ['3 12.5578', '4 5.224']]
The first list comprehension should be:
>>> with open(fn) as f:
... mylist=[line.split() for line in f]
...
>>> mylist
[['1', '13.4545'], ['2', '10.5578'], ['3', '12.5578'], ['4', '5.224']]
Then the rest will work OK (but you will have the split list in this case -- not the line -- to print):
>>> minimum=min(mylist, key = lambda x: float(x[1]))
>>> minimum
['4', '5.224']
You were quite near, this is the minimal edit needed
with open(fl) as f: # don't use file as variable name
line = [i.strip().split() for i in f] # Get the lines as separate line no and value
line = [(x[0],float(x[1])) for x in line] # Convert the second value in the file to float
m = min(line,key = lambda x:x[1]) # find the minimum float value, that is the minimum second argument.
print "{} {}".format(m[0],m[1]) # print it. Hip Hip Hurray \o/
a=open('d.txt','r')
d=a.readlines()
m=float(d[0].split()[1])
for x in d[1:]:
if float(x.split()[1])<m:
m=float(x.split()[1])
print m
map:
map(function, iterable, ...)
Apply function to every item of iterable and return a list of the results. Link
Demo:
>>> map(float , ["1.9", "2.0", "3"])
[1.9, 2.0, 3.0]
>>> map(float , "1.9")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: could not convert string to float: .
>>>
Read input file by csv module because structure of input file is fixed.
Set small and small_row variables to None.
Read file row by row.
Type Casting of second item from the row, from string to float.
Check small variable is None or less then second item of row.
If yes the assign small value and small_row respectivly
Demo:
import csv
file1 = "1_d.txt"
small = None
small_row = None
with open(file1) as fp:
root = csv.reader(fp, delimiter=' ')
for row in root:
item = float(row[1])
if small==None or small>item:
small = item
small_row = row
print "small_row:", small_row
Output:
$ python 3.py
small_row: ['4', '5.224']

How to convert a file into a dictionary?

I have a file comprising two columns, i.e.,
1 a
2 b
3 c
I wish to read this file to a dictionary such that column 1 is the key and column 2 is the value, i.e.,
d = {1:'a', 2:'b', 3:'c'}
The file is small, so efficiency is not an issue.
d = {}
with open("file.txt") as f:
for line in f:
(key, val) = line.split()
d[int(key)] = val
This will leave the key as a string:
with open('infile.txt') as f:
d = dict(x.rstrip().split(None, 1) for x in f)
You can also use a dict comprehension like:
with open("infile.txt") as f:
d = {int(k): v for line in f for (k, v) in [line.strip().split(None, 1)]}
def get_pair(line):
key, sep, value = line.strip().partition(" ")
return int(key), value
with open("file.txt") as fd:
d = dict(get_pair(line) for line in fd)
By dictionary comprehension
d = { line.split()[0] : line.split()[1] for line in open("file.txt") }
Or By pandas
import pandas as pd
d = pd.read_csv("file.txt", delimiter=" ", header = None).to_dict()[0]
Simple Option
Most methods for storing a dictionary use JSON, Pickle, or line reading. Providing you're not editing the dictionary outside of Python, this simple method should suffice for even complex dictionaries. Although Pickle will be better for larger dictionaries.
x = {1:'a', 2:'b', 3:'c'}
f = 'file.txt'
print(x, file=open(f,'w')) # file.txt >>> {1:'a', 2:'b', 3:'c'}
y = eval(open(f,'r').read())
print(x==y) # >>> True
If you love one liners, try:
d=eval('{'+re.sub('\'[\s]*?\'','\':\'',re.sub(r'([^'+input('SEP: ')+',]+)','\''+r'\1'+'\'',open(input('FILE: ')).read().rstrip('\n').replace('\n',',')))+'}')
Input FILE = Path to file, SEP = Key-Value separator character
Not the most elegant or efficient way of doing it, but quite interesting nonetheless :)
IMHO a bit more pythonic to use generators (probably you need 2.7+ for this):
with open('infile.txt') as fd:
pairs = (line.split(None) for line in fd)
res = {int(pair[0]):pair[1] for pair in pairs if len(pair) == 2 and pair[0].isdigit()}
This will also filter out lines not starting with an integer or not containing exactly two items
I had a requirement to take values from text file and use as key value pair. i have content in text file as key = value, so i have used split method with separator as "=" and
wrote below code
d = {}
file = open("filename.txt")
for x in file:
f = x.split("=")
d.update({f[0].strip(): f[1].strip()})
By using strip method any spaces before or after the "=" separator are removed and you will have the expected data in dictionary format
import re
my_file = open('file.txt','r')
d = {}
for i in my_file:
g = re.search(r'(\d+)\s+(.*)', i) # glob line containing an int and a string
d[int(g.group(1))] = g.group(2)
Here's another option...
events = {}
for line in csv.reader(open(os.path.join(path, 'events.txt'), "rb")):
if line[0][0] == "#":
continue
events[line[0]] = line[1] if len(line) == 2 else line[1:]

Categories

Resources