Opening file and putting lines into separate strings - python

I am trying to write a function that opens a file containing two lines, the first with the string giving the keys and the second with the string giving the values
So far I have the following
f = open('PT.txt','r')
string = ""
while 1:
line = f.readline()
if not line:break
string += line
f.close()
This is the contents of 'PT.txt'
abcdefghijklmnopqrstuvwxyz
gikaclmnqrpoxzybdefhjstuvw
I get the following output when I print string
abcdefghijklmnopqrstuvwxyz
gikaclmnqrpoxzybdefhjstuvw
I am confused now how to get each line on its own string and how to create a dictionary.
I want the dictionary to look like
{
'a': 'g',
'b': 'i',
'c': 'k',
# etc
}

Try this:
fp = open('PT.txt','r')
s1 = fp.readline()
s2 = fp.readline()
s = zip(s1, s2)
ans = {key : val for key,val in s}

with open("filename") as infile:
lines = infile.readlines()
Note: Do not use string, or any other command, type or standard module name as a variable name.

Related

Skip multiple lines while parsing file in python and storing their values

I apologize for the confusing title. I'm very new to Python and here's what I'm trying to achieve:
I'm parsing a file file.txt that has data like this (and other stuff):
file.txt:
...
a = (
1
2
3 )
...
I need to store this type of data in 2 parts:
name = "a"
value = {"(", "1", "2", "3 )"}
^ each line is an element of the list
I'm parsing the file line by line as shown in the snippet below and I can't change that. I'm not sure how to store the data this way by looking ahead a few lines, storing their values and then skipping them so that they're not processed twice. I want the 2 variables name and value populated when the loop is at the first line "a = "
with open(file.txt) as fp:
for line in fp:
...
Thanks for the help.
I suggest using a dictionary:
txt=open(r"file.txt","r").readlines()
dictionary=dict()
for i in range(len(txt)):
if "=" in txt[i]:
name,values=txt[i].split()[0],[txt[i].split()[-1]]
dictionary[name],i={"name":name},i+1
while True:
values.append(txt[i])
if ")" in txt[i]:
break
i=i+1
values=[value.replace("\n","") for value in values]
dictionary[name].update({"values":values})
i=i-1
i=i+1
>>dictionary["a"]
Out[40]: {'name': 'a', 'values': ['(', '1', '2', '3 )']}
>>dictionary["b"]
Out[45]: {'name': 'b', 'values': ['(', '3', '4', '6 )']}
So, you parse the file line-to-line. Whenever you find an equal sign "=" in a line, it means that the char before the "=" is the name value you want. Then the next line is the first element of the list, the line after that the second element etc... when a line has the char ")" it means that it is the last value of the list.
See the Python string.find method for this. Try to understand the concept and the coding shouldn't be hard.
[u'a']
['(', '1', '2', '3', ')']
Is this what you need?
Then you can follow these lines of code:
import nltk
name = []
value = []
with open("file.txt") as fp:
for line in fp:
words = line.split()
if ('(') in words:
name.append(words[0].decode('utf-8'))
value.append('(')
else:
for entry in words:
value.append(entry)
print (name)
print (value)
fp.close()
If the file is not too large, read whole file into memory then use a while loop to do a finer-grained control:
# python3
with open("file.txt") as f:
lines = f.readlines()
index = 0
while True:
# do something here
Else, if only last value contains ')', do this:
with open('file.txt') as f:
pairs = []
for line in f:
values = []
name, value = line.strip().split('=')
name = name.strip()
values.append(value.strip())
while True:
line = next(f)
values.append(line.strip())
if ')' in line:
break
pairs.append((name, values))

Creating dicts from file in python

For example, I've got file with multilines like
<<something>> 1, 5, 8
<<somethingelse>> hello
<<somethingelseelse>> 1,5,6
I need to create dict with keys
dict = { "something":[1,5,8], "somethingelse": "hello" ...}
I need to somehow read what is inside << >> and put it as a key, and also I need to check if there are a lot of elements or only 1. If only one then I put it as string. If more then one then I need to put it as a list of elements.
Any ideas how to help me?
Maybe regEx's but I'm not great with them.
I easily created def which is reading a file lines, but don't know how to separate those values:
f = open('something.txt', 'r')
lines = f.readlines()
f.close()
def finding_path():
for line in lines:
print line
finding_path()
f.close()
Any ideas? Thanks :)
Assuming that your keys will always be single words, you can play with split(char, maxSplits). Something like below
import sys
def finding_path(file_name):
f = open(file_name, 'r')
my_dict = {}
for line in f:
# split on first occurance of space
key_val_pair = line.split(' ', 1)
# if we do have a key seprated by a space
if len(key_val_pair) > 1:
key = key_val_pair[0]
# proceed only if the key is enclosed within '<<' and '>>'
if key.startswith('<<') and key.endswith('>>'):
key = key[2:-2]
# put more than one value in list, otherwise directly a string literal
val = key_val_pair[1].split(',') if ',' in key_val_pair[1] else key_val_pair[1]
my_dict[key] = val
print my_dict
f.close()
if __name__ == '__main__':
finding_path(sys.argv[1])
Using a file like below
<<one>> 1, 5, 8
<<two>> hello
// this is a comment, it will be skipped
<<three>> 1,5,6
I get the output
{'three': ['1', '5', '6\n'], 'two': 'hello\n', 'one': ['1', ' 5', ' 8\n']}
Please check the below code:
Used regex to get key and value
If the length of value list is 1, then converting it into string.
import re
demo_dict = {}
with open("val.txt",'r') as f:
for line in f:
m= re.search(r"<<(.*?)>>(.*)",line)
if m is not None:
k = m.group(1)
v = m.group(2).strip().split(',')
if len(v) == 1:
v = v[0]
demo_dict[k]=v
print demo_dict
Output:
C:\Users\dinesh_pundkar\Desktop>python demo.Py
{'somethingelseelse': [' 1', '5', '6'], 'somethingelse': 'hello', 'something': [
' 1', ' 5', ' 8']}
My answer is similar to Dinesh's. I've added a function to convert the values in the list to numbers if possible, and some error handling so that if a line doesn't match, a useful warning is given.
import re
import warnings
regexp =re.compile(r'<<(\w+)>>\s+(.*)')
lines = ["<<something>> 1, 5, 8\n",
"<<somethingelse>> hello\n",
"<<somethingelseelse>> 1,5,6\n"]
#In real use use a file descriptor instead of the list
#lines = open('something.txt','r')
def get_value(obj):
"""Converts an object to a number if possible,
or a string if not possible"""
try:
return int(obj)
except ValueError:
pass
try:
return float(obj)
except ValueError:
return str(obj)
dictionary = {}
for line in lines:
line = line.strip()
m = re.search(regexp, line)
if m is None:
warnings.warn("Match failed on \n {}".format(line))
continue
key = m.group(1)
value = [get_value(x) for x in m.group(2).split(',')]
if len(value) == 1:
value = value[0]
dictionary[key] = value
print(dictionary)
output
{'something': [1, 5, 8], 'somethingelse': 'hello', 'somethingelseelse': [1, 5, 6]}

Python: Read text file into dict and ignore comments

I am trying to put the following text file into a dictionary but I would like any section starting with '#' or empty lines ignored.
My text file looks something like this:
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
My desired output would be:
myVariables = {'Apples': 1, 'Oranges': 3, 'Bananas': 5}
My Python code reads as follows:
filename = "myFile.txt"
myVariables = {}
with open(filename) as f:
for line in f:
if line.startswith('#') or not line:
next(f)
key, val = line.split()
myVariables[key] = val
print "key: " + str(key) + " and value: " + str(val)
The error I get:
Traceback (most recent call last):
File "C:/Python27/test_1.py", line 11, in <module>
key, val = line.split()
ValueError: need more than 1 value to unpack
I understand the error but I do not understand what is wrong with the code.
Thank you in advance!
Given your text:
text = """
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
"""
We can do this in 2 ways. Using regex, or using Python generators. I would choose the latter (described below) as regex is not particularly fast(er) in such cases.
To open the file:
with open('file_name.xyz', 'r') as file:
# everything else below. Just substitute `for line in lines` with
# `for line in file.readline()`
Now to create a similar, we split the lines, and create a list:
lines = text.split('\n') # as if read from a file using `open`.
Here is how we do all you want in a couple of lines:
# Discard all comments and empty values.
comment_less = filter(None, (line.split('#')[0].strip() for line in lines))
# Separate items and totals.
separated = {item.split()[0]: int(item.split()[1]) for item in comment_less}
Lets test:
>>> print(separated)
{'Apples': 1, 'Oranges': 3, 'Bananas': 5}
Hope this helps.
This doesn't exactly reproduce your error, but there's a problem with your code:
>>> x = "Apples\t1\t# This is a comment"
>>> x.split()
['Apples', '1', '#', 'This', 'is', 'a', 'comment']
>>> key, val = x.split()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
Instead try:
key = line.split()[0]
val = line.split()[1]
Edit: and I think your "need more than 1 value to unpack" is coming from the blank lines. Also, I'm not familiar with using next() like this. I guess I would do something like:
if line.startswith('#') or line == "\n":
pass
else:
key = line.split()[0]
val = line.split()[1]
To strip comments, you could use str.partition() which works whether the comment sign is present or not in the line:
for line in file:
line, _, comment = line.partition('#')
if line.strip(): # non-blank line
key, value = line.split()
line.split() may raise an exception in this code too—it happens if there is a non-blank line that does not contain exactly two whitespace-separated words—it is application depended what you want to do in this case (ignore such lines, print warning, etc).
You need to ignore empty lines and lines starting with # splitting the remaining lines after either splitting on # or using rfind as below to slice the string, an empty line will have a new line so you need and line.strip() to check for one, you cannot just split on whitespace and unpack as you have more than two elements after splitting including what is in the comment:
with open("in.txt") as f:
d = dict(line[:line.rfind("#")].split() for line in f
if not line.startswith("#") and line.strip())
print(d)
Output:
{'Apples': '1', 'Oranges': '3', 'Bananas': '5'}
Another option is to split twice and slice:
with open("in.txt") as f:
d = dict(line.split(None,2)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
Or splitting twice and unpacking using an explicit loop:
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
k, v, _ = line.split(None, 2)
d[k] = v
You can also use itertools.groupby to group the lines you want.
from itertools import groupby
with open("in.txt") as f:
grouped = groupby(f, lambda x: not x.startswith("#") and x.strip())
d = dict(next(v).split(None, 2)[:2] for k, v in grouped if k)
print(d)
To handle where we have multiple words in single quotes we can use shlex to split:
import shlex
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
data = shlex.split(line)
d[data[0]] = data[1]
print(d)
So changing the Banana line to:
Bananas 'north-side disabled' # I want to ignore this comment too!
We get:
{'Apples': '1', 'Oranges': '3', 'Bananas': 'north-side disabled'}
And the same will work for the slicing:
with open("in.txt") as f:
d = dict(shlex.split(line)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
If the format of the file is correctly defined you can try a solution with regular expressions.
Here's just an idea:
import re
fruits = {}
with open('fruits_list.txt', mode='r') as f:
for line in f:
match = re.match("([a-zA-Z0-9]+)[\s]+([0-9]+).*", line)
if match:
fruit_name, fruit_amount = match.groups()
fruits[fruit_name] = fruit_amount
print fruits
UPDATED:
I changed the way of reading lines taking care of large files. Now I read line by line and not all in one. This improves the memory usage.

Why my code is recording into the file only when I run it second time?

My goal is to calculate amount of words. When I run my code I am suppose to:
read in strings from the file
split every line in words
add these words into the dictionary
sort keys and add them to the list
write the string that consists of keys and appropriate values into the file
When I run code for the first time it does not write anything in the file, but I see the result on my screen. The file is empty. Only when I run code second time I see content is recorded into the file.
Why is that happening?
#read in the file
fileToRead = open('../folder/strings.txt')
fileToWrite = open('../folder/count.txt', 'w')
d = {}
#iterate over every line in the file
for line in fileToRead:
listOfWords = line.split()
#iterate over every word in the list
for word in listOfWords:
if word not in d:
d[word] = 1
else:
d[word] = d.get(word) + 1
#sort the keys
listF = sorted(d)
#iterate over sorted keys and write them in the file with appropriate value
for word in listF:
string = "{:<18}\t\t\t{}\n".format(word, d.get(word))
print string
fileToWrite.write(string)
A minimalistic version:
import collections
with open('strings.txt') as f:
d = collections.Counter(s for line in f for s in line.split())
with open('count.txt', 'a') as f:
for word in sorted(d.iterkeys()):
string = "{:<18}\t\t\t{}\n".format(word, d[word])
print string,
f.write(string)
Couple changes, it think you meant 'a' (append to file) instead of 'w' overwrite file each time in open('count.txt', 'a'). Please also try to use with statement for reading and writing files, as it automatically closes the file descriptor after the read/write is done.
#read in the file
fileToRead = open('strings.txt')
d = {}
#iterate over every line in the file
for line in fileToRead:
listOfWords = line.split()
#iterate over every word in the list
for word in listOfWords:
if word not in d:
d[word] = 1
else:
d[word] = d.get(word) + 1
#sort the keys
listF = sorted(d)
#iterate over sorted keys and write them in the file with appropriate value
with open('count.txt', 'a') as fileToWrite:
for word in listF:
string = "{:<18}\t\t\t{}\n".format(word, d.get(word))
print string,
fileToWrite.write(string)
When you do file.write(some_data), it writes the data into a buffer but not into the file. It only saves the file to disk when you do file.close().
f = open('some_temp_file.txt', 'w')
f.write("booga boo!")
# nothing written yet to disk
f.close()
# flushes the buffer and writes to disk
The better way to do this would be to store the path in the variable, rather than the file object. Then you can open the file (and close it again) on demand.
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
This also allows you to use the with keyword, which handles file opening and closing much more elegantly.
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
d = dict()
with open(read_path) as inf:
for line in inf:
for word in line.split()
d[word] = d.get(word, 0) + 1
# remember dict.get's default value! Saves a conditional
# since we've left the block, `inf` is closed by here
sorted_words = sorted(d)
with open(write_path, 'w') as outf:
for word in sorted_words:
s = "{:<18}\t\t\t{}\n".format(word, d.get(word))
# don't shadow the stdlib `string` module
# also: why are you using both fixed width AND tab-delimiters in the same line?
print(s) # not sure why you're doing this, but okay...
outf.write(s)
# since we leave the block, the file closes automagically.
That said, there's a couple things you could do to make this a little better in general. First off: counting how many of something are in a container is a job for a collections.Counter.
In [1]: from collections import Counter
In [2]: Counter('abc')
Out[2]: Counter({'a': 1, 'b': 1, 'c': 1})
and Counters can be added together with the expected behavior
In [3]: Counter('abc') + Counter('cde')
Out[3]: Counter({'c': 2, 'a': 1, 'b': 1, 'd': 1, 'e': 1})
and also sorted the same way you'd sort a dictionary with keys
In [4]: sorted((Counter('abc') + Counter('cde')).items(), key=lambda kv: kv[0])
Out[4]: [('a', 1), ('b', 1), ('c', 2), ('d', 1), ('e', 1)]
Put those all together and you could do something like:
from collections import Counter
read_path = '../folder/strings.txt'
write_path = '../folder/count.txt'
with open(read_path) as inf:
results = sum([Counter(line.split()) for line in inf])
with open(write_path, 'w') as outf:
for word, count in sorted(results.items(), key=lambda kv: kv[0]):
s = "{:<18}\t\t\t{}\n".format(word, count)
outf.write(s)

Not working: indexing the words in a file in a dict by first letter

I have to write a function based on a open file that has one lowercase word per line. I have to return a dictionary with keys in single lowercase letters and each value is a list of the words from the file that starts with that letter. (The keys in the dictionary are from only the letters of the words that appear in the file.)
This is my code:
def words(file):
line = file.readline()
dict = {}
list = []
while (line != ""):
list = line[:].split()
if line[0] not in dict.keys():
dict[line[0]] = list
line = file.readline()
return dict
However, when I was testing it myself, my function doesn't seem to return all the values. If there are more than two words that start with a certain letter, only the first one shows up as the values in the output. What am I doing wrong?
For example, the file should return:
{'a': ['apple'], 'p': ['peach', 'pear', 'pineapple'], \
'b': ['banana', 'blueberry'], 'o': ['orange']}, ...
... but returns ...
{'a': ['apple'], 'p': ['pear'], \
'b': ['banana'], 'o': ['orange']}, ...
Try this solution, it takes into account the case where there are words starting with the same character in more than one line, and it doesn't use defaultdict. I also simplified the function a bit:
def words(file):
dict = {}
for line in file:
lst = line.split()
dict.setdefault(line[0], []).extend(lst)
return dict
You aren't adding to the list for each additional letter. Try:
if line[0] not in dict.keys():
dict[line[0]] = list
else:
dict[line[0]] += list
The specific problem is that dict[line[0]] = list replaces the value for the new key. There are many ways to fix this... I'm happy to provide one, but you asked what was wrong and that's it. Welcome StackOverflow.
It seems like every dictionary entry should be a list. Use the append method on the dictionary key.
Sacrificing performance (to a certain extent) for elegance:
with open(whatever) as f: words = f.read().split()
result = {
first: [word for word in words if word.startswith(first)]
for first in set(word[0] for word in words)
}
Something like this should work
def words(file):
dct = {}
for line in file:
word = line.strip()
try:
dct[word[0]].append(word)
except KeyError:
dct[word[0]] = [word]
return dct
The first time a new letter is found, there will be a KeyError, subsequent occurances of the letter will cause the word to be appended to the existing list
Another approach would be to prepopulate the dict with the keys you need
import string
def words(file):
dct = dict.fromkeys(string.lowercase, [])
for line in file:
word = line.strip()
dct[word[0]] = dct[word[0]] + [word]
return dct
I'll leave it as an exercise to work out why dct[word[0]] += [word] won't work
Try this function
def words(file):
dict = {}
line = file.readline()
while (line != ""):
my_key = line[0].lower()
dict.setdefault(my_key, []).extend(line.split() )
line = file.readline()
return dict

Categories

Resources