For example, I've got file with multilines like
<<something>> 1, 5, 8
<<somethingelse>> hello
<<somethingelseelse>> 1,5,6
I need to create dict with keys
dict = { "something":[1,5,8], "somethingelse": "hello" ...}
I need to somehow read what is inside << >> and put it as a key, and also I need to check if there are a lot of elements or only 1. If only one then I put it as string. If more then one then I need to put it as a list of elements.
Any ideas how to help me?
Maybe regEx's but I'm not great with them.
I easily created def which is reading a file lines, but don't know how to separate those values:
f = open('something.txt', 'r')
lines = f.readlines()
f.close()
def finding_path():
for line in lines:
print line
finding_path()
f.close()
Any ideas? Thanks :)
Assuming that your keys will always be single words, you can play with split(char, maxSplits). Something like below
import sys
def finding_path(file_name):
f = open(file_name, 'r')
my_dict = {}
for line in f:
# split on first occurance of space
key_val_pair = line.split(' ', 1)
# if we do have a key seprated by a space
if len(key_val_pair) > 1:
key = key_val_pair[0]
# proceed only if the key is enclosed within '<<' and '>>'
if key.startswith('<<') and key.endswith('>>'):
key = key[2:-2]
# put more than one value in list, otherwise directly a string literal
val = key_val_pair[1].split(',') if ',' in key_val_pair[1] else key_val_pair[1]
my_dict[key] = val
print my_dict
f.close()
if __name__ == '__main__':
finding_path(sys.argv[1])
Using a file like below
<<one>> 1, 5, 8
<<two>> hello
// this is a comment, it will be skipped
<<three>> 1,5,6
I get the output
{'three': ['1', '5', '6\n'], 'two': 'hello\n', 'one': ['1', ' 5', ' 8\n']}
Please check the below code:
Used regex to get key and value
If the length of value list is 1, then converting it into string.
import re
demo_dict = {}
with open("val.txt",'r') as f:
for line in f:
m= re.search(r"<<(.*?)>>(.*)",line)
if m is not None:
k = m.group(1)
v = m.group(2).strip().split(',')
if len(v) == 1:
v = v[0]
demo_dict[k]=v
print demo_dict
Output:
C:\Users\dinesh_pundkar\Desktop>python demo.Py
{'somethingelseelse': [' 1', '5', '6'], 'somethingelse': 'hello', 'something': [
' 1', ' 5', ' 8']}
My answer is similar to Dinesh's. I've added a function to convert the values in the list to numbers if possible, and some error handling so that if a line doesn't match, a useful warning is given.
import re
import warnings
regexp =re.compile(r'<<(\w+)>>\s+(.*)')
lines = ["<<something>> 1, 5, 8\n",
"<<somethingelse>> hello\n",
"<<somethingelseelse>> 1,5,6\n"]
#In real use use a file descriptor instead of the list
#lines = open('something.txt','r')
def get_value(obj):
"""Converts an object to a number if possible,
or a string if not possible"""
try:
return int(obj)
except ValueError:
pass
try:
return float(obj)
except ValueError:
return str(obj)
dictionary = {}
for line in lines:
line = line.strip()
m = re.search(regexp, line)
if m is None:
warnings.warn("Match failed on \n {}".format(line))
continue
key = m.group(1)
value = [get_value(x) for x in m.group(2).split(',')]
if len(value) == 1:
value = value[0]
dictionary[key] = value
print(dictionary)
output
{'something': [1, 5, 8], 'somethingelse': 'hello', 'somethingelseelse': [1, 5, 6]}
Related
I apologize for the confusing title. I'm very new to Python and here's what I'm trying to achieve:
I'm parsing a file file.txt that has data like this (and other stuff):
file.txt:
...
a = (
1
2
3 )
...
I need to store this type of data in 2 parts:
name = "a"
value = {"(", "1", "2", "3 )"}
^ each line is an element of the list
I'm parsing the file line by line as shown in the snippet below and I can't change that. I'm not sure how to store the data this way by looking ahead a few lines, storing their values and then skipping them so that they're not processed twice. I want the 2 variables name and value populated when the loop is at the first line "a = "
with open(file.txt) as fp:
for line in fp:
...
Thanks for the help.
I suggest using a dictionary:
txt=open(r"file.txt","r").readlines()
dictionary=dict()
for i in range(len(txt)):
if "=" in txt[i]:
name,values=txt[i].split()[0],[txt[i].split()[-1]]
dictionary[name],i={"name":name},i+1
while True:
values.append(txt[i])
if ")" in txt[i]:
break
i=i+1
values=[value.replace("\n","") for value in values]
dictionary[name].update({"values":values})
i=i-1
i=i+1
>>dictionary["a"]
Out[40]: {'name': 'a', 'values': ['(', '1', '2', '3 )']}
>>dictionary["b"]
Out[45]: {'name': 'b', 'values': ['(', '3', '4', '6 )']}
So, you parse the file line-to-line. Whenever you find an equal sign "=" in a line, it means that the char before the "=" is the name value you want. Then the next line is the first element of the list, the line after that the second element etc... when a line has the char ")" it means that it is the last value of the list.
See the Python string.find method for this. Try to understand the concept and the coding shouldn't be hard.
[u'a']
['(', '1', '2', '3', ')']
Is this what you need?
Then you can follow these lines of code:
import nltk
name = []
value = []
with open("file.txt") as fp:
for line in fp:
words = line.split()
if ('(') in words:
name.append(words[0].decode('utf-8'))
value.append('(')
else:
for entry in words:
value.append(entry)
print (name)
print (value)
fp.close()
If the file is not too large, read whole file into memory then use a while loop to do a finer-grained control:
# python3
with open("file.txt") as f:
lines = f.readlines()
index = 0
while True:
# do something here
Else, if only last value contains ')', do this:
with open('file.txt') as f:
pairs = []
for line in f:
values = []
name, value = line.strip().split('=')
name = name.strip()
values.append(value.strip())
while True:
line = next(f)
values.append(line.strip())
if ')' in line:
break
pairs.append((name, values))
I am trying to write a function that opens a file containing two lines, the first with the string giving the keys and the second with the string giving the values
So far I have the following
f = open('PT.txt','r')
string = ""
while 1:
line = f.readline()
if not line:break
string += line
f.close()
This is the contents of 'PT.txt'
abcdefghijklmnopqrstuvwxyz
gikaclmnqrpoxzybdefhjstuvw
I get the following output when I print string
abcdefghijklmnopqrstuvwxyz
gikaclmnqrpoxzybdefhjstuvw
I am confused now how to get each line on its own string and how to create a dictionary.
I want the dictionary to look like
{
'a': 'g',
'b': 'i',
'c': 'k',
# etc
}
Try this:
fp = open('PT.txt','r')
s1 = fp.readline()
s2 = fp.readline()
s = zip(s1, s2)
ans = {key : val for key,val in s}
with open("filename") as infile:
lines = infile.readlines()
Note: Do not use string, or any other command, type or standard module name as a variable name.
sorry for asking but I'm kind of new to these things. I'm doing a splitting words from the text and putting them to dict creating an index for each token:
import re
f = open('/Users/Half_Pint_Boy/Desktop/sentenses.txt', 'r')
a=0
c=0
e=[]
for line in f:
b=re.split('[^a-z]', line.lower())
a+=len(list(filter(None, b)))
c = c + 1
e = e + b
d = dict(zip(e, range(len(e))))
But in the end I receive a dict with spaces in it like that:
{'': 633,
'a': 617,
'according': 385,
'adjacent': 237,
'allow': 429,
'allows': 459}
How can I remove "" from the final result in dict? Also how can I change the indexing after that to not use "" in index counting? (with "" the index count is 633, without-248)
Big thanks!
How about this?
b = list(filter(None, re.split('[^a-z]', line.lower())))
As an alternative:
b = re.findall('[a-z]+', line.lower())
Either way, you can then also remove that filter from the next line:
a += len(b)
EDIT
As an aside, I think what you end up with here is a dictionary mapping words to the last position in which they appear in the text. I'm not sure if that's what you intended to do. E.g.
>>> dict(zip(['hello', 'world', 'hello', 'again'], range(4)))
{'world': 1, 'hello': 2, 'again': 3}
If you instead want to keep track of all the positions a word occurs, perhaps try this code instead:
from collections import defaultdict
import re
indexes = defaultdict(list)
with open('test.txt', 'r') as f:
for index, word in enumerate(re.findall(r'[a-z]+', f.read().lower())):
indexes[word].append(index)
indexes then maps each word to a list of indexes at which the word appears.
EDIT 2
Based on the comment discussion below, I think you want something more like this:
from collections import defaultdict
import re
word_positions = {}
with open('test.txt', 'r') as f:
index = 0
for word in re.findall(r'[a-z]+', f.read().lower()):
if word not in word_positions:
word_positions[word] = index
index += 1
print(word_positions)
# Output:
# {'hello': 0, 'goodbye': 2, 'world': 1}
Your regex looks not a good one. Consider to use:
line = re.sub('[^a-z]*$', '', line.strip())
b = re.split('[^a-z]+', line.lower())
Replace:
d = dict(zip(e, range(len(e))))
With:
d = {word:n for n, word in enumerate(e) if word}
Alternatively, to avoid the empty entries in the first place, replace:
b=re.split('[^a-z]', line.lower())
With:
b=re.split('[^a-z]+', re.sub('(^[^a-z]+|[^a-z]+$)', '', line.lower()))
I am trying to put the following text file into a dictionary but I would like any section starting with '#' or empty lines ignored.
My text file looks something like this:
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
My desired output would be:
myVariables = {'Apples': 1, 'Oranges': 3, 'Bananas': 5}
My Python code reads as follows:
filename = "myFile.txt"
myVariables = {}
with open(filename) as f:
for line in f:
if line.startswith('#') or not line:
next(f)
key, val = line.split()
myVariables[key] = val
print "key: " + str(key) + " and value: " + str(val)
The error I get:
Traceback (most recent call last):
File "C:/Python27/test_1.py", line 11, in <module>
key, val = line.split()
ValueError: need more than 1 value to unpack
I understand the error but I do not understand what is wrong with the code.
Thank you in advance!
Given your text:
text = """
# This is my header info followed by an empty line
Apples 1 # I want to ignore this comment
Oranges 3 # I want to ignore this comment
#~*~*~*~*~*~*~*Another comment~*~*~*~*~*~*~*~*~*~*
Bananas 5 # I want to ignore this comment too!
"""
We can do this in 2 ways. Using regex, or using Python generators. I would choose the latter (described below) as regex is not particularly fast(er) in such cases.
To open the file:
with open('file_name.xyz', 'r') as file:
# everything else below. Just substitute `for line in lines` with
# `for line in file.readline()`
Now to create a similar, we split the lines, and create a list:
lines = text.split('\n') # as if read from a file using `open`.
Here is how we do all you want in a couple of lines:
# Discard all comments and empty values.
comment_less = filter(None, (line.split('#')[0].strip() for line in lines))
# Separate items and totals.
separated = {item.split()[0]: int(item.split()[1]) for item in comment_less}
Lets test:
>>> print(separated)
{'Apples': 1, 'Oranges': 3, 'Bananas': 5}
Hope this helps.
This doesn't exactly reproduce your error, but there's a problem with your code:
>>> x = "Apples\t1\t# This is a comment"
>>> x.split()
['Apples', '1', '#', 'This', 'is', 'a', 'comment']
>>> key, val = x.split()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: too many values to unpack
Instead try:
key = line.split()[0]
val = line.split()[1]
Edit: and I think your "need more than 1 value to unpack" is coming from the blank lines. Also, I'm not familiar with using next() like this. I guess I would do something like:
if line.startswith('#') or line == "\n":
pass
else:
key = line.split()[0]
val = line.split()[1]
To strip comments, you could use str.partition() which works whether the comment sign is present or not in the line:
for line in file:
line, _, comment = line.partition('#')
if line.strip(): # non-blank line
key, value = line.split()
line.split() may raise an exception in this code too—it happens if there is a non-blank line that does not contain exactly two whitespace-separated words—it is application depended what you want to do in this case (ignore such lines, print warning, etc).
You need to ignore empty lines and lines starting with # splitting the remaining lines after either splitting on # or using rfind as below to slice the string, an empty line will have a new line so you need and line.strip() to check for one, you cannot just split on whitespace and unpack as you have more than two elements after splitting including what is in the comment:
with open("in.txt") as f:
d = dict(line[:line.rfind("#")].split() for line in f
if not line.startswith("#") and line.strip())
print(d)
Output:
{'Apples': '1', 'Oranges': '3', 'Bananas': '5'}
Another option is to split twice and slice:
with open("in.txt") as f:
d = dict(line.split(None,2)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
Or splitting twice and unpacking using an explicit loop:
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
k, v, _ = line.split(None, 2)
d[k] = v
You can also use itertools.groupby to group the lines you want.
from itertools import groupby
with open("in.txt") as f:
grouped = groupby(f, lambda x: not x.startswith("#") and x.strip())
d = dict(next(v).split(None, 2)[:2] for k, v in grouped if k)
print(d)
To handle where we have multiple words in single quotes we can use shlex to split:
import shlex
with open("in.txt") as f:
d = {}
for line in f:
if not line.startswith("#") and line.strip():
data = shlex.split(line)
d[data[0]] = data[1]
print(d)
So changing the Banana line to:
Bananas 'north-side disabled' # I want to ignore this comment too!
We get:
{'Apples': '1', 'Oranges': '3', 'Bananas': 'north-side disabled'}
And the same will work for the slicing:
with open("in.txt") as f:
d = dict(shlex.split(line)[:2] for line in f
if not line.startswith("#") and line.strip())
print(d)
If the format of the file is correctly defined you can try a solution with regular expressions.
Here's just an idea:
import re
fruits = {}
with open('fruits_list.txt', mode='r') as f:
for line in f:
match = re.match("([a-zA-Z0-9]+)[\s]+([0-9]+).*", line)
if match:
fruit_name, fruit_amount = match.groups()
fruits[fruit_name] = fruit_amount
print fruits
UPDATED:
I changed the way of reading lines taking care of large files. Now I read line by line and not all in one. This improves the memory usage.
I have a text file with tuples in it that I would like to convert to a list with indices as follows:
2, 60;
3, 67;
4, 67;
5, 60;
6, 60;
7, 67;
8, 67;
Needs to become:
60, 2 5 6
67, 3 4 7 8
And so on with many numbers...
I've made it as far as reading in the file and getting rid of the punctuation and casting it as ints, but I'm not quite sure how to iterate through and add multiple items at a given index of a list. Any help would be much appreciated!
Here is my code so far:
with open('cues.txt') as f:
lines = f.readlines()
arr = []
for i in lines:
i = i.replace(', ', ' ')
i = i.replace(';', '')
i = i.replace('\n', '')
arr.append(i)
array = []
for line in arr: # read rest of lines
array.append([int(x) for x in line.split()])
arr = []
#make array of first values 40 to 80
for i in range(40, 81):
arr.append(i)
print arr
for j in range(0, len(array)):
for i in array:
if (i[0] == arr[j]):
arr[i[0]].extend(i[1])
Do you need it in a list you can simply collect them into a dict:
i = {}
with open('cues.txt') as f:
for (x, y) in (l.strip(';').split(', ') for l in f):
i.setdefault(y, []).append(x)
for k, v in i.iteritems():
print "{0}, {1}".format(k, " ".join(v))
You could use defaultdict function from collections module.
from collections import defaultdict
with open('file') as f:
l = []
for line in f:
l.append(tuple(line.replace(';','').strip().split(', ')))
m = defaultdict(list)
for i in l:
m[i[1]].append(i[0])
for j in m:
print j+", "+' '.join(m[j])
You can use a dict to store the index:
results = {}
with open("cues.txt") as f:
for line in f:
value, index = line.strip()[:-1].split(", ")
if index not in results:
results[index] = [value]
else:
results[index].append(value)
for index in results:
print("{0}, {1}".format(index, " ".join(results[index]))
1) This code is wrong at many level. See inline comment
arr = []
for i in lines:
i = i.replace(', ', ' ')
i = i.replace(';', '')
i = i.replace('\n', '') # Wrong identation. You will only get the last line in arr
arr.append(i)
You can simply do
arr = []
for i in lines:
i = i.strip().replace(';', '').split(", ")
arr.append(i)
It will remove newline character, remove ; and nicely split a line into a tuple of (index, value)
2) This code can be simplified to one line
arr = [] # It should not be named `arr` because it destroyed the arr created in stage 1
for i in range(40, 81):
arr.append(i)
print arr
becomes:
result = range(40, 81)
But it is not an ideal data structure for your problem. You should use dictionary instead. In the other word, you can lose this bit of code altogether
3) Finally you are ready to iterate arr and build the result
result = defaultdict(list)
for a in arr:
result[a[1]].append(a[0])
You should use dict to save text data, the following code:
d = {}
with open('cues.txt') as f:
lines = f.readlines()
for line in lines:
line = line.split(',')
key = line[1].strip()[0:-1]
if d.has_key(key):
d[key].append(line[0])
else:
d[key] = [line[0]]
for key, value in d.iteritems():
print "{0}, {1}".format(key, " ".join(value))