convert file to python dict - python

Here is my file that I want to convert to a python dict:
#
# DATABASE
#
Database name FooFileName
Database file FooDBFile
Info file FooInfoFile
Database ID 3
Total entries 8888
I have tried several things and I can't get it to convert to a dict. I ultimately want to be able to pick off the 'Database file' as a string. Thanks in advance.
Here is what I have tried already and the errors:
# ValueError: need more than 1 value to unpack
#d = {}
#for line in json_dump:
#for k,v in [line.strip().split('\n')]:
# for k,v in [line.strip().split(None, 1)]:
# d[k] = v.strip()
#print d
#print d['Database file']
# IndexError: list index out of range
#d = {}
#for line in json_dump:
# line = line.strip()
# parts = [p.strip() for p in line.split('/n')]
# d[parts[0]] = (parts[1], parts[2])
#print d

First you need to separate the string after last # . you can do it with regular expressions , re.search will do it :
>>> import re
>>> s="""#
... # DATABASE
... #
... Database name FooFileName
... Database file FooDBFile
... Info file FooInfoFile
... Database ID 3
... Total entries 8888"""
>>> re.search(r'#\n([^#]+)',s).group(1)
'Database name FooFileName\nDatabase file FooDBFile\nInfo file FooInfoFile\nDatabase ID 3\nTotal entries 8888'
also in this case you can just use split , you can split the text with # then choose the last element :
>>> s2=s.split('#')[-1]
Then you can use a dictionary comprehension and list comprehension , note that re.split is a good choice for this case as it use r' {2,}' for split that match 2 or more space :
>>> {k:v for k,v in [re.split(r' {2,}',i) for i in s2.split('\n') if i]}
{'Database name': 'FooFileName', 'Total entries': '8888', 'Database ID': '3', 'Database file': 'FooDBFile', 'Info file': 'FooInfoFile'}

Actually when we split, it returns a list of 3 values in it , so we need 3 variables to store the returned results, now we combine the first and second value returned , separated by a space to act as a key whose value is the third value returned in the list , This may be the most simple approach but I guess it will get your job done and it is easy to understand as well
d = {}
for line in json_dump:
if line.startswith('#'): continue
for u,k,v in line.strip().split():
d[u+" "+k] = v.strip()
print d
print d['Database file']

EDITED to reflect a line-wise regular expression approach.
Since it appears your file is not tab-delimited, you could use a regular expression to isolate the columns:
import re
#
# The rest of your code that loads up json_dump
#
d = {}
for line in json_dump:
if line.startswith('#'): continue ## For filtering out comment lines
line = line.strip()
#parts = [p.strip() for p in line.split('/n')]
try:
(key, value) = re.split(r'\s\s+', line) ## Split the line of input using 2 or more consecutive white spaces as the delimiter
except ValueError: continue ## Skip malformed lines
#d[parts[0]] = (parts[1], parts[2])
d[key] = value
print d
This yields this dictionary:
{'Database name': 'FooFileName', 'Total entries': '8888', 'Database ID': '3', 'Database file': 'FooDBFile', 'Info file': 'FooInfoFile'}
Which should allow you to isolate the individual values.

Related

Extract data between two lines from text file

Say I have hundreds of text files like this example :
NAME
John Doe
DATE OF BIRTH
1992-02-16
BIO
THIS is
a PRETTY
long sentence
without ANY structure
HOBBIES
//..etc..
NAME, DATE OF BIRTH, BIO, and HOBBIES (and others) are always there, but text content and the number of lines between them can sometimes change.
I want to iterate through the file and store the string between each of these keys. For example, a variable called Name should contain the value stored between 'NAME' and 'DATE OF BIRTH'.
This is what I turned up with :
lines = f.readlines()
for line_number, line in enumerate(lines):
if "NAME" in line:
name = lines[line_number + 1] # In all files, Name is one line long.
elif "DATE OF BIRTH" in line:
date = lines[line_number + 2] # Date is also always two lines after
elif "BIO" in line:
for x in range(line_number + 1, line_number + 20): # Length of other data can be randomly bigger
if "HOBBIES" not in lines[x]:
bio += lines[x]
else:
break
elif "HOBBIES" in line:
#...
This works well enough, but I feel like instead of using many double loops, there must be a smarter and less hacky way to do it.
I'm looking for a general solution where NAME would store everything until DATE OF BIRTH, and BIO would store everything until HOBBIES, etc. With the intention of cleaning up and removing extra white lintes later.
Is it possible?
Edit : While I was reading through the answers, I realized I forgot a really significant detail, the keys will sometimes be repeated (in the same order).
That is, a single text file can contain more than one person. A list of persons should be created. The key Name signals the start of a new person.
I did it storing everything in a dictionary, see code below.
f = open("test.txt")
lines = f.readlines()
dict_text = {"NAME":[], "DATEOFBIRTH":[], "BIO":[]}
for line_number, line in enumerate(lines):
if not ("NAME" in line or "DATE OF BIRTH" in line or "BIO" in line):
text = line.replace("\n","")
dict_text[location].append(text)
else:
location = "".join((line.split()))
You could use a regular expression:
import re
keys = """
NAME
DATE OF BIRTH
BIO
HOBBIES
""".strip().splitlines()
key_pattern = '|'.join(f'{key.strip()}' for key in keys)
pattern = re.compile(fr'^({key_pattern})', re.M)
# uncomment to see the pattern
# print(pattern)
with open(filename) as f:
text = f.read()
parts = pattern.split(text)
... process parts ...
parts will be a list strings. The odd indexed positions (parts[1], parts[3], ...) will be the keys ('NAME', etc) and the even indexed positions (parts[2], parts[4], ...) will be the text in between the keys. parts[0] will be whatever was before the first key.
Instead of reading lines you could cast the file as one long string. Use string.index() to find the start index of your trigger words, then set everything from that index to the next trigger word index to a variable.
Something like:
string = str(f)
important_words = ['NAME', 'DATE OF BIRTH']
last_phrase = None
for phrase in important_words:
phrase_start = string.index(phrase)
phrase_end = phrase_start + len(phrase)
if last_phrase is not None:
get_data(string, last_phrase, phrase_start)
last_phrase = phrase_end
def get_data(string, previous_end_index, current_start_index):
usable_data = string[previous_end_index: current_start_index]
return usable_data
Better/shorter variable names should probably be used
You can just read the text in as 1 long string. And then make use of .split()
This will only work if the categories are in order and don't repeat.
Like so;
Categories = ["NAME", "DOB", "BIO"] // in the order they appear in text
Output = {}
Text = str(f)
for i in range(1,len(Categories)):
SplitText = Text.split(Categories[i])
Output.update({Categories[i-1] : SplitText[0] })
Text = SplitText[1]
Output.update({Categories[-1] : Text})
You can try the following.
keys = ["NAME","DATE OF BIRTH","BIO","HOBBIES"]
f = open("data.txt", "r")
result = {}
for line in f:
line = line.strip('\n')
if any(v in line for v in keys):
last_key = line
else:
result[last_key] = result.get(last_key, "") + line
print(result)
Output
{'NAME': 'John Doe', 'DATE OF BIRTH': '1992-02-16', 'BIO ': 'THIS is a PRETTY long sentence without ANY structure ', 'HOBBIES ': '//..etc..'}

Python splitting data record

I have a record as below:
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355
0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103
0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
I want to split the data into key-value pairs neglecting the first top row i.e 29 16. It should be neglected.
The output should be something like this:
x = A , B
y = 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
I am able to neglect the first line using the below code:
f = open(fileName, 'r')
lines = f.readlines()[1:]
Now how do I separate rest record in Python?
So here's my take :D I expect you'd want to have the numbers parsed as well?
def generate_kv(fileName):
with open(fileName, 'r') as file:
# ignore first line
file.readline()
for line in file:
if '' == line.strip():
# empty line
continue
values = line.split(' ')
try:
yield values[0], [float(x) for x in values[1:]]
except ValueError:
print(f'one of the elements was not a float: {line}')
if __name__ == '__main__':
x = []
y = []
for key, value in generate_kv('sample.txt'):
x.append(key)
y.append(value)
print(x)
print(y)
assumes that the values in sample.txt look like this:
% cat sample.txt
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
and the output:
% python sample.py
['A', 'B']
[[1.2595034, 0.82587254, 0.7375044, 1.1270138, -0.35065323, 0.55985355, 0.7200067, -0.889543, 0.2300735, 0.56767654, 0.2789483, 0.32296127, -0.6423197, 0.26456305, -0.07363393, -1.0788593], [1.2467299, 0.78651106, 0.4702038, 1.204216, -0.5282698, 0.13987103, 0.5911153, -0.6729466, 0.377103, 0.34090135, 0.3052503, 0.028784657, -0.39129165, 0.079238065, -0.29310825, -0.99383247]]
Alternatively, if you'd wanted to have a dictionary, do:
if __name__ == '__main__':
print(dict(generate_kv('sample.txt')))
That will convert the list into a dictionary and output:
{'A': [1.2595034, 0.82587254, 0.7375044, 1.1270138, -0.35065323, 0.55985355, 0.7200067, -0.889543, 0.2300735, 0.56767654, 0.2789483, 0.32296127, -0.6423197, 0.26456305, -0.07363393, -1.0788593], 'B': [1.2467299, 0.78651106, 0.4702038, 1.204216, -0.5282698, 0.13987103, 0.5911153, -0.6729466, 0.377103, 0.34090135, 0.3052503, 0.028784657, -0.39129165, 0.079238065, -0.29310825, -0.99383247]}
you can use this script if your file is a text
filename='file.text'
with open(filename) as f:
data = f.readlines()
x=[data[0][0],data[1][0]]
y=[data[0][1:],data[1][1:]]
If you're happy to store the data in a dictionary here is what you can do:
records = dict()
with open(filename, 'r') as f:
f.readline() # skip the first line
for line in file:
key, value = line.split(maxsplit=1)
records[key] = value.split()
The structure of records would be:
{
'A': ['1.2595034', '0.82587254', '0.7375044', ... ]
'B': ['1.2467299', '0.78651106', '0.4702038', ... ]
}
What's happening
with ... as f we're opening the file within a context manager (more info here). This allows us to automatically close the file when the block finishes.
Because the open file keeps track of where it is in the file we can use f.readline() to move the pointer down a line. (docs)
line.split() allows you to turn a string into a list of strings. With the maxsplits=1 arg it means that it will only split on the first space.
e.g. x, y = 'foo bar baz'.split(maxsplit=1), x = 'foo' and y = 'bar baz'
If I understood correctly, you want the numbers to be collected in a list. One way of doing this is:
import string
text = '''
29 16
A 1.2595034 0.82587254 0.7375044 1.1270138 -0.35065323 0.55985355 0.7200067 -0.889543 0.2300735 0.56767654 0.2789483 0.32296127 -0.6423197 0.26456305 -0.07363393 -1.0788593
B 1.2467299 0.78651106 0.4702038 1.204216 -0.5282698 0.13987103 0.5911153 -0.6729466 0.377103 0.34090135 0.3052503 0.028784657 -0.39129165 0.079238065 -0.29310825 -0.99383247
'''
lines = text.split('\n')
x = [
line[1:].strip().split()
for i, line in enumerate(lines)
if line and line[0].lower() in string.ascii_letters]
This will produce a list of lists when the outer list contains A, B, etc. and the inner lists contain the numbers associated to A, B, etc.
This code assumes that you are interested in lines starting with any single letter (case-insensitive).
For more elaborated conditions you may want to look into regular expressions.
Obviously, if your text is in a file, you could substitute lines = ... with:
with open(filepath, 'r') as lines:
x = ...
Also, if the items in x should not be separated, but rather in a string, you may want to change line[1:].strip().split() with line[1:].strip().
Instead, if you want the numbers as float and not string, you should replace line[1:].strip().split() with [float(value) for value in line[1:].strip().split()].
EDIT:
Alternatively to line[1:].strip().split() you may want to do:
line.split(maxsplit=1)[1].split()
as suggested in some other answer. This would generalize better if the first token is not a single character.

a loop that is suppose to write lines to a file isnt working

I have a very large file that looks like this:
[original file][1]
field number 7 (info) contains ~100 pairs of X=Y separated by ';'.
I first want to split all X=Y pairs.
Next I want to scan one pair at a time, and if X is one of 4 titles and Y is an int- I want to put them them in a dictionary.
After finishing going through the pairs I want to check if the dictionary contains all 4 of my titles, and if so, I want to calculate something and write it into a new file.
This is the part of my code which suppose to do that:
for row in reader:
m = re.split(';',row[7]) # split the info field by ';'
d = {}
nl = []
for c in m: # for each info field, split by '=', if it is one of the 4 fields wanted and the value is int- add it to a dict
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE') and type(t[1])==int:
d[t[0]] = t[1]
if 'AC_MALE' in d and 'AC_FEMALE' in d and 'AN_MALE' in d and 'AN_FEMALE' in d: # if the dict contain all 4 wanted fields- make a new line for the final file
total_ac = int(d['AC_MALE'])+ int(d['AC_FEMALE'])
total_an = int(d['AN_MALE'])+ int(d['AN_FEMALE'])
ac_an = total_ac/total_an
nl.extend([row[0],row[1],row[3],row[4],total_ac,total_an, ac_an])
writer.writerow(nl)
The code is running with no errors but isnt writing anything to the file.
Can someone figure out why?
Thanks!
type(t[1])==int is never true. t[1] is a string, always, because you just split that object from another string. It doesn't matter here if the string contains only digits and could be converted to a int.
Test if you can convert your string to an integer, and if that fails, just move on to the next. If it succeeds, add the value to your dictionary:
for c in m:
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE'):
try:
d[t[0]] = int(t[1])
except ValueError:
# string could not be converted, so move on
pass
Note that you don't need to use re.split(); use the standard str.split() method instead. You don't need to test if all keys are present in your dictionary afterwards, just test if the dictionary contains 4 elements, so has a length of 4. You can also simplify the code to test the key name:
for row in reader:
d = {}
for key_value in row[7].split(','):
key, value = key_value.split('=')
if key in {'AC_MALE', 'AC_FEMALE', 'AN_MALE', 'AN_FEMALE'}:
try:
d[key] = int(value)
except ValueError:
pass
if len(d) == 4:
total_ac = d['AC_MALE'] + d['AC_FEMALE']
total_an = d['AN_MALE'] + d['AN_FEMALE']
ac_an = total_ac / total_an
writer.writerow([
row[0], row[1], row[3], row[4],
total_ac, total_an, ac_an])

Look for pattern in a line and print following values in the brackets

I'm trying to extract some info from a file. The file has many lines like the one below
"names":["DNSCR"],"actual_names":["RADIO_R"],"castime":[2,4,6,8,10] ......
I want to search in each line for names and castime, if found I want to print the value in the brackets
the values in the brackets are changing in different line. for example in the above line names is DNSCR, and casttime is 2,3,6,8. but the length might
be different in next line
I have tried the following code but it will always give me 10 characters but I only need whatever in the bracket only.
c_req = 10
keyword = ['"names":','"castime":']
with open('mylogfile.log') as searchfile:
for line in searchfile:
for key in keywords:
left,sep,right = line.partition(key)
if sep:
print key + " = " + (right[:c_req])
This looks just like json, are there brackets around each line?
if so, the whole content is trivial to parse:
import json
test = '{"names":["DNSCR"],"actual_names":["RADIO_R"],"castime":[2,4,6,8,10]}'
result = json.loads(test)
print(result["names"], result["castime"])
You could also use a library like pandas to read the whole file into a dataframe if it matches a whole JSON file.
Use Regular Expression:
import re
# should contain all lines
lines = ['"names":["DNSCR"],"actual_names":["RADIO_R"],"castime":[2,4,6,8,10]']
# more efficient in large files
names_pattern = re.compile('"names":\["(\w+)"\]')
castime_pattern = re.compile('"castime":\[(.+)\],?')
names, castimes = list(), list()
for line in lines:
names.append(re.search(names_pattern, line).group(1))
castimes.append(
[int(num) for num in re.search(castime_pattern, line).group(1).split(',')]
)
add exception handling and file opening/reading
Given mylogfile.log:
"names":["DNSCR"],"actual_names":["RADIO_R"],"castime":[2,4,6,8,10]
"names":["FOO", "BAR"],"actual_names":["RADIO_R"],"castime":[1, 2, 3]
Using regular expressions and ast.literal_eval.
import ast
import re
keywords = ['"names":', '"castime":']
keywords_name = ['names', 'castime']
d = {}
with open('mylogfile.log') as searchfile:
for i, line in enumerate(searchfile):
d['line ' + str(i)] = {}
for key, key_name in zip(keywords, keywords_name):
d['line ' + str(i)][key_name] = ast.literal_eval(re.search(key + '\[(.*?)\]', line).group(1))
print(d)
#{ 'line 0': {'castime': (2, 4, 6, 8, 10), 'names': 'DNSCR'},
# 'line 1': {'castime': (1, 2, 3), 'names': ('FOO', 'BAR')}}
re.search(key + '\[(.*?)\]', line).group(1) will catch everything that is in between [] after your keys.
And ast.literal_eval() will transform remove usless quote and spaces in your string and automatically create tuples when needed.
I also used enumerate to keep track of which lines it gets in the log file.

capturing the usernames after List: tag

I am trying to create a list named "userlist" with all the usernames listed beside "List:",
my idea is to parse the line with "List:" and then split based on "," and put them in a list,
however am not able to capture the line ,any inputs on how can this be achieved?
output=""" alias: tech.sw.host
name: tech.sw.host
email: tech.sw.host
email2: tech.sw.amss
type: email list
look_elsewhere: /usr/local/mailing-lists/tech.sw.host
text: List tech SW team
list_supervisor: <username>
List: username1,username2,username3,username4,
: username5
Members: User1,User2,
: User3,User4,
: User5 """
#print output
userlist = []
for line in output :
if "List" in line:
print line
If it were me, I'd parse the entire input so as to have easy access to every field:
inFile = StringIO.StringIO(ph)
d = collections.defaultdict(list)
for line in inFile:
line = line.partition(':')
key = line[0].strip() or key
d[key] += [part.strip() for part in line[2].split(',')]
print d['List']
Using regex, str.translate and str.split :
>>> import re
>>> from string import whitespace
>>> strs = re.search(r'List:(.*)(\s\S*\w+):', ph, re.DOTALL).group(1)
>>> strs.translate(None, ':'+whitespace).split(',')
['username1', 'username2', 'username3', 'username4', 'username5']
You can also create a dict here, which will allow you to access any attribute:
def func(lis):
return ''.join(lis).translate(None, ':'+whitespace)
lis = [x.split() for x in re.split(r'(?<=\w):',ph.strip(), re.DOTALL)]
dic = {}
for x, y in zip(lis[:-1], lis[1:-1]):
dic[x[-1]] = func(y[:-1]).split(',')
dic[lis[-2][-1]] = func(lis[-1]).split(',')
print dic['List']
print dic['Members']
print dic['alias']
Output:
['username1', 'username2', 'username3', 'username4', 'username5']
['User1', 'User2', 'User3', 'User4', 'User5']
['tech.sw.host']
Try this:
for line in output.split("\n"):
if "List" in line:
print line
When Python is asked to treat a string like a collection, it'll treat each character in that string as a member of that collection (as opposed to each line, which is what you're trying to accomplish).
You can tell this by printing each line:
>>> for line in ph:
... print line
...
a
l
i
a
s
:
t
e
...
By the way, there are far better ways of handling this. I'd recommend taking a look at Python's built-in RegEx library: http://docs.python.org/2/library/re.html
Try using strip() to remove the white spaces and line breakers before doing the check:
if 'List:' == line.strip()[:5]:
this should capture the line you need, then you can extract the usernames using split(','):
usernames = [i for i in line[5:].split(',')]
Here is my two solutions, which are essentially the same, but the first is easier to understand.
import re
output = """ ... """
# First solution: join continuation lines, the look for List
# Join lines such as username5 with previous line
# List: username1,username2,username3,username4,
# : username5
# becomes
# List: username1,username2,username3,username4,username5
lines = re.sub(r',\s*:\s*', ',', output)
for line in lines.splitlines():
label, values = [token.strip() for token in line.split(':')]
if label == 'List':
userlist = userlist = [user.strip() for user in values.split(',')]
print 'Users:', ', '.join(userlist)
# Second solution, same logic as above
# Different means
tokens, = [line for line in re.sub(r',\s*:\s*', ',', output).splitlines()
if 'List:' in line]
label, values = [token.strip() for token in tokens.split(':')]
userlist = userlist = [user.strip() for user in values.split(',')]
print 'Users:', ', '.join(userlist)

Categories

Resources