How to read variating data into dictionary? - python

I need to extract the name of the constants and their corresponding values from a .txt file into a dictionary. Where key = NameOfConstants and Value=float.
The start of the file looks like this:
speed of light 299792458.0 m/s
gravitational constant 6.67259e-11 m**3/kg/s**2
Planck constant 6.6260755e-34 J*s
elementary charge 1.60217733e-19 C
How do I get the name of the constants easy?
This is my attempt:
with open('constants.txt', 'r') as infile:
file1 = infile.readlines()
constants = {i.split()[0]: i.split()[1] for i in file1[2:]}
I'm not getting it right with the split(), and I need a little correction!

{' '.join(line.split()[:-2]):' '.join(line.split()[-2:]) for line in lines}

From your text file I'm unable to get the correct value of no of spaces to split. So below code is designed to help you. Please have a look, it worked for you above stated file.
import string
valid_char = string.ascii_letters + ' '
valid_numbers = string.digits + '.'
constants = {}
with open('constants.txt') as file1:
for line in file1.readlines():
key = ''
for index, char in enumerate(line):
if char in valid_char:
key += char
else:
key = key.strip()
break
value = ''
for char in line[index:]:
if char in valid_numbers:
value += char
else:
break
constants[key] = float(value)
print constants

Have You tried using regular expressions?
for example
([a-z]|\s)*
matches the first part of a line until the digits of the constants begin.
Python provides a very good tutorial on regular expressions (regex)
https://docs.python.org/2/howto/regex.html
You can try out your regex online as well
https://regex101.com/

with open('constants.txt', 'r') as infile:
lines = infile.readlines()
constants = {' '.join(line.split()[:-2]):float(' '.join(line.split()[-2:-1])) for line in lines[2:]}
Since there were two lines above not needed.

This would best be solved using a regexp.
Focussing on your question (how to get the names) and your desires (have something shorter):
import re
# Regular expression fetches all characters
# until the first occurence of a number
REGEXP = re.compile('^([a-zA-Z\s]+)\d.*$')
with open('tst.txt', 'r') as f:
for line in f:
match = REGEXP.match(line)
if match:
# On a match the part between parentheses
# are copied to the first group
name = match.group(1).strip()
else:
# Raise something, or change regexp :)
pass

What about re.split-
import re
lines = open(r"C:\txt.txt",'r').readlines()
for line in lines:
data = re.split(r'\s{3,}',line)
print "{0} : {1}".format(data[0],''.join(data[1:]))
Or use oneliner to make dictionary-
{k:v.strip() for k,v in [(re.split(r'\s{3,}',line)[0],''.join(re.split(r'\s{3,}',line)[1:])) for line in open(r"C:\txt.txt",'r').readlines() ]}
Output-
gravitational constant : 6.67259e-11m**3/kg/s**2
Planck constant : 6.6260755e-34J*s
elementary charge : 1.60217733e-19C
Dictionary-
{'Planck constant': '6.6260755e-34J*s', 'elementary charge': '1.60217733e-19C', 'speed of light': '299792458.0m/s', 'gravitational constant': '6.67259e-11m**3/kg/s**2'}

Related

Adding words between lines to an array

This is the content of my file:
david C001 C002 C004 C005 C006 C007
* C008 C009 C010 C011 C016 C017 C018
* C019 C020 C021 C022 C023 C024 C025
anna C500 C521 C523 C547 C555 C556
* C557 C559 C562 C563 C566 C567 C568
* C569 C571 C572 C573 C574 C575 C576
* C578
charlie C701 C702 C704 C706 C707 C708
* C709 C712 C715 C716 C717 C718
I want my output to be:
david=[C001,C002,C004,C005,C006,C007,C008,C009,C010,C011,C016,C017,C018,C019,C020,C021,C022,C023,C024,C025]
anna=[C500,C521,C523,C547,C555,C556,C557,C559,C562,C563,C566,C567,C568,C569,C571,C572,C573,C574,C575,C576,C578]
charlie=[C701,C702,C704,C706,C707,C708,C709,C712,C715,C716,C717,C718]
I am able to create:
david=[C001,C002,C004,C005,C006,C007]
anna=[C500,C521,C523,C547,C555,C556]
charlie=[C701,C702,C704,C706,C707,C708]
counting the number of words in a line and using line[0] as the array name and adding the remaining words to the array.
However, I don't know how to take the continuation of words in the next lines starting with "*" to the array.
Can anyone help?
NOTE: This solution relies on defaultdict being ordered, which is something that was introduced on Python 3.6
Somewhat naive approach:
from collections import defaultdict
# Create a dictionary of people
people = defaultdict(list)
# Open up your file in read-only mode
with open('your_file.txt', 'r') as f:
# Iterate over all lines, stripping them and splitting them into words
for line in filter(bool, map(str.split, map(str.strip, f))):
# Retrieve the name of the person
# either from the current line or use the name of the last person processed
name, words = list(people)[-1] if line[0] == '*' else line[0], line[1:]
# Add all remaining words to that person's record
people[name].extend(words)
print(people['anna'])
# ['C500', 'C521', 'C523', 'C547', 'C555', 'C556', 'C557', 'C559', 'C562', 'C563', 'C566', 'C567', 'C568', 'C569', 'C571', 'C572', 'C573', 'C574', 'C575', 'C576', 'C578']
It also has the additional benefit of returning an empty list for unknown names:
print(people['matt'])
# []
You could read the lists into a dictionary using regular expressions:
import re
with open('file_name') as file:
contents = file.read()
res_list = re.findall(r"[a-z]+\s+[^a-z]+",contents)
res_dict = {}
for p in res_list:
elt = p.split()
res_dict[elt[0]] = [e for e in elt[1:] if e != '*']
print(res_dict)
I figured out a way myself. Thanks to the ones who gave their own solution. It gave me new perspective.
Below is my code:
persons_library={}
persons=['david','anna','charlie']
for i,person in enumerate(persons,start=0):
persons_library[person]=[]
with open('data.txt','r') as f:
for line in f:
line=line.replace('*',"")
line=line.split()
for i,val in enumerate(line,start=0):
if val in persons_library:
key=val
else:
persons_library[key].append(val)
print(persons_library)

Comparing multiple file items using re

Currently I have a script that finds all the lines across multiple input files that have something in the format of
Matches: 500 (54.3 %) and prints out the top 10 highest matches in percentage.
I want to be able to have it also output the top 10 lines for score ex: Score: 4000
import re
def get_values_from_file(filename):
f = open(filename)
winpat = re.compile("([\d\.]+)\%")
xinpat = re.compile("[\d]") #ISSUE, is this the right regex for it? Score: 500****
values = []
scores = []
for line in f.readlines():
if line.find("Matches") >=0:
percn = float(winpat.findall(line)[0])
values.append(percn)
elif line.find("Score") >=0:
hey = float(xinpat.findall(line)[0])
scores.append(hey)
return (scores,values)
all_values = []
all_scores = []
for filename in ["out0.txt", "out1.txt"]:#and so on
values = get_values_from_file(filename)
all_values += values
all_scores += scores
all_values.sort()
all_values.reverse()
all_scores.sort() #also for scores
all_scores.reverse()
print(all_values[0:10])
print(all_scores[0:10])
Is my regex for the score format correct? I believe that's where I am having the issue, as it doesn't output both correctly.
Any thoughts? Should I split it into two functions?
Thank you.
Is my regex for the score format correct?
No, it should be r"\d+".
You don't need []. Those brackets establish a character class representing all of the characters inside the brackets. Since you only have one character type inside the bracket, they do nothing.
You only match a single character. You need a * or a + to match a sequence of characters.
You have an unescaped backslash in your string. Use the r prefix to allow the regular expression engine to see the backslash.
Commentary:
If it were me, I'd let the regular expression do all of the work, and skip line.find() altogether:
#UNTESTED
def get_values_from_file(filename):
winpat = re.compile(r"Matches:\s*\d+\s*\(([\d\.]+)\%\)")
xinpat = re.compile(r"Score:\s*([\d]+)")
values = []
scores = []
# Note: "with open() as f" automatically closes f
with open(filename) as f:
# Note: "for line in f" more memory efficient
# than "for line in f.readlines()"
for line in f:
win = winpat.match(line)
xin = xinpat.match(line)
if win: values.append(float(win.group(0)))
if xin: scores.append(float(xin.group(0)))
return (scores,values)
Just for fun, here is a version of the routine which calls re.findall exactly once per file:
# TESTED
# Compile this only once to save time
pat = re.compile(r'''
(?mx) # multi-line, verbose
(?:Matches:\s*\d+\s*\(([\d\.]+)\s*%\)) # "Matches: 300 (43.2%)"
|
(?:Score:\s*(\d+)) # "Score: 4000"
''')
def get_values_from_file(filename):
with open(filename) as f:
values, scores = zip(*pat.findall(f.read()))
values = [float(value) for value in values if value]
scores = [float(score) for score in scores if score]
return scores, values
No. xinpat will only match single digits, so findall() will return a list of single digits, which is a bit messy. Change it to
xinpat = re.compile("[\d]+")
Actually, you don't need the square brackets here, so you could simplify it to
xinpat = re.compile("\d+")
BTW, the names winpat and xinpat are a bit opaque. The pat bit is ok, but win & xin? And hey isn't great either. But I guess xin and hey are just temporary names you made up when you decidd to expand the program.
Another thing I just noticed, you don't need to do
all_values.sort()
all_values.reverse()
You can (and should) do that in one hit:
all_values.sort(reverse=True)

Python: Finding values after searching for a string in a text files

I'm new to the world of python and I'm trying to extract values from multiple text files. I can open up the files fine with a loop, but I'm looking for a straight forward way to search for a string and then return the value after it.
My results text files look like this
SUMMARY OF RESULTS
Max tip rotation =,-18.1921,degrees
Min tip rotation =,-0.3258,degrees
Mean tip rotation =,-7.4164,degrees
Max tip displacement =,6.9956,mm
Min tip displacement =,0.7467,mm
Mean tip displacement = ,2.4321,mm
Max Tsai-Wu FC =,0.6850
Max Tsai-Hill FC =,0.6877
So I want to be able to search for say 'Max Tsai-Wu =,' and it return 0.6850
I want to be able to search for the string as the position of each variable might change at a later date.
Sorry for posting such an easy question, just can't seem to find a straight forward robust way of finding it.
Any help would be greatly appreciated!
Matt
You can make use of regex:
import re
regexp = re.compile(r'Max Tsai-Wu.*?([0-9.-]+)')
with open('input.txt') as f:
for line in f:
match = regexp.match(line)
if match:
print match.group(1)
prints:
0.6850
UPD: getting results into the list
import re
regexp = re.compile(r'Max Tsai-Wu.*?([0-9.-]+)')
result = []
with open('input.txt') as f:
for line in f:
match = regexp.match(line)
if match:
result.append(match.group(1))
My favorite way is to test if the line starts with the desired text:
keyword = 'Max Tsai-Wu'
if line.startswith(keyword):
And then split the line using the commas and return the value
try:
return float(line.split(',')[1])
except ValueError:
# treat the error
You can use regular expression to find both name and value:
import re
RE_VALUE = re.compile('(.*?)\s*=,(.*?),')
def test():
line = 'Max tip rotation =,-18.1921,degrees'
rx = RE_VALUE.search(line)
if rx:
print('[%s] value: [%s]' % (rx.group(1), rx.group(2)))
test()
This way reading file line by line you can fill some dictionary.
My regex uses fact that value is between commas.
If the files aren't that big, you could simply do:
import re
files = [list, of, files]
for f in files:
with open(f) as myfile:
print re.search(r'Max Tsai-Wu.*?=,(.+)', myfile.read()).group(1)

Splitting lines in a file into string and hex and do operations on the hex values

I have a large file with several lines as given below.I want to read in only those lines which have the _INIT pattern in them and then strip off the _INIT from the name and only save the OSD_MODE_15_H part in a variable. Then I need to read the corresponding hex value, 8'h00 in this case, ans strip off the 8'h from it and replace it with a 0x and save in a variable.
I have been trying strip the off the _INIT,the spaces and the = and the code is becoming really messy.
localparam OSD_MODE_15_H_ADDR = 16'h038d;
localparam OSD_MODE_15_H_INIT = 8'h00
Can you suggest a lean and clean method to do this?
Thanks!
The following solution uses a regular expression (compiled to speed searching up) to match the relevant lines and extract the needed information. The expression uses named groups "id" and "hexValue" to identify the data we want to extract from the matching line.
import re
expression = "(?P<id>\w+?)_INIT\s*?=.*?'h(?P<hexValue>[0-9a-fA-F]*)"
regex = re.compile(expression)
def getIdAndValueFromInitLine(line):
mm = regex.search(line)
if mm == None:
return None # Not the ..._INIT parameter or line was empty or other mismatch happened
else:
return (mm.groupdict()["id"], "0x" + mm.groupdict()["hexValue"])
EDIT: If I understood the next task correctly, you need to find the hexvalues of those INIT and ADDR lines whose IDs match and make a dictionary of the INIT hexvalue to the ADDR hexvalue.
regex = "(?P<init_id>\w+?)_INIT\s*?=.*?'h(?P<initValue>[0-9a-fA-F]*)"
init_dict = {}
for x in re.findall(regex, lines):
init_dict[x.groupdict()["init_id"]] = "0x" + x.groupdict()["initValue"]
regex = "(?P<addr_id>\w+?)_ADDR\s*?=.*?'h(?P<addrValue>[0-9a-fA-F]*)"
addr_dict = {}
for y in re.findall(regex, lines):
addr_dict[y.groupdict()["addr_id"]] = "0x" + y.groupdict()["addrValue"]
init_to_addr_hexvalue_dict = {init_dict[x] : addr_dict[x] for x in init_dict.keys() if x in addr_dict}
Even if this is not what you actually need, having init and addr dictionaries might help to achieve your goal easier. If there are several _INIT (or _ADDR) lines with the same ID and different hexvalues then the above dict approach will not work in a straight forward way.
try something like this- not sure what all your requirements are but this should get you close:
with open(someFile, 'r') as infile:
for line in infile:
if '_INIT' in line:
apostropheIndex = line.find("'h")
clean_hex = '0x' + line[apostropheIndex + 2:]
In the case of "16'h038d;", clean_hex would be "0x038d;" (need to remove the ";" somehow) and in the case of "8'h00", clean_hex would be "0x00"
Edit: if you want to guard against characters like ";" you could do this and test if a character is alphanumeric:
clean_hex = '0x' + ''.join([s for s in line[apostropheIndex + 2:] if s.isalnum()])
You can use a regular expression and the re.findall() function. For example, to generate a list of tuples with the data you want just try:
import re
lines = open("your_file").read()
regex = "([\w]+?)_INIT\s*=\s*\d+'h([\da-fA-F]*)"
res = [(x[0], "0x"+x[1]) for x in re.findall(regex, lines)]
print res
The regular expression is very specific for your input example. If the other lines in the file are slightly different you may need to change it a bit.

Help parsing text file in python

Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)

Categories

Resources