Extract data using regular expressions in python

Extract data using regular expressions in python - python

I'm trying to parse a file with Serial numbers and part numbers etc and sort them into a structure. I would like to parse this file by tagging off of the identifiers but then I only really need the actual numbers/codes for my data structure. I need to assume that all the numbers/codes are of varied length however I can depend on the identifiers to precede the numbers/codes and also depend on the end line after each value.
//Text file with serials and information
Serial: 523524234235
Part Number: MHC-1251-A
Manufacturer: KNL-ETA
Serial: 523524281238
Part Number: QLC-851
Manufacturer: MHQ-MCE
.
.
.

On each line you can apply regular expressions to extract desired part like this:
>>> import re
>>> text = "Serial: 523524234235"
>>> m = re.search(r'Serial: (\d+)', text)
>>> m.group(1)
'523524234235'
You can also use split to get two parts in each line and then check first part to see what kind of token it is Serial, Part Number etc.
your regular expression needs some improvement.
m = re.search(r'Serial: (\d+)', text) ==> ` m = re.search(r'Serial:[\s]*(\d+)[\s]*', text)`

open the file and readlines and iterate and split by ':' to get your numbers. You can use regex if values are not line by line.

I agree with #loki; from what you are telling, the use of regex is not necessary. An appropriate structure extracted from a file like yours might be set up like:
parts={} # data structure
entry={} # single set
for line in open('file.dat', 'r'):
flds = [fld.strip() for fld in line.split(':')[:2]]
if len(flds) > 1:
k,v = flds
if k == 'Serial': # use serial number as key vor corresponding entry
entry = {}
parts[v] = entry
else:
entry[k] = v # save information in data set
Result:
{'523524234235': {'Part Number': 'MHC-1251-A', 'Manufacturer': 'KNL-ETA'}, '523524281238': {'Part Number': 'QLC-851', 'Manufacturer': 'MHQ-MCE'}, ...}

Related

Get the full word(s) by knowing only just a part of it

I am searching through a text file line by line and i want to get back all strings that contains the prefix AAAXX1234. For example in my text file i have these lines
Hello my ID is [123423819::AAAXX1234_3412] #I want that(AAAXX1234_3412)
Hello my ID is [738281937::AAAXX1234_3413:AAAXX1234_4212] #I
want both of them(AAAXX1234_3413, AAAXX1234_4212)
Hello my ID is [123423819::XXWWF1234_3098] #I don't care about that
The code i have a just to check if the line starts with "Hello my ID is"
with open(file_hrd,'r',encoding='utf-8') as hrd:
hrd=hrd.readlines()
for line in hrd:
if line.startswith("Hello my ID is"):
#do something

Try this:
import re
with open(file_hrd,'r',encoding='utf-8') as hrd:
res = []
for line in hrd:
res += re.findall('AAAXX1234_\d+', line)
print(res)
Output:
['AAAXX1234_3412', 'AAAXX1234_3413', 'AAAXX1234_4212']

I’d suggest you to parse your lines and extract the information into meaningful parts. That way, you can then use a simple startswith on the ID part of your line. In addition, this will also let you control where you find these prefixes, e.g. in case the lines contains additional data that could also theoretically contain something that looks like an ID.
Something like this:
if line.startswith('Hello my ID is '):
idx_start = line.index('[')
idx_end = line.index(']', idx_start)
idx_separator = line.index(':', idx_start, idx_end)
num = line[idx_start + 1:idx_separator]
ids = line[idx_separator + 2:idx_end].split(':')
print(num, ids)
This would give you the following output for your three example lines:
123423819 ['AAAXX1234_3412']
738281937 ['AAAXX1234_3413', 'AAAXX1234_4212']
123423819 ['XXWWF1234_3098']
With that information, you can then check the ids for a prefix:
if any(ids, lambda x: x.startswith('AAAXX1234')):
print('do something')

Using regular expressions through the re module and its findall() function should be enough:
import re
with open('file.txt') as file:
prefix = 'AAAXX1234'
lines = file.read().splitlines()
output = list()
for line in lines:
output.extend(re.findall(f'{prefix}_[\d]+', line))

You can do it by findall with the regex r'AAAXX1234_[0-9]+', it will find all parts of the string that start with AAAXX1234_ and then grabs all of the numbers after it, change + to * if you want it to match 'AAAXX1234_' on it's own as well

Matching variable number of occurrences of token using regex in python

I am trying to match a token multiple times, but I only get back the last occurrence, which I understand is the normal behavior as per this answer, but I haven't been able to get the solution presented there in my example.
My text looks something like this:
&{dict1_name}= key1=key1value key2=key2value
&{dict2_name}= key1=key1value
So basically multiple lines, each with a starting string, spaces, then a variable number of key pairs. If you are wondering where this comes from, it is a robot framework variables file that I am trying to transform into a python variables file.
I will be iterating per line to match the key pairs and construct a python dictionary from them.
My current regex pattern is:
&{([^ ]+)}=[ ]{2,}(?:[ ]{2,}([^\s=]+)=([^\s=]+))+
This correctly gets me the dict name but the key pairs only match the last occurrence, as mentioned above. How can I get it to return a tuple containing: ("dict1_name","key1","key1value"..."keyn","keynvalue") so that I can then iterate over this and construct the python dictionary like so:
dict1_name= {"key1": "key1value",..."keyn": "keynvalue"}
Thanks!

As you point out, you will need to work around the fact that capture groups will only catch the last match. One way to do so is to take advantage of the fact that lines in a file are iterable, and to use two patterns: one for the "line name", and one for its multiple keyvalue pairs:*
import re
dname = re.compile(r'^&{(?P<name>\w+)}=')
keyval = re.compile(r'(?P<key>\w+)=(?P<val>\w+)')
data = {}
with open('input/keyvals.txt') as f:
for line in f:
name = dname.search(line)
if name:
name = name.group('name')
data[name] = dict(keyval.findall(line))
*Admittedly, this is a tad inefficient since you're conducting two searches per line. But for moderately sized files, you should be fine.
Result:
>>> from pprint import pprint
>>> pprint(data)
{'d5': {'key1': '28f_s', 'key2': 'key2value'},
'name1': {'key1': '5', 'key2': 'x'},
'othername2': {'key1': 'key1value', 'key2': '7'}}
Note that \w matches Unicode word characters.
Sample input, keyvals.txt:
&{name1}= key1=5 key2=x
&{othername2}= key1=key1value key2=7
&{d5}= key1=28f_s key2=aaa key2=key2value

You could use two regexes one for the names and other for the items, applying the one for the items after the first space:
import re
lines = ['&{dict1_name}= key1=key1value key2=key2value',
'&{dict2_name}= key1=key1value']
name = re.compile('^&\{(\w+)\}=')
item = re.compile('(\w+)=(\w+)')
for line in lines:
n = name.search(line).group(1)
i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
exec('{} = {}'.format(n, i))
print(locals()[n])
Output
{'key2': 'key2value', 'key1': 'key1value'}
{'key1': 'key1value'}
Explanation
The '^&\{(\w+)\}=' matches an '&' followed by a word (\w+) surrounded by curly braces '\{', '\}'. The second regex matches any words joined by a '='. The line:
i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
creates a dictionary literal, finally you create a dictionary with the required name using exec. You can access the value of the dictionary querying locals.

Use two expressions in combination with a dict comprehension:
import re
junkystring = """
lorem ipsum
&{dict1_name}= key1=key1value key2=key2value
&{dict2_name}= key1=key1value
lorem ipsum
"""
rx_outer = re.compile(r'^&{(?P<dict_name>[^{}]+)}(?P<values>.+)', re.M)
rx_inner = re.compile(r'(?P<key>\w+)=(?P<value>\w+)')
result = {m_outer.group('dict_name'): {m_inner.group('key'): m_inner.group('value')
for m_inner in rx_inner.finditer(m_outer.group('values'))}
for m_outer in rx_outer.finditer(junkystring)}
print(result)
Which produces
{'dict1_name': {'key1': 'key1value', 'key2': 'key2value'},
'dict2_name': {'key1': 'key1value'}}
With the two expressions being
^&{(?P<dict_name>[^{}]+)}(?P<values>.+)
# the outer format
See a demo on regex101.com. And the second
(?P<key>\w+)=(?P<value>\w+)
# the key/value pairs
See a demo for the latter on regex101.com as well.
The rest is simply sorting the different expressions in the dict comprehension.

Building off of Brad's answer, I made some modifications. As mentioned in my comment on his reply, it failed at empty lines or comment lines. I modified it to ignore these and continue. I also added handling of spaces: it now matches spaces in dictionary names but replaces them with underscore since python cannot have spaces in variable names. Keys are left untouched since they are strings.
import re
def robot_to_python(filename):
"""
This function can be used to convert robot variable files containing dicts to a python
variables file containing python dict that can be imported by both python and robot.
"""
dname = re.compile(r"^&{(?P<name>.+)}=")
keyval = re.compile(r"(?P<key>[\w|:]+)=(?P<val>[\w|:]+)")
data = {}
with open(filename + '.robot') as f:
for line in f:
n = dname.search(line)
if n:
name = dname.search(line).group("name").replace(" ", "_")
if name:
data[name] = dict(keyval.findall(line))
with open(filename + '.py', 'w') as file:
for dictionary in data.items():
dict_name = dictionary[0]
file.write(dict_name + " = { \n")
keyvals = dictionary[1]
for k in sorted(keyvals.keys()):
file.write("'%s':'%s', \n" % (k, keyvals[k]))
file.write("}\n\n")
file.close()

Parsing paragraph out of text file in Python?

I am trying to parse certain paragraphs out of multiple text file and store them in list. All the text file have some similar format to this:
MODEL NUMBER: A123
MODEL INFORMATION: some info about the model
DESCRIPTION: This will be a description of the Model. It
could be multiple lines but an empty line at the end of each.
CONCLUSION: Sold a lot really profitable.
Now i can pull out the information where its one line, but am having trouble when i encounter something which is multiple line (like 'Description'). The description length is not known but i know at the end it would have an empty line (which would mean using '\n'). This is what i have so far:
import os
dir = 'Test'
DESCRIPTION = []
for files in os.listdir(dir):
if files.endswith('.txt'):
with open(dir + '/' + files) as File:
reading = File.readlines()
for num, line in enumerate(reading):
if 'DESCRIPTION:' in line:
Start_line = num
if len(line.strip()) == 0:
I don't know if its the best approach, but what i was trying to do with if len(line.strip()) == 0: is to create a list of blank lines and then find the first greater value than Start_Line. I saw this Bisect.
In the end i would like my data to be if i say print Description
['DESCRIPTION: Description from file 1',
'DESCRIPTION: Description from file 2',
'DESCRIPTION: Description from file 3,]
Thanks.

Regular expression. Think about it this way: you have a pattern that will allow you to cut any file into pieces you will find palatable: "newline followed by capital letter"
re.split is your friend
Take a string
"THE
BEST things
in life are
free
IS
YET
TO
COME"
As a string:
p = "THE\nBEST things\nin life are\nfree\nIS\nYET\nTO\nCOME"
c = re.split('\n(?=[A-Z])', p)
Which produces list c
['THE', 'BEST things\nin life are\nfree', 'IS', 'YET', 'TO', 'COME']
I think you can take it from there, as this would separate your files into each a list of strings with each string beings its own section, then from there you can find the "DESCRIPTION" element and store it, you see that you separate each section, including its subcontents by that re split. Important to note that the way I've set up the regex it recognies the PATTERN "newline and then Capital Letter" but CUTS after the newline, which is why it is outside the brackets.

python - How to extract strings from each line in text file?

I have a text file that detects the amount of monitors that are active.
I want to extract specific data from each line and include it in a list.
The text file looks like this:
[EnumerateDevices]: Enumerating Devices.
DISPLAY\LGD03D7\4&ACE0355&1&UID68092928 : Generic PnP Monitor
DISPLAY\ABCF206\4&ACE0355&1&UID51249920 : Generic PnP Monitor
//
// here can be more monitors...
//
2 matching device(s) found.
I need to get the number after the UID in the middle of the text : 68092929 , 51249920 ..
I thought of doing the next:
a. enter each line in text
b. see if "UID" string exist
c. if it exists : split (here I dot know how to do it.. split by (" ") or ("&")
Is there any good idea you can advise? I don't understand how can I get the numbers after the UID (if the next number is longer than the previous ones for example)
how can I get a command that does : ("If you see UID string, get all the data until you see first blank")
any idea?
Thanks

I would use a regular expresssion to extract the UID
e.g.
import re
regexp = re.compile('UID(\d+)')
file = """[EnumerateDevices]: Enumerating Devices.
DISPLAY\LGD03D7\4&ACE0355&1&UID68092928 : Generic PnP Monitor
DISPLAY\ABCF206\4&ACE0355&1&UID51249920 : Generic PnP Monitor
//
// here can be more monitors...
//
2 matching device(s) found."""
print re.findall(regexp, file)

Use regular expressions:
import re
p =re.compile(r'.*UID(\d+)')
with open('infile') as infile:
for line in infile:
m = p.match(line)
if m:
print m.groups[0]

You can use the split() method.
s = "hello this is a test"
words = s.split(" ")
print words
The output of the above snippet is a list containing: ['hello', 'this', 'is', 'a', 'test']
In your case, you can split on the substring "UID" and grab the second element in the list to get the number that you're looking for.
See docs here: https://docs.python.org/2/library/string.html#string.split

This is a bit esoteric but does the trick with some list comprehension:
[this.split("UID")[1].split()[0] for this in txt.split("\n") if "UID" in this]
the output is the list you are looking for I presume: ['68092928', '51249920']
Explanations:
split the text into rows (split("\n")
select only rows with UID inside (for this in ... if "UID" in this)
in the remaining rows, split using "UID".
You want to keep only one element after UID hence the [1]
The resulting string contains the id and some text separated by a space so, we use a second split(), defaulting to spaces.

>>> for line in s.splitlines():
... line = line.strip()
... if "UID" in line:
... tmp = line.split("UID")
... uid = tmp[1].split(':')[0]
... print "UID " + uid
...
UID 68092928
UID 51249920

You can use the find() method:
if line.find('UID') != -1:
print line[line.find('UID') + 2 :]
Docs https://docs.python.org/2/library/string.html#string.find

if you read the whole file at once, otherwise if line by line just change the first line to line.split()
for elem in file.split():
if 'UID' in elem:
print elem.split('UID')[1]
the split will have already stripped "junk" do each elem that contains the 'UID' string will be all set to int() or just print as a string

Splitting lines in a file into string and hex and do operations on the hex values

I have a large file with several lines as given below.I want to read in only those lines which have the _INIT pattern in them and then strip off the _INIT from the name and only save the OSD_MODE_15_H part in a variable. Then I need to read the corresponding hex value, 8'h00 in this case, ans strip off the 8'h from it and replace it with a 0x and save in a variable.
I have been trying strip the off the _INIT,the spaces and the = and the code is becoming really messy.
localparam OSD_MODE_15_H_ADDR = 16'h038d;
localparam OSD_MODE_15_H_INIT = 8'h00
Can you suggest a lean and clean method to do this?
Thanks!

The following solution uses a regular expression (compiled to speed searching up) to match the relevant lines and extract the needed information. The expression uses named groups "id" and "hexValue" to identify the data we want to extract from the matching line.
import re
expression = "(?P<id>\w+?)_INIT\s*?=.*?'h(?P<hexValue>[0-9a-fA-F]*)"
regex = re.compile(expression)
def getIdAndValueFromInitLine(line):
mm = regex.search(line)
if mm == None:
return None # Not the ..._INIT parameter or line was empty or other mismatch happened
else:
return (mm.groupdict()["id"], "0x" + mm.groupdict()["hexValue"])
EDIT: If I understood the next task correctly, you need to find the hexvalues of those INIT and ADDR lines whose IDs match and make a dictionary of the INIT hexvalue to the ADDR hexvalue.
regex = "(?P<init_id>\w+?)_INIT\s*?=.*?'h(?P<initValue>[0-9a-fA-F]*)"
init_dict = {}
for x in re.findall(regex, lines):
init_dict[x.groupdict()["init_id"]] = "0x" + x.groupdict()["initValue"]
regex = "(?P<addr_id>\w+?)_ADDR\s*?=.*?'h(?P<addrValue>[0-9a-fA-F]*)"
addr_dict = {}
for y in re.findall(regex, lines):
addr_dict[y.groupdict()["addr_id"]] = "0x" + y.groupdict()["addrValue"]
init_to_addr_hexvalue_dict = {init_dict[x] : addr_dict[x] for x in init_dict.keys() if x in addr_dict}
Even if this is not what you actually need, having init and addr dictionaries might help to achieve your goal easier. If there are several _INIT (or _ADDR) lines with the same ID and different hexvalues then the above dict approach will not work in a straight forward way.

try something like this- not sure what all your requirements are but this should get you close:
with open(someFile, 'r') as infile:
for line in infile:
if '_INIT' in line:
apostropheIndex = line.find("'h")
clean_hex = '0x' + line[apostropheIndex + 2:]
In the case of "16'h038d;", clean_hex would be "0x038d;" (need to remove the ";" somehow) and in the case of "8'h00", clean_hex would be "0x00"
Edit: if you want to guard against characters like ";" you could do this and test if a character is alphanumeric:
clean_hex = '0x' + ''.join([s for s in line[apostropheIndex + 2:] if s.isalnum()])

You can use a regular expression and the re.findall() function. For example, to generate a list of tuples with the data you want just try:
import re
lines = open("your_file").read()
regex = "([\w]+?)_INIT\s*=\s*\d+'h([\da-fA-F]*)"
res = [(x[0], "0x"+x[1]) for x in re.findall(regex, lines)]
print res
The regular expression is very specific for your input example. If the other lines in the file are slightly different you may need to change it a bit.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract data using regular expressions in python - python

open the file and readlines and iterate and split by ':' to get your numbers. You can use regex if values are not line by line.

Related

Get the full word(s) by knowing only just a part of it

Matching variable number of occurrences of token using regex in python

Parsing paragraph out of text file in Python?

python - How to extract strings from each line in text file?

Splitting lines in a file into string and hex and do operations on the hex values

Categories

Resources