Dynamically Read the Format of a String, Python

Dynamically Read the Format of a String, Python - python

I have a lookup table of Scientific Names for plants. I want to use this lookup table to validate other tables where I have a data entry person entering the data. Sometimes they get the formatting of these scientific names wrong, so I am writing a script to try to flag the errors.
There's a very specific way to format each name. For example 'Sonchus arvensis L.' specifically needs to have the S in Sonchus capitalized as well as the L at the end. I have about 1000 different plants and each one is formatted differently. Here's a few more examples:
Linaria dalmatica (L.) Mill.
Knautia arvensis (L.) Coult.
Alliaria petiolata (M. Bieb.) Cavara & Grande
Berteroa incana (L.) DC.
Aegilops cylindrica Host
As you can see, all of these strings are formatted very differently (i.e some letters are capitalized, some aren't, there are brackets sometimes, ampersands, periods, etc)
My question is, is there any way to dynamically read the formatting of each string in the lookup table so that I can compare that to the value the data entry person entered to make sure it is formatted properly? In the script below, I test (first elif) to see if the value is in the lookup table by capitalizing all values in order to make the match work, regardless of formatting. In the next test (second elif) I can sort of test formatting by comparing against the lookup table value for value. This will return unmatched records based on formatting, but it doesn't specifically tell you why the unmatched record returned.
What I perceive to do is, read in the string values in the look up table and somehow dynamically read the formatting of each string, so that I can specifically identify the error (i.e. a letter should be capitalized, where it wasn't)
So far my code snippet looks like this:
# Determine if the field heaidng is in a list I built earlier
if "SCIENTIFIC_NAME" in fieldnames:
# First, Test to see if record is empty
if not row.SCIENTIFIC_NAME:
weedPLineErrors.append("SCIENTIFIC_NAME record is empty")
# Second, Test to see if value is in the lookup table, regardless of formatting.
elif row.SCIENTIFIC_NAME.upper() not in [x.upper() for x in weedScientificTableList]:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not in the domain table")
# Third, if the second test is satisfied, we know the value is in the lookup table. We can then test the lookup table again, without capitalizing everything to see if there is an exact match to account for formatting.
elif row.SCIENTIFIC_NAME not in weedScientificTableList:
weedPLineErrors.append("COMMON_NAME (" + row.SCIENTIFIC_NAME + ")" + " is not formatted properly")
else:
pass
I hope my question is clear enough. I looked at string templates, but I don't think it does what I want to do...at least not dynamically. If anyone can point me in a better direction, I am all eyes...but maybe I am way out to lunch on this one.
Thanks,
Mike

To get around the punctuation problem, you can use regular expressions.
>>> import re
>>> def tokenize(s):
... return re.split('[^A-Za-z]+', s) # Split by anything that isn't a letter
...
>>> tokens = tokenize('Alliaria petiolata (M. Bieb.) Cavara & Grande')
>>> tokens
['Alliaria', 'petiolata', 'M', 'Bieb', 'Cavara', 'Grande']
To get around the capitalization problem, you can use
>>> tokens = [s.lower() for s in tokens]
From there, you could rewrite the entry in a standardized format, such as
>>> import string
>>> ## I'm not sure exactly what format you're looking for
>>> first, second, third = [string.capitalize(s) for s in tokens[:3]]
>>> "%s %s (%s)" % (first, second, third)
'Alliaria Petiolata (M)'
This probably isn't the exact formatting that you want, but maybe that will get you headed in the right direction.

You can build a dictionary of the names from the lookup table. Assuming that you have the names stored in a list (call it correctList), you can write a function which removes all formatting and maybe lowers or uppers the case and store the result in a dictionary. For example following is a sample code to build the dictionary
def removeFormatting(name):
name = name.replace("(", "").replace(")", "")
name = name.replace(".", "")
...
return name.lower()
formattingDict = dict([(removeFormatting(i), i) for i in correctList])
Now you can compare the strings input by the data entry person. Lets say it is in a list called inputList.
for name in inputList:
unformattedName = removeFormatting(name)
lookedUpName = formattingDict.get(unformattedName, "")
if not lookedUpName:
print "Spelling mistake:", name
elif lookedUpName != name:
print "Formatting error"
print differences(name, lookedUpName)
The differences function could be stuffed with some rules like brackets, "."s etc
def differences(inputName, lookedUpName):
mismatches = []
# Check for brackets
if "(" in lookedUpName:
if "(" not in inputName:
mismatches.append("Bracket missing")
...
# Add more rules
return mismatches
Does that answer your question a bit?

Related

pythonic method for extracting numeric digits from string

I am developing a program to read through a CSV file and create a dictionary of information from it. Each line in the CSV is essentially a new dictionary entry with the delimited objects being the values.
As one subpart of task, I need to extract an unknown number of numeric digits from within a string. I have a working version, but it does not seem very pythonic.
An example string looks like this:
variable = Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]
variable is string's name in the python code, and represents the variable name within a MODBUS. I want to extract just the digits prior to the .WORD_type[0] which relate to the number of bytes the string is packed into.
Here is my working code, note this is nested within a for statement iterating through the lines in the CSV. var_length and var_type are some of the keys, i.e. {"var_length": var_length}
if re.search(".+_ST[0-9]{1,2}\\.WORD_type.+", variable):
var_type = "string"
temp = re.split("\\.", variable)
temp = re.split("_", temp[2])
temp = temp[-1]
var_length = int(str.lstrip(temp, "ST")) / 2

You could maybe try using matching groups like so:
import re
variable = "Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]"
matches = re.match(r".+_ST(\d+)\.WORD_type.+", variable)
if matches:
print(matches[1])
matches[0] has the full match and matches[1] contains the matched group.

How can I use a loop to create similarly-named strings for a number of similar columns imported from a csv?

I want to work with data form a csv in python. I'm looking to make each column a separate string, and I am wondering if there is a way to loop through this process so that I don't have to specify the name of each string individually (as the naming conventions are very similar).
For a number of the csv columns, I am using the following code:
dot_title=str(row[0]).lower()
onet_title=str(row[1]).lower()
For [2]-[11], I would like each string to be named the same but numbered. I.e., row[2] would become a string called onet_reported_1, row[3] would be onet_reported_2, row[4] would be onet_reported_3... etc., all the way through to row[12].
Is there a way of doing this with a loop, instead of simply defining onet_reported_1, _2, _3, _4 etc. individually?
Thanks in advance!

So, first some clarity.
A string is a variable type. In Python, you create a string by surrounding some text in either single or double quotes.
"This is a string"
'So is this. It can have number characters: 123. Or any characters: !##$'
Strings are values that can be assigned to a variable. So you use a string by giving it a name:
my_string = "This is a string"
another_string = "One more of these"
You can do different kinds of operations on strings like joining them with the + operator
new_string = my_string + another_string
And you can create lists of strings:
list_of_strings = [new_string, my_string, another_string]
which looks like ["This is a stringOne more of these", "This is a string", "One more of these"].
To create multiple strings in a loop, you'll need a place to store them. A list is a good candidate:
list_of_strings = []
for i in range(1, 11):
list.append("onet_reported_" + i)
But I think what you want is to name the variables "onet_reported_x" so that you end up with something equivalent to :
onet_reported_1 = row[1]
onet_reported_2 = row[2]
and so forth, without having to type out all that redundant code. That's a good instinct. One nice way to do this kind of thing is to create a dictionary where the keys are the string names that you want and the values are the row[i]'s. You can do this in a loop:
onet_dict = {}
for i in range(1, 11):
onet_dict["onet_reported_" + i] = row[i]
or with a dictionary comprehension:
onet_dict = {"onet_reported_" + i: row[i] for i in range(1,11)}
Both will give you the same result. Now you have a collection of strings with then names you want as the keys of the dict that are mapped to the row values you want them associated to. To use them, instead of referring directly to the name onet_reported_x you have to access the value from the dict like:
# Adding some other value to onet_reported_5. I'm assuming the values are numbers.
onet_dict["onet_reported_5"] += 20457

Why does "\n" appear in my string output?

I have elements that I've scraped off of a website and when I print them using the following code, they show up neatly as spaced out elements.
print("\n" + time_element)
prints like this
F
4pm-5:50pm
but when I pass time_element into a dataframe as a column and convert it to a string, the output looks like this
# b' \n F\n \n 4pm-5:50pm\n
I am having trouble understanding why it appears so and how to get rid of this "\n" character. I tried using regex to match the "F" and the "4pm-5:50pm" and I thought this way I could separate out the data I need. But using various methods including
# Define the list and the regex pattern to match
time = df['Time']
pattern = '[A-Z]+'
# Filter out all elements that match the pattern
filtered = [x for x in time if re.match(pattern, x)]
print(filtered)
I get back an empty list.
From my research, I understand the "\n" represents a new line and that there might be invisible characters. However, I'm not understanding more about how they behave so I can get rid of them/around them to extract the data that I need.
When I pass the data to csv format, it prints like this all in one cell
F
4pm-5:50pm
but I still end up in the similar place when it comes to separating out the data that I need.

you can use the function strip() when you extract data from the website to avoid "\n"

Python: extracting text from strings using a key phrase

Struggling trying to find a way to do this, any help would be great.
I have a long string – it’s the Title field. Here are some samples.
AIR-LAP1142N-A-K
AIR-LP142N-A-K
Used Airo 802.11n Draft 2.0 SingleAccess Point AIR-LP142N-A-9
Airo AIR-AP142N-A-K9 IOS Ver 15.2
MINT Lot of (2) AIR-LA112N-A-K9 - Dual-band-based 802.11a/g/n
Genuine Airo 112N AP AIR-LP114N-A-K9 PoE
Wireless AP AIR-LP114N-A-9 Airy 50 availiable
I need to pull the part number out of the Title and assign it to a variable named ‘PartNumber’. The part number will always start with the characters ‘AIR-‘.
So for example-
Title = ‘AIR-LAP1142N-A-K9 W/POWER CORD’
PartNumber = yourformula(Title)
Print (PartNumber) will output AIR-LAP1142N-A-K9
I am fairly new to python and would greatly appreciate help. I would like it to ONLY print the part number not all the other text before or after.

What you’re looking for is called regular expressions and is implemented in the re module. For instance, you’d need to write something like :
>>> import re
>>> def format_title(title):
... return re.search("(AIR-\S*)", title).group(1)
>>> Title = "Cisco AIR-LAP1142N-A-K9 W/POWER CORD"
>>> PartNumber = format_title(Title)
>>> print(PartNumber)
AIR-LAP1142N-A-K9
The \S ensures you match everything from AIR- to the next blank character.

def yourFunction(title):
for word in title.split():
if word.startswith('AIR-'):
return word
>>> PartNumber = yourFunction(Title)
>>> print PartNumber
AIR-LAP1142N-A-K9

This is a sensible time to use a regular expression. It looks like the part number consists of upper-case letters, hyphens, and numbers, so this should work:
import re
def extract_part_number(title):
return re.search(r'(AIR-[A-Z0-9\-]+)', title).groups()[0]
This will throw an error if it gets a string that doesn't contain something that looks like a part number, so you'll probably want to add some checks to make sure re.search doesn't return None and groups doesn't return an empty tuple.

You may/could use the .split() function. What this does is that it'll split parts of the text separated by spaces into a list.
To do this the way you want it, I'd make a new variable (named whatever); though for this example, let's go with titleSplitList. (Where as this variable is equal to titleSplitList = Title.split())
From here, you know that the part of text you're trying to retrieve is the second item of the titleSplitList, so you could assign it to a new variable by:
PartNumber = titleSplitList[1]
Hope this helps.

Python Regex to match a string as a pattern and return number

I have some lines that represent some data in a text file. They are all of the following format:
s = 'TheBears SUCCESS Number of wins : 14'
They all begin with the name then whitespace and the text 'SUCCESS Number of wins : ' and finally the number of wins, n1. There are multiple strings each with a different name and value. I am trying to write a program that can parse any of these strings and return the name of the dataset and the numerical value at the end of the string. I am trying to use regular expressions to do this and I have come up with the following:
import re
def winnumbers(s):
pattern = re.compile(r"""(?P<name>.*?) #starting name
\s*SUCCESS #whitespace and success
\s*Number\s*of\s*wins #whitespace and strings
\s*\:\s*(?P<n1>.*?)""",re.VERBOSE)
match = pattern.match(s)
name = match.group("name")
n1 = match.group("n1")
return (name, n1)
So far, my program can return the name, but the trouble comes after that. They all have the text "SUCCESS Number of wins : " so my thinking was to find a way to match this text. But I realize that my method of matching an exact substring isn't correct right now. Is there any way to match a whole substring as part of the pattern? I have been reading quite a bit on regular expressions lately but haven't found anything like this. I'm still really new to programming and I appreciate any assistance.
Eventually, I will use float() to return n1 as a number, but I left that out because it doesn't properly find the number in the first place right now and would only return an error.

Try this one out:
((\S+)\s+SUCCESS Number of wins : (\d+))
These are the results:
>>> regex = re.compile("((\S+)\s+SUCCESS Number of wins : (\d+))")
>>> r = regex.search(string)
>>> r
<_sre.SRE_Match object at 0xc827cf478a56b350>
>>> regex.match(string)
<_sre.SRE_Match object at 0xc827cf478a56b228>
# List the groups found
>>> r.groups()
(u'TheBears SUCCESS Number of wins : 14', u'TheBears', u'14')
# List the named dictionary objects found
>>> r.groupdict()
{}
# Run findall
>>> regex.findall(string)
[(u'TheBears SUCCESS Number of wins : 14', u'TheBears', u'14')]
# So you can do this for the name and number:
>>> fullstring, name, number = r.groups()
If you don't need the full string just remove the surround parenthesis.

I believe that there is no actual need to use a regex here. So you can use the following code if it acceptable for you(note that i have posted it so you will have ability to have another one option):
dict((line[:line.lower().index('success')+1], line[line.lower().index('wins:') + 6:]) for line in text.split('\n') if 'success' in line.lower())
OR in case of you are sure that all words are splitted by single spaces:
output={}
for line in text:
if 'success' in line.lower():
words = line.strip().split(' ')
output[words[0]] = words[-1]

If the text in the middle is always constant, there is no need for a regular expression. The inbuilt string processing functions will be more efficient and easier to develop, debug and maintain. In this case, you can just use the inbuilt split() function to get the pieces, and then clean the two pieces as appropriate:
>>> def winnumber(s):
... parts = s.split('SUCCESS Number of wins : ')
... return (parts[0].strip(), int(parts[1]))
...
>>> winnumber('TheBears SUCCESS Number of wins : 14')
('TheBears', 14)
Note that I have output the number of wins as an integer (as presumably this will always be a whole number), but you can easily substitute float()- or any other conversion function - for int() if you desire.
Edit: Obviously this will only work for single lines - if you call the function with several lines it will give you errors. To process an entire file, I'd use map():
>>> map(winnumber, open(filename, 'r'))
[('TheBears', 14), ('OtherTeam', 6)]
Also, I'm not sure of your end use for this code, but you might find it easier to work with the outputs as a dictionary:
>>> dict(map(winnumber, open(filename, 'r')))
{'OtherTeam': 6, 'TheBears': 14}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.