Underscore Index in Python - python

Why is the underscore needed when using an index? Here is a bit of code from code academy.
animals = ["aardvark", "badger", "duck", "emu", "fennec fox"]
duck_index = animals.index("duck") # Use index() to find "duck"
animals.insert(duck_index, "cobra")# Your code here!
print animals # Observe what prints after the insert operation
It's very confusing to have duck_index, then have animals.index. It really throws off the whole thing in my mind.
I tried looking on other sites to see if others have brought this up. But I can't find any answers. Makes sense to have this answered and archived in Stack Overflow.
Why use the "_index", why not just use the .index for everything. Also what is the major difference between the two?

An underscore is just another valid character in a variable name, just like the characters A-Z, a-z, and 0-9. You use it when you want to separate two words in a single name. It's the recommended practice in PEP 8. The underscore takes on special meaning when used at the beginning of the name, but I won't get into that here.
The . on the other hand is used to access a member of a variable. In this case you're calling the index method on the animals variable. animals and index are two different names.

Animal.index ("duck") is looking for index position of "duck" in animal.
While duck _ index is an indetifier variable that will hold index of duck. The naming makes sense

Related

regex to find LastnameFirstname with no space between in Python

i currently have several names that look like this
SmithJohn
smithJohn
O'BrienPeter
both of these have no spaces, but have a capital letter in between.
is there a regex to match these types of names (but won't match names like Smith, John, Smith John or Smith.John)? furthermore, how could i split up the last name and first name into two different variables?
thanks
If all you want is a string with a capital letter in the middle and lowercase letters around it, this should work okay: [a-z][A-Z] (make sure you use re.search and not match). It handles "O'BrienPeter" fine, but might match names like "McCutchon" when it shouldn't. It's impossible to come up with a regex, or any program really, that does that you want for all names (see Falsehoods Programmers Believe About Names).
As Brian points out, there's a question you need to ask yourself here: What guarantees do you have about the strings you will be processing?
Do you know without a doubt that the only capitals will be the beginnings of the names? Or could something like "McCutchonBrian", or in my case "Mallegol-HansenPhilip" have found its way in there as well?
In the greater context of software in general, you need to consider the assumptions you are going in with. Otherwise you're going to be solving a problem, that is in fact not the problem you have.

Wildcard in python dictionary

I am trying create a python dictionary to reference 'WHM1',2,3, 'HISPM1',2,3, etc. and other iterations to create a new column with a specific string for ex. White or Hispanic. Using regex seems like the right path but I am missing something here and refuse to hard code the whole thing in the dictionary.
I have tried several iterations of regex and regexdict :
d = regexdict({'W*':'White', 'H*':'Hispanic'})
eeoc_nac2_All_unpivot_df['Race'] =
eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)
A new column will be created with 'White' or 'Hispanic' for each row based on what is in an existing column called 'EEOC_Code'.
Your regular expressions are wrong - you appear to be using glob syntax instead of proper regular expressions.
In regex, x* means "zero or more of x" and so both your regexes will trivially match the empty string. You apparently mean
d = regexdict({'^W':'White', '^H':'Hispanic'})
instead, where the regex anchor ^ matches beginning of string.
There are several third-party packages 1, 2, 3 named regexdict so you should probably point out which one you use. I can't tell whether the ^ is necessary here, or whether the regexes need to match the input completely (I have assumed a substring match is sufficient, as is usually the case in regex) because this sort of detail may well differ between implementations.
I'm not sure to have completely understood your problem. However, if all your labels have structure WHM... and HISP..., then you can simply check the first character:
for race in eeoc_nac2_All_unpivot_df['EEOC_Code']:
if race.startswith('W'):
eeoc_nac2_All_unpivot_df['Race'] = "White"
else:
eeoc_nac2_All_unpivot_df['Race'] = "Hispanic"
Note: it only works if what you have inside eeoc_nac2_All_unpivot_df['EEOC_Code'] is iterable.

How to parse names from raw text

I was wondering if anyone knew of any good libraries or methods of parsing names from raw text.
For example, let's say I've got these as examples: (note sometimes they are capitalized tuples, other times not)
James Vaynerchuck and the rest of the group will be meeting at 1PM.
Sally Johnson, Jim White and brad burton.
Mark angleman Happiness, Productivity & blocks. Mark & Evan at 4pm.
My first thought is to load some sort of Part Of Speech tagger (like Pythons NLTK), tag all of the words. Then strip out only nouns, then compare the nouns against a database of known words (ie a literal dictionary), if they aren't in the dictionary, assume they are a name.
Other thoughts would be to delve into machine learning, but that might be beyond the scope of what I need here.
Any thoughts, suggestions or libraries you could point me to would be very helpful.
Thanks!
I don't know why you think you need NLTK just to rule out dictionary words; a simple dictionary (which you might have installed somewhere like /usr/share/dict/words, or you can download one off the internet) is all you need:
with open('/usr/share/dict/words') as f:
dictwords = {word.strip() for word in f}
with open(mypath) as f:
names = [word for line in f for word in line.rstrip().split()
if word.lower() not in dictwords]
Your words list may include names, but if so, it will include them capitalized, so:
dictwords = {word.strip() for word in f if word.islower()}
Or, if you want to whitelist proper names instead of blacklisting dictionary words:
with open('/usr/share/dict/propernames') as f:
namewords = {word.strip() for word in f}
with open(mypath) as f:
names = [word for line in f for word in line.rstrip().split()
if word.title() in namewords]
But this really isn't going to work. Look at "Jim White" from your example. His last name is obviously going to be in any dictionary, and his first name will be in many (as a short version of "jimmy", as a common romanization of the Arabic letter "jīm", etc.). "Mark" is also a common dictionary word. And the other way around, "Will" is a very common name even though you want to treat it as a word, and "Happiness" is an uncommon name, but at least a few people have it.
So, to make this work even the slightest bit, you probably want to combine multiple heuristics. First, instead of a word being either always a name or never a name, each word has a probability of being used as a name in some relevant corpus—White may be a name 13.7% of the time, Mark 41.3%, Jim 99.1%, Happiness 0.1%, etc. Next, if it's not the first word in a sentence, but is capitalized, it's much more likely to be a name (how much more? I don't know, you'll need to test and tune for your particular input), and if it's lowercase, it's less likely to be a name. You could bring in more context—for example, you have a lot of full names, so if something is a possible first name and it appears right next to something that's a common last name, it's more likely to be a first name. You could even try to parse the grammar (it's OK if you bail on some sentences; they just won't get any input from the grammar rule), so if two adjacent words only work as part of a sentence one if the second one is a verb, they're probably not a first and last name, even if that same second word could be a noun (and a name) in other contexts. And so on.
I found this library quite useful for parsing names: Python Name Parser
It can also deal with names that are formatted Lastname, Firstname.

A simple regexp in python

My program is a simple calculator, so I need to parse te expression which the user types, to get the input more user-friendly. I know I can do it with regular expressions, but I'm not familar enough about this.
So I need transform a input like this:
import re
input_user = "23.40*1200*(12.00-0.01)*MM(H2O)/(8.314 *func(2*x+273.15,x))"
re.some_stuff( ,input_user) # ????
in this:
"23.40*1200*(12.00-0.01)*MM('H2O')/(8.314 *func('2*x+273.15',x))"
just adding these simple quotes inside the parentheses. How can I do that?
UPDATE:
To be more clear, I want add simple quotes after every sequence of characters "MM(" and before the ")" which comes after it, and after every sequence of characters "func(" and before the "," which comes after it.
This is the sort of thing where regexes can work, but they can potentially result in major problems unless you consider exactly what your input will be like. For example, can whatever is inside MM(...) contain parentheses of its own? Can the first expression in func( contain a comma? If the answers to both questions is no, then the following could work:
input_user2 = re.sub(r'MM\(([^\)]*)\)', r"MM('\1')", input_user)
output = re.sub(r'func\(([^,]*),', r"func('\1',", input_user)
However, this will not work if the answer to either question is yes, and even without that could cause problems depending upon what sort of inputs you expect to receive. Essentially, the first re.sub here looks for MM( ('MM('), followed by any number (including 0) of characters that aren't a close-parenthesis ('([^)]*)') that are then stored as a group (caused by the extra parentheses), and then a close-parenthesis. It replaces that section with the string in the second argument, where \1 is replaced by the first and only group from the pattern. The second re.sub works similarly, looking for any number of characters that aren't a comma.
If the answer to either question is yes, then regexps aren't appropriate for the parsing, as your language would not be regular. The answer to this question, while discussing a different application, may give more insight into that matter.

How to format search autocompletion part lists?

I'm currently working on an AppEngine project, and I'd like to implement autocompletion of search terms. The items that can be searched for are reasonably unambiguous and short, so I was thinking of implementing it by giving each item a list of incomplete typings. So foobar would get a list like [f, fo, foo, foob, fooba, foobar]. The user's text in the searchbox is then compared to this list, and positive matches are suggested.
There are a couple of possible optimizations in this list that I was thinking of:
Removing spaces punctuation from search terms. Foo. Bar to FooBar.
Removing capital letters
Removing leading particles like "the", "a", "an". The Guy would be guy, and indexed as [g, gu, guy].
Only adding substring longer than 2 or 3 to the indexing list. So The Guy would be indexed as [gu, guy]. I thought that suggestions that only match the first letter would not be so relevant.
The users search term would also be formatted in this way, after which the DB is searched. Upon suggesting a search term, the particles, punctuation, and capital letters would be added according to the suggested object's full name. So searching for "the" would give no suggestions, but searching for "The Gu.." or "gu" would suggest "The Guy".
Is this a good idea? Mainly: would this formatting help, or only cause trouble?
I have already run into the same problem and the solution that I adopted was very similar to your idea. I split the items into words, convert them to lowercase, remove accents, and create a list of startings. For instance, "Báz Bar" would become ['b', 'ba', 'bar', 'baz'].
I have posted the code in this thread. The search box of this site is using it. Feel free to use it if you like.

Categories

Resources