Parsing full names from a list of names

Parsing full names from a list of names - python

I am using namesparser to extract full names from a list of names.
from namesparser import HumanNames
names = HumanNames('Randy Heimerman, James Durham, Nate Green')
print(names.human_names[0])
Namesparser works well in most cases, but the above example is getting hung up. I believe it is because the name "Randy" includes "and", which namesparser is treating as a separator.
When I move Randy's name to the end of the string, the correct name is printed (James Durham). If I try to print either of the 2 other names, though, the wrong strings are returned.
Any ideas on how I can resolve this?

I think you should use the comma , as your delimiter.
def print_names( name_string ):
return (name.strip() for name in name_string.split(","))
what this does is split your string on the comma, and then strip trailing and leading spaces, etc... before returning an array of names.
Now that you have a generator of names, you can pass it into other things for example:
humans = [HumanName(name) for name in print_names(name_string)]
but then again, I dont know what your class HumanNames / HumanName really means, and you didnt put a class defition.
If you are looking at this module: https://pypi.org/project/nameparser/ in which it takes a string consisting of a singular name, the above will still work no problem.

Related

Python: retrieve substring bounded by indentations on a text file

I am having trouble on how to even identify indentations on a text file with Python (the ones that appear when you press tab). I thought that using the split function would be helpful, but it seems like there has to be a physical character that can act as the 'separator'.
Here is a sample of the text, where I am trying to retrieve the string 'John'. Assume that the spaces are the indentations:
15:50:00 John 1029384
All help is appreciated! Thanks!

Dependent on the program that you used for creating the file, what is actually inserted when you press TAB may either be a TAB character (\t) or a series of spaces.
You were actually right in thinking that split() is a way to do what you want. If you don't pass any arguments to it, it treats both series of whitespace and tabs as a single separator:
s = "15:50:00 John 1029384"
t = "15:50:00\tJohn\t1029384"
s.split() # Output: ['15:50:00', 'John', '1029384']
t.split() # Output: ['15:50:00', 'John', '1029384']

Tabs are represented by \t. See https://www.w3schools.com/python/gloss_python_escape_characters.asp for a longer list.
So we can do the following:
s = "15:50:00 John 1029384"
s.split("\t") # Output: ['15:50:00', 'John', '1029384']
If you know regex, then you can use look-ahead and look-behind as follows:
import re
re.search("(?<=\t).*?(?=\t)", s)[0] # Output: "John"
Obviously both methods will need to be made more robust by considering edge cases and error handling (eg., what happens if there are fewer -- or more -- than two tabs in the string -- how do you identify the name in that case?)

Why do we need to put apostrophes around an f string in python?

I understand what f strings are and how to use them, but I do not understand why we need to have apostrophes around the whole f string when we have already made the variables a string.
first_name = 'Chris'
last_name = 'Christie'
sentence = f'That is {first_name} {last_name}'
I understand what the expected result is going to be. But here's where I'm confused. Aren't the variables first name and last name already a string? So when we put it into the f string statement, aren't we putting two strings (the variables first name and last name) inside one big string (as the whole f string statement is surrounded by apostrophes)? Sorry if this is confusing

Do not get confused about apostrophes:
We use apostrophes to define strings in Python:
name = "Chris"
We use f-Strings as it is the new and improved way of formatting Strings in Python:
# Define two strings
name = "Chris"
surname = "Christie"
# Use f-Strings to format the overall sentence
sentence = f"Hello, {name} {surname}"
# Print the computed sentence
print(sentence)
Output: 'Hello, Chris Christie'

aren't we putting two strings (the variables first name and last name) inside one big string
Yes. And in this example that's not a hard thing to do. As you suggest, you could just do:
first_name = 'Chris'
last_name = 'Christie'
sentence = 'That is ' + first_name + ' ' + last_name
to concatenate the strings. But this quickly gets unwieldy. Because of all those quotes and operators it's difficult to see at a glance exactly what the final string is going to look like.
So many languages have come up with a way of having the variables part of the string itself. That way you can write the string a bit more naturally. For example, in Python you can also use "%-formatting":
sentence = 'That is %s %s' % (first_name, last_name)
That's a bit better, because you can easily see where the variables are going to go in the string. But it gets messy when you have a lot of substitutions to do, because you need to match the order of the %'s with the order of the list. What if we could just put the variables in the string itself? Well, that's what f-strings do.
So f-strings allow you to see just where the variables are going to end up in the string. As you point out, the notation can look a little odd because you end up inserting strings as variables into a string, but notations are just that - notations. You get a bit of expressiveness at the cost of a bit of obscureness.

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".

You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each time…

If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

Parse a string to find and remove a float

I am created a pythod method that will take in a string of variable length, that will always include a floating point number at the end :
"adsfasdflkdslf:asldfasf-adslfk:1.5698464586546"
OR
"asdif adfi=9393 adfkdsf:1.84938"
I need to parse the string and return the floating point number at the end. There usually a delimiter character before the float, such as : - or a space.
def findFloat(stringArg):
stringArg.rstrip()
stringArg.replace("-",":")
if stringArg.rfind(":"):
locateFloat = stringArg.rsplit(":")
#second element should be the desired float
magicFloat = locateFloat[1]
return magicFloat
I am recieving a
magicFloat = locateFloat[1]
IndexError: list index out of range
Any guidence on how to locate the float and return it would be awesome.

In Python, strings are immutable. No matter what function you call on a string, the actual text of that string does not change. Thus, methods like rstrip, replace etc. create a new string representing the modified version. (You would know this if you read the documentation.) In your code, you do not assign the results of these calls anywhere in the first two statements, so the results are lost.
Without specifying a number of splits, rsplit does the exact same thing that split does. It checks for splits from the end, sure, but it still splits at every possible point, so the net effect is the same. You need to specify that you want to split at most one time.
However, you shouldn't do that anyway; a much simpler way to get "everything after the last colon, or everything if there is no colon" is to use rpartition.
You don't actually have to remove whitespace from the end for float conversion. Although you probably should actually, you know, perform the conversion.
Finally, there is no point in assigning to a variable just to return it; just return the expression directly.
Putting that together gives us the exceptionally simple:
def findFloat(stringArg):
return float(stringArg.replace('-', ':').rpartition(':')[2])

re always rocks. Depending on what your floating point number looks like (leading 0?) something like:
magicFloat = re.search('.*([0-9]\.[0-9]+)',st).group(1)
p.s. if you do this a lot, precompile the regex first:
re_float = re.compile('.*([0-9]\.[0-9]+)')
# later in your code
magicFloat = re_float.search(st).group(1)

You could do it in an easier manner:
def findFloat(stringArg):
s = stringArg.rstrip()
return s.split('-:')[-1]
rstrip() will return the stripped string, you must store it somewhere
split() can take multiple token, you can avoid the replace then
rsplit() is an optimization, but split()[-1] will always take the latest element in the split list
locateFloat is not defined if no rfind() is found
in you need to find a char, you could write if ':' in stringArg: instead.
Hope thoses tips would help you later :)

If it's always at the end you should use $ in your re:
import re
def findFloat(stringArg):
fl = re.search(r'([\.0-9]+)$', stringArg)
return fl and float(fl.group(1))

You can use regular expressions.
>>> st = "adsfasdflkdslf:asldfasf-adslfk:1.5698464586546"
>>> float(re.split(r':|\s|-',st)[-1])
1.5698464586545999
I have used re.split(pattern, string, maxsplit=0, flags=0) which split string by the occurrences of pattern.
Here pattern is your delimiter like :,white-space(\s),-.

How to add a comma to the end of a list efficiently?

I have a list of horizontal names that is too long to open in excel. It's 90,000 names long. I need to add a comma after each name to put into my program. I tried find/replace but it freezes up my computer and crashes. Is there a clever way I can get a comma at the end of each name? My options to work with are python and excel thanks.

If you actually had a Python list, say names, then ','.join(names) would make into a string with a comma between each name and the following one (if you need one at the end as well, just use + ',' to append one more comma to the result).
Even though you say you have "a list" I suspect you actually have a string instead, for example in a file, where the names are separated by...? You don't tell us, and therefore force us to guess. For example, if they're separated by line-ends (one name per line), your life is easiest:
with open('yourfile.txt') as f:
result = ','.join(f)
(again, supplement this with a + ',' after the join if you need that, of course). That's because separation by line-ends is the normal default behavior for a text file, of course.
If the separator is something different, you'll have to read the file's contents as a string (with f.read()) and split it up appropriately then join it up again with commas.
For example, if the separator is a tab character:
with open('yourfile.txt') as f:
result = ','.join(f.read().split('\t'))
As you see, it's not so much worse;-).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.