cutting strings onto new lines

cutting strings onto new lines - python

My string is too long to fit in TkInter, therefore I'm trying to split the list every 15 spaces onto a new line.
so far I have counted the spaces and everytime I get to 15 it adds the string '\n', which should put it on a new line, however it just places it in the string.
How can I fix this?`
def stringCutter(movie):
n = 0
strings = []
spaces = 0
curFilms = db.CurrentFilm(movie)
tempOverview = curFilms[5]
for i in tempOverview:
n += 1
if i == ' ':
spaces += 1
if (spaces % 15)== 0:
string = tempOverview[:n]
tempOverview = tempOverview[n:]
strings.append(string)
n = 0
spaces = 0
if n == len(tempOverview):
strings.append(tempOverview)
overview = '\n'.join(strings)
return overview`
curFilms takes lots of movie info and the 5 element is the overview, which is a long string.
I want it to return the overview like this:
After a global war the seaside kingdom known as the Valley Of The Wind remains
one of the last strongholds on Earth untouched by a poisonous jungle and the powerful
insects that guard it. Led by the courageous Princess Nausicaa the people of the Valley
engage in an epic struggle to restore the bond between humanity and Earth.
Instead of that though, it does this:
After a global war the seaside kingdom known as the Valley Of The Wind remains \none of the last strongholds on Earth untouched by a poisonous jungle and the powerful \ninsects that guard it. Led by the courageous Princess Nausicaa the people of the Valley \nengage in an epic struggle to restore the bond between humanity and Earth.

Related

How to replace second instance of character in substring?

I have the following strings:
text_one = str("\"A Ukrainian American woman who lives near Boston, Massachusetts, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Russian attacks on Ukraine and the fear these attacks have engendered.")
text_two = str("\n\nMany people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Russian soldiers overtake the area, the Boston-area woman said.\"")
I need to replace every instance of s/S with $, but not the first instance of s/S in a given word. So the input/output would look something like:
> Mississippi
> Mis$i$$ippi
My idea is to do something like 'after every " " character, skip first "s" and then replace all others up until " " character' but I have no idea how I might go about this. I also thought about creating a list to handle each word.

Solution with re:
import re
text_one = '"A Ukrainian American woman who lives near Boston, Massachusetts, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Russian attacks on Ukraine and the fear these attacks have engendered.'
text_two = '\n\nMany people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Russian soldiers overtake the area, the Boston-area woman said."'
def replace(s):
return re.sub(
r"(?<=[sS])(\S+)",
lambda g: g.group(1).replace("s", "$").replace("S", "$"),
s,
)
print(replace(text_one))
print(replace(text_two))
Prints:
"A Ukrainian American woman who lives near Boston, Mas$achu$ett$, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Rus$ian attacks on Ukraine and the fear these attacks have engendered.
Many people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Rus$ian soldier$ overtake the area, the Boston-area woman said."

The first thing you're going to want to do is to find the index of the first s
Then, you'll want to split the string so you get the string until after the first s and the rest of the string into two separate variables
Next, replace all of the s's in the second string with dollar signs
Finally, join the two strings with an empty string
test = "mississippi"
first_index = test.find("s")
tests = [test[:first_index+1], test[first_index+1:]]
tests[1] = tests[1].replace("s", "$")
result = ''.join(tests)
print(result)

matching content creating new column

Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"

Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool

how to sort the target word by value of dictionary and count associated words?

I have two text files, one is sample.txt and the other is common.txt. First I would like to remove common words from sample.txt. Common words are found in common.txt and in the code sample.txt has been modified as desired. common.txt is:
a
about
after
again
against
ago
all
along
also
always
an
and
another
any
are
around
as
at
away
back
be
because
been
before
began
being
between
both
but
by
came
can
come
could
course
day
days
did
do
down
each
end
even
ever
every
first
for
four
from
get
give
go
going
good
got
great
had
half
has
have
he
head
her
here
him
his
house
how
hundred
i
if
in
into
is
it
its
just
know
last
left
life
like
little
long
look
made
make
man
many
may
me
men
might
miles
more
most
mr
much
must
my
never
new
next
no
not
nothing
now
of
off
old
on
once
one
only
or
other
our
out
over
own
people
pilot
place
put
right
said
same
saw
say
says
see
seen
she
should
since
so
some
state
still
such
take
tell
than
that
the
their
them
then
there
these
they
thing
think
this
those
thousand
three
through
time
times
to
told
too
took
two
under
up
upon
us
use
used
very
want
was
way
we
well
went
were
what
when
where
which
while
who
will
with
without
work
world
would
year
years
yes
yet
you
young
your
sample.txt is:
THE Mississippi is well worth reading about. It is not a commonplace
river, but on the contrary is in all ways remarkable. Considering the
Missouri its main branch, it is the longest river in the world--four
thousand three hundred miles. It seems safe to say that it is also the
crookedest river in the world, since in one part of its journey it uses
up one thousand three hundred miles to cover the same ground that the
crow would fly over in six hundred and seventy-five. It discharges three
times as much water as the St. Lawrence, twenty-five times as much
as the Rhine, and three hundred and thirty-eight times as much as the
Thames. No other river has so vast a drainage-basin: it draws its water
supply from twenty-eight States and Territories; from Delaware, on the
Atlantic seaboard, and from all the country between that and Idaho on
the Pacific slope--a spread of forty-five degrees of longitude. The
Mississippi receives and carries to the Gulf water from fifty-four
subordinate rivers that are navigable by steamboats, and from some
hundreds that are navigable by flats and keels. The area of its
drainage-basin is as great as the combined areas of England, Wales,
Scotland, Ireland, France, Spain, Portugal, Germany, Austria, Italy,
and Turkey; and almost all this wide region is fertile; the Mississippi
valley, proper, is exceptionally so.
after removing the common words, I need to break it into sentences and use "." as full stop and count appearance of target word in sentences. Also, need to create profile for target word to show associated words and their counts. For example, if "river" is the target word, the associated words include "commonplace", "contrary" and so on that happen in the same sentence (within a full stop) with "river". The desired output is listed in descending order:
river 4
ground: 1
journey: 1
longitude: 1
main: 1
world--four: 1
contrary: 1
cover: 1
...
mississippi 3
area: 1
steamboats: 1
germany: 1
reading: 1
france: 1
proper: 1
...
Three dots mean the associated words should be more and are not listed in here. And now here is the coding so far:
def open_file(file):
file = "/Users/apple/Documents/sample.txt"
file1 = "/Users/apple/Documents/common.txt"
with open(file1, "r") as f:
common_words = {i.strip() for i in f}
punctionmark = ":;,'\"."
trans_table = str.maketrans(punctionmark, " " * len(punctionmark))
word_counter = {}
with open(file, "r") as f:
for line in f:
for word in line.translate(trans_table).split():
if word.lower() not in common_words:
word_counter[word.lower()] = word_counter.get(word, 0) + 1
#print(word_counter)
print("\n".join("{} {}".format(w, c) for w, c in word_counter.items()))
And my output now is:
mississipi 1
reading 1
about 1
commonplace 1
river 4
.
.
.
And so far I have counted the occurrence of target word but stuck to sort the target words in descending order and to get the counts for their associated words. Anyone can provide solution without importing other modules? Thank you so much.

You can use re.findall to tokenize, filter, and group the text into sentences, and then traverse your structure of target and associated words to find the final counts:
import re, string
from collections import namedtuple
import itertools
stop_words = [i.strip('\n') for i in open('filename.txt')]
text = open('filename.txt').read()
grammar = {'punctuation':string.punctuation, 'stopword':stop_words}
token = namedtuple('token', ['name', 'value'])
tokenized_file = [token((lambda x:'word' if not x else x[0])([a for a, b in grammar.items() if i.lower() in b]), i) for i in re.findall('\w+|\!|\-|\.|;|,:', text)]
filtered_file = [i for i in tokenized_file if i.name != 'stopword']
grouped_data = [list(b) for _, b in itertools.groupby(filtered_file, key=lambda x:x.value not in '!.?')]
text_with_sentences = ' '.join([' '.join([c.value for c in grouped_data[i]])+grouped_data[i+1][0].value for i in range(0, len(grouped_data), 2)])
Currently, the result of text_with_sentences is:
'Mississippi worth reading. commonplace river contrary ways remarkable. Considering Missouri main branch longest river - -. seems safe crookedest river part journey uses cover ground crow fly six seventy - five. discharges water St. Lawrence twenty - five Rhine thirty - eight Thames. river vast drainage - basin draws water supply twenty - eight States Territories ; Delaware Atlantic seaboard country Idaho Pacific slope - - spread forty - five degrees longitude. Mississippi receives carries Gulf water fifty - subordinate rivers navigable steamboats hundreds navigable flats keels. area drainage - basin combined areas England Wales Scotland Ireland France Spain Portugal Germany Austria Italy Turkey ; almost wide region fertile ; Mississippi valley proper exceptionally.'
To find the counts for the keyword profiling, you can use collections.Counter:
import collections
counts = collections.Counter(map(str.lower, re.findall('[\w\-]+', text)))
structure = [['river', ['ground', 'journey', 'longitude', 'main', 'world--four', 'contrary', 'cover']], ['mississippi', ['area', 'steamboats', 'germany', 'reading', 'france', 'proper']]]
new_structure = [{'keyword':counts.get(a, 0), 'associated':{i:counts.get(i, 0) for i in b}} for a, b in structure]
Output:
[{'associated': {'cover': 1, 'longitude': 1, 'journey': 1, 'contrary': 1, 'main': 1, 'world--four': 1, 'ground': 1}, 'keyword': 4}, {'associated': {'area': 1, 'france': 1, 'germany': 1, 'proper': 1, 'reading': 1, 'steamboats': 1}, 'keyword': 3}]
Without using any modules, str.split can be used:
words = [[i[:-1], i[-1]] if i[-1] in string.punctuation else [i] for i in text.split()]
new_words = [i for b in words for i in b if i.lower() not in stop_words]
def find_groups(d, _pivot = '.'):
current = []
for i in d:
if i == _pivot:
yield ' '.join(current)+'.'
current = []
else:
current.append(i)
print(list(find_groups(new_words)))
counts = {}
for i in new_words:
if i.lower() not in counts:
counts[i.lower()] = 1
else:
counts[i.lower()] += 1
structure = [['river', ['ground', 'journey', 'longitude', 'main', 'world--four', 'contrary', 'cover']], ['mississippi', ['area', 'steamboats', 'germany', 'reading', 'france', 'proper']]]
new_structure = [{'keyword':counts.get(a, 0), 'associated':{i:counts.get(i, 0) for i in b}} for a, b in structure]
Output:
['Mississippi worth reading.', 'commonplace river , contrary ways remarkable.', 'Considering Missouri main branch , longest river world--four.', 'seems safe crookedest river , part journey uses cover ground crow fly six seventy-five.', 'discharges water St.', 'Lawrence , twenty-five Rhine , thirty-eight Thames.', 'river vast drainage-basin : draws water supply twenty-eight States Territories ; Delaware , Atlantic seaboard , country Idaho Pacific slope--a spread forty-five degrees longitude.', 'Mississippi receives carries Gulf water fifty-four subordinate rivers navigable steamboats , hundreds navigable flats keels.', 'area drainage-basin combined areas England , Wales , Scotland , Ireland , France , Spain , Portugal , Germany , Austria , Italy , Turkey ; almost wide region fertile ; Mississippi valley , proper , exceptionally.']
[{'associated': {'cover': 1, 'longitude': 1, 'journey': 1, 'contrary': 1, 'main': 1, 'world--four': 1, 'ground': 1}, 'keyword': 4}, {'associated': {'area': 1, 'france': 1, 'germany': 1, 'proper': 1, 'reading': 1, 'steamboats': 1}, 'keyword': 3}]

Python, find words from array in string

I just want to ask how can I find words from array in my string?
I need to do filter that will find words i saved in my array in text that user type to text window on my web.
I need to have 30+ words in array or list or something.
Then user type text in text box.
Then script should find all words.
Something like spam filter i quess.
Thanks

import re
words = ['word1', 'word2', 'word4']
s = 'Word1 qwerty word2, word3 word44'
r = re.compile('|'.join([r'\b%s\b' % w for w in words]), flags=re.I)
r.findall(s)
>> ['Word1', 'word2']

Solution 1 uses the regex approach which will return all instances of the keyword found in the data. Solution 2 will return the indexes of all instances of the keyword found in the data
import re
dataString = '''Life morning don't were in multiply yielding multiply gathered from it. She'd of evening kind creature lesser years us every, without Abundantly fly land there there sixth creature it. All form every for a signs without very grass. Behold our bring can't one So itself fill bring together their rule from, let, given winged our. Creepeth Sixth earth saying also unto to his kind midst of. Living male without for fruitful earth open fruit for. Lesser beast replenish evening gathering.
Behold own, don't place, winged. After said without of divide female signs blessed subdue wherein all were meat shall that living his tree morning cattle divide cattle creeping rule morning. Light he which he sea from fill. Of shall shall. Creature blessed.
Our. Days under form stars so over shall which seed doesn't lesser rule waters. Saying whose. Seasons, place may brought over. All she'd thing male Stars their won't firmament above make earth to blessed set man shall two it abundantly in bring living green creepeth all air make stars under for let a great divided Void Wherein night light image fish one. Fowl, thing. Moved fruit i fill saw likeness seas Tree won't Don't moving days seed darkness.
'''
keyWords = ['Life', 'stars', 'seed', 'rule']
#---------------------- SOLUTION 1
print 'Solution 1 output:'
for keyWord in keyWords:
print re.findall(keyWord, dataString)
#---------------------- SOLUTION 2
print '\nSolution 2 output:'
for keyWord in keyWords:
index = 0
indexes = []
indexFound = 0
while indexFound != -1:
indexFound = dataString.find(keyWord, index)
if indexFound not in indexes:
indexes.append(indexFound)
index += 1
indexes.pop(-1)
print indexes
Output:
Solution 1 output:
['Life']
['stars', 'stars']
['seed', 'seed']
['rule', 'rule', 'rule']
Solution 2 output:
[0]
[765, 1024]
[791, 1180]
[295, 663, 811]

Try
words = ['word1', 'word2', 'word4']
s = 'word1 qwerty word2, word3 word44'
s1 = s.split(" ")
i = 0
for x in s1:
if(x in words):
print x
i++
print "count is "+i
output
'word1'
'word2'
count is 2

Extracting name from line

I have data in the following format:
Bxxxx, Mxxxx F Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Counselor Education, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Axxxx Brown Birmingham AL (123) 555-2281 NCC Clinical Mental Health, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling English 99.52029 -99.8115
Axxxx, Bxxxx Mobile AL (123) 555-8011 NCC Childhood & Adolescence, Clinical Mental Health, Sexual Abuse Recovery, Disaster Counseling English 99.68639 -99.053238
Axxxx, Rxxxx Lunsford Athens AL (123) 555-8119 NCC, NCCC, NCSC Career Development, Childhood & Adolescence, School, Disaster Counseling, Supervision English 99.804501 -99.971283
Axxxx, Mxxxx Mobile AL (123) 555-5963 NCC Clinical Mental Health, Counselor Education, Depression/Grief/Chronically or Terminally Ill, Mental Health/Agency Counseling, Supervision English 99.68639 -99.053238
Axxxx, Txxxx Mountain Brook AL (123) 555-3099 NCC Addictions and Dependency, Career Development, Childhood & Adolescence, Corrections/Offenders, Sexual Abuse Recovery English 99.50214 -99.75557
Axxxx, Lxxxx Birmingham AL (123) 555-4550 NCC Addictions and Dependency, Eating Disorders English 99.52029 -99.8115
Axxxx, Wxxxx Birmingham AL (123) 555-2328 NCC English 99.52029 -99.8115
Axxxx, Rxxxx Mobile AL (123) 555-9411 NCC Addictions and Dependency, Childhood & Adolescence, Couples & Family, Sexual Abuse Recovery, Depression/Grief/Chronically or Terminally Ill English 99.68639 -99.053238
And need to extract only the person names. Ideally, I'd be able to use humanName to get a bunch of name objects with fields name.first, name.middle, name.last, name.title...
I've tried iterating through until I hit the first two consecutive caps letters representing the state and then storing the stuff previous into a list and then calling humanName but that was a disaster. I don't want to continue to try this method.
Is there a way to sense the starts and ends of words? That might be helpful...
Recommendations?

Your best bet is to find a different data source. Seriously. This one is farked.
If you can't do that, then I would do some work like this:
Replace all double spaces with single spaces.
Split the line by spaces
Take the last 2 items in the list. Those are lat and lng
Looping backwards in the list, do a lookup of each item into a list of potential languages. If the lookup fails, you are done with languages.
Join the remaining list items back with spaces
In the line, find the first opening paren. Read about 13 or 14 characters in, replace all punctuation with empty strings, and reformat it as a normal phone number.
Split the remainder of the line after the phone number by commas.
Using that split, loop through each item in the list. If the text starts with more than 1 capital letter, add it to certifications. Otherwise, add it to areas of practice.
Going back to the index you found in step #6, get the line up until then. Split it on spaces, and take the last item. That's the state. All that's left is name and city!
Take the first 2 items in the space-split line. That's your best guess for name, so far.
Look at the 3rd item. If it is a single letter, add it to the name and remove from the list.
Download US.zip from here: http://download.geonames.org/export/zip/US.zip
In the US data file, split all of it on tabs. Take the data at indexes 2 and 4, which are city name and state abbreviation. Loop through all data and insert each row, concatenated as abbreviation + ":" + city name (i.e. AK:Sand Point) into a new list.
Make a combination of all possible joins of the remaining items in your line, in the same format as in step #13. So you'd end up with AL:Brown Birmingham and AL:Birmingham for the 2nd line.
Loop through each combination and search for it in the list you created in step #13. If you found it, remove it from the split list.
Add all remaining items in the string-split list to the person's name.
If desired, split the name on the comma. index[0] is the last name index[1] is all remaining names. Don't make any assumptions about middle names.
Just for giggles, I implemented this. Enjoy.
import itertools
# this list of languages could be longer and should read from a file
languages = ["English", "Spanish", "Italian", "Japanese", "French",
"Standard Chinese", "Chinese", "Hindi", "Standard Arabic", "Russian"]
languages = [language.lower() for language in languages]
# Loop through US.txt and format it. Download from geonames.org.
cities = []
with open('US.txt', 'r') as us_data:
for line in us_data:
line_split = line.split("\t")
cities.append("{}:{}".format(line_split[4], line_split[2]))
# This is the dataset
with open('state-teachers.txt', 'r') as teachers:
next(teachers) # skip header
for line in teachers:
# Replace all double spaces with single spaces
while line.find(" ") != -1:
line = line.replace(" ", " ")
line_split = line.split(" ")
# Lat/Lon are the last 2 items
longitude = line_split.pop().strip()
latitude = line_split.pop().strip()
# Search for potential languages and trim off the line as we find them
teacher_languages = []
while True:
language_check = line_split[-1]
if language_check.lower().replace(",", "").strip() in languages:
teacher_languages.append(language_check)
del line_split[-1]
else:
break
# Rejoin everything and then use phone number as the special key to split on
line = " ".join(line_split)
phone_start = line.find("(")
phone = line[phone_start:phone_start+14].strip()
after_phone = line[phone_start+15:]
# Certifications can be recognized as acronyms
# Anything else is assumed to be an area of practice
certifications = []
areas_of_practice = []
specialties = after_phone.split(",")
for specialty in specialties:
specialty = specialty.strip()
if specialty[0:2].upper() == specialty[0:2]:
certifications.append(specialty)
else:
areas_of_practice.append(specialty)
before_phone = line[0:phone_start-1]
line_split = before_phone.split(" ")
# State is the last column before phone
state = line_split.pop()
# Name should be the first 2 columns, at least. This is a basic guess.
name = line_split[0] + " " + line_split[1]
line_split = line_split[2:]
# Add initials
if len(line_split[0].strip()) == 1:
name += " " + line_split[0].strip()
line_split = line_split[1:]
# Combo of all potential word combinations to see if we're dealing with a city or a name
combos = [" ".join(combo) for combo in set(itertools.permutations(line_split))] + line_split
line = " ".join(line_split)
city = ""
# See if the state:city combo is valid. If so, set it and let everything else be the name
for combo in combos:
if "{}:{}".format(state, combo) in cities:
city = combo
line = line.replace(combo, "")
break
# Remaining data must be a name
if line.strip() != "":
name += " " + line
# Clean up names
last_name, first_name = [piece.strip() for piece in name.split(",")]
print first_name, last_name

Not a code answer, but it looks like you could get most/all of the data you're after from the licensing board at http://www.abec.alabama.gov/rostersearch2.asp?search=%25&submit1=Search. Names are easy to get there.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.