I have a list of job titles (12,000 in total) formatted in this way:
Career_List = ['1) ABLE SEAMAN', '2) ABRASIVE GRADER', '3) ABRASIVE GRINDER']
How do I remove the numbers, parentheses, and spaces from the list elements so that I end up with this output:
Career_List_Updated = ['ABLE SEAMAN', 'ABRASIVE GRADER', 'ABRASIVE GRINDER']
I know that I am unable to simply remove the first three characters because I have more than ten items in my list.
Take advantage of the fact that str.lstrip() and the rest of the strip functions accept multiple characters as an argument.
Career_List_Updated =[career.lstrip('0123456789) ') for career in Career_List]
Split each career at the first space; keep the rest of the line.
Career_List = ['1) ABLE SEAMAN', '2) ABRASIVE GRADER', '3) ABRASIVE GRINDER', '12000) ZEBRA CLEANER']
Career_List_Updated = []
for career in Career_List:
job = career.split(' ', 1)
Career_List_Updated.append(job[1])
print Career_List_Updated
Output:
['ABLE SEAMAN', 'ABRASIVE GRADER', 'ABRASIVE GRINDER', 'ZEBRA CLEANER']
One-line version:
Career_List_Updated = [career.split(' ', 1)[1] \
for career in Career_List]
We want to find the first index that STOPS being a bad character and return the rest of the string, as follows.
def strip_bad_starting_characters_from_string(string):
bad_chars = set(r"'0123456789 )") # set of characters we don't like
for i, char in enumerate(string):
if char not in bad_chars
# we are at first index past "noise" digits
return string[i:]
career_list_updated = [strip_bad_starting_characters_from_string(string) for string in career_list]
Related
I have a strange problem with the following codewars Kata:
https://www.codewars.com/kata/51e056fe544cf36c410000fb/train/python
I haven't completed it, but I ran into a really strange problem with the string.count() method.
The count that I get with the method for word "an" is 8, eventhough it is only once in the string.
Advise would be much appreciated.
Here is my code:
import re
words = "In a village of La Mancha, the name of which I have no desire to call to \
mind, there lived not long since one of those gentlemen that keep a lance \
in the lance-rack, an old buckler, a lean hack, and a greyhound for \
coursing. An olla of rather more beef than mutton, a salad on most \
nights, scraps on Saturdays, lentils on Fridays, and a pigeon or so extra \
on Sundays, made away with three-quarters of his income."
# => ["a", "of", "on"]
def top_3_words(text: str):
text = re.sub('[^0-9a-zA-Z]+', ' ', text).strip(' ')
arr = text.split(' ')
dict = {value: text.count(value) for value in arr}
result = []
for _ in range(3):
max_key = max(dict, key=dict.get)
result.append(max_key)
dict.pop(max_key)
print(result)
top_3_words(words)
You should first tokenize string words and then count.
text = text.replace(',', '').split()
The above code return list of words in your string. Then you can count numbers of each word in the list.
First split your string into individual words and count the number of an in it.
temp_words = words.split(' ')
an_count = temp_words.count('an')
I have a long list of many elements, each element is a string. See below sample:
data = ['BAT.A.100', 'Regulation 2020-1233', 'this is the core text of', 'the regulation referenced ',
'MOC to BAT.A.100', 'this', 'is', 'one method of demonstrating compliance to BAT.A.100',
'BAT.A.120', 'Regulation 2020-1599', 'core text of the regulation ...', ' more free text','more free text',
'BAT.A.145', 'Regulation 2019-3333', 'core text of' ,'the regulation1111',
'MOC to BAT.A.145', 'here is how you can show compliance to BAT.A.145','more free text',
'MOC2 to BAT.A.145', ' here is yet another way of achieving compliance']
My desired output is ultimately a Pandas DataFrame as follows:
As the strings may have to be concatenated, I have firstly joining all the elements to single string using ## to separate the text which have been joined.
I am going for all regex because there would be lot of conditions to check otherwise.
re_req = re.compile(r'##(?P<Short_ref>BAT\.A\.\d{3})'
r'##(?P<Full_Reg_ref>Regulation\s\d{4}-\d{4})'
r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))'
r'(?:##)?(?:(?P<Moc_text>.*?MOC2 to \1)(?P<MOC2>(?:##)?.*?(?=##BAT\.A\.\d{3})(?!\1)|.+)'
r'|(?P<Moc_text_temp>.*?(?=##BAT\.A\.\d{3})(?!\1)))')
final_list = []
for match in re_req.finditer("##" + "##".join(data)):
inner_list = [match.group('Short_ref').replace("##", " "),
match.group('Full_Reg_ref').replace("##", " "),
match.group('Reg_text').replace("##", " ")]
if match.group('Moc_text_temp'): # just Moc_text is present
inner_list += [match.group('Moc_text_temp').replace("##", " "), ""]
elif match.group('Moc_text') and match.group('MOC2'): # both Mock_text and MOC2 is present
inner_list += [match.group('Moc_text').replace("##", " "), match.group('MOC2').replace("##", " ")]
else: # neither Moc_text nor MOC2 is present
inner_list += ["", ""]
final_list.append(inner_list)
final_df = pd.DataFrame(final_list, columns=['Short_ref', 'Full_Reg_ref', 'Reg_text', 'Moc_text', 'MOC2'])
First and second line of regex is same as which you posted earlier and identifies the first two columns.
In third line of regex, r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))' - matches all text till MOC to Short_ref or matches all the text before the next Reg_text. (?=##BAT\.A\.\d{3})(?!\1) part is to taking the text upto Short_ref pattern and if the Short_ref is not the current Reg_text.
Fourth line is for when Moc_text and MOC2 both is present and it is or with fifth line for the case when just Moc_text is present. This part of the regex is similar to the third line.
Last looping over all the matches using finditer and constructing the rows of the dataframe
final_df:
I want to make sure that each sentence in a text starts with a capital letter.
E.g. "we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken." should become
"We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken."
I tried using split() to split the sentence. Then, I capitalized the first character of each line. I appended the rest of the string to the capitalized character.
text = input("Enter the text: \n")
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() # capitalize the first word of sentence
for i in range(1, len(line)):
a = a + line[i]
print(a)
I want to obtain "We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken."
I get "We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister
The good news is they tasted like chicken."
This code should work:
text = input("Enter the text: \n")
lines = text.split('. ') # Split the sentences
for index, line in enumerate(lines):
lines[index] = line[0].upper() + line[1:]
print(". ".join(lines))
The error in your code is that str.split(chars) removes the splitting delimiter char and that's why the period is removed.
Sorry for not providing a thorough description as I cannot think of what to say. Please feel free to ask in comments.
EDIT: Let me try to explain what I did.
Lines 1-2: Accepts the input and splits into a list by '. '. On the sample input, this gives: ['"We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister', 'the good news is they tasted like chicken.']. Note the period is gone from the first sentence where it was split.
Line 4: enumerate is a generator and iterates through an iterator returning the index and item of each item in the iterator in a tuple.
Line 5: Replaces the place of line in lines with the capital of the first character plus the rest of the line.
Line 6: Prints the message. ". ".join(lines) basically reverses what you did with split. str.join(l) takes a iterator of strings, l, and sticks them together with str between all the items. Without this, you would be missing your periods.
When you split the string by ". " that removes the ". "s from your string and puts the rest of it into a list. You need to add the lost periods to your sentences to make this work.
Also, this can result in the last sentence to have double periods, since it only has "." at the end of it, not ". ". We need to remove the period (if it exists) at the beginning to make sure we don't get double periods.
text = input("Enter the text: \n")
output = ""
if (text[-1] == '.'):
# remove the last period to avoid double periods in the last sentence
text = text[:-1]
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() # capitalize the first word of sentence
for i in range(1, len(line)):
a = a + line[i]
a = a + '.' # add the removed period
output = output + a
print (output)
We can also make this solution cleaner:
text = input("Enter the text: \n")
output = ""
if (text[-1] == '.'):
# remove the last period to avoid double periods in the last sentence
text = text[:-1]
lines = text.split('. ') #Split the sentences
for line in lines:
a = line[0].capitalize() + line [1:] + '.'
output = output + a
print (output)
By using str[1:] you can get a copy of your string with the first character removed. And using str[:-1] will give you a copy of your string with the last character removed.
split splits the string AND none of the new strings contain the delimiter - or the string/character you split by.
change your code to this:
text = input("Enter the text: \n")
lines = text.split('. ') #Split the sentences
final_text = ". ".join([line[0].upper()+line[1:] for line in lines])
print(final_text)
The below can handle multiple sentence types (ending in ".", "!", "?", etc...) and will capitalize the first word of each of the sentences. Since you want to keep your existing capital letters, using the capitalize function will not work (since it will make none sentence starting words lowercase). You can throw a lambda function into the list comp to take advantage of upper() on the first letter of each sentence, this keeps the rest of the sentence completely un-changed.
import re
original_sentence = 'we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken.'
val = re.split('([.!?] *)', original_sentence)
new_sentence = ''.join([(lambda x: x[0].upper() + x[1:])(each) if len(each) > 1 else each for each in val])
print(new_sentence)
The "new_sentence" list comprehension is the same as saying:
sentence = []
for each in val:
sentence.append((lambda x: x[0].upper() + x[1:])(each) if len(each) > 1 else each)
print(''.join(sentence))
You can use the re.sub function in order to replace all characters following the pattern . \w with its uppercase equivalent.
import re
original_sentence = 'we have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. the good news is they tasted like chicken.'
def replacer(match_obj):
return match_obj.group(0).upper()
# Replace the very first characer or any other following a dot and a space by its upper case version.
re.sub(r"(?<=\. )(\w)|^\w", replacer, original_sentence)
>>> 'We have good news and bad news about your emissaries to our world," the extraterrestrial ambassador informed the Prime Minister. The good news is they tasted like chicken.'
I found some similar questions, but nothing from python.
The context:
I have many pdf files (text) which have a table among other texts.
The position and size of the table varies from file to file.
I already tried different libraries but pdftotext was the best until now. tabula didn't work for example.
Solution until now:
I use pdftotext to extract all the info as a big string, find the substrings that will always delimit the table and save the table in a variable
Unfortunately, I can't write the whole content of the table, but the first two lines:
D Staph. aureus Ps. aeruginosa E. coli ATCC Ser. Asp. Cand. albicans
a ATCC 6538, ATCC 9027, Ps. 8739, Ent. marcescens brasiliensis ATCC 10231,
since pdftotext puts a "\n" at the end of each line, I could split the table into each row
My goal here is to separate this string into substrings as columns like this:
['Staph. aureus', 'Ps. aeruginosa', 'E. coli ATCC', 'Ser.', 'Asp.', 'Cand. albicans']
and this:
['ATCC 6538, ', 'ATCC 9027, Ps. ', '8739, Ent. ', 'marcescens ', 'brasiliensis ', 'ATCC 10231,']
The second line was delimited in every 15 characters for example
I realized that the maximum length of a column is 15 characters, so i tried splitting it like this, with n = 15:
print([line[i: (i + n)] for i in range(0, len(line), n)])
but this is what I get:
['Staph. aureus ', 'Ps. aeruginosa ', 'E. coli ATCC Se', 'r. ', 'Asp. ', 'Cand. albicans']
the question here is how to cut the string in substrings without cutting the words?
I already realized that if I cut on the position line[i + n], the position line[i+n-1] has to be equal to " " in order to not cut a word.
You can split a string into words using str.split(). If you do not provide a deliminator it will use spaces by default and return the words of a string. See the official Python documentation here.
The data seems to be tab delimited, but with the tabs replaced with spaces.
The only pattern I can spot is multiple spaces between column values.If this is the case your code would break if there was a double space (e.g. typo by the author).
Using the maximum column width is risky. It would break if the columns have short values (e.g. 'one', 'two').
Would this be of help?
str = "D Staph. aureus Ps. aeruginosa E. coli ATCC Ser. Asp.
Cand. albicans"
list = []
for s in str.split():
if "." in s:
list.append(s)
elif list:
list[-1] = f"{list[-1]} {s}"
print(list)
output:
['Staph. aureus', 'Ps. aeruginosa', 'E. coli ATCC', 'Ser.', 'Asp.', 'Cand. albicans']
This splits the string at whitespace, then goes through the resulting array and starts a new entry in the list if a word contains "." and appends that index in the list with the following words until the next word with "." is encountered.
I can't see a rule that we could apply to all rows, but with these two examples we could do:
line1 = "D Staph. aureus Ps. aeruginosa E. coli ATCC Ser. Asp. Cand. albicans"
line2 = "a ATCC 6538, ATCC 9027, Ps. 8739, Ent. marcescens brasiliensis ATCC 10231,"
for line in (line1, line2):
if line[0] == "D":
list = []
for s in line.split():
if "." in s:
list.append(s)
elif list:
list[-1] = f"{list[-1]} {s}"
print(list)
if line[0] == "a":
count = 0
list = []
for s in line2[3:]:
if count % 15 == 0 or count == 0:
list.append(s)
if len(list) > 1: list[-2] = list[-2].rstrip()
else:
list[-1] = f"{list[-1]}{s}"
count += 1
print(list)
output:
['Staph. aureus', 'Ps. aeruginosa', 'E. coli ATCC', 'Ser.', 'Asp.', 'Cand. albicans']
['ATCC 6538,', 'ATCC 9027, Ps.', '8739, Ent.', 'marcescens', 'brasiliensis', 'ATCC 10231,']
Looks pretty horrible, but hopefully gives some ideas. :)
So I'm trying to count anhCrawler and return the number of characters with and without spaces alone with the position of "DEATH STAR" and return it in theReport. I can't get the numbers to count correctly either. Please help!
anhCrawler = """Episode IV, A NEW HOPE. It is a period of civil war. \
Rebel spaceships, striking from a hidden base, have won their first \
victory against the evil Galactic Empire. During the battle, Rebel \
spies managed to steal secret plans to the Empire's ultimate weapon, \
the DEATH STAR, an armored space station with enough power to destroy \
an entire planet. Pursued by the Empire's sinister agents, Princess Leia\
races home aboard her starship, custodian of the stolen plans that can \
save her people and restore freedom to the galaxy."""
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR occurs and starts at position {3}.
"""
def analyzeCrawler(thetext):
numchars = 0
nospacechars = 0
numspacechars = 0
anhCrawler = thetext
word = anhCrawler.split()
for char in word:
numchars = word[numchars]
if numchars == " ":
numspacechars += 1
anhCrawler = re.split(" ", anhCrawler)
for char in anhCrawler:
nospacechars += 1
numwords = len(anhCrawler)
pos = thetext.find("DEATH STAR")
char_len = len("DEATH STAR")
ds = thetext[261:271]
dspos = "[261:271]"
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler(theReport)
You're overthinking this problem.
Number of chars in string (returns 520):
len(anhCrawler)
Number of non-whitespace characters in string (using split as using split automatically removes the whitespace, and join creates a string with no whitespace) (returns 434):
len(''.join(anhCrawler.split()))
Finding the position of "DEATH STAR" (returns 261):
anhCrawler.find("DEATH STAR")
Here, you have simplilfied version of your function:
import re
def analyzeCrawler2(thetext, text_to_search = "DEATH STAR"):
numchars = len(anhCrawler)
nospacechars = len(re.sub(r"\s+", "", anhCrawler))
numwords = len(anhCrawler.split())
dspos = anhCrawler.find(text_to_search)
return theReport.format(numchars, nospacechars, numwords, dspos)
print analyzeCrawler2(theReport)
This text contains 520 characters (434 if you ignore spaces).
There are approximately 87 words in the text. The phrase
DEATH STAR occurs and starts at position 261.
I think the trick part is to remove white spaces from the string and to calculate the non-space character count. This can be done simply using regular expression. Rest should be self-explanatory.
First off, you need to indent the code that's inside a function. Second... your code can be simplified to the following:
theReport = """
This text contains {0} characters ({1} if you ignore spaces).
There are approximately {2} words in the text. The phrase
DEATH STAR is the {3}th word and starts at the {4}th character.
"""
def analyzeCrawler(thetext):
numchars = len(anhCrawler)
nospacechars = len(anhCrawler.replace(' ', ''))
numwords = len(anhCrawler.split())
word = 'DEATH STAR'
wordPosition = anhCrawler.split().index(word)
charPosition = anhCrawler.find(word)
return theReport.format(
numchars, nospacechars, numwords, wordPosition, charPosition
)
I modified the last two format arguments because it wasn't clear what you meant by dspos, although maybe it's obvious and I'm not seeing it. In any case, I included the word and char position instead. You can determine which one you really meant to include.