python splitting string without cutting words - python

I found some similar questions, but nothing from python.
The context:
I have many pdf files (text) which have a table among other texts.
The position and size of the table varies from file to file.
I already tried different libraries but pdftotext was the best until now. tabula didn't work for example.
Solution until now:
I use pdftotext to extract all the info as a big string, find the substrings that will always delimit the table and save the table in a variable
Unfortunately, I can't write the whole content of the table, but the first two lines:
D Staph. aureus Ps. aeruginosa E. coli ATCC Ser. Asp. Cand. albicans
a ATCC 6538, ATCC 9027, Ps. 8739, Ent. marcescens brasiliensis ATCC 10231,
since pdftotext puts a "\n" at the end of each line, I could split the table into each row
My goal here is to separate this string into substrings as columns like this:
['Staph. aureus', 'Ps. aeruginosa', 'E. coli ATCC', 'Ser.', 'Asp.', 'Cand. albicans']
and this:
['ATCC 6538, ', 'ATCC 9027, Ps. ', '8739, Ent. ', 'marcescens ', 'brasiliensis ', 'ATCC 10231,']
The second line was delimited in every 15 characters for example
I realized that the maximum length of a column is 15 characters, so i tried splitting it like this, with n = 15:
print([line[i: (i + n)] for i in range(0, len(line), n)])
but this is what I get:
['Staph. aureus ', 'Ps. aeruginosa ', 'E. coli ATCC Se', 'r. ', 'Asp. ', 'Cand. albicans']
the question here is how to cut the string in substrings without cutting the words?
I already realized that if I cut on the position line[i + n], the position line[i+n-1] has to be equal to " " in order to not cut a word.

You can split a string into words using str.split(). If you do not provide a deliminator it will use spaces by default and return the words of a string. See the official Python documentation here.

The data seems to be tab delimited, but with the tabs replaced with spaces.
The only pattern I can spot is multiple spaces between column values.If this is the case your code would break if there was a double space (e.g. typo by the author).
Using the maximum column width is risky. It would break if the columns have short values (e.g. 'one', 'two').

Would this be of help?
str = "D Staph. aureus Ps. aeruginosa E. coli ATCC Ser. Asp.
Cand. albicans"
list = []
for s in str.split():
if "." in s:
list.append(s)
elif list:
list[-1] = f"{list[-1]} {s}"
print(list)
output:
['Staph. aureus', 'Ps. aeruginosa', 'E. coli ATCC', 'Ser.', 'Asp.', 'Cand. albicans']
This splits the string at whitespace, then goes through the resulting array and starts a new entry in the list if a word contains "." and appends that index in the list with the following words until the next word with "." is encountered.
I can't see a rule that we could apply to all rows, but with these two examples we could do:
line1 = "D Staph. aureus Ps. aeruginosa E. coli ATCC Ser. Asp. Cand. albicans"
line2 = "a ATCC 6538, ATCC 9027, Ps. 8739, Ent. marcescens brasiliensis ATCC 10231,"
for line in (line1, line2):
if line[0] == "D":
list = []
for s in line.split():
if "." in s:
list.append(s)
elif list:
list[-1] = f"{list[-1]} {s}"
print(list)
if line[0] == "a":
count = 0
list = []
for s in line2[3:]:
if count % 15 == 0 or count == 0:
list.append(s)
if len(list) > 1: list[-2] = list[-2].rstrip()
else:
list[-1] = f"{list[-1]}{s}"
count += 1
print(list)
output:
['Staph. aureus', 'Ps. aeruginosa', 'E. coli ATCC', 'Ser.', 'Asp.', 'Cand. albicans']
['ATCC 6538,', 'ATCC 9027, Ps.', '8739, Ent.', 'marcescens', 'brasiliensis', 'ATCC 10231,']
Looks pretty horrible, but hopefully gives some ideas. :)

Related

Translate paragraph in python

I am trying to translate a Paragraph from english to my local language which I have written the code as:
def translate(inputvalue):
//inputvalue is an array of english paragraphs
try:
translatedData = []
trans = Translator()
for i in inputvalue:
sentence = re.sub(r'(?<=[.,])(?=[^\s])', r' ', i)
//adding space where there is no space after , or ,
t = trans.translate(sentence, src='en', dest = 'ur')
//translating from english to my local language urdu
translatedData.append(t.text)
//appending data in translatedData array
DisplayOutput.output(translatedData)
//finally calling DisplayOutput function to print translated data
The problem I am facing here is that my local language begins writing from [Right side]
and googletrans is not giving proper output. It puts periods ,commas, untranslated words at the beginning or at the end for example:
I am 6 years old. I love to draw cartoons, animals, and plants. I do not have ADHD.
it would translate this sentence as:
میری عمر 6 سال ہے،. مجھے کارٹون جانور اور پودے کھینچنا پسند ہےمجھے ADHD 6نہیں ہے.
As you can observe it could not translate ADHD as it is just an abbreviation it puts that at the beginning of the sentence and same goes for periods and numbers and commas.
How should I translate it so that it does not conflict like that.
If putting the sentence in another array like:
['I am', '6', 'years old', '.', 'I love to draw cartoons',',', 'animals',',', 'and plants','.', 'I do not have', 'ADHD','.']
I have no idea how to achieve this type of array but I believe it can solve the problem.
As I can translate only the parts that has English words and then appending the list in a string.
Kindly Help me generate this type of array or any other solution
string = "I am 6 years old. I love to draw cartoons, animals, and plants. I do not have ADHD."
arr = []
substring = ""
alpha = None
for char in string:
if char.isalpha() or char == " ": alpha = True
else: alpha = False
if substring.replace(" ","").isalpha():
if alpha:
substring += char
else:
arr.append(substring)
substring = char
else:
if alpha:
arr.append(substring)
substring = char
while " " in arr: arr.remove(" ")
while "" in arr: arr.remove("")
print(arr)
Loop through each character in the string, then check if it is a letter or not a letter with ".isalpha()". Then depending on the conditions of the current substring, you append to it or create a new one.

Parsing complicated list of strings using regex, loops, enumerate, to produce a pandas dataframe

I have a long list of many elements, each element is a string. See below sample:
data = ['BAT.A.100', 'Regulation 2020-1233', 'this is the core text of', 'the regulation referenced ',
'MOC to BAT.A.100', 'this', 'is', 'one method of demonstrating compliance to BAT.A.100',
'BAT.A.120', 'Regulation 2020-1599', 'core text of the regulation ...', ' more free text','more free text',
'BAT.A.145', 'Regulation 2019-3333', 'core text of' ,'the regulation1111',
'MOC to BAT.A.145', 'here is how you can show compliance to BAT.A.145','more free text',
'MOC2 to BAT.A.145', ' here is yet another way of achieving compliance']
My desired output is ultimately a Pandas DataFrame as follows:
As the strings may have to be concatenated, I have firstly joining all the elements to single string using ## to separate the text which have been joined.
I am going for all regex because there would be lot of conditions to check otherwise.
re_req = re.compile(r'##(?P<Short_ref>BAT\.A\.\d{3})'
r'##(?P<Full_Reg_ref>Regulation\s\d{4}-\d{4})'
r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))'
r'(?:##)?(?:(?P<Moc_text>.*?MOC2 to \1)(?P<MOC2>(?:##)?.*?(?=##BAT\.A\.\d{3})(?!\1)|.+)'
r'|(?P<Moc_text_temp>.*?(?=##BAT\.A\.\d{3})(?!\1)))')
final_list = []
for match in re_req.finditer("##" + "##".join(data)):
inner_list = [match.group('Short_ref').replace("##", " "),
match.group('Full_Reg_ref').replace("##", " "),
match.group('Reg_text').replace("##", " ")]
if match.group('Moc_text_temp'): # just Moc_text is present
inner_list += [match.group('Moc_text_temp').replace("##", " "), ""]
elif match.group('Moc_text') and match.group('MOC2'): # both Mock_text and MOC2 is present
inner_list += [match.group('Moc_text').replace("##", " "), match.group('MOC2').replace("##", " ")]
else: # neither Moc_text nor MOC2 is present
inner_list += ["", ""]
final_list.append(inner_list)
final_df = pd.DataFrame(final_list, columns=['Short_ref', 'Full_Reg_ref', 'Reg_text', 'Moc_text', 'MOC2'])
First and second line of regex is same as which you posted earlier and identifies the first two columns.
In third line of regex, r'##(?P<Reg_text>.*?MOC to \1|.*?(?=##BAT\.A\.\d{3})(?!\1))' - matches all text till MOC to Short_ref or matches all the text before the next Reg_text. (?=##BAT\.A\.\d{3})(?!\1) part is to taking the text upto Short_ref pattern and if the Short_ref is not the current Reg_text.
Fourth line is for when Moc_text and MOC2 both is present and it is or with fifth line for the case when just Moc_text is present. This part of the regex is similar to the third line.
Last looping over all the matches using finditer and constructing the rows of the dataframe
final_df:

Replace a word in a String by indexing without "string replace function" -python

Is there a way to replace a word within a string without using a "string replace function," e.g., string.replace(string,word,replacement).
[out] = forecast('This snowy weather is so cold.','cold','awesome')
out => 'This snowy weather is so awesome.
Here the word cold is replaced with awesome.
This is from my MATLAB homework which I am trying to do in python. When doing this in MATLAB we were not allowed to us strrep().
In MATLAB, I can use strfind to find the index and work from there. However, I noticed that there is a big difference between lists and strings. Strings are immutable in python and will likely have to import some module to change it to a different data type so I can work with it like how I want to without using a string replace function.
just for fun :)
st = 'This snowy weather is so cold .'.split()
given_word = 'awesome'
for i, word in enumerate(st):
if word == 'cold':
st.pop(i)
st[i - 1] = given_word
break # break if we found first word
print(' '.join(st))
Here's another answer that might be closer to the solution you described using MATLAB:
st = 'This snow weather is so cold.'
given_word = 'awesome'
word_to_replace = 'cold'
n = len(word_to_replace)
index_of_word_to_replace = st.find(word_to_replace)
print st[:index_of_word_to_replace]+given_word+st[index_of_word_to_replace+n:]
You can convert your string into a list object, find the index of the word you want to replace and then replace the word.
sentence = "This snowy weather is so cold"
# Split the sentence into a list of the words
words = sentence.split(" ")
# Get the index of the word you want to replace
word_to_replace_index = words.index("cold")
# Replace the target word with the new word based on the index
words[word_to_replace_index] = "awesome"
# Generate a new sentence
new_sentence = ' '.join(words)
Using Regex and a list comprehension.
import re
def strReplace(sentence, toReplace, toReplaceWith):
return " ".join([re.sub(toReplace, toReplaceWith, i) if re.search(toReplace, i) else i for i in sentence.split()])
print(strReplace('This snowy weather is so cold.', 'cold', 'awesome'))
Output:
This snowy weather is so awesome.

Removing digits from list elements

I have a list of job titles (12,000 in total) formatted in this way:
Career_List = ['1) ABLE SEAMAN', '2) ABRASIVE GRADER', '3) ABRASIVE GRINDER']
How do I remove the numbers, parentheses, and spaces from the list elements so that I end up with this output:
Career_List_Updated = ['ABLE SEAMAN', 'ABRASIVE GRADER', 'ABRASIVE GRINDER']
I know that I am unable to simply remove the first three characters because I have more than ten items in my list.
Take advantage of the fact that str.lstrip() and the rest of the strip functions accept multiple characters as an argument.
Career_List_Updated =[career.lstrip('0123456789) ') for career in Career_List]
Split each career at the first space; keep the rest of the line.
Career_List = ['1) ABLE SEAMAN', '2) ABRASIVE GRADER', '3) ABRASIVE GRINDER', '12000) ZEBRA CLEANER']
Career_List_Updated = []
for career in Career_List:
job = career.split(' ', 1)
Career_List_Updated.append(job[1])
print Career_List_Updated
Output:
['ABLE SEAMAN', 'ABRASIVE GRADER', 'ABRASIVE GRINDER', 'ZEBRA CLEANER']
One-line version:
Career_List_Updated = [career.split(' ', 1)[1] \
for career in Career_List]
We want to find the first index that STOPS being a bad character and return the rest of the string, as follows.
def strip_bad_starting_characters_from_string(string):
bad_chars = set(r"'0123456789 )") # set of characters we don't like
for i, char in enumerate(string):
if char not in bad_chars
# we are at first index past "noise" digits
return string[i:]
career_list_updated = [strip_bad_starting_characters_from_string(string) for string in career_list]

Problems with nested loops…

I’m going to explain to you in details of what I want to achieve.
I have 2 programs about dictionaries.
The code for program 1 is here:
import re
words = {'i':'jeg','am':'er','happy':'glad'}
text = "I am happy.".split()
translation = []
for word in text:
word_mod = re.sub('[^a-z0-9]', '', word.lower())
punctuation = word[-1] if word[-1].lower() != word_mod[-1] else ''
if word_mod in words:
translation.append(words[word_mod] + punctuation)
else:
translation.append(word)
translation = ' '.join(translation).split('. ')
print('. '.join(s.capitalize() for s in translation))
This program has following advantages:
You can write more than one sentence
You get the first letter capitalized after “.”
The program “append” the untranslated word to the output (“translation = []”)
Here is the code for program 2:
words = {('i',): 'jeg', ('read',): 'leste', ('the', 'book'): 'boka'}
max_group = len(max(words))
text = "I read the book".lower().split()
translation = []
position = 0
while text:
for m in range(max_group - 1, -1, -1):
word_mod = tuple(text[:position + m])
if word_mod in words:
translation.append(words[word_mod])
text = text[position + m:]
position += 1
translation = ' '.join(translation).split('. ')
print('. '.join(s.capitalize() for s in translation))
With this code you can translate idiomatic expressions or
“the book” to “boka”.
Here is how the program proceeds the codes.
This is the output:
1
('i',)
['jeg']
['read', 'the', 'book']
0
()
1
('read', 'the')
0
('read',)
['jeg', 'leste']
['the', 'book']
1
('the', 'book')
['jeg', 'leste', 'boka']
[]
0
()
Jeg leste boka
What I want is to implement some of the codes from program 1 into program 2.
I have tried many times with no success…
Here is my dream…:
If I change the text to the following…:
text = "I read the book. I read the book! I read the book? I read the book.".lower().split()
I want the output to be:
Jeg leste boka. Jeg leste boka! Jeg leste boka? Jeg leste boka.
So please, tweak your brain and help me with a solution…
I appreciate any reply very much!
Thank you very much in advance!
My solution flow would be something like this:
dict = ...
max_group = len(max(dict))
input = ...
textWPunc = input.lower().split()
textOnly = [re.sub('[^a-z0-9]', '', x) for x in input.lower().split()]
translation = []
while textOnly:
for m in [max_group..0]:
if textOnly[:m] in words:
check for punctuation here using textWPunc[:m]
if punctuation present in textOnly[:m]:
Append translated words + punctuation
else:
Append only translated words
textOnly = textOnly[m:]
textWPunc = textWPunc[m:]
join translation to finish
The key part being you keep two parallel lines of text, one that you check for words to translate and the other you check for punctuation if your translation search comes up with a hit. To check for punctuation, I fed the word group that I was examining into re() like so: re.sub('[a-z0-9]', '', wordGroup) which will strip out all characters but no punctuation.
Last thing was that your indexing looks kind of weird to me with that position variable. Since you're truncating the source string as you go, I'm not sure that's really necessary. Just check the leftmost x words as you go instead of using that position variable.

Categories

Resources