I'm using ocrmypdf. I'm trying to do ocr on campaign finances pdfs. Example pdfs: https://apps1.lavote.net/camp/comm.cfm?&cid=11
My clients wants to parse these pdfs, as well as others (form 496s, form 497s). The problem is, even with forms of the same type, the ocr results are inconsistent.
For example, one pdf (form 460) will yield these results:
Statement covers period
from 07/01/2005
through __11/30/2005
and another of the same type yields:
Statement covers period
01/01/2006
from
through 03/17/2006
Notice in the first, the first date comes after the from, whereas in the second, the first date comes before the from. This creates complications when trying to parse the data.
I'm using what I call "checkpoints" to parse forms of similar type. Here's an example:
checkpoints = [
['Statement covers period from', 'Date From'],
['through', 'Date Thru'],
['Date of election if applicable:', None],
['\n', None],
['\\NUMBER Treasurer(s)\n', 'ID'],
['\n', None],
['COMMITTEE NAME (OR CANDIDATE’S NAME IF NO COMMITTEE)\n\n', 'Committee / Candidate Name'],
['\n', None],
['NAME OF TREASURER\n\n', 'Name of Treasurer'],
['\n', None],
['NAME OF OFFICEHOLDER OR CANDIDATE\n\n', 'Name of Officeholder or Candidate'],
['\n', None],
['OFFICE SOUGHT OR HELD (INCLUDE LOCATION AND DISTRICT NUMBER IF APPLICABLE)\n\n', 'Office Sough or Held'],
['\n', None],
]
I loop through every checkpoint, find the start index and end (using current checkpoint and next) index of the current iteration, [0] and not [1], and I save the contents to a key in a master object, like county_object[checkpoint[1]] = contents[start_index:end_index].
This setup only works specifically for the pdf I am parsing. Because ocrmypdf yields different results for even same form types, my setup is not ideal. Can someone point me in the right direction on how I should go about this?
Thanks
I imagine the difference between "identical" Form 460's is a
vertical misalignment due to one being scanned
at a slight CW angle and another at a slight CCW angle.
I hope you are invoking with --deskew,
but even with that there may be minor aberrations that prove troublesome.
The vertical separation between the dates seems large and robust,
so one date will precede the other in the proper way.
Consider focusing more on the mm/dd/yyyy pattern
and less on the text anchors.
You can obtain bound box coordinates from Tesseract OCR.
Use them to disambiguate dates,
based on your knowledge of what appears higher or lower on the form,
and by (approximately) how much.
Related
I have an approximately 1 million row pandas dataframe containing data parsed from federal appellate court opinions. I need to extract the names of judges hearing the cases. The data has an unknown number of judges per case (one row) which are contained in a string. That string (currently stored in a single column) contains a lot of excess text as well as has inconsistent formatting and capitalization. I use different dictionaries of judge names (with 2,575 regex keys possible to be used) to match judges listed based on multiple criteria described below. I use the dictionary with the most stringent matching criteria first and gradually loosen the criteria. I also remove the matched string from the source column. The current methods that I have tried are simply too slow (taking days, weeks, or even months).
The reason there are multiple possible dictionaries is that many judges share the same (last) names. The strings don't ordinarily include full names. I use data contained in two other columns to get the right match: year the case was decided and the court hearing the case (both integers). I also have higher and lower quality substring search terms. The dictionaries I use can be recreated at will in different formats besides regex if needed.
The fastest solution I have tried was crude and unpythonic. In the initial parsing of the data (extraction of sections and keywords from raw text files), which occurs on a case-by-case basis, I did the following: 1) removed excess text to the degree possible, 2) sorted the remaining text into a list stored within a pandas column, 3) concatenated as strings the year and court to each item in that list, and 4) matched that concatenated string to a dictionary that I had similarly prepared. That dictionary didn't use regular expressions and had approximately 800,000 keys. That process took about a day (with all of the other parsing involved as well) and was not as accurate as I would have liked (because it omitted certain name format permutations).
The code below contains my most recent attempt (which is currently running and looks to be among the slowest options yet). It creates subset dictionaries on the fly and still ends up iterating through those smaller dictionaries with regex keys. I've read through and tried to apply solutions from many stackoverflow questions, but couldn't find a workable solution. I'm open to any python-based idea. The data is real data that I've cleaned with a prior function.
import numpy as np
import pandas as pd
test_data = {'panel_judges' : ['CHAGARES, VANASKIE, SCHWARTZ',
'Sidney R. Thomas, Barry G. Silverman, Raymond C. Fisher, Opinion by Thomas'],
'court_num' : [3, 9],
'date_year' : [2014, 2014]}
test_df = pd.DataFrame(data = test_data)
name_dict = {'full_name' : ['Chagares, Michael A.',
'Vanaskie, Thomas Ignatius',
'Schwartz, Charles, Jr.',
'Schwartz, Edward Joseph',
'Schwartz, Milton Lewis',
'Schwartz, Murray Merle'],
'court_num' : [3, 3, 1061, 1097, 1058, 1013],
'circuit_num' : [3, 3, 5, 9, 9, 3],
'start_year' : [2006, 2010, 1976, 1968, 1979, 1974],
'end_year' : [2016, 2019, 2012, 2000, 2005, 2013],
'hq_match' : ['M(?=ICHAEL)? ?A?(?=\.)? ?CHAGARES',
'T(?=HOMAS)? ?I?(?=GNATIUS)? ?VANASKIE',
'C(?=HARLES)? SCHWARTZ',
'E(?=DWARD)? ?J?(?=OSEPH)? ?SCHWARTZ',
'M(?=ILTON)? ?L?(?=EWIS)? ?SCHWARTZ',
'M(?=URRAY)? ?M?(?=ERLE)? ?SCHWARTZ'],
'lq_match' : ['CHAGARES',
'VANASKIE',
'SCHWARTZ',
'SCHWARTZ',
'SCHWARTZ',
'SCHWARTZ']}
names = pd.DataFrame(data = name_dict)
in_col = 'panel_judges'
year_col = 'date_year'
out_col = 'fixed_panel'
court_num_col = 'court_num'
test_df[out_col] = ''
test_df[out_col].astype(list, inplace = True)
def judge_matcher(df, in_col, out_col, year_col, court_num_col,
size_column = None):
general_cols = ['start_year', 'end_year', 'full_name']
court_cols = ['court_num', 'circuit_num']
match_cols = ['hq_match', 'lq_match']
for match_col in match_cols:
for court_col in court_cols:
lookup_cols = general_cols + [court_col] + [match_col]
judge_df = names[lookup_cols]
for year in range(df[year_col].min(),
df[year_col].max() + 1):
for court in range(df[court_num_col].min(),
df[court_num_col].max() + 1):
lookup_subset = ((judge_df['start_year'] <= year)
& (year < (judge_df['end_year'] + 2))
& (judge_df[court_col] == court))
new_names = names.loc[lookup_subset]
df_subset = ((df[year_col] == year)
& (df[court_num_col] == court))
df.loc[df_subset] = matcher(df.loc[df_subset],
in_col, out_col, new_names, match_col)
return df
def matcher(df, in_col, out_col, lookup, keys):
patterns = dict(zip(lookup[keys], lookup['full_name']))
for key, value in patterns.items():
df[out_col] = (
np.where(df[in_col].astype(str).str.upper().str.contains(key),
df[out_col] + value + ', ', df[out_col]))
df[in_col] = df[in_col].astype(str).str.upper().str.replace(key, '')
return df
df = judge_matcher(test_df, in_col, out_col, year_col, court_num_col)
The output I currently get is essentially right (although the names should be sorted and in a list). The proper "Schwartz" is picked and the matches are all correct. The problem is speed.
My goal is to have a de-deduplicated, sorted (alphabetically) list of judges on each panel either stored in a single column or exploded into up to 15 separate columns (I presently do that in a separate vectorized function). I then will do other lookups on those judges based upon other demographic and biographical information. The produced data will be openly available to researchers in the area and the code will be part of a free, publicly available platform usable for studying other courts as well. So accuracy and speed are both important considerations for users on many different machines.
For anyone who stumbles across this question and has a similar complex string matching issue in pandas, this is the solution I found to be the fastest.
It isn't fully vectorized like I wanted, but I used df.apply with this method within a class:
def judge_matcher(self, row, in_col, out_col, year_col, court_num_col,
size_col = None):
final_list = []
raw_list = row[in_col]
cleaned_list = [x for x in raw_list if x]
cleaned_list = [x.strip() for x in cleaned_list]
for name in cleaned_list:
name1 = self.convert_judge_name(row[year_col],
row[court_num_col], name, 1)
name2 = self.convert_judge_name(row[year_col],
row[court_num_col], name, 2)
if name1 in self.names_dict_list[0]:
final_list.append(self.names_dict_list[0].get(name1))
elif name1 in self.names_dict_list[1]:
final_list.append(self.names_dict_list[1].get(name1))
elif name2 in self.names_dict_list[2]:
final_list.append(self.names_dict_list[2].get(name2))
elif name2 in self.names_dict_list[3]:
final_list.append(self.names_dict_list[3].get(name2))
elif name in self.names_dict_list[4]:
final_list.append(self.names_dict_list[4].get(name))
final_list = list(unique_everseen(final_list))
final_list.sort()
row[out_col] = final_list
if size_col and final_list:
row[size_col] = len(final_list)
return row
#staticmethod
def convert_judge_name(year, court, name, dict_type):
if dict_type == 1:
return str(int(court) * 10000 + int(year)) + name
elif dict_type == 2:
return str(int(year)) + name
else:
return name
Basically, it concatenates three columns together and performs hashed dictionary lookups (instead of regexes) with the concatenated strings. Multiplication is used to efficiently concatenate the two numbers to be side-by-side as strings. The dictionaries had similarly prepared keys (and the values are the desired strings). By using lists and then deduplicating, I didn't have to remove the matched strings. I didn't time this specific function, but the overall module took just over 10 hours to process ~ 1 million rows. When I run it again, I will try to remember to time this applied function specifically and post the results here. The method is ugly, but fairly effective.
I am totally new to python. I am using a package that takes medical text and annotates it with classifiers called pyConTextNLP
It basically takes some natural language text, adds some 'modifiers' to it and classifies it whilst removing negative findings.
The problem I am having is how to add the list of modifiers as a csv or a yaml file. I have been following the basic setup instructions here:
The problem is the line here:
modifiers = itemData.get_items("https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_05042016.yml")
itemData.get_items doesn't look like it exists anymore and there is a function instead called itemData.get_fileobj(). This takes a csv file as far as I understand and the csv is passed to the function markup.markItems(modifiers, mode="modifier") which looks at the text and 'marks up' any concepts in the raw text that match the modifiers.
The error that I get when trying to run the example code is:
if not `item.getLiteral() in compiledRegExprs:`
and this gives me the error:
AttributeError: 'UnicodeReader' object has no attribute 'getLiteral'
The whole code is here: but I have also written it below
import networkx as nx
import pyConTextNLP.itemData as itemData
import pyConTextNLP.pyConTextGraph as pyConText
reports = [
"""IMPRESSION: Evaluation limited by lack of IV contrast; however, no evidence of
bowel obstruction or mass identified within the abdomen or pelvis. Non-specific interstitial opacities and bronchiectasis seen at the right
base, suggestive of post-inflammatory changes.""",
"""DIAGNOSIS: NO SIGNIFICANT PATHOLOGY
MICRO These biopsies of large bowel mucosa show oedema of the lamina propriabut no architectural abnormality
There is no dysplasia or malignancy
There is no evidence of active inflammation
There is no increase in the inflammatory cell content of the lamina propria""" ,
"""IMPRESSION:
1. 2.0 cm cyst of the right renal lower pole. Otherwise, normal appearance
of the right kidney with patent vasculature and no sonographic evidence of
renal artery stenosis.
2. Surgically absent left kidney.""",
"""IMPRESSION: No definite pneumothorax""",
"""IMPRESSION: New opacity at the left lower lobe consistent with pneumonia."""
]
modifiers = itemData.get_fileobj("/Applications/anaconda3/lib/python3.7/site-packages/pyConTextNLP-0.6.2.0-py3.7.egg/pyConTextNLP/CSV_Modifiers.csv")
targets = itemData.get_fileobj("/Applications/anaconda3/lib/python3.7/site-packages/pyConTextNLP-0.6.2.0-py3.7.egg/pyConTextNLP/CSV_targets.csv")
def markup_sentence(s, modifiers, targets, prune_inactive=True):
"""
"""
markup = pyConText.ConTextMarkup()
markup.setRawText(s)
markup.cleanText()
markup.markItems(modifiers, mode="modifier")
markup.markItems(targets, mode="target")
markup.pruneMarks()
markup.dropMarks('Exclusion')
# apply modifiers to any targets within the modifiers scope
markup.applyModifiers()
markup.pruneSelfModifyingRelationships()
if prune_inactive:
markup.dropInactiveModifiers()
return markup
reports[3]
markup = pyConText.ConTextMarkup()
isinstance(markup,nx.DiGraph)
markup.setRawText(reports[4].lower())
print(markup)
print(len(markup.getRawText()))
markup.cleanText()
print(markup)
print(len(markup.getText()))
markup.markItems(modifiers, mode="modifier")
print(markup.nodes(data=True))
print(type(list(markup.nodes())[0]))
markup.markItems(targets, mode="target")
for node in markup.nodes(data=True):
print(node)
markup.pruneMarks()
for node in markup.nodes(data=True):
print(node)
print(markup.edges())
markup.applyModifiers()
for edge in markup.edges():
print(edge)
markItems function is here:
def markItems(self, items, mode="target"):
"""tags the sentence for a list of items
items: a list of contextItems"""
if not items:
return
for item in items:
self.add_nodes_from(self.markItem(item, ConTextMode=mode),
category=mode)
The question is, how can I get the code to read the list in the csv file without throwing this error?
Suppose such a text
s = '\n\nPART I, WHERE I’M COMING FROM\n\n1\xa0My Call to Adventure: 1949–1967\n2\xa0Crossing the Threshold: 1967–1979\n3\xa0My Abyss: 1979–1982\n4\xa0My Road of Trials: 1983–1994\n5\xa0The Ultimate Boon: 1995–2010\n6\xa0Returning the Boon: 2011–2015\n7\xa0My Last Year and My Greatest Challenge: 2016–2017\n8\xa0Looking Back from a Higher Level\n\nPART II, LIFE PRINCIPLES\n\n1\xa0Embrace Reality and Deal with It\n2\xa0Use the 5-Step Process to Get What You Want Out of Life\n3\xa0Be Radically Open-Minded\n4\xa0Understand That People Are Wired Very Differently\n5\xa0Learn How to Make Decisions Effectively\nLife Principles: Putting It All Together\nSummary and Table of Life Principles\n\nPART III, WORK PRINCIPLES\n\nSummary and Table of Work Principles\nTO GET THE CULTURE RIGHT\n\nTO GET THE PEOPLE\n\nTO BUILD AND EVOLVE YOUR \nWork Principles: Putting It All Together\n\n'
Split it by delimiter PART
In [14]: parts = re.split(r'\n\nPART',s)
In [15]: parts
Out[15]:
['',
' I, WHERE I’M COMING FROM\n\n1\xa0My Call to Adventure: 1949–1967\n2\xa0Crossing the Threshold: 1967–1979\n3\xa0My Abyss: 1979–1982\n4\xa0My Road of Trials: 1983–1994\n5\xa0The Ultimate Boon: 1995–2010\n6\xa0Returning the Boon: 2011–2015\n7\xa0My Last Year and My Greatest Challenge: 2016–2017\n8\xa0Looking Back from a Higher Level',
' II, LIFE PRINCIPLES\n\n1\xa0Embrace Reality and Deal with It\n2\xa0Use the 5-Step Process to Get What You Want Out of Life\n3\xa0Be Radically Open-Minded\n4\xa0Understand That People Are Wired Very Differently\n5\xa0Learn How to Make Decisions Effectively\nLife Principles: Putting It All Together\nSummary and Table of Life Principles',
' III, WORK PRINCIPLES\n\nSummary and Table of Work Principles\nTO GET THE CULTURE RIGHT\n\nTO GET THE PEOPLE\n\nTO BUILD AND EVOLVE YOUR \nWork Principles: Putting It All Together\n\n']
Add prefix Part back to list
In [16]: ['PART '+ i for i in parts if i]
Out[16]:
['PART I, WHERE I’M COMING FROM\n\n1\xa0My Call to Adventure: 1949–1967\n2\xa0Crossing the Threshold: 1967–1979\n3\xa0My Abyss: 1979–1982\n4\xa0My Road of Trials: 1983–1994\n5\xa0The Ultimate Boon: 1995–2010\n6\xa0Returning the Boon: 2011–2015\n7\xa0My Last Year and My Greatest Challenge: 2016–2017\n8\xa0Looking Back from a Higher Level',
'PART II, LIFE PRINCIPLES\n\n1\xa0Embrace Reality and Deal with It\n2\xa0Use the 5-Step Process to Get What You Want Out of Life\n3\xa0Be Radically Open-Minded\n4\xa0Understand That People Are Wired Very Differently\n5\xa0Learn How to Make Decisions Effectively\nLife Principles: Putting It All Together\nSummary and Table of Life Principles',
'PART III, WORK PRINCIPLES\n\nSummary and Table of Work Principles\nTO GET THE CULTURE RIGHT\n\nTO GET THE PEOPLE\n\nTO BUILD AND EVOLVE YOUR \nWork Principles: Putting It All Together\n\n']
I would like to finish it in one step,
In [17]: parts = re.findall(r'\n\nPART.+', s)
In [18]: parts
Out[18]:
['\n\nPART I, WHERE I’M COMING FROM',
'\n\nPART II, LIFE PRINCIPLES',
'\n\nPART III, WORK PRINCIPLES
#dot stops at \n, I desire to solve the problem with quantifier(multipy many stops)
In [20]: parts = re.findall(r'\n\n(?:PART.+)+', s)
In [21]: parts
Out[21]:
['\n\nPART I, WHERE I’M COMING FROM',
'\n\nPART II, LIFE PRINCIPLES',
'\n\nPART III, WORK PRINCIPLES']
#Unfortunately, it prints the same output
How to accomplish such a task?
Try splitting on a positive lookahead to retain the delimiter, using the regex module:
import regex
print regex.split(r"(?=\n\nPART)", s, flags=regex.VERSION1)
I have data that looks like this:
owned category weight mechanics_split
28156 Environmental, Medical 2.8023 [Action Point Allowance System, Co-operative P...
9269 Card Game, Civilization, Economic 4.3073 [Action Point Allowance System, Auction/Biddin...
36707 Modern Warfare, Political, Wargame 3.5293 [Area Control / Area Influence, Campaign / Bat...
and used this function (taken from the generous answer in this question):
def owned_nums(games):
for row in games.iterrows():
owned_value = row[1]['owned']
mechanics = row[1]['mechanics_split']
for type_string in mechanics:
game_types.loc[type_string, ['owned']] += owned_value
to iterate over the values in the dataframe and put new values in a new dataframe called game_types. It worked great. In fact, it still works great; that notebook is open, and if I change the last line of the function to just print (type_string), it prints:
Action Point Allowance System
Co-operative Play
Hand Management
Point to Point Movement
Set Collection
Trading
Variable Player Powers
Action Point Allowance System...
Okay, perfect. So, I saved my data as a csv, opened a new notebook, opened the csv with the columns with the split strings, copied and pasted the exact same function, and when I print type_string, I now get:
[
'
A
c
t
i
o
n
P
o
i
n
t
A
l
l
o
w
The only thing I could notice is that the original lists were quote-less, with [Action Point Allowance System, Co-operative...] etc., and the new dataframe opened from the new csv was rendered as ['Action Point Allowance System', 'Co-operative...'], with quotes. I used str.replace("'","") which got rid of the quotes, but it's still returning every letter. I've tried experimenting with the escapechars in to_csv but to no avail. Very confused as to what setting I need to tweak.
Thanks very much for any help.
The only way the code
mechanics = row[1]['mechanics_split']
for type_string in mechanics:
game_types.loc[type_string, ['owned']] += owned_value
can have worked is if your mechanics_split column contained not a string but an iterable containing strings.
Storing non-scalar data in Series is not well-supported, and while it's sometimes useful (though slow) as an intermediate step, it's not supposed to be something you do regularly. Basically what you're doing is
>>> df = pd.DataFrame({"A": [["x","y"],["z"]]})
>>> df.to_csv("a.csv")
>>> !cat a.csv
,A
0,"['x', 'y']"
1,['z']
after which you have
>>> df2 = pd.read_csv("a.csv", index_col=0)
>>> df2
A
0 ['x', 'y']
1 ['z']
>>> df.A.values
array([['x', 'y'], ['z']], dtype=object)
>>> df2.A.values
array(["['x', 'y']", "['z']"], dtype=object)
>>> type(df.A.iloc[0])
<class 'list'>
>>> type(df2.A.iloc[0])
<class 'str'>
and you notice that what was originally a Series containing lists of strings is now a Series containing only strings. Which makes sense, if you think about it, because CSVs never claimed to be type-preserving.
If you insist on using a frame like this, you should manually encode and decode your lists via some representation (e.g. JSON strings) on reading and writing. I'm too lazy to confirm what pandas does to str-ify lists, but you might be able to get away with applying ast.literal_eval to the resulting strings to turn them back into lists.
I've used a number of pdf-->text methods to extract text from pdf documents. For one particular type of PDF I have, neither pyPDF or pdfMiner are doing a good job extracting the text. However, http://www.convertpdftotext.net/ does it (almost) perfectly.
I discovered that the pdf I'm using has some transparent text in it, and it is getting merged into the other text.
Some examples of the blocks of text I get back are:
12324 35th Ed. 01-MAR-12 Last LNM: 14/12 NAD 83 14/12 Corrective Object of Corrective
ChartTitle: Intracoastal Waterway Sandy Hook to Little Egg Harbor Position
C HAActRionT N Y -NJ - S A N D Y H OO K ATcO tionLI T TLE EGG HARBOR. Page/Side: N/A
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true.
Bearings RoEf LlighOCtAT seEc tors aSrehre towwsbuardry th Re ivligher Ct fhroanmn seel Lawighartde.d B Theuoy 5no minal range of lights is expressedf roin mna 4u0tic-24al -mi46les.56 0(NNM ) unless othe0r7w4is-00e n-o05te.d8.8 0 W
to 40-24-48.585N 074-00-05.967W
and
12352 33rd Ed. 01-MAR-11 Last LNM: 03/12 NAD 83 04/12 . . l . . . . Corrective Object of Corrective ChartTitle: Shinnecock Bay to East Rockaway Inlet Position C HAActRionT S H IN N E C OC K B A Y TO AcEtionAS T ROCKAWAY INLET. Page/Side: N/A (Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true. (BTeeamringp) s DoEf LlighETtE s ectors aSretat toew Baoratd Ctheh anlighnet lf Droaym beseacoawanr 3d. The nominal range of lights is expressedf roin mna 4u0tic-37al -mi11les.52 0(NNM ) unless othe0r7w3is-29e n-5o3te.d76. 0 W
and I have discovered that the "ghost text" is ALWAYS the following:
Corrective Object of Corrective Position
Action Action
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are given in degrees clockwise from 000 true.
Bearings of light sectors are toward the light from seaward. The nominal range of lights is expressed in nautical miles (NM) unless otherwise noted.
In the 2nd example I posted, the text I want (with the ghost text removed) is:
12352 33rd Ed. 01-Mar-11 Last LNM:03/12 NAD 83 04/12
Chart Title:Shinnecock Bay to East Rockaway Inlet. Page/Side:N/A
CGD01
(Temp) DELETE State Boat Channel Daybeacon 3 from 40-37-11.520N 073-29-53.760W
This problem occurs just once per document, and does not appear to be totally consistent (as seen above). I am wondering if one of you wizards could think of a way to remove the ghosted text (I don't need/want it) using python. If I had been using pyPDF, I would have used a regex to rip it out during the conversion to text. Unfortunately, since I'm starting out with a text file from the website listed above, the damage has already been done. I'm at a bit of a loss.
Thanks for reading.
EDIT:
The solution to this problem looks like it be more complex than the rest of the application, so I'm going to withdraw my request for help.
I very much appreciate the thought put into it by those who have contributed.
Given that the ghost text can be split up in seemingly unpredictable ways, I don't think there is a simple automatic way of removing it that would not have false positives. What you need is almost human-level pattern recognition. :-)
What you could try is exploiting the format of these kinds of messages. Roughly;
<number> <number>[rn]d Ed. <date> Last LNM:<mm>/<yy> NAD <date2>
Chart Title:<text>. Page/Side:<N/A or number(s)> CGD<number> <text>
<position>
Using this you could pluck out the nonsense from the predictable elements, and then if you have a list of chart names ('Shinnecock Bay to East Rockaway Inlet') and descriptive words (like 'State', 'Boat', 'Daybeacon') you might be able to reconstruct the original words by finding the smallest levenshtein distance between mangled words in the two text blocks and those in your word lists.
If you can install the poppler software, you could try and use pdftotext with the -layout option to keep the formatting from the original PDF as much as possible. That might make your problem disappear.
You could recursively find all possible ways that your Pattern
"Corrective Object of Corrective Position Action ..." can be contained within your mangled text,
Then you can unmangle the text for each of these possible paths, run some sort of spellcheck over them, and choose the one with the fewest spelling mistakes. Or since you know roughly where each substring should appear, you can use that as a heuristic.
Or you could simply use the first path.
some pseudocode (untested):
def findPaths(mangledText, pattern, path)
if len(pattern)==0: # end of pattern
return [path]
else:
nextLetter= pattern[0]
locations = findAllOccurences (mangledText, nextLetter) # get all indices in mangledText that contain nextLetter
allPaths = []
for loc in locations:
paths = findPaths( mangledText[loc+1:], pattern[1:], path + (loc,) )
allPaths.Extend(paths)
return allPaths # if no locations for the next letters exist, allPaths will be emtpy
Then you can call it like this (optionally remove all spaces from your search pattern, unless you are certain they are all included in the mangled text)
allPossiblePaths = findPaths ( YourMangledText, "Corrective Object...", () )
then allPossiblePaths should contain a list of all possible ways your pattern could be contained in your mangled text.
Each entry is a tuple with the same length as the pattern, containing the index at which the corresponding letter of the pattern occurs in the search text.