How to solve Unicoder problem when reading csv

How to solve Unicoder problem when reading csv - python

I am totally new to python. I am using a package that takes medical text and annotates it with classifiers called pyConTextNLP
It basically takes some natural language text, adds some 'modifiers' to it and classifies it whilst removing negative findings.
The problem I am having is how to add the list of modifiers as a csv or a yaml file. I have been following the basic setup instructions here:
The problem is the line here:
modifiers = itemData.get_items("https://raw.githubusercontent.com/chapmanbe/pyConTextNLP/master/KB/lexical_kb_05042016.yml")
itemData.get_items doesn't look like it exists anymore and there is a function instead called itemData.get_fileobj(). This takes a csv file as far as I understand and the csv is passed to the function markup.markItems(modifiers, mode="modifier") which looks at the text and 'marks up' any concepts in the raw text that match the modifiers.
The error that I get when trying to run the example code is:
if not `item.getLiteral() in compiledRegExprs:`
and this gives me the error:
AttributeError: 'UnicodeReader' object has no attribute 'getLiteral'
The whole code is here: but I have also written it below
import networkx as nx
import pyConTextNLP.itemData as itemData
import pyConTextNLP.pyConTextGraph as pyConText
reports = [
"""IMPRESSION: Evaluation limited by lack of IV contrast; however, no evidence of
bowel obstruction or mass identified within the abdomen or pelvis. Non-specific interstitial opacities and bronchiectasis seen at the right
base, suggestive of post-inflammatory changes.""",
"""DIAGNOSIS: NO SIGNIFICANT PATHOLOGY
MICRO These biopsies of large bowel mucosa show oedema of the lamina propriabut no architectural abnormality
There is no dysplasia or malignancy
There is no evidence of active inflammation
There is no increase in the inflammatory cell content of the lamina propria""" ,
"""IMPRESSION:
1. 2.0 cm cyst of the right renal lower pole. Otherwise, normal appearance
of the right kidney with patent vasculature and no sonographic evidence of
renal artery stenosis.
2. Surgically absent left kidney.""",
"""IMPRESSION: No definite pneumothorax""",
"""IMPRESSION: New opacity at the left lower lobe consistent with pneumonia."""
]
modifiers = itemData.get_fileobj("/Applications/anaconda3/lib/python3.7/site-packages/pyConTextNLP-0.6.2.0-py3.7.egg/pyConTextNLP/CSV_Modifiers.csv")
targets = itemData.get_fileobj("/Applications/anaconda3/lib/python3.7/site-packages/pyConTextNLP-0.6.2.0-py3.7.egg/pyConTextNLP/CSV_targets.csv")
def markup_sentence(s, modifiers, targets, prune_inactive=True):
"""
"""
markup = pyConText.ConTextMarkup()
markup.setRawText(s)
markup.cleanText()
markup.markItems(modifiers, mode="modifier")
markup.markItems(targets, mode="target")
markup.pruneMarks()
markup.dropMarks('Exclusion')
# apply modifiers to any targets within the modifiers scope
markup.applyModifiers()
markup.pruneSelfModifyingRelationships()
if prune_inactive:
markup.dropInactiveModifiers()
return markup
reports[3]
markup = pyConText.ConTextMarkup()
isinstance(markup,nx.DiGraph)
markup.setRawText(reports[4].lower())
print(markup)
print(len(markup.getRawText()))
markup.cleanText()
print(markup)
print(len(markup.getText()))
markup.markItems(modifiers, mode="modifier")
print(markup.nodes(data=True))
print(type(list(markup.nodes())[0]))
markup.markItems(targets, mode="target")
for node in markup.nodes(data=True):
print(node)
markup.pruneMarks()
for node in markup.nodes(data=True):
print(node)
print(markup.edges())
markup.applyModifiers()
for edge in markup.edges():
print(edge)
markItems function is here:
def markItems(self, items, mode="target"):
"""tags the sentence for a list of items
items: a list of contextItems"""
if not items:
return
for item in items:
self.add_nodes_from(self.markItem(item, ConTextMode=mode),
category=mode)
The question is, how can I get the code to read the list in the csv file without throwing this error?

Related

Convert XML file to a list of lists in Python

I have a dataset from HMDB the Saliva Metabolites data.
This data is an XML file. What I want to do is to convert this XML file to a list of lists (nested lists) in Python, however, I don't want all the nodes in the list.
EDITED: AND THIS IS EXAMPLE OF PARTIAL DATA FOR ONE METABOLITE
<?xml version="1.0" encoding="UTF-8"?>
<hmdb xmlns="http://www.hmdb.ca">
<metabolite>
<version>4.0</version>
<creation_date>2005-11-16 15:48:42 UTC</creation_date>
<update_date>2019-01-11 19:13:56 UTC</update_date>
<accession>HMDB0000001</accession>
<status>quantified</status>
<secondary_accessions>
<accession>HMDB00001</accession>
<accession>HMDB0004935</accession>
<accession>HMDB0006703</accession>
<accession>HMDB0006704</accession>
<accession>HMDB04935</accession>
<accession>HMDB06703</accession>
<accession>HMDB06704</accession>
</secondary_accessions>
<name>1-Methylhistidine</name>
<cs_description>1-Methylhistidine, also known as 1-mhis, belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom. 1-Methylhistidine has been found in human muscle and skeletal muscle tissues, and has also been detected in most biofluids, including cerebrospinal fluid, saliva, blood, and feces. Within the cell, 1-methylhistidine is primarily located in the cytoplasm. 1-Methylhistidine participates in a number of enzymatic reactions. In particular, 1-Methylhistidine and Beta-alanine can be converted into anserine; which is catalyzed by the enzyme carnosine synthase 1. In addition, Beta-Alanine and 1-methylhistidine can be biosynthesized from anserine; which is mediated by the enzyme cytosolic non-specific dipeptidase. In humans, 1-methylhistidine is involved in the histidine metabolism pathway. 1-Methylhistidine is also involved in the metabolic disorder called the histidinemia pathway.</cs_description>
<description>One-methylhistidine (1-MHis) is derived mainly from the anserine of dietary flesh sources, especially poultry. The enzyme, carnosinase, splits anserine into b-alanine and 1-MHis. High levels of 1-MHis tend to inhibit the enzyme carnosinase and increase anserine levels. Conversely, genetic variants with deficient carnosinase activity in plasma show increased 1-MHis excretions when they consume a high meat diet. Reduced serum carnosinase activity is also found in patients with Parkinson's disease and multiple sclerosis and patients following a cerebrovascular accident. Vitamin E deficiency can lead to 1-methylhistidinuria from increased oxidative effects in skeletal muscle. 1-Methylhistidine is a biomarker for the consumption of meat, especially red meat.</description>
<synonyms>
<synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
<synonym>1-Methylhistidine</synonym>
<synonym>Pi-methylhistidine</synonym>
<synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
<synonym>1 Methylhistidine</synonym>
<synonym>1-Methyl histidine</synonym>
</synonyms>
<chemical_formula>C7H11N3O2</chemical_formula>
<smiles>CN1C=NC(C[C#H](N)C(O)=O)=C1</smiles>
<inchikey>BRMWTNUJHUMWMS-LURJTMIESA-N</inchikey>
<diseases>
<disease>
<name>Kidney disease</name>
<omim_id/>
<references>
<reference>
<reference_text>McGregor DO, Dellow WJ, Lever M, George PM, Robson RA, Chambers ST: Dimethylglycine accumulates in uremia and predicts elevated plasma homocysteine concentrations. Kidney Int. 2001 Jun;59(6):2267-72.</reference_text>
<pubmed_id>11380830</pubmed_id>
</reference>
<reference>
<reference_text>Ehrenpreis ED, Salvino M, Craig RM: Improving the serum D-xylose test for the identification of patients with small intestinal malabsorption. J Clin Gastroenterol. 2001 Jul;33(1):36-40.</reference_text>
<pubmed_id>11418788</pubmed_id>
</reference>
<reference>
</reference>
</references>
</disease>
<disease>
Importing the file:
import xml.etree.ElementTree as et
data1 = et.parse('D:/path/to/Tal/my/HMDB/DataSets/saliva_metabolites/saliva_metabolites.xml')
root = data1.getroot()
Now, not sure how to select specific nodes. Meaning, my goal is to create a list of metabolites and each metabolite from the list will contain a list of nodes (say, <accession>, <name>, <synonyms> and <diseases_name>)
In turn, those elements will contain another list (say, inside <synonyms> there will be a list of values names, or inside <diseases_name> will be the list of names of diseases and each disease will contain a list of pub_id values).
# To access the 4'th node of the first metabolit
>> root[0][3].text
'HMDB0000001'
where root[0][3] represents the <accession> node.
Tried to run loop with print so i'll understand the output of the loop but recieved list of None
for node in root:
print(node.find('accession'))
None
None
None
None
None
.
.
.
Also tried
>> root.findall('./metabolite/accession')
[]
But received empty brackets
for list of synonyms of the first metbolite i tried:
>> root[0][9].text
'\n '
# This gave the first value of synonyms
root[0][9][0].text
'\n '
I used those questions to find an answer:
How do I parse XML in Python?
how to create a list of elements from an XML file in python
Python: XML file to pandas dataframe
Convert XML into Lists of Tags and Values with Python
Generating nested lists from XML doc
Any hints, ideas would be a help, thank you for your time

You are ignoring the namespace in the XML.
<hmdb xmlns="http://www.hmdb.ca">
means that there is no <hmdb> element. There is a <hmdb> in the http://www.hmdb.ca namespace. And since it's the default namespace for this element, all descendant elements are in the same namespace, unless they override that.
So this
root.findall('./metabolite/accession')
will not return anything because you're searching in the wrong namespace.
Let's search in the http://www.hmdb.ca namespace by giving it the handle h, for convenience:
ns = {
"h": "http://www.hmdb.ca"
}
accession = root.findall('./h:metabolite/h:accession', ns)
print(accession)
This finds one element (see how it explicitly denotes the namespace when you print it):
[<Element '{http://www.hmdb.ca}accession' at 0x03E6E7B0>]
You can use the same explicit syntax in ElementTree, but it gets unwieldy very quickly:
t.findall('./{http://www.hmdb.ca}metabolite/{http://www.hmdb.ca}accession')
The shorter (and standard) syntax with the prefix: is a lot nicer to work with.

Doc2Vec not providing adequate results in most_similar

I'm trying to use Doc2Vec to go through the classic exercise of training on Wikipedia articles, using the article title as the tag.
Here's my code and the results, is there something that I'm missing that they would not give the matching results with most_similar? Following this tutorial, but I used the wiki-english-20171001 dataset that came with gensim.
import gensim.downloader as api
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import re
def cleanText(text):
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = text.lower()
text = re.sub(r'[^\w\s]','',text)
return text
wiki = api.load("wiki-english-20171001")
data = [d for d in wiki]
for i in range(10):
print(data[i])
def my_create_tagged_docs(data):
for wikiidx in range(len(data)):
yield TaggedDocument([i for i in data[wikiidx].get('section_texts') for i in cleanText(i).split()], [data[wikiidx].get('title')])
wiki_data = my_create_tagged_docs(data)
del data
del wiki
model = Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, epochs=40)
model.build_vocab(wiki_data)
model.train(wiki_data, total_examples=model.corpus_count, epochs=model.epochs)
model.docvecs.most_similar(positive=["Lady Gaga"], topn=10)
[('Chlorothrix', 0.35521823167800903),
("A Child's Garden of Verses", 0.3533579707145691),
('Fish Mooney', 0.35129639506340027),
('2000 Paris–Roubaix', 0.3463437855243683),
('Calvin C. Chaffee', 0.3439667224884033),
('Murders of Eve Stratford and Lynne Weedon', 0.3397218585014343),
('Black Air', 0.3396576941013336),
('Turzyn', 0.3312540054321289),
('Scott Baker', 0.33018186688423157),
('Amongst the Waves', 0.3297169804573059)]
model.docvecs.most_similar(positive=["Machine learning"], topn=10)
[('Wolf Rock, Connecticut', 0.3855834901332855),
('Amália Rodrigues', 0.3349645137786865),
('Victoria Park, Leicester', 0.33312514424324036),
('List of visual anthropology films', 0.3311382532119751),
('Sadqay Teri Mout Tun', 0.3287636637687683),
('T. Damodaran', 0.32876330614089966),
('Urqu Jawira (Aroma)', 0.32281631231307983),
('Tiggy Wiggy', 0.3226730227470398),
('Frédéric Brun (cyclist, born 1988)', 0.32106447219848633),
('Unholy Crusade', 0.3200794756412506)]

It looks like your wiki_data is a single-pass generator, as returned by my_create_tagged_docs(), which can be iterated over only once - not an iterable object capable of many iterations, as the many steps of the Doc2Vec training requires.
You can test your wiki_data object for whether it's multiply-iterable, just after it's been assigned, by executing:
print(sum(1 for _ in wiki_data))
print(sum(1 for _ in wiki_data))
If you see the same number twice – the total number of documents – all's well. If the 2nd number is 0, you've created a single-use iterator instead of a multiple-use iterable.
As a result, the build_vocab() call will work to initialize the known-vocabulary & model – but then the train() will see an empty iterable, completing instantly with no real training happening. (If you run with logging at the INFO level, this may be obvious in the log timestamps for the various steps.)
Two possible fixes:
If you're lucky enough to have enough RAM to hold the whole corpus as Python objects, converting it into a in-memory list would ensure it's multiple-iterable:
wiki_data = list(my_create_tagged_docs(data))
But, most won't have that much RAM * shouldn't/needn't take that step. Instead, you can define a class for an iterable view on the data, which can return a fresh iterator every time it's needed. There's an example with further explanation in a blog post by the founder of the gensim project at:
https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Converting an imperative algorithm into functional style

I wrote a simple procedure to calculate the average of the test coverage of some specific packages in a Java project. The raw data in a huge html file is like this:
<body>
package pkg1 <line_coverage>11/111,<branch_coverage>44/444<end>
package pkg2 <line_coverage>22/222,<branch_coverage>55/555<end>
package pkg3 <line_coverage>33/333,<branch_coverage>66/666<end>
...
</body>
Given the specified packages "pkg1" and "pkg3", for example, the average line coverage is:
(11+33)/(111+333)
and average branch coverage is:
(44+66)/(444+666)
I wrote the follow procedure to get the result and it works well. But how to implement this calculation in a functional style? Something like "(x,y) for x in ... for b in ... if...". I know a little Erlang, Haskell and Clojure, So solutions in these languages are also appreciated. Thanks a lot!
from __future__ import division
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for line in datafile:
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
match = ptn.match(line)
if match is not None:
cvln, tlln, cvbh, tlbh = match.groups()
covered_lines += int(cvln)
total_lines += int(tlln)
covered_branches += int(cvbh)
total_branches += int(tlbh)
print 'Line coverage:', '{:.2%}'.format(covered_lines / total_lines)
print 'Branch coverage:', '{:.2%}'.format(covered_branches/total_branches)

Down below you can find my Haskell solution. I will try to explain the important points I went through as I wrote it.
First you will find that I created a data structure for coverage data. It's generally a good idea to create data structures to represent whatever data you want to handle. This is in part because it makes it easier to design your code when you can think in terms of whatever you are designing – closely related to functional programming philosophies, and in part because it can eliminate a few bugs where you think you are doing something but are in actuality doing something else.
Related to the point before: The first thing I do is to convert the string-represented data into my own data structure. When you are doing functional programming, you are often doing things in "sweeps." You don't have a single function that converts data to your format, filters out the unwanted data and summarises the result. You have three different functions for each of those tasks, and you do them one at a time!
This is because functions are very composable, i.e. if you have three different ones, you can stick them together to form a single one if you want to. If you start with a single one, it is very difficult to take it apart to form three different ones.
The actual workings of the conversion function is actually quite uninteresting unless you are specifically doing Haskell. All it does is try to match each string with a regex, and if it succeeds, it adds the coverage data to the resulting list.
Again, mad composition is about to happen. I don't create a function to loop over a list of coverages and sum them up. I create a single function to sum two coverages, because I know I can use it together with the specialised fold loop (which is sort of like a for loop on steroids) to summarise all coverages in a list. There's no need for me to reinvent the wheel and create a loop myself.
Besides, my sumCoverages function works with a lot of specialised loops, so I don't have to write a ton of functions, I just stick my single function into a ton of pre-made library functions!
In the main function you will see what I mean by programming in "sweeps" or "passes" over the data. First I convert it to the internal format, then I filter out the unwanted data, then I summarise the remaining data. These are completely independent computations. That's functional programming.
You will also notice that I use two specialised loops there, filter and fold. This means that I don't have to write any loops myself, I just stick in a function to those standard library loops and let those take it from there.
import Data.Maybe (catMaybes)
import Data.List (foldl')
import Text.Printf (printf)
import Text.Regex (matchRegex, mkRegex)
corePkgs = ["d", "f"]
stats = [
"d>11/23d>34/89d",
"e>25/65e>13/25e",
"f>36/92f>19/76"
]
format = mkRegex ".*(\\w+).*>([0-9]+)/([0-9]+).*>([0-9]+)/([0-9]+).*"
-- It might be a good idea to define a datatype for coverage data.
-- A bit of coverage data is defined as the name of the package it
-- came from, the lines covered, the total amount of lines, the
-- branches covered and the total amount of branches.
data Coverage = Coverage String Int Int Int Int
-- Then we need a way to convert the string data into a list of
-- coverage data. We do this by regex. We try to match on each
-- string in the list, and then we choose to keep only the successful
-- matches. Returned is a list of coverage data that was represented
-- by the strings.
convert :: [String] -> [Coverage]
convert = catMaybes . map match
where match line = do
[name, cl, tl, cb, tb] <- matchRegex format line
return $ Coverage name (read cl) (read tl) (read cb) (read tb)
-- We need a way to summarise two coverage data bits. This can of course also
-- be used to summarise entire lists of coverage data, by folding over it.
sumCoverage (Coverage nameA clA tlA cbA tbA) (Coverage nameB clB tlB cbB tbB) =
Coverage (nameA ++ nameB ++ ",") (clA + clB) (tlA + tlB) (cbA + cbB) (tbA + tbB)
main = do
-- First we need to convert the strings to coverage data
let coverageData = convert stats
-- Then we want to filter out only the relevant data
relevantData = filter (\(Coverage name _ _ _ _) -> name `elem` corePkgs) coverageData
-- Then we need to summarise it, but we are only interested in the numbers
Coverage _ cl tl cb tb = foldl' sumCoverage (Coverage "" 0 0 0 0) relevantData
-- So we can finally print them!
printf "Line coverage: %.2f\n" (fromIntegral cl / fromIntegral tl :: Double)
printf "Branch coverage: %.2f\n" (fromIntegral cb / fromIntegral tb :: Double)

Here are some quickly-hacked, untested ideas applied to your code:
import numpy as np
import re
datafile = ('abc', 'd>11/23d>34/89d', 'e>25/65e>13/25e', 'f>36/92f>19/76')
core_pkgs = ('d', 'f')
covered_lines, total_lines, covered_branches, total_branches = 0, 0, 0, 0
for pkg in core_pkgs:
ptn = re.compile('.*'+pkg+'.*'+'>(\d+)/(\d+).*>(\d+)/(\d+).*')
matches = map(datafile, ptn.match)
statsList = [map(int, match.groups()) for match in matches if matches]
# statsList is a list of [cvln, tlln, cvbh, tlbh]
stats = np.array(statsList)
covered_lines, total_lines, covered_branches, total_branches = stats.sum(axis=1)
Well, as you can see I haven't bothered to finish off the remaining loop, but I think the point is made by now. There's certainly a lot more than one way to do this; I elected to show off map() (which some will say makes this less efficient, and it probably does), as well as NumPy to get the (admittedly light) math done.

This is the corresponding Clojure solution:
(defn extract-data
"extract 4 integer from a string line according to a package name"
[pkg line]
(map read-string
(rest (first
(re-seq
(re-pattern
(str pkg ".*>(\\d+)/(\\d+).*>(\\d+)/(\\d+)"))
line)))))
(defn scan-lines-by-pkg
"scan all string lines and extract all data as integer sequences
according to package names"
[pkgs lines]
(filter seq (for [pkg pkgs
line lines]
(extract-data pkg line))))
(defn sum-data
"add all data in valid lines together"
[pkgs lines]
(apply map + (scan-lines-by-pkg pkgs lines)))
(defn get-percent
[covered all]
(str (format "%.2f" (float (/ (* covered 100) all))) "%"))
(defn get-cov
[pkgs lines]
{:line-cov (apply get-percent (take 2 (sum-data pkgs lines)))
:branch-cov (apply get-percent (drop 2 (sum-data pkgs lines)))})
(get-cov ["d" "f"] ["abc" "d>11/23d>34/89d" "e>25/65e>13/25e" "f>36/92f>19/76"])

How can I convert a regex to an NFA?

Are there any modules available in Python to convert a regular expression to corresponding NFA,
or do I have to build the code from scratch (by converting the regex from infix to postfix and then implementing Thompson's Algorithm to get the corresponding NFA)?
Is it possible in Python to get the state diagram of an NFA from the transition table?

regex=''.join(postfix)
keys=list(set(re.sub('[^A-Za-z0-9]+', '', regex)+'e'))
s=[];stack=[];start=0;end=1
counter=-1;c1=0;c2=0
for i in regex:
if i in keys:
counter=counter+1;c1=counter;counter=counter+1;c2=counter;
s.append({});s.append({})
stack.append([c1,c2])
s[c1][i]=c2
elif i=='*':
r1,r2=stack.pop()
counter=counter+1;c1=counter;counter=counter+1;c2=counter;
s.append({});s.append({})
stack.append([c1,c2])
s[r2]['e']=(r1,c2);s[c1]['e']=(r1,c2)
if start==r1:start=c1
if end==r2:end=c2
elif i=='.':
r11,r12=stack.pop()
r21,r22=stack.pop()
stack.append([r21,r12])
s[r22]['e']=r11
if start==r11:start=r21
if end==r22:end=r12
else:
counter=counter+1;c1=counter;counter=counter+1;c2=counter;
s.append({});s.append({})
r11,r12=stack.pop()
r21,r22=stack.pop()
stack.append([c1,c2])
s[c1]['e']=(r21,r11); s[r12]['e']=c2; s[r22]['e']=c2
if start==r11 or start==r21:start=c1
if end==r22 or end==r12:end=c2
print keys
print s
this is the pretty much code sample after the postfix. s contains the transition table and keys contains all the terminals used including e. e is used for Epsilon.
It's completely based on Thompson's Algorithm.

2 different blocks of text are merging together. Can I separate them if i know what 1 is?

I've used a number of pdf-->text methods to extract text from pdf documents. For one particular type of PDF I have, neither pyPDF or pdfMiner are doing a good job extracting the text. However, http://www.convertpdftotext.net/ does it (almost) perfectly.
I discovered that the pdf I'm using has some transparent text in it, and it is getting merged into the other text.
Some examples of the blocks of text I get back are:
12324 35th Ed. 01-MAR-12 Last LNM: 14/12 NAD 83 14/12 Corrective Object of Corrective
ChartTitle: Intracoastal Waterway Sandy Hook to Little Egg Harbor Position
C HAActRionT N Y -NJ - S A N D Y H OO K ATcO tionLI T TLE EGG HARBOR. Page/Side: N/A
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true.
Bearings RoEf LlighOCtAT seEc tors aSrehre towwsbuardry th Re ivligher Ct fhroanmn seel Lawighartde.d B Theuoy 5no minal range of lights is expressedf roin mna 4u0tic-24al -mi46les.56 0(NNM ) unless othe0r7w4is-00e n-o05te.d8.8 0 W
to 40-24-48.585N 074-00-05.967W
and
12352 33rd Ed. 01-MAR-11 Last LNM: 03/12 NAD 83 04/12 . . l . . . . Corrective Object of Corrective ChartTitle: Shinnecock Bay to East Rockaway Inlet Position C HAActRionT S H IN N E C OC K B A Y TO AcEtionAS T ROCKAWAY INLET. Page/Side: N/A (Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true. (BTeeamringp) s DoEf LlighETtE s ectors aSretat toew Baoratd Ctheh anlighnet lf Droaym beseacoawanr 3d. The nominal range of lights is expressedf roin mna 4u0tic-37al -mi11les.52 0(NNM ) unless othe0r7w3is-29e n-5o3te.d76. 0 W
and I have discovered that the "ghost text" is ALWAYS the following:
Corrective Object of Corrective Position
Action Action
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are given in degrees clockwise from 000 true.
Bearings of light sectors are toward the light from seaward. The nominal range of lights is expressed in nautical miles (NM) unless otherwise noted.
In the 2nd example I posted, the text I want (with the ghost text removed) is:
12352 33rd Ed. 01-Mar-11 Last LNM:03/12 NAD 83 04/12
Chart Title:Shinnecock Bay to East Rockaway Inlet. Page/Side:N/A
CGD01
(Temp) DELETE State Boat Channel Daybeacon 3 from 40-37-11.520N 073-29-53.760W
This problem occurs just once per document, and does not appear to be totally consistent (as seen above). I am wondering if one of you wizards could think of a way to remove the ghosted text (I don't need/want it) using python. If I had been using pyPDF, I would have used a regex to rip it out during the conversion to text. Unfortunately, since I'm starting out with a text file from the website listed above, the damage has already been done. I'm at a bit of a loss.
Thanks for reading.
EDIT:
The solution to this problem looks like it be more complex than the rest of the application, so I'm going to withdraw my request for help.
I very much appreciate the thought put into it by those who have contributed.

Given that the ghost text can be split up in seemingly unpredictable ways, I don't think there is a simple automatic way of removing it that would not have false positives. What you need is almost human-level pattern recognition. :-)
What you could try is exploiting the format of these kinds of messages. Roughly;
<number> <number>[rn]d Ed. <date> Last LNM:<mm>/<yy> NAD <date2>
Chart Title:<text>. Page/Side:<N/A or number(s)> CGD<number> <text>
<position>
Using this you could pluck out the nonsense from the predictable elements, and then if you have a list of chart names ('Shinnecock Bay to East Rockaway Inlet') and descriptive words (like 'State', 'Boat', 'Daybeacon') you might be able to reconstruct the original words by finding the smallest levenshtein distance between mangled words in the two text blocks and those in your word lists.
If you can install the poppler software, you could try and use pdftotext with the -layout option to keep the formatting from the original PDF as much as possible. That might make your problem disappear.

You could recursively find all possible ways that your Pattern
"Corrective Object of Corrective Position Action ..." can be contained within your mangled text,
Then you can unmangle the text for each of these possible paths, run some sort of spellcheck over them, and choose the one with the fewest spelling mistakes. Or since you know roughly where each substring should appear, you can use that as a heuristic.
Or you could simply use the first path.
some pseudocode (untested):
def findPaths(mangledText, pattern, path)
if len(pattern)==0: # end of pattern
return [path]
else:
nextLetter= pattern[0]
locations = findAllOccurences (mangledText, nextLetter) # get all indices in mangledText that contain nextLetter
allPaths = []
for loc in locations:
paths = findPaths( mangledText[loc+1:], pattern[1:], path + (loc,) )
allPaths.Extend(paths)
return allPaths # if no locations for the next letters exist, allPaths will be emtpy
Then you can call it like this (optionally remove all spaces from your search pattern, unless you are certain they are all included in the mangled text)
allPossiblePaths = findPaths ( YourMangledText, "Corrective Object...", () )
then allPossiblePaths should contain a list of all possible ways your pattern could be contained in your mangled text.
Each entry is a tuple with the same length as the pattern, containing the index at which the corresponding letter of the pattern occurs in the search text.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.