Suppose such a text
s = '\n\nPART I, WHERE I’M COMING FROM\n\n1\xa0My Call to Adventure: 1949–1967\n2\xa0Crossing the Threshold: 1967–1979\n3\xa0My Abyss: 1979–1982\n4\xa0My Road of Trials: 1983–1994\n5\xa0The Ultimate Boon: 1995–2010\n6\xa0Returning the Boon: 2011–2015\n7\xa0My Last Year and My Greatest Challenge: 2016–2017\n8\xa0Looking Back from a Higher Level\n\nPART II, LIFE PRINCIPLES\n\n1\xa0Embrace Reality and Deal with It\n2\xa0Use the 5-Step Process to Get What You Want Out of Life\n3\xa0Be Radically Open-Minded\n4\xa0Understand That People Are Wired Very Differently\n5\xa0Learn How to Make Decisions Effectively\nLife Principles: Putting It All Together\nSummary and Table of Life Principles\n\nPART III, WORK PRINCIPLES\n\nSummary and Table of Work Principles\nTO GET THE CULTURE RIGHT\n\nTO GET THE PEOPLE\n\nTO BUILD AND EVOLVE YOUR \nWork Principles: Putting It All Together\n\n'
Split it by delimiter PART
In [14]: parts = re.split(r'\n\nPART',s)
In [15]: parts
Out[15]:
['',
' I, WHERE I’M COMING FROM\n\n1\xa0My Call to Adventure: 1949–1967\n2\xa0Crossing the Threshold: 1967–1979\n3\xa0My Abyss: 1979–1982\n4\xa0My Road of Trials: 1983–1994\n5\xa0The Ultimate Boon: 1995–2010\n6\xa0Returning the Boon: 2011–2015\n7\xa0My Last Year and My Greatest Challenge: 2016–2017\n8\xa0Looking Back from a Higher Level',
' II, LIFE PRINCIPLES\n\n1\xa0Embrace Reality and Deal with It\n2\xa0Use the 5-Step Process to Get What You Want Out of Life\n3\xa0Be Radically Open-Minded\n4\xa0Understand That People Are Wired Very Differently\n5\xa0Learn How to Make Decisions Effectively\nLife Principles: Putting It All Together\nSummary and Table of Life Principles',
' III, WORK PRINCIPLES\n\nSummary and Table of Work Principles\nTO GET THE CULTURE RIGHT\n\nTO GET THE PEOPLE\n\nTO BUILD AND EVOLVE YOUR \nWork Principles: Putting It All Together\n\n']
Add prefix Part back to list
In [16]: ['PART '+ i for i in parts if i]
Out[16]:
['PART I, WHERE I’M COMING FROM\n\n1\xa0My Call to Adventure: 1949–1967\n2\xa0Crossing the Threshold: 1967–1979\n3\xa0My Abyss: 1979–1982\n4\xa0My Road of Trials: 1983–1994\n5\xa0The Ultimate Boon: 1995–2010\n6\xa0Returning the Boon: 2011–2015\n7\xa0My Last Year and My Greatest Challenge: 2016–2017\n8\xa0Looking Back from a Higher Level',
'PART II, LIFE PRINCIPLES\n\n1\xa0Embrace Reality and Deal with It\n2\xa0Use the 5-Step Process to Get What You Want Out of Life\n3\xa0Be Radically Open-Minded\n4\xa0Understand That People Are Wired Very Differently\n5\xa0Learn How to Make Decisions Effectively\nLife Principles: Putting It All Together\nSummary and Table of Life Principles',
'PART III, WORK PRINCIPLES\n\nSummary and Table of Work Principles\nTO GET THE CULTURE RIGHT\n\nTO GET THE PEOPLE\n\nTO BUILD AND EVOLVE YOUR \nWork Principles: Putting It All Together\n\n']
I would like to finish it in one step,
In [17]: parts = re.findall(r'\n\nPART.+', s)
In [18]: parts
Out[18]:
['\n\nPART I, WHERE I’M COMING FROM',
'\n\nPART II, LIFE PRINCIPLES',
'\n\nPART III, WORK PRINCIPLES
#dot stops at \n, I desire to solve the problem with quantifier(multipy many stops)
In [20]: parts = re.findall(r'\n\n(?:PART.+)+', s)
In [21]: parts
Out[21]:
['\n\nPART I, WHERE I’M COMING FROM',
'\n\nPART II, LIFE PRINCIPLES',
'\n\nPART III, WORK PRINCIPLES']
#Unfortunately, it prints the same output
How to accomplish such a task?
Try splitting on a positive lookahead to retain the delimiter, using the regex module:
import regex
print regex.split(r"(?=\n\nPART)", s, flags=regex.VERSION1)
Related
I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:
Author
Title
Date
Category
Text
url
0
Amira Charfeddine
Wild Fadhila 01
2019-01-01
novel
الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...
NaN
1
Amira Charfeddine
Wild Fadhila 02
2019-01-01
novel
في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...
NaN
2
253826
1515368_7636953
2010-12-28
/forums/forums/91/
هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...
https://www.tunisia-sat.com/forums/threads/151...
3
250442
1504416_7580403
2010-12-21
/forums/sports/
\n\n\n\n\n\nاعلنت الجامعة التونسية لكرة اليد ا...
https://www.tunisia-sat.com/forums/threads/150...
4
312628
1504416_7580433
2010-12-21
/forums/sports/
quel est le résultat final\n,,,,????
https://www.tunisia-sat.com/forums/threads/150...
The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).
I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.
I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)
This code works, but it uses a lot of CPU time. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).
Is there a way to do this so that it's less CPU intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).
Here's the code:
import pickle
import pandas as pd
import re
with open('1_raw_df.pkl', 'rb') as pickle_file:
df = pickle.load(pickle_file)
spelling_var = {
'illi': ["الي", "اللي"],
'besh': ["باش", "بش"],
...
}
spelling_df = df.copy()
def count_word(df, word):
pattern = r"\b" + re.escape(word) + r"\b"
return df['Text'].str.count(pattern)
def compute_gini(freq_list):
proportions = [f/sum(freq_list) for f in freq_list]
squared = [p**2 for p in proportions]
return 1-sum(squared)
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
gini = compute_gini(count_list)
spelling_df[w] = gini
I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?
gini_lst = []
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
#gini = compute_gini(count_list) # don't think you need to compute this at every iteration of the inner loop, right?
#spelling_df[w] = gini # having this inside of the loop creates a new column at each iteration, which could crash your CPU
gini_lst.append(compute_gini(count_list))
# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})
I'm using ocrmypdf. I'm trying to do ocr on campaign finances pdfs. Example pdfs: https://apps1.lavote.net/camp/comm.cfm?&cid=11
My clients wants to parse these pdfs, as well as others (form 496s, form 497s). The problem is, even with forms of the same type, the ocr results are inconsistent.
For example, one pdf (form 460) will yield these results:
Statement covers period
from 07/01/2005
through __11/30/2005
and another of the same type yields:
Statement covers period
01/01/2006
from
through 03/17/2006
Notice in the first, the first date comes after the from, whereas in the second, the first date comes before the from. This creates complications when trying to parse the data.
I'm using what I call "checkpoints" to parse forms of similar type. Here's an example:
checkpoints = [
['Statement covers period from', 'Date From'],
['through', 'Date Thru'],
['Date of election if applicable:', None],
['\n', None],
['\\NUMBER Treasurer(s)\n', 'ID'],
['\n', None],
['COMMITTEE NAME (OR CANDIDATE’S NAME IF NO COMMITTEE)\n\n', 'Committee / Candidate Name'],
['\n', None],
['NAME OF TREASURER\n\n', 'Name of Treasurer'],
['\n', None],
['NAME OF OFFICEHOLDER OR CANDIDATE\n\n', 'Name of Officeholder or Candidate'],
['\n', None],
['OFFICE SOUGHT OR HELD (INCLUDE LOCATION AND DISTRICT NUMBER IF APPLICABLE)\n\n', 'Office Sough or Held'],
['\n', None],
]
I loop through every checkpoint, find the start index and end (using current checkpoint and next) index of the current iteration, [0] and not [1], and I save the contents to a key in a master object, like county_object[checkpoint[1]] = contents[start_index:end_index].
This setup only works specifically for the pdf I am parsing. Because ocrmypdf yields different results for even same form types, my setup is not ideal. Can someone point me in the right direction on how I should go about this?
Thanks
I imagine the difference between "identical" Form 460's is a
vertical misalignment due to one being scanned
at a slight CW angle and another at a slight CCW angle.
I hope you are invoking with --deskew,
but even with that there may be minor aberrations that prove troublesome.
The vertical separation between the dates seems large and robust,
so one date will precede the other in the proper way.
Consider focusing more on the mm/dd/yyyy pattern
and less on the text anchors.
You can obtain bound box coordinates from Tesseract OCR.
Use them to disambiguate dates,
based on your knowledge of what appears higher or lower on the form,
and by (approximately) how much.
I'm attempting a similar operation as shown here.
I begin with reading in two columns from a CSV file that contains 2405 rows in the format of: Year e.g. "1995" AND cleaned e.g. ["this", "is, "exemplar", "document", "contents"], both columns utilise strings as data types.
df = pandas.read_csv("ukgovClean.csv", encoding='utf-8', usecols=[0,2])
I have already pre-cleaned the data, and below shows the format of the top 4 rows:
[IN] df.head()
[OUT] Year cleaned
0 1909 acquaint hous receiv follow letter clerk crown...
1 1909 ask secretari state war whether issu statement...
2 1909 i beg present petit sign upward motor car driv...
3 1909 i desir ask secretari state war second lieuten...
4 1909 ask secretari state war whether would introduc...
[IN] df['cleaned'].head()
[OUT] 0 acquaint hous receiv follow letter clerk crown...
1 ask secretari state war whether issu statement...
2 i beg present petit sign upward motor car driv...
3 i desir ask secretari state war second lieuten...
4 ask secretari state war whether would introduc...
Name: cleaned, dtype: object
Then I initialise the TfidfVectorizer:
[IN] v = TfidfVectorizer(decode_error='replace', encoding='utf-8')
Following this, calling upon the below line results in:
[IN] x = v.fit_transform(df['cleaned'])
[OUT] ValueError: np.nan is an invalid document, expected byte or unicode string.
I overcame this using the solution in the aforementioned thread:
[IN] x = v.fit_transform(df['cleaned'].values.astype('U'))
however, this resulted in a Memory Error (Full Traceback).
I've attempted to look up storage using Pickle to circumvent mass-memory usage, but I'm not sure how to filter it in in this scenario. Any tips would be much appreciated, and thanks for reading.
[UPDATE]
#pittsburgh137 posted a solution to a similar problem involving fitting data here, in which the training data is generated using pandas.get_dummies(). What I've done with this is:
[IN] train_X = pandas.get_dummies(df['cleaned'])
[IN] train_X.shape
[OUT] (2405, 2380)
[IN] x = v.fit_transform(train_X)
[IN] type(x)
[OUT] scipy.sparse.csr.csr_matrix
I thought I should update any readers while I see what I can do with this development. If there are any predicted pitfalls with this method, I'd love to hear them.
I believe it's the conversion to dtype('<Unn') that might be giving you trouble. Check out the size of the array on a relative basis, using just the first few documents plus an NaN:
>>> df['cleaned'].values
array(['acquaint hous receiv follow letter clerk crown',
'ask secretari state war whether issu statement',
'i beg present petit sign upward motor car driv',
'i desir ask secretari state war second lieuten',
'ask secretari state war whether would introduc', nan],
dtype=object)
>>> df['cleaned'].values.astype('U').nbytes
1104
>>> df['cleaned'].values.nbytes
48
It seems like it would make sense to drop the NaN values first (df.dropna(inplace=True)). Then, it should be pretty efficient to call v.fit_transform(df['cleaned'].tolist()).
I have data that looks like this:
owned category weight mechanics_split
28156 Environmental, Medical 2.8023 [Action Point Allowance System, Co-operative P...
9269 Card Game, Civilization, Economic 4.3073 [Action Point Allowance System, Auction/Biddin...
36707 Modern Warfare, Political, Wargame 3.5293 [Area Control / Area Influence, Campaign / Bat...
and used this function (taken from the generous answer in this question):
def owned_nums(games):
for row in games.iterrows():
owned_value = row[1]['owned']
mechanics = row[1]['mechanics_split']
for type_string in mechanics:
game_types.loc[type_string, ['owned']] += owned_value
to iterate over the values in the dataframe and put new values in a new dataframe called game_types. It worked great. In fact, it still works great; that notebook is open, and if I change the last line of the function to just print (type_string), it prints:
Action Point Allowance System
Co-operative Play
Hand Management
Point to Point Movement
Set Collection
Trading
Variable Player Powers
Action Point Allowance System...
Okay, perfect. So, I saved my data as a csv, opened a new notebook, opened the csv with the columns with the split strings, copied and pasted the exact same function, and when I print type_string, I now get:
[
'
A
c
t
i
o
n
P
o
i
n
t
A
l
l
o
w
The only thing I could notice is that the original lists were quote-less, with [Action Point Allowance System, Co-operative...] etc., and the new dataframe opened from the new csv was rendered as ['Action Point Allowance System', 'Co-operative...'], with quotes. I used str.replace("'","") which got rid of the quotes, but it's still returning every letter. I've tried experimenting with the escapechars in to_csv but to no avail. Very confused as to what setting I need to tweak.
Thanks very much for any help.
The only way the code
mechanics = row[1]['mechanics_split']
for type_string in mechanics:
game_types.loc[type_string, ['owned']] += owned_value
can have worked is if your mechanics_split column contained not a string but an iterable containing strings.
Storing non-scalar data in Series is not well-supported, and while it's sometimes useful (though slow) as an intermediate step, it's not supposed to be something you do regularly. Basically what you're doing is
>>> df = pd.DataFrame({"A": [["x","y"],["z"]]})
>>> df.to_csv("a.csv")
>>> !cat a.csv
,A
0,"['x', 'y']"
1,['z']
after which you have
>>> df2 = pd.read_csv("a.csv", index_col=0)
>>> df2
A
0 ['x', 'y']
1 ['z']
>>> df.A.values
array([['x', 'y'], ['z']], dtype=object)
>>> df2.A.values
array(["['x', 'y']", "['z']"], dtype=object)
>>> type(df.A.iloc[0])
<class 'list'>
>>> type(df2.A.iloc[0])
<class 'str'>
and you notice that what was originally a Series containing lists of strings is now a Series containing only strings. Which makes sense, if you think about it, because CSVs never claimed to be type-preserving.
If you insist on using a frame like this, you should manually encode and decode your lists via some representation (e.g. JSON strings) on reading and writing. I'm too lazy to confirm what pandas does to str-ify lists, but you might be able to get away with applying ast.literal_eval to the resulting strings to turn them back into lists.
I've used a number of pdf-->text methods to extract text from pdf documents. For one particular type of PDF I have, neither pyPDF or pdfMiner are doing a good job extracting the text. However, http://www.convertpdftotext.net/ does it (almost) perfectly.
I discovered that the pdf I'm using has some transparent text in it, and it is getting merged into the other text.
Some examples of the blocks of text I get back are:
12324 35th Ed. 01-MAR-12 Last LNM: 14/12 NAD 83 14/12 Corrective Object of Corrective
ChartTitle: Intracoastal Waterway Sandy Hook to Little Egg Harbor Position
C HAActRionT N Y -NJ - S A N D Y H OO K ATcO tionLI T TLE EGG HARBOR. Page/Side: N/A
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true.
Bearings RoEf LlighOCtAT seEc tors aSrehre towwsbuardry th Re ivligher Ct fhroanmn seel Lawighartde.d B Theuoy 5no minal range of lights is expressedf roin mna 4u0tic-24al -mi46les.56 0(NNM ) unless othe0r7w4is-00e n-o05te.d8.8 0 W
to 40-24-48.585N 074-00-05.967W
and
12352 33rd Ed. 01-MAR-11 Last LNM: 03/12 NAD 83 04/12 . . l . . . . Corrective Object of Corrective ChartTitle: Shinnecock Bay to East Rockaway Inlet Position C HAActRionT S H IN N E C OC K B A Y TO AcEtionAS T ROCKAWAY INLET. Page/Side: N/A (Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true. (BTeeamringp) s DoEf LlighETtE s ectors aSretat toew Baoratd Ctheh anlighnet lf Droaym beseacoawanr 3d. The nominal range of lights is expressedf roin mna 4u0tic-37al -mi11les.52 0(NNM ) unless othe0r7w3is-29e n-5o3te.d76. 0 W
and I have discovered that the "ghost text" is ALWAYS the following:
Corrective Object of Corrective Position
Action Action
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are given in degrees clockwise from 000 true.
Bearings of light sectors are toward the light from seaward. The nominal range of lights is expressed in nautical miles (NM) unless otherwise noted.
In the 2nd example I posted, the text I want (with the ghost text removed) is:
12352 33rd Ed. 01-Mar-11 Last LNM:03/12 NAD 83 04/12
Chart Title:Shinnecock Bay to East Rockaway Inlet. Page/Side:N/A
CGD01
(Temp) DELETE State Boat Channel Daybeacon 3 from 40-37-11.520N 073-29-53.760W
This problem occurs just once per document, and does not appear to be totally consistent (as seen above). I am wondering if one of you wizards could think of a way to remove the ghosted text (I don't need/want it) using python. If I had been using pyPDF, I would have used a regex to rip it out during the conversion to text. Unfortunately, since I'm starting out with a text file from the website listed above, the damage has already been done. I'm at a bit of a loss.
Thanks for reading.
EDIT:
The solution to this problem looks like it be more complex than the rest of the application, so I'm going to withdraw my request for help.
I very much appreciate the thought put into it by those who have contributed.
Given that the ghost text can be split up in seemingly unpredictable ways, I don't think there is a simple automatic way of removing it that would not have false positives. What you need is almost human-level pattern recognition. :-)
What you could try is exploiting the format of these kinds of messages. Roughly;
<number> <number>[rn]d Ed. <date> Last LNM:<mm>/<yy> NAD <date2>
Chart Title:<text>. Page/Side:<N/A or number(s)> CGD<number> <text>
<position>
Using this you could pluck out the nonsense from the predictable elements, and then if you have a list of chart names ('Shinnecock Bay to East Rockaway Inlet') and descriptive words (like 'State', 'Boat', 'Daybeacon') you might be able to reconstruct the original words by finding the smallest levenshtein distance between mangled words in the two text blocks and those in your word lists.
If you can install the poppler software, you could try and use pdftotext with the -layout option to keep the formatting from the original PDF as much as possible. That might make your problem disappear.
You could recursively find all possible ways that your Pattern
"Corrective Object of Corrective Position Action ..." can be contained within your mangled text,
Then you can unmangle the text for each of these possible paths, run some sort of spellcheck over them, and choose the one with the fewest spelling mistakes. Or since you know roughly where each substring should appear, you can use that as a heuristic.
Or you could simply use the first path.
some pseudocode (untested):
def findPaths(mangledText, pattern, path)
if len(pattern)==0: # end of pattern
return [path]
else:
nextLetter= pattern[0]
locations = findAllOccurences (mangledText, nextLetter) # get all indices in mangledText that contain nextLetter
allPaths = []
for loc in locations:
paths = findPaths( mangledText[loc+1:], pattern[1:], path + (loc,) )
allPaths.Extend(paths)
return allPaths # if no locations for the next letters exist, allPaths will be emtpy
Then you can call it like this (optionally remove all spaces from your search pattern, unless you are certain they are all included in the mangled text)
allPossiblePaths = findPaths ( YourMangledText, "Corrective Object...", () )
then allPossiblePaths should contain a list of all possible ways your pattern could be contained in your mangled text.
Each entry is a tuple with the same length as the pattern, containing the index at which the corresponding letter of the pattern occurs in the search text.