Detecting sections of a pdf with pdfminer [closed]

Detecting sections of a pdf with pdfminer [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to transform pdfs from conference/journal papers into .txt files. I basically want to have a structure a bit cleaner that the current pdf: no line break before the end of a sentence and highlighting sections of the paper. The problem I am dealing with currently is to try and detect sections automatically. That is, in the following image, I want to be able to find ABSTRACT, CSS CONCEPT, 1 INTRODUCTION, 2 THE BODY OF THE PAPER..
If currently use a simply idea which works-ish. I basically let pdf miner do its job and then use NTLK to find sentences.
def convert_pdf_to_txt(path, year):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
sentences = sent_tokenize(text)
size = len(sentences)
i = 0
path = path[:-3]
output = open("out.txt", 'w')
for s in sentences:
s = s.replace("-\n", '') #remove hyphens
lines = s.split("\n")
for line in lines:
if(line.isupper()): #section are only uppercase.
#however, other things are also only uppercase hence my errors
line = "--SECTION-- " +line
output.write("\n\n"+line+"\n")
else:
output.write(line)
output.write("\n")
fp.close()
device.close()
retstr.close()
This gives me on the whole file the following output:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758SIG Proceedings Paper in LaTeX Format∗Extended Abstract†
--SECTION-- G.K.M.
Tobin§Dublin, OhioSean Fogartywebmaster#marysville-ohio.comLars Thørväld¶The Thørväld GroupHekla, Icelandlarst#affiliation.orgCharles PalmerInstitute for Clarity in DocumentationInstitute for Clarity in DocumentationBen Trovato‡Dublin, Ohiotrovato#corporation.comLawrence P. LeipunerBrookhaven Laboratorieslleipuner#researchlabs.orgJohn SmithPalmer Research LaboratoriesSan Antonio, Texascpalmer#prl.comJulius P. KumquatNASA Ames Research CenterMoffett Field, Californiafogartys#amesres.orgThe Thørväld Groupjsmith#affiliation.orgThe Kumquat Consortiumjpkumquat#consortium.netcolumns), a specified set of fonts (Arial or Helvetica and TimesRoman) in certain specified sizes, a specified live area, centeredon the page, specified size of margins, specified column width andgutter size.
--SECTION-- ABSTRACT
This paper provides a sample of a LATEX document which conforms,somewhat loosely, to the formatting guidelines for ACM SIG Proceedings.1Unpublishedworkingdraft.
Notfordistribution.
--SECTION-- CCS CONCEPTS
• Computer systems organization → Embedded systems; Redundancy; Robotics; • Networks → Network reliability;
--SECTION-- KEYWORDS
ACM proceedings, LATEX, text taggingACM Reference Format:Ben Trovato, G.K.M.
Tobin, Lars Thørväld, Lawrence P. Leipuner, SeanFogarty, Charles Palmer, John Smith, and Julius P. Kumquat.
1997.
--SECTION-- SIG
Proceedings Paper in LaTeX Format: Extended Abstract.
In Proceedings ofACM Woodstock conference (WOODSTOCK’97).
ACM, New York, NY, USA,5 pages.
https://doi.org/10.475/123_4
--SECTION-- 2 THE BODY OF THE PAPER
Typically, the body of a paper is organized into a hierarchical structure, with numbered or unnumbered headings for sections, subsections, sub-subsections, and even smaller sections.
The command\section that precedes this paragraph is part of such a hierarchy.3LATEX handles the numbering and placement of these headings foryou, when you use the appropriate heading commands aroundthe titles of the headings.
If you want a sub-subsection or smallerpart to be unnumbered in your output, simply append an asteriskto the command name.
Examples of both numbered and unnumbered headings will appear throughout the balance of this sampledocument.
Because the entire article is contained in the document environment, you can indicate the start of a new paragraph with a blankline in your input file; that is why this sentence forms a separateparagraph.
--SECTION-- 1 INTRODUCTION
The proceedings are the records of a conference.2 ACM seeks to givethese conference by-products a uniform, high-quality appearance.
To do this, ACM has some rigid requirements for the format of theproceedings documents: there is a specified format (balanced double∗Produces the permission block, and copyright information†The full version of the author’s guide is available as acmart.pdf document‡Dr.
Trovato insisted his name be first.
§The secretary disavows any knowledge of this author’s actions.
¶This author is the one who did all the really hard work.
1This is an abstract footnote2This is a footnotePermission to make digital or hard copies of part or all of this work for personal orUnpublished working draft.
Not for distribution.
classroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page.
Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
WOODSTOCK’97, July 1997, El Paso, Texas USA© 2016 Copyright held by the owner/author(s).
--SECTION-- ACM ISBN 123-4567-24-567/08/06.
https://doi.org/10.475/123_4Submission ID: 123-A12-B3.
2018-10-20 12:29.
Page 1 of 1–5.
2.1 Type Changes and Special CharactersWe have already seen several typeface changes in this sample.
You can indicate italicized words or phrases in your text with thecommand \textit; emboldening with the command \textbf andtypewriter-style (for instance, for computer code) with \texttt.
But remember, you do not have to indicate typestyle changes whensuch changes are part of the structural elements of your article;for instance, the heading of this subsection will be in a sans serif4typeface, but that is handled by the document class file.
Take carewith the use of5 the curly braces in typeface changes; they mark thebeginning and end of the text that is to be in the different typeface.
3This is a footnote.
4Another footnote here.
Let’s make this a rather long one to see how it looks.
5Another footnote.
5960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116WOODSTOCK’97, July 1997, El Paso, Texas USAB. Trovato et al.
You can use whatever symbols, accented characters, or nonEnglish characters you need anywhere in your document; you canfind a complete list of what is available in the LATEX User’s Guide[26].
2.2 Math EquationsYou may want to display math equations in three distinct styles:inline, numbered or non-numbered display.
Each of the three arediscussed in the next sections.
Table 1: Frequency of Special CharactersNon-English or Math
--SECTION-- Ø
π$2
--SECTION-- Ψ
1Frequency Comments1 in 1,0001 in 54 in 5For Swedish namesCommon in mathUsed in business1 in 40,000 Unexplained usagemand, using \cite.
This article shows only the plainest form of the citation comInline (In-text) Equations.
A formula that appears in the2.2.1running text is called an inline or in-text formula.
It is producedby the math environment, which can be invoked with the usual\begin .
.
.
\end construction or with the short form $ .
.
.
$.
You can use any of the symbols and structures, from α to ω, availablein LATEX [26]; this section will simply show a few examples of intext equations in context.
Notice how this equation: limn→∞ x = 0,set here in in-line math style, looks slightly different when set indisplay style.
(See next section).
2.2.2 Display Equations.
A numbered display equation—one setoff by vertical space from the text and centered horizontally—isproduced by the equation environment.
An unnumbered displayequation is produced by the displaymath environment.
Again, in either environment, you can use any of the symbolsand structures available in LATEX; this section will just give a coupleof examples of display equations in context.
First, consider theequation, shown as an inline equation above:Some examples.
A paginated journal article [2], an enumeratedjournal article [11], a reference to an entire issue [10], a monograph(whole book) [25], a monograph/whole book in a series (see 2ain spec.
document) [18], a divisible-book such as an anthology orcompilation [13] followed by the same example, however we onlyoutput the series if the volume number is given [14] (so Editor00a’sseries should NOT be present since it has no vol.
no.
), a chapterin a divisible book [37], a chapter in a divisible book in a series[12], a multi-volume work as book [24], an article in a proceedings(of a conference, symposium, workshop for example) (paginatedproceedings article) [4], a proceedings article with all possible elements [36], an example of an enumerated proceedings article [16],an informally published work [17], a doctoral dissertation [9], amaster’s thesis: [5], an online document / world wide web resource[1, 30, 38], a video game (Case 1) [29] and (Case 2) [28] and [27] and(Case 3) a patent [35], work accepted for publication [31], ’YYYYb’test for prolific author [32] and [33].
Other cites might contain’duplicate’ DOI and URLs (some SIAM articles) [23].
Boris / BarbaraBeeton: multi-volume works as books [21] and [20].
Unpublishedworkingdraft.
Notfordistribution.
2.4 TablesBecause tables cannot be split across pages, the best placement forthem is typically the top of the page nearest their initial cite.
To ensure this proper “floating” placement of tables, use the environmenttable to enclose the table’s contents and the table caption.
The contents of the table itself must go in the tabular environment, to bealigned properly in rows and columns, with the desired horizontaland vertical rules.
Again, detailed instructions on tabular materialare found in the LATEX User’s Guide.
Immediately following this sentence is the point at which Table 1is included in the input file; compare the placement of the tablehere with the table in the printed output of this document.
Notice how it is formatted somewhat differently in the displaymath environment.
Now, we’ll enter an unnumbered equation:A couple of citations with DOIs: [22, 23].
Online citations: [38–40].
just to demonstrate LATEX’s able handling of numbering.
and follow it with another numbered equation:x + 1∫ π +2(1)(2)limn→∞ x = 0∞i =0∞i =0xi =0f2.3 CitationsCitations to articles [6–8, 19], conference proceedings [8] or maybebooks [26, 34] listed in the Bibliography section of your article willoccur throughout the text of your article.
You should use BibTeX toautomatically produce this bibliography; you simply need to insertone of several citation commands with a key of the item cited in theproper location in the .tex file [26].
The key is a short referenceyou invent to uniquely identify each work; in this sample document,the key is the first author’s surname and a word from the title.
Thisidentifying key is included with each item in the .bib file for yourarticle.
The details of the construction of the .bib file are beyond thescope of this sample document, but more information can be foundin the Author’s Guide, and exhaustive details in the LATEX User’sGuide by Lamport [26].
To set a wider table, which takes up the whole width of the page’slive area, use the environment table* to enclose the table’s contentsand the table caption.
As with a single-column table, this widetable will “float” to a location deemed more desirable.
Immediatelyfollowing this sentence is the point at which Table 2 is included inthe input file; again, it is instructive to compare the placement ofthe table here with the table in the printed output of this document.
It is strongly recommended to use the package booktabs [15]and follow its main principles of typography with respect to tables:(1) Never, ever use vertical rules.
(2) Never use double rules.
Submission ID: 123-A12-B3.
2018-10-20 12:29.
Page 2 of 1–5.
117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232SIG Proceedings Paper in LaTeX FormatWOODSTOCK’97, July 1997, El Paso, Texas USATable 2: Some Typical CommandsCommand A Number Comments\author\table\table*100300400AuthorFor tablesFor wider tablesDefinition 2.2.
If z is irrational, then by ez we mean the uniquenumber that has logarithm z:f (x)д(x) = L.(cid:21)log ez = z.
The pre-defined theorem-like constructs are theorem, conjecture, proposition, lemma and corollary.
The pre-defined definition-like constructs are example and definition.
You can add yourown constructs using the amsthm interface [3].
The styles used inthe \theoremstyle command are acmplain and acmdefinition.
Another construct is proof, for example,Proof.
Suppose on the contrary there exists a real number Lsuch thatFigure 1: A sample black and white graphic.
It is also a good idea not to overuse horizontal rules.
Figure 2: A sample black and white graphic that has beenresized with the includegraphics command.
Unpublishedworkingdraft.
Notfordistribution.
lim(cid:20)x→∞дx · f (x)д(x)f (x) = limx→c2.5 FiguresLike tables, figures cannot be split across pages; the best placementfor them is typically the top or the bottom of the page nearest theirinitial cite.
To ensure this proper “floating” placement of figures,use the environment figure to enclose the figure and its caption.
This sample document contains examples of .eps files to bedisplayable with LATEX.
If you work with pdfLATEX, use files in the.pdf format.
Note that most modern TEX systems will convert .epsto .pdf for you on the fly.
More details on each of these are foundin the Author’s Guide.
As was the case with tables, you may want a figure that spans twocolumns.
To do this, and still to ensure proper “floating” placementof tables, use the environment figure* to enclose the figure and itscaption.
And don’t forget to end the environment with figure*, notfigure!
= limx→cд(x)· limx→cf (x)д(x) = 0·L = 0,Thenl = limx→cwhich contradicts our assumption that l (cid:44) 0.
--SECTION-- 3 CONCLUSIONS
This paragraph will end the body of this sample document.
Remember that you might still have Acknowledgments or Appendices;brief samples of these follow.
There is still the Bibliography to dealwith; and we will make a disclaimer about that here: with the exception of the reference to the LATEX book, the citations in this paperare to articles which have nothing to do with the present subjectand are used as examples only.
□
--SECTION-- A HEADINGS IN APPENDICES
The rules about hierarchical headings discussed above for the bodyof the article are different in the appendices.
In the appendix environment, the command section is used to indicate the start ofeach Appendix, with alphabetic order designation (i.e., the first isA, the second B, etc.)
and a title (if you include one).
So, if you needhierarchical structure within an Appendix, start with subsectionas the highest level.
Here is an outline of the body of this documentin Appendix-appropriate form:2.6 Theorem-like ConstructsOther common constructs that may occur in your article are theforms for logical constructs like theorems, axioms, corollaries andproofs.
ACM uses two types of these constructs: theorem-like anddefinition-like.
Here is a theorem:Theorem 2.1.
Let f be continuous on [a, b].
If G is an antiderivative for f on [a, b], then∫ baHere is a definition:f (t) dt = G(b) − G(a).
Submission ID: 123-A12-B3.
2018-10-20 12:29.
Page 3 of 1–5.
A.1 IntroductionA.2 The Body of the PaperA.2.1 Type Changes and Special Characters.
A.2.2 Math Equations.
Inline (In-text) Equations.
Display Equations.
A.2.3 Citations.
233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348WOODSTOCK’97, July 1997, El Paso, Texas USAB. Trovato et al.
Figure 4: A sample black and white graphic that has beenresized with the includegraphics command.
A.2.4 Tables.
--SECTION-- A.2.5
Figures.
A.2.6 Theorem-like Constructs.
Figure 3: A sample black and white graphic that needs to span two columns of text.
Unpublishedworkingdraft.
Notfordistribution.
(Nov. 1996).
booktabs.
Mathematical Society.
http://www.ctan.org/pkg/amsthm.
[2] Patricia S. Abril and Robert Plant.
2007.
The patent holder’s dilemma: Buy, sell,or troll?
Commun.
ACM 50, 1 (Jan. 2007), 36–44.
https://doi.org/10.1145/1188913.
1188915[3] American Mathematical Society 2015.
Using the amsthm Package.
American[4] Sten Andler.
1979.
Predicate Path expressions.
In Proceedings of the 6th.
--SECTION-- ACM
SIGACT-SIGPLAN symposium on Principles of Programming Languages (POPL ’79).
ACM Press, New York, NY, 226–236.
https://doi.org/10.1145/567752.567774[5] David A. Anisi.
2003.
Optimal Motion Control of a Ground Vehicle.
Master’s thesis.
[6] Mic Bowman, Saumya K. Debray, and Larry L. Peterson.
1993.
Reasoning AboutNaming Systems.
ACM Trans.
Program.
Lang.
Syst.
15, 5 (November 1993), 795–825. https://doi.org/10.1145/161468.161471[7] Johannes Braams.
1991.
Babel, a Multilingual Style-Option System for Use withRoyal Institute of Technology (KTH), Stockholm, Sweden.
LaTeX’s Standard Document Styles.
TUGboat 12, 2 (June 1991), 291–301.
TeX Users Group, 84–89.
[8] Malcolm Clark.
1991.
Post Congress Tristesse.
In TeX90 Conference Proceedings.
[9] Kenneth L. Clarkson.
1985.
Algorithms for Closest-Point Problems (ComputationalGeometry).
Ph.D. Dissertation.
Stanford University, Palo Alto, CA.
UMI OrderNumber: AAT 8506171.
[10] Jacques Cohen (Ed.).
1996.
Special issue: Digital Libraries.
Commun.
--SECTION-- ACM 39, 11
[11] Sarah Cohen, Werner Nutt, and Yehoshua Sagic.
2007.
Deciding equivalancesamong conjunctive aggregate queries.
J. ACM 54, 2, Article 5 (April 2007),50 pages.
https://doi.org/10.1145/1219092.1219093[12] Bruce P. Douglass, David Harel, and Mark B. Trakhtenbrot.
1998.
Statecarts inuse: structured analysis and object-orientation.
In Lectures on Embedded Systems,Grzegorz Rozenberg and Frits W. Vaandrager (Eds.).
Lecture Notes in ComputerScience, Vol.
1494.
Springer-Verlag, London, 368–394.
https://doi.org/10.1007/3-540-65193-4_29[13] Ian Editor (Ed.).
2007.
The title of book one (1st.
ed.).
The name of the seriesone, Vol.
9.
University of Chicago Press, Chicago.
https://doi.org/10.1007/3-540-09237-4Chicago, Chapter 100. https://doi.org/10.1007/3-540-09237-4[14] Ian Editor (Ed.).
2008.
The title of book two (2nd.
ed.).
University of Chicago Press,[15] Simon Fear.
2005.
Publication quality tables in LATEX.
http://www.ctan.org/pkg/[16] Matthew Van Gundy, Davide Balzarotti, and Giovanni Vigna.
2007.
Catch me, ifyou can: Evading network signatures with web-based polymorphic worms.
InProceedings of the first USENIX workshop on Offensive Technologies (WOOT ’07).
USENIX Association, Berkley, CA, Article 7, 9 pages.
[17] David Harel.
1978.
LOGICS of Programs: AXIOMATICS and DESCRIPTIVE POWER.
MIT Research Lab Technical Report TR-200.
Massachusetts Institute of Technology, Cambridge, MA.
[18] David Harel.
1979.
First-Order Dynamic Logic.
Lecture Notes in Computer Science,Vol.
68.
Springer-Verlag, New York, NY.
https://doi.org/10.1007/3-540-09237-4[19] Maurice Herlihy.
1993.
A Methodology for Implementing Highly ConcurrentData Objects.
ACM Trans.
Program.
Lang.
Syst.
15, 5 (November 1993), 745–770.
https://doi.org/10.1145/161468.161469[20] Lars Hörmander.
1985.
The analysis of linear partial differential operators.
--SECTION-- III.
Grundlehren der Mathematischen Wissenschaften [Fundamental Principles ofMathematical Sciences], Vol.
275.
Springer-Verlag, Berlin, Germany.
viii+525pages.
Pseudodifferential operators.
[21] Lars Hörmander.
1985.
The analysis of linear partial differential operators.
--SECTION-- IV.
Grundlehren der Mathematischen Wissenschaften [Fundamental Principles ofMathematical Sciences], Vol.
275.
Springer-Verlag, Berlin, Germany.
vii+352pages.
Fourier integral operators.
A Caveat for the TEX Expert.
A.3 ConclusionsA.4 ReferencesGenerated by bibtex from your .bib file.
Run latex, then bibtex, thenlatex twice (to resolve references) to create the .bbl file.
Insert that.bbl file into the .tex source file and comment out the command\thebibliography.
--SECTION-- B MORE HELP FOR THE HARDY
Of course, reading the source code is always useful.
The file acmart.
pdf contains both the user guide and the commented code.
--SECTION-- ACKNOWLEDGMENTS
The authors would like to thank Dr. Yuhua Li for providing theMATLAB code of the BEPS method.
The authors would also like to thank the anonymous refereesfor their valuable comments and helpful suggestions.
The work issupported by the National Natural Science Foundation of Chinaunder Grant No.
: 61273304 and Young Scientists’ Support Program(http://www.nnsf.cn/youngscientists).
--SECTION-- REFERENCES
[1] Rafal Ablamowicz and Bertfried Fauser.
2007.
CLIFFORD: a Maple 11 Package forClifford Algebra Computations, version 11.
Retrieved February 28, 2008 fromhttp://math.tntech.edu/rafal/cliff11/index.htmlSubmission ID: 123-A12-B3.
2018-10-20 12:29.
Page 4 of 1–5.
349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464SIG Proceedings Paper in LaTeX FormatWOODSTOCK’97, July 1997, El Paso, Texas USAAlgorithms (3rd.
ed.).
Addison Wesley Longman Publishing Co., Inc.New York, NY.
--SECTION-- [22] IEEE 2004.
IEEE TCSC Executive Committee.
In Proceedings of the IEEE International Conference on Web Services (ICWS ’04).
IEEE Computer Society, Washington,
--SECTION-- DC, USA, 21–22.
https://doi.org/10.1109/ICWS.2004.64[23] Markus Kirschmer and John Voight.
2010.
Algorithmic Enumeration of IdealClasses for Quaternion Orders.
SIAM J. Comput.
39, 5 (Jan. 2010), 1714–1747.
https://doi.org/10.1137/080734467[24] Donald E. Knuth.
1997.
The Art of Computer Programming, Vol.
1: Fundamental[25] David Kosiur.
2001.
Understanding Policy-Based Networking (2nd.
ed.).
Wiley,[26] Leslie Lamport.
1986.
LATEX: A Document Preparation System.
Addison-Wesley,[27] Newton Lee.
2005.
Interview with Bill Kinder: January 13, 2005.
Video.
Comput.
Entertain.
3, 1, Article 4 (Jan.-March 2005).
https://doi.org/10.1145/1057270.
1057278[28] Dave Novak.
2003.
Solder man.
Video.
In ACM SIGGRAPH 2003 Video Review onAnimation theater Program: Part I - Vol.
145 (July 27–27, 2003).
ACM Press, NewYork, NY, 4. https://doi.org/99.9999/woot07-S422[29] Barack Obama.
2008.
A more perfect union.
Video.
Retrieved March 21, 2008Reading, MA.
from http://video.google.com/videoplay?docid=6528042696351994555[30] Poker-Edge.Com.
2006.
Stats and Analysis.
Retrieved June 7, 2006 from http://www.poker-edge.com/stats.phpArticle 5 (July 2008).
To appear.
[31] Bernard Rous.
2008.
The Enabling of Digital Libraries.
Digital Libraries 12, 3,[32] Mehdi Saeedi, Morteza Saheb Zamani, and Mehdi Sedighi.
2010.
A library-basedsynthesis methodology for reversible logic.
Microelectron.
--SECTION-- J.
41, 4 (April 2010),185–194.
[33] Mehdi Saeedi, Morteza Saheb Zamani, Mehdi Sedighi, and Zahra Sasanian.
2010.
Synthesis of Reversible Circuit Using Cycle-Based Approach.
J. Emerg.
Technol.
Comput.
Syst.
6, 4 (Dec. 2010).
--SECTION-- [34] S.L.
Salas and Einar Hille.
1978.
Calculus: One and Several Variable.
John Wileyand Sons, New York.
[35] Joseph Scientist.
2009.
The fountain of youth.
Patent No.
12345, Filed July 1st.,Unpublishedworkingdraft.
Notfordistribution.
2008, Issued Aug.
9th., 2009.
[36] Stan W. Smith.
2010.
An experiment in bibliographic mark-up: Parsing metadatafor XML export.
In Proceedings of the 3rd.
annual workshop on Librarians andComputers (LAC ’10), Reginald N. Smythe and Alexander Noble (Eds.
), Vol.
3.
Paparazzi Press, Milan Italy, 422–431.
https://doi.org/99.9999/woot07-S422In DistributedSystems (2nd.
ed.
), Sape Mullender (Ed.).
ACM Press, New York, NY, 19–33.
https://doi.org/10.1145/90417.90738[38] Harry Thornburg.
2001.
Introduction to Bayesian Statistics.
Retrieved March 2,[37] Asad Z. Spector.
1990.
Achieving application requirements.
2005 from http://ccrma.stanford.edu/~jos/bayes/bayes.html
--SECTION-- [39] TUG 2017.
Institutional members of the TEX Users Group.
Retrieved May 27,[40] Boris Veytsman.
[n. d.].
acmart—Class for typesetting publications of ACM.
2017 from http://wwtug.org/instmem.htmlRetrieved May 27, 2017 from http://www.ctan.org/pkg/acmartSubmission ID: 123-A12-B3.
2018-10-20 12:29.
Page 5 of 1–5.
465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580
This output is partially right. I mean all section are rightfully detected but I also get a lot of false positives. Can you think of any better way (less false-positive prone) to implement this.
PS: if you need the pdf, it is available here, filename is
sample-sigconf-authordraft.pdf

To understand why you're getting false positives, you need to understand more about PDF.
PDF (unlike what you may think) is not a WYSIWYG format. In fact, you should think of PDF as a container of instructions. It has its own programming language that tells viewing (or rendering) software what to do.
Example:
go to position 500, 30
set the font to Arial
set the fontsize to 12
draw the glyph for character 'H'
go to position 500, 35
draw the glyphs for characters 'ello'
Some blocks of instructions are grouped together in what is called an object.
Each object gets its own number, and objects can reference each other.
Objects are indexed in the cross reference table (known as the XREF table).
Some of the challenges when doing text-extraction from PDF:
objects don't need to appear in logical reading order
sometimes whitespace is explicitly written, sometimes it is achieved by simply moving the drawing cursor
PDFs do not always have logical structure embedded in them. So whilst you may be able to clearly see a 2-column layout, from the viewpoint of the software this is less clear.
I have worked at iText, a software company that focuses solely on PDF related technology. And extracting text in the way you describe, from a random document, in a foolproof way is something that simply isn't possible with current software packages.
There are tons of exceptions that may break your workflow:
formulas appearing in text
tables
tables with row span and column span
images with text flowing around them
footnotes
headers
footers

Related

Python package to extract sentence from a textfile based on keyword

I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).

You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy

I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.

The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that

Working with span objects. [spaCy, python]

I'm not sure if this is a really dumb question, but here goes.
text_corpus = '''Insurance bosses plead guilty\n\nAnother three US insurance executives have pleaded guilty to fraud charges stemming from an ongoing investigation into industry malpractice.\n\nTwo executives from American International Group (AIG) and one from Marsh & McLennan were the latest. The investigation by New York attorney general Eliot Spitzer has now obtained nine guilty pleas. The highest ranking executive pleading guilty on Tuesday was former Marsh senior vice president Joshua Bewlay.\n\nHe admitted one felony count of scheming to defraud and faces up to four years in prison. A Marsh spokeswoman said Mr Bewlay was no longer with the company. Mr Spitzer\'s investigation of the US insurance industry looked at whether companies rigged bids and fixed prices. Last month Marsh agreed to pay $850m (£415m) to settle a lawsuit filed by Mr Spitzer, but under the settlement it "neither admits nor denies the allegations".\n'''
def get_entities(document_text, model):
analyzed_doc = model(document_text)
entities = [entity for entity in analyzed_doc.ents if entity.label_ in ["PER", "ORG", "LOC", "GPE"]]
return entities
model = spacy.load("en_core_web_sm")
entities_1 = get_entities(text_corpus, model)
entities_2 = get_entities(text_corpus, model)
but when it run the following,
entities_1[0] in entities_2
The output is False.
Why is that? The objects in both the entity lists are the same. Yet an item from one list is not in the other one. That's extremely odd. Can someone please explain why that is so to me?

This is due to the way ents's are represented in spaCy. They are classes with specific implementations so even entities_2[0] == entities_1[0] will evaluate to False. By the looks of it, the Span class does not have an implementation of __eq__ which, at first glance at least, is the simple reason why.
If you print out the value of entities_2[0] it will give you US but this is simply because the span class has a __repr__ method implemented in the same file. If you want to do a boolean comparison, one way would be to use the text property of Span and do something like:
entities_1[0].text in [e.text for e in entities_2]
edit:
As #abb pointed out, Span implements __richcmp__, however this is applicable to the same instance of Span since it checks the position of the token itself.

Want to extract text from a text or pdf file as different paragraphs

Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.

Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).

Extracting parts of emails in text files

I am trying to do some text processing corpus which has emails.
I have a main directory, under which I have various folders. Each folder has many .txt files. Each txt file is basically the email conversations.
To give an example of how my text file looks like with emails, am copying a similar looking text file of emails from publicly available enron email corpus. I have the same type of text data with multiple emails in one text file.
An example text file can look like below:
Message-ID: <3490571.1075846143093.JavaMail.evans#thyme>
Date: Wed, 8 Sep 1999 08:50:00 -0700 (PDT)
From: steven.kean#enron.com
To: kelly.kimberly#enron.com
Subject: Re: India And The WTO Services Negotiation
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Steven J Kean
X-To: Kelly Kimberly
X-cc:
X-bcc:
X-Folder: \Steven_Kean_Dec2000_1\Notes Folders\All documents
X-Origin: KEAN-S
X-FileName: skean.nsf
fyi
---------------------- Forwarded by Steven J Kean/HOU/EES on 09/08/99 03:49
PM ---------------------------
Joe Hillings#ENRON
09/08/99 02:52 PM
To: Joe Hillings/Corp/Enron#Enron
cc: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H
Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron
Subject: Re: India And The WTO Services Negotiation
Sanjay: Some information of possible interest to you. I attended a meeting
this afternoon of the Coalition of Service Industries, one of the lead groups
promoting a wide range of services including energy services in the upcoming
WTO GATTS 2000 negotiations. CSI President Bob Vastine was in Delhi last week
and met with CII to discuss the upcoming WTO. CII apparently has a committee
looking into the WTO. Bob says that he told them that energy services was
among the CSI recommendations and he recalls that CII said that they too have
an interest.
Since returning from the meeting I spoke with Kiran Pastricha and told her
the above. She actually arranged the meeting in Delhi. She asked that I send
her the packet of materials we distributed last week in Brussels and London.
One of her associates is leaving for India tomorrow and will take one of
these items to Delhi.
Joe
Joe Hillings
09/08/99 11:57 AM
To: Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT
cc: Terence H Thorn/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Ashok
Mehta/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, John
Ambler/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Steven J Kean/HOU/EES#EES,
Jeffrey Sherrick/Corp/Enron#Enron (bcc: Joe Hillings/Corp/Enron)
Subject: India And The WTO Services Negotiation
Sanjay: First some information and then a request for your advice and
involvment.
A group of US companies and associations formed the US WTO Energy Services
Coalition in late May and has asked the US Government to include "energy
services" on their proposed agenda when the first meeting of the WTO GATTS
2000 ministerial convenes late this year in Seattle. Ken Lay will be among
the CEO speakers. These negotiations are expected to last three years and
cover a range of subjects including agriculture, textiles, e-commerce,
investment, etc.
This morning I visited with Sudaker Rao at the Indian Embassy to tell him
about our coalition and to seek his advice on possible interest of the GOI.
After all, India is a leader in data processing matters and has other
companies including ONGC that must be interested in exporting energy
services. In fact probably Enron and other US companies may be engaging them
in India and possibly abroad.
Sudaker told me that the GOI has gone through various phases of opposing the
services round to saying only agriculture to now who knows what. He agrees
with the strategy of our US WTO Energy Services Coalition to work with
companies and associations in asking them to contact their government to ask
that energy services be on their list of agenda items. It would seem to me
that India has such an interest. Sudaker and I agree that you are a key
person to advise us and possibly to suggest to CII or others that they make
such a pitch to the GOI Minister of Commerce.
I will ask Lora to send you the packet of materials Chris Long and I
distributed in Brussels and London last week. I gave these materials to
Sudaker today.
Everyone tells us that we need some developing countries with an interest in
this issue. They may not know what we are doing and that they are likely to
have an opportunity if energy services are ultimately negotiated.
Please review and advise us how we should proceed. We do need to get
something done in October.
Joe
PS Terry Thorn is moderating a panel on energy services at the upcoming World
Services Congress in Atlanta. The Congress will cover many services issues. I
have noted in their materials that Mr. Alliwalia is among the speakers but
not on energy services. They expect people from all over the world to
participate.
So as you see there can be basically multiple emails in one text file with not much clear separation rule except new email headers (To, From etc).
I can do the os.walk in the main directory and then it would go through each of the sub directory, parse each of the text file in that sub-directory etc and repeat it for other sub-directory and so on.
I need to extract certain parts of each email within a text file and store it as new row in a dataset (csv,pandas dataframe etc).
Parts which can be helpful to extract and store as columns of a row in a dataset. Each row of this dataset can then be each email within each text file.
Fields:
Original Email content | From (Sender)| To (Receipient) | cc (Receipient)| Date/Time Sent| Subject of Email|
Edit: I looked at the duplicate question added. That considers a fixed spec and boundary. That's not the case here. I am looking for a simple regular expression way of extracting different fields as mentioned above

^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)
Make sure you're using dotall, multiline, and extended modes on your regex engine.
For the example you posted it works at least, it captures everything in different groups (you may need to enable that on the regex engine as well, depending on which it is)
Group `date` 63-99 `Wed, 8 Sep 1999 08:50:00 -0700 (PDT)`
Group `sender` 106-127 `steven.kean#enron.com`
Group `to` 132-156 `kelly.kimberly#enron.com`
Group `cc` 650-714 `Sanjay Bhatnagar/ENRON_DEVELOPMENT#ENRON_DEVELOPMENT, Terence H `
Group `subject` 930-974 `Re: India And The WTO Services Negotiation `
https://regex101.com/r/gHUOLi/1
And use it to iterate over your stream of text, you mention python so there you go:
def match_email(long_string):
regex = r'^Date:\ (?P<date>.+?$)
.+?
^From:\ (?P<sender>.+?$)
.+?
^To:\ (?P<to>.+?$)
.+?
^cc:\ (?P<cc>.+?$)
.+?
^Subject:\ (?P<subject>.+?$)'
# try to match the thing
match = re.search(regex, long_string.strip(), re.I | re.X)
# if there is no match its over
if match is None:
return None, long_string
# otherwise, get it
email = match.groupdict()
# remove whatever matched from the original string
if email is not None:
long_string = long_string.strip()[match.end():]
# return the email, and the remaining string
return email, long_string
# now iterate over the long string
emails = []
email, tail = match_email(the_long_string)
while email is not None:
emails.append(email)
email, tail = match_email(tail)
print(emails)
Thats directly stolen from here just some names changed and stuff.

Is there a simpler way to parse the xml file into a nested array?

Given an input file, e.g.
<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>
The desired result is a nested dictionary that stores:
/setid
/docid
/segid
text
I've been using a defaultdict and reading the xml file with BeautifulSoup and nested loops, i.e.
from io import StringIO
from collections import defaultdict
from bs4 import BeautifulSoup
srcfile = """<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>"""
#ntok = NISTTokenizer()
eval_docs = defaultdict(lambda: defaultdict(dict))
with StringIO(srcfile) as fin:
bsoup = BeautifulSoup(fin.read(), 'html5lib')
setid = bsoup.find('srcset')['setid']
for doc in bsoup.find_all('doc'):
docid = doc['docid']
for seg in doc.find_all('seg'):
segid = seg['id']
eval_docs[setid][docid][segid] = seg.text
[out]:
>>> eval_docs
defaultdict(<function __main__.<lambda>>,
{'newstest2015': defaultdict(dict,
{'1012-bbc': {'1': 'India and Japan prime ministers meet in Tokyo',
'2': "India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.",
'3': 'Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.',
'4': 'High on the agenda are plans for greater nuclear co-operation.',
'5': 'India is also reportedly hoping for a deal on defence collaboration between the two nations.'},
'1018-lenta.ru': {'1': 'FANO Russia will hold a final Expert Session',
'2': 'The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.',
'3': 'The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.',
'4': 'At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.'}})})
Is there a simpler way to read the file and get the same eval_docs nested dictionary?
Can it be done easily without using BeautifulSoup?
Note that in the example, there's only one setid and one docid but the actual file has more than one of those.

Since what you have is an HTML with an appearance like XML, you can't go for XML based tools. In most cases your options were
Implement SAX parser
use BS4 (which you are already doing)
Use lxml
In any case you will end up spending more time and effort and have a bigger code to handle this. What you have really sleek and easy. I wouldn't look for another solution if I was you.
PS: What simpler could it be than a 10 liner code!

I don't know if you'll find this simpler, but here's an alternative, using lxml as others have suggested.
Step 1: Convert the XML data into a normalized table (a list of lists)
from lxml import etree
tree = etree.parse('source.xml')
segs = tree.xpath('//seg')
normalized_list = []
for seg in segs:
srcset = seg.getparent().getparent().getparent().attrib['setid']
doc = seg.getparent().getparent().attrib['docid']
normalized_list.append([srcset, doc, seg.attrib['id'], seg.text])
Step 2: Use defaultdict like you did in your original code
d = defaultdict(lambda: defaultdict(dict))
for i in normalized_list:
d[i[0]][i[1]][i[2]] = i[3]
Depending on how you're keeping the source file, you'll have to use one of these methods to parse XML:
tree = etree.parse('source.xml'): when you want to parse a file directly - you won't need StringIO. File is closed automatically by etree.
tree = etree.fromstring(source): where source is a string object, like in your question.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.