Python read part of a pdf page - python

I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form
A | B | C
D | E | F
G | H | I
Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar entries in the other 8 slots.
I've looked at pdfminer and pypdf2. I haven't found pdfminer overly useful, but pypdf2 has given me something close.
import PyPDF2
from StringIO import StringIO
def getPDFContent(path):
content = ""
p = file(path, "rb")
pdf = PyPDF2.PdfFileReader(p)
numPages = pdf.getNumPages()
for i in range(numPages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
However, this only reads the file line by line. I'd like a solution where I can read only a portion of the page so that I could read A, then B, then C, and so on. Also, the answer here works fairly well, but the order of
columns routinely gets distorted and I've only gotten it to read line by line.

I assume the PDF files in question are generated PDFs rather than scanned (as in the example you gave), given that you're using pdfminer and pypdf2. If you know the size of the columns and rows in inches you can use minecart (full disclosure: I wrote minecart). Example code:
import minecart
# minecart units are 1/72 inch, measured from bottom-left of the page
ROW_BORDERS = (
72 * 1, # Bottom row starts 1 inch from the bottom of the page
72 * 3, # Second row starts 3 inches from the bottom of the page
72 * 5, # Third row starts 5 inches from the bottom of the page
72 * 7, # Third row ends 7 inches from the bottom of the page
)
COLUMN_BORDERS = (
72 * 8, # Third col ends 8 inches from the left of the page
72 * 6, # Third col starts 6 inches from the left of the page
72 * 4, # Second col starts 4 inches from the left of the page
72 * 2, # First col starts 2 inches from the left of the page
) # reversed so that BOXES is ordered properly
BOXES = [
(left, bot, right, top)
for top, bot in zip(ROW_BORDERS, ROW_BORDERS[1:])
for left, right in zip(COLUMN_BORDERS, COLUMN_BORDERS[1:])
]
def extract_output(page):
"""
Reads the text from page and splits it into the 9 cells.
Returns a list with 9 entries:
[A, B, C, D, E, F, G, H, I]
Each item in the tuple contains a string with all of the
text found in the cell.
"""
res = []
for box in BOXES:
strings = list(page.letterings.iter_in_bbox(box))
# We sort from top-to-bottom and then from left-to-right, based
# on the strings' top left corner
strings.sort(key=lambda x: (-x.bbox[3], x.bbox[0]))
res.append(" ".join(strings).replace(u"\xa0", " ").strip())
return res
content = []
doc = minecart.Document(open("path/to/pdf-doc.pdf", 'rb'))
for page in doc.iter_pages():
content.append(extract_output(page))

Related

Code optimisation for extracting cloud points from AutoCAD DXF Python

I'm working on processing Lidar data with Python. The test data has about 150 000 data points but the actually data will contain hundreds of millions. Initially, it was exported as .dwg file, however, since I couldn't find a way to process it I decided to convert it to *.dxf and work from there. Then I'm trying to extract the point coordinates and layer and save it as a *.cvs file for further processing. Here is the code:
import pandas as pd
PointCloud = pd.DataFrame(columns=['X', 'Y', 'Z','Layer'])
filename="template"
# Using readlines()
with open(filename+".dxf", "r") as f2:
input = list(f2.readlines())
###Strip the data only to datapoints to speed up (look up .dxf documentation)
i=input.index('ENTITIES\n') #find the begining of the entities section
length = input.index('OBJECTS\n') #find the begining of the entities section
while i<length:
line=input[i]
if i%1000==0: print ("Completed: "+str(round(i/length*100,2))+"%")
if line.startswith("AcDbPoi"):
x=float(input[i+2].strip())
y=float(input[i+4].strip())
z=float(input[i+6].strip())
layer=input[i-2].strip() # Strips the newline character
point = {'X':x,'Y':y,'Z':z,'Layer':layer}
PointCloud.loc[PointCloud.shape[0]]=[x,y,z,layer]
i+=14
else:
i+=1
PointCloud.to_csv(filename+'.csv', sep='\t', encoding='utf-8')
While it works, going line by line is not the most efficient way, hence I'm trying to find ways to optimize it. Here is the *.dxf point structure that I'm interested in extracting:
AcDbEntity
8
SU-SU-Point cloud-Z
100
AcDbPoint
10
4.0973
20
2.1156
30
-0.6154000000000001
0
POINT
5
3130F
330
2F8CD
100
AcDbEntity
Where: 10, 20, and 30 are the XYZ coordinates and 8 is the layer. Any ideas on how to improve it would be greatly appreciated.
The slowest part is file IO and I don't think this can be sped up much.
But it could be more memory efficient by really reading the (very large) DXF file line by line. The code could also be more robust by just parsing the absolut minimum data from the POINT entities, this way the function can parse newer DXF versions and also DXF R12 and older.
import sys
from dataclasses import dataclass
#dataclass
class Point:
x: float = 0.0
y: float = 0.0
z: float = 0.0
layer: str = ""
def load_points(filename: str):
def read_tag():
"""Read the next DXF tag (group code, value)."""
code = fp.readline()
if code == "":
raise EOFError()
value = fp.readline().strip()
return int(code), value
def next_entity():
"""Collect entity tags, starting with the first group code 0 tag like (0, POINT).
"""
tags = []
while True:
code, value = read_tag()
if code == 0:
if tags:
yield tags
tags = [(code, value)]
else:
if tags: # skip everything in front of the first entity
tags.append((code, value))
def parse_point(tags):
"""Parse the DXF POINT entity."""
point = Point()
# The order of the DXF tags can differ from application to application.
for code, value in tags:
if code == 10: # x-coordinate
point.x = float(value)
elif code == 20: # y-coordinate
point.y = float(value)
elif code == 30: # z-coordinate
point.z = float(value)
elif code == 8: # layer name
point.layer = value
return point
# DXF R2007 has always utf8 encoding, older DXF versions using
# the encoding stored in the HEADER section, if only ASCII characters
# are used for the layer names, the encoding can be ignored.
fp = open(filename, mode="rt", encoding="utf8", errors="ignore")
try:
# find the ENTITIES section
while read_tag() != (2, "ENTITIES"):
pass
# iterate over all DXF entities until tag (0, ENDSEC) appears
for tags in next_entity():
if tags[0] == (0, "POINT"):
yield parse_point(tags)
elif tags[0] == (0, "ENDSEC"):
return
except EOFError:
pass
finally:
fp.close()
def main(files):
for file in files:
print(f"loading: {file}")
csv = file.replace(".dxf", ".csv")
with open(csv, "wt", encoding="utf8") as fp:
fp.write("X, Y, Z, LAYER\n")
for point in load_points(file):
print(point)
fp.write(f'{point.x}, {point.y}, {point.z}, "{point.layer}"\n')
if __name__ == "__main__":
main(sys.argv[1:])
FYI: This is the simplest valid DXF R12 file containing only POINT entities:
0
SECTION
2
ENTITIES
0
POINT
8
layer name
10
1.0
20
2.0
30
3.0
0
ENDSEC
0
EOF

Reading from more than one line after keyword?

I have an output file which prints out a matrix of numeric data. I need to search through this file for the identifier at the start of each data set, which is:
GROUP 1 FIRST 1 LAST 163
Here GROUP 1 is the first column of the matrix, FIRST 1 is the first non-zero element of this matrix in position 1, and LAST 163 is the last non-zero element of the matrix in position 163. The matrix doesn't necessarily end at this LAST value - in this case there are 172 values.
I want to read this data into a simpler form to work with. Here is an example of the first two column results:
GROUP 1 FIRST 1 LAST 163
7.150814E-02 9.866657E-03 8.500540E-04 1.818338E-03 2.410691E-03 3.284499E-03 3.011986E-03 1.612432E-03
1.674247E-03 3.436244E-03 3.655873E-03 4.056876E-03 4.560725E-03 2.462454E-03 2.567764E-03 5.359393E-03
5.457415E-03 2.679373E-03 2.600020E-03 2.491592E-03 2.365089E-03 2.228494E-03 5.792616E-03 1.623274E-03
1.475062E-03 1.331820E-03 1.195052E-03 2.832699E-03 7.298341E-04 6.301271E-04 1.377459E-03 1.048925E-03
1.677453E-04 3.580640E-04 1.575301E-04 1.150545E-04 1.197719E-04 2.950028E-05 5.380539E-05 1.228784E-05
1.627659E-05 4.522051E-05 7.736908E-06 1.758838E-05 8.161204E-06 6.103670E-06 6.431876E-06 1.585671E-06
4.110246E-06 4.512924E-07 2.775227E-06 5.107739E-07 1.219448E-06 1.653674E-07 4.429047E-07 4.837661E-07
2.036820E-07 3.449548E-07 1.457648E-07 4.494116E-07 1.629392E-07 1.300509E-07 1.730199E-07 8.130338E-08
1.591993E-08 5.457638E-08 1.713141E-08 7.806754E-09 1.154869E-08 3.545961E-09 2.862203E-09 2.289470E-09
4.324002E-09 2.243199E-09 2.627165E-09 2.273119E-09 1.973867E-09 1.710714E-09 1.468845E-09 1.772236E-09
1.764492E-09 1.004393E-09 1.044698E-09 5.201382E-10 2.660613E-10 3.012732E-10 2.630323E-10 4.381052E-10
2.521794E-10 9.213524E-11 2.619283E-10 3.591906E-11 1.449830E-10 1.867363E-11 1.230445E-10 1.108149E-11
2.775004E-11 1.156249E-11 4.393752E-11 5.318751E-11 6.815569E-12 1.817489E-11 2.044674E-11 2.044673E-11
1.931080E-11 1.931076E-11 1.817484E-11 2.044668E-11 5.486837E-12 7.681572E-12 1.536314E-11 7.132886E-12
8.230253E-12 1.426577E-11 1.426577E-11 4.389468E-12 5.925780E-12 2.853153E-12 2.853153E-12 5.706307E-12
5.706307E-12 2.194733E-12 3.292099E-12 5.267358E-12 2.194733E-12 3.072626E-12 4.828412E-12 4.389466E-12
4.389465E-12 1.097366E-11 2.194732E-12 1.316839E-11 2.194732E-12 1.608784E-11 1.674222E-11 1.778860E-11
6.993074E-12 2.622402E-12 9.090994E-12 5.769285E-12 1.573441E-12 6.861030E-12 4.782885E-12 8.768619E-13
2.311727E-12 3.188589E-12 4.393636E-12 3.844430E-12 4.256331E-12 1.235709E-12 2.746020E-12 2.746020E-12
8.238059E-13 2.608719E-12 1.445203E-12 4.817344E-13 1.445203E-12 7.609642E-14 2.536547E-13 2.000924E-13
7.075681E-14 7.075681E-14 3.056704E-14
GROUP 2 FIRST 2 LAST 168
6.740271E-02 8.310813E-03 3.609403E-03 1.307012E-03 2.949375E-03 3.605043E-03 1.612647E-03 1.640960E-03
3.597806E-03 4.022993E-03 4.289805E-03 4.480576E-03 2.352539E-03 2.415121E-03 5.018262E-03 5.188098E-03
2.589224E-03 2.546116E-03 2.472462E-03 2.374431E-03 2.260519E-03 5.981164E-03 1.700972E-03 1.556116E-03
1.410140E-03 1.273499E-03 3.061941E-03 7.995844E-04 6.967963E-04 1.553994E-03 1.216266E-03 1.997540E-04
4.426460E-04 1.990445E-04 1.470610E-04 1.539762E-04 3.814900E-05 7.024764E-05 1.611156E-05 2.136422E-05
5.984886E-05 1.035646E-05 2.363444E-05 1.105747E-05 8.308678E-06 8.789299E-06 2.257693E-06 5.807418E-06
6.248625E-07 3.822327E-06 6.987942E-07 1.660586E-06 2.240283E-07 5.983062E-07 6.513773E-07 2.735403E-07
4.614998E-07 1.940877E-07 5.895136E-07 2.081549E-07 1.662117E-07 2.316650E-07 1.101916E-07 2.162701E-08
7.493990E-08 2.341661E-08 1.072330E-08 1.606536E-08 4.945307E-09 3.936301E-09 3.147244E-09 5.945972E-09
3.108514E-09 3.682241E-09 3.210760E-09 2.795020E-09 2.436545E-09 2.118219E-09 2.612622E-09 2.586657E-09
1.432507E-09 1.457386E-09 7.264341E-10 3.803348E-10 4.514677E-10 3.959518E-10 6.541553E-10 3.707172E-10
1.334816E-10 3.875547E-10 5.294296E-11 2.294557E-10 2.790137E-11 1.719152E-10 1.408339E-11 3.526731E-11
1.469469E-11 5.583990E-11 6.759567E-11 8.766360E-12 2.337697E-11 2.629908E-11 2.629908E-11 2.483802E-11
2.483802E-11 2.337697E-11 2.629908E-11 7.112706E-12 9.957791E-12 1.991557E-11 9.246516E-12 1.066906E-11
1.849303E-11 1.849303E-11 5.690165E-12 7.681722E-12 3.698607E-12 3.698607E-12 7.397214E-12 7.397214E-12
2.845082E-12 4.267624E-12 6.828199E-12 2.845082E-12 3.983115E-12 6.259180E-12 5.690165E-12 5.690165E-12
1.422541E-11 2.845082E-12 1.707049E-11 2.845082E-12 2.095991E-11 2.193285E-11 2.330364E-11 1.096642E-11
4.112407E-12 1.425635E-11 8.906802E-12 2.429128E-12 1.106603E-11 8.097092E-12 1.484468E-12 3.913596E-12
5.398063E-12 8.624785E-12 7.546689E-12 8.355261E-12 2.425721E-12 5.390492E-12 5.390492E-12 1.617147E-12
5.120967E-12 2.710198E-12 9.033993E-13 2.710198E-12 3.744092E-13 1.248030E-12 6.614939E-13 4.359798E-13
4.359798E-13 1.364861E-13 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15
What I have at the moment works, except it only reads in the first line after the GROUP keyword line. How can I make it continue reading the data in until it reaches the next GROUP keyword?
file_name = "test_data.txt"
import re
import io
group_pattern = re.compile(r"GROUP +\d+ FIRST +(?P<first>\d+) LAST +(?P<last>\d+)")
def read_data_from_file(file_name, start_identifier, end_identifier):
results = []
longest = 0
with open(file_name) as file:
t = file.read()
t=t[t.find('MACRO'):]
t=t[t.find(start_identifier)+len(start_identifier):t.find(end_identifier)]
t=io.StringIO(t)
for line in t:
match = group_pattern.search(line)
if match:
first = int(match.group('first'))
last = int(match.group('last'))
data = [float(value) for value in next(t).split()]
row = [0.0] * last
for i, value in enumerate(data, start=first-1):
row[i] = value
longest = max(longest, len(row))
results.append(row)
for row in results:
if len(row) < longest:
row.extend([0.0] * (longest-len(row)))
return results
start_identifier = "SCATTER MOMENT 1"
end_identifier = "SCATTER MOMENT 2"
results = read_data_from_file(file_name, start_identifier, end_identifier)
print(results)
What I want the code to produce is a matrix with just the numerical data. In this case it would be size [2x168] but my full data set is [172x172]. I want every GROUP to be read in as a row of the matrix, and zeroes filled into every element not specified in the output data. The current code does almost all of this, except that it only reads the first line of data after the GROUP keyword line.
So I took a look at the data you provided in your question. I found what I think is a better and simpler way of pulling those data points out of that file. However I noticed that you have some other code thats looking for other things in the file as well but those weren't in the test data you posted. So you may have to adapt this a little to work with your dataset.
def read_data_from_file(file_name):
with open(file_name) as fp:
index = -1
matrices = []
# Iterate over the file line by line via iter. Reduces memory usage
for line in fp:
# Since headers are always on their own line and data points always being with
# two spaces we can just look for lines that start with two spaces.
# If we find a line without these spaces then its the header line, add a new
# list to matrices and add one to index
if not line.startswith(' '):
index += 1
matrices.append([])
else:
# Splice str at index 2 to ignore first two spaces
# Then split by two spaces to get each data point
str_data_points = line[2:].split(' ')
# Map the string data points to a floats
float_data_points = map(lambda s: float(s), str_data_points)
# Add those float data points to the list in matrices via index
matrices[index].extend(float_data_points)
max_matrix_length = max(map(lambda matrix: len(matrix), matrices))
for matrix in matrices:
matrix.extend([0.0] * (max_matrix_length - len(matrix)))
return matrices
Here's my solution to read the data from the .txt file and produce a matrix-like output (0.0 padded at the end of each group)
import re
def read_data_from_file(filepath):
GROUP_DATA = []
MAX_ELEMENT_COUNT = 0
with open(file_path) as f:
for line in f.readlines():
if 'GROUP' in line:
GROUP_DATA.append([])
MAX_ELEMENT_COUNT = max(MAX_ELEMENT_COUNT, int(re.findall(r'\d+', line)[-1]))
else:
values = line.split(' ')
for value in values:
try:
GROUP_DATA[-1].append(float(value))
except ValueError:
pass
for DATA in GROUP_DATA:
if len(DATA) < MAX_ELEMENT_COUNT:
DATA += [0.0] * (MAX_ELEMENT_COUNT - len(DATA))
return GROUP_DATA
For the data in the given question saved into data.txt, the output would be as follows:
>>> import numpy as np ------------------------------> Just to check the output shape
>>> mat = read_data_from_file('data.txt')
>>> np.shape(mat)
(2, 168) <-------------------------------------------- Output shape as expected
The Output Matrix's size is flexible to the given data

Python how to read a latex generated pdf with equations

Consider the following article
https://arxiv.org/pdf/2101.05907.pdf
It's a typically formatted academic paper with only two pictures in pdf file.
The following code was used to extract the text and equation from the paper
#Related code explanation: https://stackoverflow.com/questions/45470964/python-extracting-text-from-webpage-pdf
import io
import requests
r = requests.get(url)
f = io.BytesIO(r.content)
#Related code explanation: https://stackoverflow.com/questions/45795089/how-can-i-read-pdf-in-python
import PyPDF2
fileReader = PyPDF2.PdfFileReader(f)
#Related code explanation: https://automatetheboringstuff.com/chapter13/
print(fileReader.getPage(0).extractText())
However, the result was not quite correct
Bohmpotentialforthetimedependentharmonicoscillator
FranciscoSoto-Eguibar
1
,FelipeA.Asenjo
2
,SergioA.Hojman
3
andH
´
ectorM.
Moya-Cessa
1
1
InstitutoNacionaldeAstrof´
´
OpticayElectr´onica,CalleLuisEnriqueErroNo.1,SantaMar´Tonanzintla,
Puebla,72840,Mexico.
2
FacultaddeIngenier´yCiencias,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
3
DepartamentodeCiencias,FacultaddeArtesLiberales,UniversidadAdolfoIb´aŸnez,Santiago7491169,Chile.
DepartamentodeF´FacultaddeCiencias,UniversidaddeChile,Santiago7800003,Chile.
CentrodeRecursosEducativosAvanzados,CREA,Santiago7500018,Chile.
Abstract.
IntheMadelung-Bohmapproachtoquantummechanics,weconsidera(timedependent)phasethatdependsquadrati-
callyonpositionandshowthatitleadstoaBohmpotentialthatcorrespondstoatimedependentharmonicoscillator,providedthe
timedependentterminthephaseobeysanErmakovequation.
Introduction
Harmonicoscillatorsarethebuildingblocksinseveralbranchesofphysics,fromclassicalmechanicstoquantum
mechanicalsystems.Inparticular,forquantummechanicalsystems,wavefunctionshavebeenreconstructedasisthe
caseforquantizedincavities[1]andforion-laserinteractions[2].Extensionsfromsingleharmonicoscillators
totimedependentharmonicoscillatorsmaybefoundinshortcutstoadiabaticity[3],quantizedpropagatingin
dielectricmedia[4],Casimire
ect[5]andion-laserinteractions[6],wherethetimedependenceisnecessaryinorder
totraptheion.
Timedependentharmonicoscillatorshavebeenextensivelystudiedandseveralinvariantshavebeenobtained[7,8,9,
10,11].Alsoalgebraicmethodstoobtaintheevolutionoperatorhavebeenshown[12].Theyhavebeensolvedunder
variousscenariossuchastimedependentmass[12,13,14],timedependentfrequency[15,11]andapplicationsof
invariantmethodshavebeenstudiedindi
erentregimes[16].Suchinvariantsmaybeusedtocontrolquantumnoise
[17]andtostudythepropagationoflightinwaveguidearrays[18,19].Harmonicoscillatorsmaybeusedinmore
generalsystemssuchaswaveguidearrays[20,21,22].
Inthiscontribution,weuseanoperatorapproachtosolvetheone-dimensionalSchr
¨
odingerequationintheBohm-
Madelungformalismofquantummechanics.ThisformalismhasbeenusedtosolvetheSchr
¨
odingerequationfor
di
erentsystemsbytakingtheadvantageoftheirnon-vanishingBohmpotentials[23,24,25,26].Alongthiswork,
weshowthatatimedependentharmonicoscillatormaybeobtainedbychoosingapositiondependentquadratictime
dependentphaseandaGaussianamplitudeforthewavefunction.Wesolvetheprobabilityequationbyusingoperator
techniques.Asanexamplewegivearationalfunctionoftimeforthetimedependentfrequencyandshowthatthe
Bohmpotentialhasdi
erentbehaviorforthatfunctionalitybecauseanauxiliaryfunctionneededinthescheme,
namelythefunctionsthatsolvestheErmakovequation,presentstwodi
erentsolutions.
One-dimensionalMadelung-Bohmapproach
ThemainequationinquantummechanicsistheSchrodingerequation,thatinonedimensionandforapotential
V
(
x
;
t
)
iswrittenas(forsimplicity,weset
}
=
1)
i
#
(
x
;
t
)
#
t
=
1
2
m
#
2
(
x
;
t
)
#
x
2
+
V
(
x
;
t
)
(
x
;
t
)
(1)
arXiv:2101.05907v1 [quant-ph] 14 Jan 2021
As shown:
The spacing, such as the title, disappeared and resulted meaning less strings.
The latex equations was wrong, and it got worse on the second page.
How to fix this and extract text and equations correctly from the pdf file that was generated from latex?

run a search over the items of a list, and save each search into a file

I have a data.dat file that has 3 columns: The 3rd column is just the numbers 1 to 6 repeated again and again:
( In reality, column 3 has numbers from 1 to 1917, but for a minimal working example, let's stick to 1 to 6 )
# Title
127.26 134.85 1
127.26 135.76 2
127.26 135.76 3
127.26 160.97 4
127.26 160.97 5
127.26 201.49 6
125.88 132.67 1
125.88 140.07 2
125.88 140.07 3
125.88 165.05 4
125.88 165.05 5
125.88 203.06 6
137.20 140.97 1
137.20 140.97 2
137.20 148.21 3
137.20 155.37 4
137.20 155.37 5
137.20 184.07 6
I would like to:
1) extract the lines that contain 1 in the 3rd column and save them to a file called mode_1.dat.
2) extract the lines that contain 2 in the 3rd column and save them to a file called mode_2.dat.
3) extract the lines that contain 3 in the 3rd column and save them to a file called mode_3.dat.
.
.
.
6) extract the lines that contain 6 in the 3rd column and save them to a file called mode_6.dat.
In order to accomplish this, I have:
a) defined a variable factor = 6
a) created a one_to_factor list that has numbers 1 to 6
b) The re.search statement is in charge of extracting the lines for each value of one_to_factor. %s are the i inside the one_to_factor list
c) append these results to an empty LINES list.
However, this does not work. I cannot manage to extract the lines that contain i in the 3rd column and save them to a file called mode_i.dat
I would appreciate if you could help me.
factor = 6
one_to_factor = range(1,factor+1)
LINES = []
f_2 = open('data.dat', 'r')
for line in f_2:
for i in one_to_factor:
if re.search(r' \b%s$' %i , line):
print 'line = ', line
LINES.append(line)
print 'LINES =' , LINES
I would do it like this:
no regexes, just use str.split() to split according to whitespace
use last item (the digit) of the current line to generate the filename
use a dictionary to open the file the first time, and reuse the handle for subsequent matches (write title line at file open)
close all handles in the end
code:
title_line="# Vol \t Freq \t Mod \n"
handles = dict()
next(f_2) # skip title
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
# create files first time id encountered
if filename in handles:
pass
else:
handles[filename] = open(filename,"w")
handles[filename].write(title_line) # write title
handles[filename].write(line)
# close all files
for v in handles.values():
v.close()
EDIT: that's the fastest way but the problem is if you have too many suffixes (like in your real example), you'll get "too many open files" exception. So for this case, there's a slightly less efficient method but which works too:
import glob,os
# pre-processing: cleanup old files if any
for f in glob.glob("mode_*.dat"):
os.remove(f)
next(f_2) # skip title
s = set()
title_line="# Vol \t Freq \t Mod \n"
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
with open(filename,"a") as f:
if filename in s:
pass
else:
s.add(filename)
f.write(title_line)
f.write(line)
It basically opens as append mode, writes the lines, and closes the file.
(the set is used to detect first write in this file, so title can be written before the data)
There's a directory cleanup first to ensure that no data is left from a previous computation (append mode expects that no file exists, and if input data set changes, there's a possibility that there's an indentifier not present in the new dataset, so there would be an "orphan" file remaining from previous run)
First, instead of looping on you one_to_factor, you can get the index in one step :
index = line[-1] # Last character on the line
Then, you can check if index is in your one_to_factor list.
You should created a dictionary of lists to store your lines.
Something like :
{ "1" : [line1, line7, ...],
"2" : ....
}
And then you can use the key of the dictionnary to create the file and populate it with lines.

Parse complex text file for data analysis in Python

I am a complete novice to Python-or programming.
I have a text file to parse into a CSV. I am not able to provide an example of the text file at this time.
The text is several (thousand) lines with no carriage returns.
There are 4 types of records in the file (A, B, C, or I).
Each record type has a specific format based on the size of the data element.
There are no delimiters.
Immediately after the last data element in the record type, the next record type appears.
I have been trying to translate from a different language what this might look like in Python.
Here is an example of what I've written (not correct format)
file=open('TestPython.txt'), 'r' # from current working directory
dataString=file.read()
data=()
i=0
while i < len(dataString):
i = i+2
curChar = dataString(i)
# Need some help on the next line var curChar = dataString[i]
if curChar = "A"
NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
NPI.strip()
PCN = datastring(i+17, 40)
PCN.strip()
seqNo = dataString(i+41, 42)
seqNo.strip()
MRN = dataString(i+43, 66)
MRN.strip()
if curChar = "B"
NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
NPI.strip()
PCN = datastring(i+17, 40)
PCN.strip()
seqNo = dataString(i+41, 42)
seqNo.strip()
RC1 = (i+43, 46)
RC1.strip()
RC2 = (i+47, 50)
RC2.strip()
RC3 = (i+51, 54)
RC3.strip()
if curChar = "C"
NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
NPI.strip()
PCN = datastring(i+17, 40)
PCN.strip()
seqNo = dataString(i+41, 42)
seqNo.strip()
DXVer = (i=43, 43)
DXVer.strip()
AdmitDX = (i+44, 50)
AdmitDX.strip()
RVisit1 = (i+51, 57)
RVisit1.strip()
Here's a Dummied-up version of a piece of the text file.
A 63489564696474677 9845687 777 67834717467764674 TUANU TINBUNIU 47 ERTYNU TDFGH UU748897764 66762589668777486U6764467467774767 7123609989 9 O
B 79466945684634677 676756787344786474634890 7746.66 7 96 4 7 7 9 7 774666 44969 494 7994 99666 77478 767766
B 098765477 64697666966667 9 99 87966 47798 797499
C 63489564696474677 6747494 7494 7497 4964 4976 N7469 4769 N9784 9677
I 79466944696474677 677769U6 8888 67764674
A 79466945684634677 6767994 777 696789989 6464467464764674 UIIUN UITTI 7747 NUU 9 ATU 4 UANU OSASDF NU67479 66567896667697487U6464467476777967 7699969978 7699969978 9 O
As you can see, there can be several of each type in the file. The way this example pastes, it looks like the type is the first character on a line. This is not the case on the actual file (i made this sample in Word).
You might take a look at pyparsing.
You better process the file as you read it.
First, do a file.read(1) to determine which type of record is up next.
Then, depending on the type, read the fields, which if I understand you correctly are fixed width. So for type 'A' this would look like this:
def processA (file):
NPI = file.read(16).strip() #assuming the NPI is 16 bytes long
PCN = file.read(23).strip() #assuming the PCN is 23 bytes long
seqNo = file.read(1).strip() #assuming seqNo is 1 byte long
MRN = file.read(23).strip() #assuming MRN is 23 bytes long
return {"NPI":NPI,"PCN":PCN, "seqNo":seqNo, "MRN":MRN}
If the file is not ASCII, there's a bit more work to get the encoding right and read characters instead of bytes.

Categories

Resources