Parse complex text file for data analysis in Python

Parse complex text file for data analysis in Python - python

I am a complete novice to Python-or programming.
I have a text file to parse into a CSV. I am not able to provide an example of the text file at this time.
The text is several (thousand) lines with no carriage returns.
There are 4 types of records in the file (A, B, C, or I).
Each record type has a specific format based on the size of the data element.
There are no delimiters.
Immediately after the last data element in the record type, the next record type appears.
I have been trying to translate from a different language what this might look like in Python.
Here is an example of what I've written (not correct format)
file=open('TestPython.txt'), 'r' # from current working directory
dataString=file.read()
data=()
i=0
while i < len(dataString):
i = i+2
curChar = dataString(i)
# Need some help on the next line var curChar = dataString[i]
if curChar = "A"
NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
NPI.strip()
PCN = datastring(i+17, 40)
PCN.strip()
seqNo = dataString(i+41, 42)
seqNo.strip()
MRN = dataString(i+43, 66)
MRN.strip()
if curChar = "B"
NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
NPI.strip()
PCN = datastring(i+17, 40)
PCN.strip()
seqNo = dataString(i+41, 42)
seqNo.strip()
RC1 = (i+43, 46)
RC1.strip()
RC2 = (i+47, 50)
RC2.strip()
RC3 = (i+51, 54)
RC3.strip()
if curChar = "C"
NPI = dataString(i+1, 16) # Need to verify that is how it is done in python inside ()
NPI.strip()
PCN = datastring(i+17, 40)
PCN.strip()
seqNo = dataString(i+41, 42)
seqNo.strip()
DXVer = (i=43, 43)
DXVer.strip()
AdmitDX = (i+44, 50)
AdmitDX.strip()
RVisit1 = (i+51, 57)
RVisit1.strip()
Here's a Dummied-up version of a piece of the text file.
A 63489564696474677 9845687 777 67834717467764674 TUANU TINBUNIU 47 ERTYNU TDFGH UU748897764 66762589668777486U6764467467774767 7123609989 9 O
B 79466945684634677 676756787344786474634890 7746.66 7 96 4 7 7 9 7 774666 44969 494 7994 99666 77478 767766
B 098765477 64697666966667 9 99 87966 47798 797499
C 63489564696474677 6747494 7494 7497 4964 4976 N7469 4769 N9784 9677
I 79466944696474677 677769U6 8888 67764674
A 79466945684634677 6767994 777 696789989 6464467464764674 UIIUN UITTI 7747 NUU 9 ATU 4 UANU OSASDF NU67479 66567896667697487U6464467476777967 7699969978 7699969978 9 O
As you can see, there can be several of each type in the file. The way this example pastes, it looks like the type is the first character on a line. This is not the case on the actual file (i made this sample in Word).

You might take a look at pyparsing.

You better process the file as you read it.
First, do a file.read(1) to determine which type of record is up next.
Then, depending on the type, read the fields, which if I understand you correctly are fixed width. So for type 'A' this would look like this:
def processA (file):
NPI = file.read(16).strip() #assuming the NPI is 16 bytes long
PCN = file.read(23).strip() #assuming the PCN is 23 bytes long
seqNo = file.read(1).strip() #assuming seqNo is 1 byte long
MRN = file.read(23).strip() #assuming MRN is 23 bytes long
return {"NPI":NPI,"PCN":PCN, "seqNo":seqNo, "MRN":MRN}
If the file is not ASCII, there's a bit more work to get the encoding right and read characters instead of bytes.

Related

How do you split a list by space in python?

How do you split a list by space? With the code below, it reads a file with 4 lines of 7 numbers separated by spaces. When it takes the file and then splits it, it splits it by number so if i print item[0], 5 will print instead of 50. here is the code
def main():
filename = input("Enter the name of the file: ")
infile = open(filename, "r")
for i in range(4):
data = infile.readline()
print(data)
item = data.split()
print(data[0])
main()
the file looks like this
50 60 15 100 60 15 40 /n
100 145 20 150 145 20 45 /n
50 245 25 120 245 25 50 /n
100 360 30 180 360 30 55 /n

Split takes as argument the character you want to split your string with.
I invite you to read the documentation of methods you are using. :)
EDIT : By the way, readline returns a string, not a **list **.
However, split does return a list.

import nltk
tokens = nltk.word_tokenize(TextInTheFile)
Try this once you have opened that file.
TextInTheFile is a variable

There's not a lot wrong with what you are doing, except that you are printing the wrong thing.
Instead of
print(data[0])
use
print(item[0])
data[0] is the first character of the string you read from file. You split this string into a variable called item so that's what you should print.

How can I ignore binary data headers (or just read the serialized data) when enumerating a file with Python?

I have a file that has a number of headers of binary data (I suppose that is what it is) and after that, there are lines of text. I'm just starting to work with it, but I noticed that if I use the Python "enumerate" function it doesn't appear to read the lines I want it to read (I'm using Python 2.7.8). It is not returning the lines I'm interested in. In my text editor I can see the data I want but the result indicates maybe it is "serialized data"? There is more of the same binary at the end of the file.
Sample from Data File (I'm hoping to skip the first 8 lines):
I want to start with the line that starts with "curve".
ÿÿÿÿ  ENetDeedPlotter, Version=5.6.1.0, Culture=neutral, PublicKeyToken=null QSystem.Drawing, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a Net_Deed_Plotter.SerializeData! LinesOfDataNumberOfTractsSeditortextSedLineStract
SNoteArraySNorthArrow
Slandscape
SPaperSizeSPaperBounds
SPrinterScaleSPrinterScaleStrSAllTractsMouseOffsetNSAllTractsMouseOffsetESAllTractsNOffsetSAllTractsEOffsetSImageScroll_YSImageScroll_XSImage_YSImage_XSImageFilePath
SUpDateMapSttcSttStbSboSnb
STitleText SDateText SPOBLines
SLabelCornersSNAmountTract0HasBeenMovedSEAmountTract0HasBeenMoved Net_Deed_Plotter.LineData[] Net_Deed_Plotter.TractData[] System.Collections.ArrayList+Net_Deed_Plotter.PaperForm+NorthArrowStruct !System.Drawing.Printing.PaperSize System.Drawing.Rectangle  ' Ân40.4635w 191.02
curve right radius 953.50 arc 361.84 chord n60.5705e 359.07
s56.3005e 3.81
s19.4515w 170.63
s13.4145w 60.67
s51.0250w 155.35
n40.4635w 191.02
curve left radius 615.16 arc 202.85 chord s45.19w 201.94
Sample Script
# INPUTS TO BE UPDATED
inputNDP = r"N:\Parcels\Parcels2012\57-11-115.ndp"
outputTXT = r"N:\Parcels\Parcels2012\57-11-115.txt"
# END OF INPUTS TO BE UPDATED
fileNDP = open(inputNDP, 'r')
for line in enumerate(9, fileNDP):
print line
Result
(9, '\x00\x01\x00\x00\x00\xff\xff\xff\xff\x01\x00\x00\x00\x00\x00\x00\x00\x0c\x02\x00\x00\x00ENetDeedPlotter, Version=5.6.1.0, Culture=neutral, PublicKeyToken=null\x0c\x03\x00\x00\x00QSystem.Drawing, Version=2.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a\x05\x01\x00\x00\x00\x1eNet_Deed_Plotter.SerializeData!\x00\x00\x00\x0bLinesOfData\x0eNumberOfTracts\x0bSeditortext\x07SedLine\x06Stract\n')
(10, 'SNoteArray\x0bSNorthArrow\n')
(11, 'Slandscape\n')
(12, 'SPaperSize\x0cSPaperBounds\rSPrinterScale\x10SPrinterScaleStr\x16SAllTractsMouseOffsetN\x16SAllTractsMouseOffsetE\x11SAllTractsNOffset\x11SAllTractsEOffset\x0eSImageScroll_Y\x0eSImageScroll_X\x08SImage_Y\x08SImage_X\x0eSImageFilePath\n')
(13, 'SUpDateMap\x04Sttc\x03Stt\x03Stb\x03Sbo\x03Snb\n')
(14, 'STitleText\tSDateText\tSPOBLines\rSLabelCorners')
>>>

Be aware that enumerate takes a start parameter that only sets the initial value of the number. It does not cause it to skip over any contents.
If you want to skip lines, you'll need to filter your enumeration:
x=xrange(20)
>>> for num,text in (tpl for tpl in enumerate(x) if tpl[0] >8):
... print num,text
...
9 9
10 10
11 11
12 12
13 13
14 14
15 15
16 16
17 17
18 18
19 19

I figured out that since the file was in a binary format, I needed to read it in that way with open('myfile', 'rb') rather than with open('myfile', 'r') and I got a lot of help from this question.
The re-write looks like this ...
#ToDO write output file
# INPUTS TO BE UPDATED
inputNDP = r"N:\Parcels\Parcels2012\57-11-115.ndp"
# END OF INPUTS TO BE UPDATED
fileNDP = open(inputNDP, 'rb')
def strip_nonascii(b):
return b.decode('ascii', errors='ignore')
n = 0
for line in fileNDP:
if n > 5:
if '|' in line:
break
print(strip_nonascii(line)).strip('\n') # + str(n)
n += 1

Adding in-between columns, skipping and keeping some rows/columns

I am new to programming but I have started looking into both Python and Perl.
I am looking for data in two input files that are partly CSV, selecting some of them and putting into a new output file.
Maybe Python CSV or Pandas can help here, but I'm a bit stuck when it comes to skipping/keeping rows and columns.
Also, I don't have any headers for my columns.
Input file 1:
-- Some comments
KW1
'Z1' 'F' 30 26 'S'
KW2
'Z1' 30 26 1 1 5 7 /
'Z1' 30 26 2 2 6 8 /
'Z1' 29 27 4 4 12 13 /
Input file 2:
-- Some comments
-- Some more comments
KW1
'Z2' 'F' 40 45 'S'
KW2
'Z2' 40 45 1 1 10 10 /
'Z2' 41 45 2 2 14 15 /
'Z2' 41 46 4 4 16 17 /
Desired output file:
KW_NEW
'Z_NEW' 1000 30 26 1 /
'Z_NEW' 1000 30 26 2 /
'Z_NEW' 1000 29 27 4 /
'Z_NEW' 1000 40 45 1 /
'Z_NEW' 1000 41 45 2 /
'Z_NEW' 1000 41 46 4 /
So what I want to do is:
Do not include anything in either of my two input files before I reach KW2
Replace KW2 with KW_NEW
Replace either Z1' orZ2withZ_NEW` in the first column
Add a new second column with a constant value e.g. 1000
Copy the next three columns as they are
Leave out any remaining columns before printing the slash / at the end
Could anyone give me at least some general hints/tips how to approach this?

Your files are not "partly csv" (there is not a comma in sight); they are (partly) space delimited. You can read the files line-by-line, use Python's .split() method to convert the relevant strings into lists of substrings, and then re-arrange the pieces as you please. The splitting and re-assembly might look something like this:
input_line = "'Z1' 30 26 1 1 5 7 /" # test data
input_items = input_line.split()
output_items = ["'Z_NEW'", '1000']
output_items.append(input_items[1])
output_items.append(input_items[2])
output_items.append(input_items[3])
output_items.append('/')
output_line = ' '.join(output_items)
print(output_line)
The final print() statement shows that the resulting string is
'Z_NEW' 1000 30 26 1 /

Is your file format static? (this is not actually csv by the way :P) You might want to investigate a standardized file format like JSON or strict CSV to store your data, so that you can use already-existing tools to parse your input files. python has great JSON and CSV libraries that can do all the hard stuff for you.
If you're stuck with this file format, I would try something along these lines.
path = '<input_path>'
kws = ['KW1', 'KW2']
desired_kw = kws[1]
def parse_columns(line):
array = line.split()
if array[-1] is '/':
# get rid of trailing slash
array = array[:-1]
def is_kw(cols):
if len(cols) > 0 and cols[0] in kws:
return cols[0]
# to parse the section denoted by desired keyword
with open(path, 'r') as input_fp:
matrix = []
reading_file = False
for line in input_fp.readlines:
cols = parse_columns(line)
line_is_kw = is_kw(line)
if line_is_kw:
if not reading_file:
if line_is_kw is desired_kw:
reading_file = True
else:
continue
else:
break
if reading_file:
matrix = cols
print matrix
From there you can use stuff like slice notation and basic list manipulation to get your desired array. Good luck!

Here is a way to do it with Perl:
#!/usr/bin/perl
use strict;
use warnings;
# initialize output array
my #output = ('KW_NEW');
# proceed first file
open my $fh1, '<', 'in1.txt' or die "unable to open file1: $!";
while(<$fh1>) {
# consider only lines after KW2
if (/KW2/ .. eof) {
# Don't treat KW2 line
next if /KW2/;
# split the current line on space and keep only the fifth first element
my #l = (split ' ', $_)[0..4];
# change the first element
$l[0] = 'Z_NEW';
# insert 1000 at second position
splice #l,1,0,1000;
# push into output array
push #output, "#l";
}
}
# proceed second file
open my $fh2, '<', 'in2.txt' or die "unable to open file2: $!";
while(<$fh2>) {
if (/KW2/ .. eof) {
next if /KW2/;
my #l = (split ' ', $_)[0..4];
$l[0] = 'Z_NEW';
splice #l,1,0,1000;
push #output, "#l";
}
}
# write array to output file
open my $fh3, '>', 'out.txt' or die "unable to open file3: $!";
print $fh3 $_,"\n" for #output;

Format individual characters differently within an Excel cell

I have a column in Excel 2013 containing letters and the digits 1,2,3 and 4 (representing pinyin pronunciations and tone values). They are all in the same font & format, but I would like to convert the numbers only to superscript. It does not seem that I can use any of Excel's built-in find-and-replace functionality to replace a single character in a cell with its superscript version: the entire cell format gets changed. I saw a thread Format individual characters in a single Excel cell with python which apparently holds a solution, but that was the first time I had heard of Python or xlwt.
Since I have never used Python and xlwt, can someone give me a basic step-by-step set of instructions to install those utilities, customize the script and run it?
Sample:
Li1Shi4
Qin3Fat1
Gon1Lin3
Den1Choi3
Xin1Nen3
Script from other thread:
import xlwt
wb = xlwt.Workbook()
ws = wb.add_sheet('Sheet1')
font0 = xlwt.easyfont('')
font1 = xlwt.easyfont('bold true')
font2 = xlwt.easyfont('color_index red')
style = xlwt.easyxf('font: color_index blue')
seg1 = ('bold', font1)
seg2 = ('red', font2)
seg3 = ('plain', font0)
seg4 = ('boldagain', font1)
ws.write_rich_text(2, 5, (seg1, seg2, seg3, seg4))
ws.write_rich_text(4, 1, ('xyz', seg2, seg3, '123'), style)
wb.save('rich_text.xls')
What is the syntax that will achieve the "find numbers and replace with superscript"? Is it a font or a style? The code from the other thread seems to manually input "seg1" , "seg2" , "seg3" etc. Or am I misunderstanding the code?
Thanks in advance. I am using Windows 8, 64 bit, Excel 2013.

I'm bored and in a teaching mood, so, here's a long "answer" that also explains a little bit about how you can figure these things out for yourself in the future :)
I typed abc123def into a cell, and recorded a macro using the macro recorder.
This is where you should always start if you don't know what the correct syntax is.
In any case, I selected the numeric part of this cell, and right-clicked, format cell, change font to superscript.
This is what the macro recorder gives me. This is a lot of code. Fortunately, it's a lot of junk.
Sub Macro2()
With ActiveCell.Characters(Start:=1, Length:=3).Font 'Applies to the first 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
With ActiveCell.Characters(Start:=4, Length:=3).Font 'Applies to the middle 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = True
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
With ActiveCell.Characters(Start:=7, Length:=3).Font 'Applies to the last 3 characters
.Name = "Calibri"
.FontStyle = "Regular"
.Size = 11
.Strikethrough = False
.Superscript = False
.Subscript = False
.OutlineFont = False
.Shadow = False
.Underline = xlUnderlineStyleNone
.ThemeColor = xlThemeColorLight1
.TintAndShade = 0
.ThemeFont = xlThemeFontMinor
End With
End Sub
What it represents is three blocks of formatting: the first is the first 3 characters that aren't changed, then the 3 that we applied superscript to, and then the last three characters.
Almost all of this is default properties, since I made no other changes, so I can revise it to this:
Sub Macro2()
With ActiveCell.Characters(Start:=4, Length:=3).Font
.Superscript = False
End With
End Sub
Now we can see that there are two important parts to this. The first part is how to specify which characters to format. This is done by refereing to a cell's .Characters:
ActiveCell.Characters(Start:=4, Length:=3).Font
So we can see that this macro refers to the characters in the positon 4-6 in the string "abc123def", or "123".
The next, obvious part is to assign the .Font.Superscript property is True.
Now you want to generalize this so that you can apply it anywhere. The above code is "hardcoded" the Start and Length arguments. We need to make it dynamic. Easiest way to do this is to go 1 character at a time, and check to see if it's numeric, if so, apply the superscript.
Sub ApplySuperscriptToNumbers()
Dim i As Long
Dim str As String
Dim rng As Range
Dim cl As Range
'## Generally should work on any contiguous "Selection" of cell(s)
Set rng = Range(Selection.Address)
'## Iterate over each cell in this selection
For Each cl In rng.Cells
str = cl.Value
'## Iterate over each character in the cell
For i = 1 To Len(str)
'## Check if this character is numeric
If IsNumeric(Mid(str, i, 1)) Then
'## Apply superscript to this 1 character
cl.Characters(Start:=i, Length:=1).Font.Superscript = True
End If
Next
Next
End Sub

Testing each line in a file

I am trying to write a Python program that reads each line from an infile. This infile is a list of dates. I want to test each line with a function isValid(), which returns true if the date is valid, and false if it is not. If the date is valid, it is written into an output file. If it is not, invalid is written into the output file. I have the function, and all I want to know is the best way to test each line with the function. I know this should be done with a loop, I'm just uncertain how to set up the loop to test each line in the file one-by-one.
Edit: I now have a program that basically works. However, I am getting incorrect output to the output file. Perhaps someone will be able to explain why.
Ok, I now have a program that basically works, but I'm getting strange results in the output file. Hopefully those with Python 3 experience can help.
def main():
datefile = input("Enter filename: ")
t = open(datefile, "r")
c = t.readlines()
ofile = input("Enter filename: ")
o = open(ofile, "w")
for line in c:
b = line.split("/")
e = b[0]
f = b[1]
g = b[2]
text = str(e) + " " + str(f) + ", " + str(g)
text2 = "The date " + text + " is invalid"
if isValid(e,f,g) == True:
o.write(text)
else:
o.write(text2)
def isValid(m, d, y):
if m == 1 or m == 3 or m == 5 or m == 7 or m == 8 or m == 10 or m == 12:
if d is range(1, 31):
return True
elif m == 2:
if d is range(1,28):
return True
elif m == 4 or m == 6 or m == 9 or m == 11:
if d is range(1,30):
return True
else:
return False
This is the output I'm getting.
The date 5 19, 1998
is invalidThe date 7 21, 1984
is invalidThe date 12 7, 1862
is invalidThe date 13 4, 2000
is invalidThe date 11 40, 1460
is invalidThe date 5 7, 1970
is invalidThe date 8 31, 2001
is invalidThe date 6 26, 1800
is invalidThe date 3 32, 400
is invalidThe date 1 1, 1111
is invalid

In the most recent versions of Python you can use the context management features that are implicit for files:
results = list()
with open(some_file) as f:
for line in f:
if isValid(line, date):
results.append(line)
... or even more tersely with a list comprehension:
with open(some_file) as f:
results = [line for line in f if isValid(line, date)]
For progressively older versions of Python you might need to explicitly open and close the file (with simple implicit iteration over the file for line in file:) or add more explicit iteration over the file (f.readline() or f.readlines() (plural) depending on whether you want to "slurp" in the entire file (with the memory overhead implications of that) or iterate line-by-line).
Also note that you may wish to strip the trailing newlines off these file contents (perhaps by calling line.rstrip('\n') --- or possibly just line.strip() if you want to eliminate all leading and trailing whitespace from each line).
(Edit based on additional comment to previous answer):
The function signature isValid(m,d,y) suggests that you're passing a data to this function (month, day, year) but that doesn't make sense given that you must also, somehow, pass in the data to be validated (a line of text, a string, etc).
To help you further you'll have to provide more information (preferable the source or a relevant portion of the source to this "isValid()" function.
In my initial answer I was assuming that your "isValid()" function was merely scanning for any valid date in its single argument. I've modified my code examples to show how one might pass a specific date, as a single argument, to a function which used this calling signature: "isValid(somedata, some_date)."

with open(fname) as f:
for line in f.readlines():
test(line)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse complex text file for data analysis in Python - python

You might take a look at pyparsing.

Related

How do you split a list by space in python?

How can I ignore binary data headers (or just read the serialized data) when enumerating a file with Python?

Adding in-between columns, skipping and keeping some rows/columns

Format individual characters differently within an Excel cell

Testing each line in a file

Categories

Resources