Parsing text in python

Parsing text in python - python

I've got a text file that is structured like so
1\t 13249\n
2\t 3249\n
3\t 43254\n
etc...
It's a very simple list. I've got the file opened and I can read the lines. I have the following code:
count = 0
for x in open(filename):
count += 1
return count
What I want to do is to assign the first number of each line to a variable (say xi) and to assign the second number of each line to another variable (yi). The goal is to be able to run some statistics on these numbers.
Many thanks in advance.

No need to reinvent the wheel..
import numpy as np
for xi, yi in np.loadtxt('blah.txt'):
print(xi)
print(yi)

count = 0
for x in open(filename):
# strip removes all whitespace on the right (so the newline in this case)
# split will break a string in two based on the passed parameter
xi, yi = x.rstrip().split("\t") # multiple values can be assigned at once
count += 1
return count

>>> with open('blah.txt') as f:
... for i,xi,yi in ([i]+map(int,p.split()) for i,p in enumerate(f)):
... print i,xi,yi
...
0 1 13249
1 2 3249
2 3 43254
note that int(' 23\n') = 23
this is clearer:
Note that enumerate provides a generator which includes a counter for you.
>>> with open('blah.txt') as f:
... for count,p in enumerate(f):
... xi,yi=map(int,p.split()) #you could prefer (int(i) for i in p.split())
... print count,xi,yi
...
0 1 13249
1 2 3249
2 3 43254

with regular expression:
import re
def FUNC(path):
xi=[]
yi=[]
f=open(path).read().split("\n") # spliting file's content into a list
patt=re.compile("^\s*(\d)\t\s(\d).*") # first some whitespaces then first number
#then a tab or space second number and other characters
for iter in f:
try:
t=patt.findall(iter)[0]
xi.append(t[0])
yi.append(t[1])
except:
pass
print xi,yi
#-----------------------------
if __name__=="__main__":
FUNC("C:\\data.txt")
a simpler code:
def FUNC(path):
x=[]
y=[]
f=open(path).read().split("\n")
for i in f:
i=i.split(" ")
try:
x.append(i[0][0])
y.append(i[1][0])
except:pass
print x,y

Related

Add new line after finding last string in a region

I have this input test.txt file with the output interleaved as #Expected in it (after finding the last line containing 1 1 1 1 within a *Title region
and this code in Python 3.6
index = 0
insert = False
currentTitle = ""
testfile = open("test.txt","r")
content = testfile.readlines()
finalContent = content
testfile.close()
# Should change the below line of code I guess to adapt
#titles = ["TitleX","TitleY","TitleZ"]
for line in content:
index = index + 1
for title in titles:
if line in title+"\n":
currentTitle = line
print (line)
if line == "1 1 1 1\n":
insert = True
if (insert == True) and (line != "1 1 1 1\n"):
finalContent.insert(index-1, currentTitle[:6] + "2" + currentTitle[6:])
insert = False
f = open("test.txt", "w")
finalContent = "".join(finalContent)
f.write(finalContent)
f.close()
Update:
Actual output with the answer provided
*Title Test
12125
124125
asdas 1 1 1 1
rthtr 1 1 1 1
asdasf 1 1 1 1
asfasf 1 1 1 1
blabla 1 1 1 1
#Expected "*Title Test2" here <-- it didn't add it
124124124
*Title Dunno
12125
124125
12763125 1 1 1 1
whatever 1 1 1 1
*Title Dunno2
#Expected "*Title Dunno2" here <-- This worked great
214142122
#and so on for thousands of them..
Also is there a way to overwrite this in the test.txt file?

Because you are already reading the entire file into memory anyway, it's easy to scan through the lines twice; once to find the last transition out of a region after each title, and once to write the modified data back to the same filename, overwriting the previous contents.
I'm introducing a dictionary variable transitions where the keys are the indices of the lines which have a transition, and the value for each is the text to add at that point.
transitions = dict()
in_region = False
reg_end = -1
current_title = None
with open("test.txt","r") as testfile:
content = testfile.readlines()
for idx, line in enumerate(content):
if line.startswith('*Title '):
# Commit last transition before this to dict, if any
if current_title:
transitions[reg_end] = current_title
# add suffix for printing
current_title = line.rstrip('\n') + '2\n'
elif line.strip().endswith(' 1 1 1 1'):
in_region = True
# This will be overwritten while we remain in the region
reg_end = idx
elif in_region:
in_region = False
if current_title:
transitions[reg_end] = current_title
with open("test.txt", "w") as output:
for idx, line in enumerate(content):
output.write(line)
if idx in transitions:
output.write(transitions[idx])
This kind of "remember the last time we saw something" loop is very common, but takes some time getting used to. Inside the loop, keep in mind that we are looping over all the lines, and remembering some things we saw during a previous iteration of this loop. (Forgetting the last thing you were supposed to remember when you are finally out of the loop is also a very common bug!)
The strip() before we look for 1 1 1 1 normalizes the input by removing any surrounding whitespace. You could do other kinds of normalizations, too; normalizing your data is another very common technique for simplifying your logic.
Demo: https://ideone.com/GzNUA5

try this, using itertools.zip_longest
from itertools import zip_longest
with open("test.txt","r") as f:
content = f.readlines()
results, title = [], ""
for i, j in zip_longest(content, content[1:]):
# extract title.
if i.startswith("*"):
title = i
results.append(i)
# compare value in i'th index with i+1'th (if mismatch add title)
if "1 1 1 1" in i and "1 1 1 1" not in j:
results.append(f'{title.strip()}2\n')
print("".join(results))

Finding missing lines in file

I have a 7000+ lines .txt file, containing description and ordered path to image. Example:
abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png
abnormal /Users/alex/Documents/X-ray-classification/data/images/2.png
normal /Users/alex/Documents/X-ray-classification/data/images/3.png
normal /Users/alex/Documents/X-ray-classification/data/images/4.png
Some lines are missing. I want to somehow automate the search of missing lines. Intuitively i wrote:
f = open("data.txt", 'r')
lines = f.readlines()
num = 1
for line in lines:
if num in line:
continue
else:
print (line)
num+=1
But of course it didn't work, since lines are strings.
Is there any elegant way to sort this out? Using regex maybe?
Thanks in advance!

the following should hopefully work - it grabs the number out of the filename, sees if it's more than 1 higher than the previous number, and if so, works out all the 'in-between' numbers and prints them. Printing the number (and then reconstructing the filename later) is needed as line will never contain the names of missing files during iteration.
# Set this to the first number in the series -1
num = lastnum = 0
with open("data.txt", 'r') as f:
for line in f:
# Pick the digit out of the filename
num = int(''.join(x for x in line if x.isdigit()))
if num - lastnum > 1:
for i in range(lastnum+1, num):
print("Missing: {}.png".format(str(i)))
lastnum = num
The main advantage of working this way is that as long as your files are sorted in the list, it can handle starting at numbers other than 1, and also reports more than one missing number in the sequence.

You can try this:
lines = ["abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png","normal /Users/alex/Documents/X-ray-classification/data/images/3.png","normal /Users/alex/Documents/X-ray-classification/data/images/4.png"]
maxvalue = 4 # or any other maximum value
missing = []
i = 0
for num in range(1, maxvalue+1):
if str(num) not in lines[i]:
missing.append(num)
else:
i += 1
print(missing)
Or if you want to check for the line ending with XXX.png:
lines = ["abnormal /Users/alex/Documents/X-ray-classification/data/images/1.png","normal /Users/alex/Documents/X-ray-classification/data/images/3.png","normal /Users/alex/Documents/X-ray-classification/data/images/4.png"]
maxvalue = 4 # or any other maximum value
missing = []
i = 0
for num in range(1, maxvalue+1):
if not lines[i].endswith(str(num) + ".png"):
missing.append(num)
else:
i += 1
print(missing)
Example: here

Python: read line after string is found

I have a file which contains blocks of lines that I would like to separate. Each block contains a number identifier in the block's header: "Block X" is the header line for the X-th block of lines. Like this:
Block X
#L E C A F X M N
11.2145 15 27 29.444444 7.6025229 1539742 29.419783
11.21451 13 28 24.607143 6.8247935 1596787 24.586264
...
Block Y
#L E C A F X M N
11.2145 15 27 29.444444 7.6025229 1539742 29.419783
11.21451 13 28 24.607143 6.8247935 1596787 24.586264
...
I can use "enumerate" to find the header line of the block as follows:
with open(filename,'r') as indata:
for num, line in enumerate(indata):
if 'Block X' in line:
startblock=num
print startblock
This will yield the line number of the first line of block #X.
However, my problem is identifying the last line of the block. To do that, I could find the next occurrence of a header line (i.e., the next block) and subtract a few numbers.
My question: how can I find the line number of a the next occurrence of a condition (i.e., right after a certain condition was met)?
I tried using enumerate again, this time indicating the starting value, like this:
with open(filename,'r') as indata:
for num, line in enumerate(indata,startblock):
if 'Block Y ' in line:
endscan=num
break
print endscan
That doesn't work, because it still begins reading the file from line 0, NOT from the line number "startblock". Instead, by starting the "enumerate" counter from a different number, the resulting value of the counter, in this case "endscan" is shifted from 0 by the amount "startblock".
Please, help! How can tell python to disregard the lines previous to "startblock"?

If you want the groups using Block as the delimiter for each section, you can use itertools.groupby:
from itertools import groupby
with open('test.txt') as f:
grp = groupby(f,key=lambda x: x.startswith("Block "))
for k,v in grp:
if k:
print(list(v) + list(next(grp, ("", ""))[1]))
Output:
['Block X\n', '#L E C A F X M N \n', '11.2145 15 27 29.444444 7.6025229 1539742 29.419783\n', '11.21451 13 28 24.607143 6.8247935 1596787 24.586264\n']
['Block Y\n', '#L E C A F X M N \n', '11.2145 15 27 29.444444 7.6025229 1539742 29.419783\n', '11.21451 13 28 24.607143 6.8247935 1596787 24.586264']
If Block can appear elsewhere but you want it only when followed by a space and a single char:
import re
with open('test.txt') as f:
r = re.compile("^Block \w$")
grp = groupby(f, key=lambda x: r.search(x))
for k, v in grp:
if k:
print(list(v) + list(next(grp, ("", ""))[1]))

You can use the .tell() and .seek() methods of file objects to move around. So for example:
with open(filename, 'r') as infile:
start = infile.tell()
end = 0
for line in infile:
if line.startswith('Block'):
end = infile.tell()
infile.seek(start)
# print all the bytes in the block
print infile.read(end - start)
# now go back to where we were so we iterate correctly
infile.seek(end)
# we finished a block, mark the start
start = end

If the difference between the header lines is uniform throughout the file, just use the distance to increase the indexing variable accordingly.
file1 = open('file_name','r')
lines = file1.readlines()
numlines = len(lines)
i=0
for line in file:
if line == 'specific header 1':
line_num1 = i
if line == 'specific header 2':
line_num2 = i
i+=1
diff = line_num2 - line_num1
Now that we know the difference between the line numbers we use for loops to acquire the data.
k=0
array = np.zeros([numlines, diff])
for i in range(numlines):
if k % diff == 0:
for j in range(diff):
array[i][j] = lines[i+j]
k+=1
% is the mod operator which returns 0 only when k is a multiple of the difference in line numbers between the two header lines in the file, which will only occur when the line corresponds to the a header line. Once the line is fixed we go on to the second for loop that fills the array so that we have a matrix that is numlines number of rows and a diff number of columns. The nonzeros rows will contain the data inbetween the header lines.
I have not tried this out, I am just writing off the top of my head. Hopefully it helps!

Reading coordinates from text file in python

I have a text file (coordinates.txt):
30.154852145,-85.264584254
15.2685169645,58.59854265854
...
I have a python script and there is a while loop inside it:
count = 0
while True:
count += 1
c1 =
c2 =
For every run of the above loop, I need to read each line (count) and set c1,c2 to the numbers of each line (separated by the comma). Can someone please tell me the easiest way to do this?
============================
import csv
count = 0
while True:
count += 1
print 'val:',count
for line in open('coords.txt'):
c1, c2 = map(float, line.split(','))
break
print 'c1:',c1
if count == 2: break

The best way to do this would be, as I commented above:
import csv
with open('coordinates.txt') as f:
reader = csv.reader(f)
for count, (c1, c2) in enumerate(reader):
# Do what you want with the variables.
# You'll probably want to cast them to floats.
I have also included a better way to use the count variable using enumerate, as pointed out by #abarnert.

f=open('coordinates.txt','r')
count =0
for x in f:
x=x.strip()
c1,c2 = x.split(',')
count +=1

Read in file contents to Arrays

I am a bit familiar with Python. I have a file with information that I need to read in a very specific way. Below is an example...
1
6
0.714285714286
0 0 1.00000000000
0 1 0.61356352337
...
-1 -1 0.00000000000
0 0 5.13787636499
0 1 0.97147643932
...
-1 -1 0.00000000000
0 0 5.13787636499
0 1 0.97147643932
...
-1 -1 0.00000000000
0 0 0 0 5.13787636499
0 0 0 1 0.97147643932
....
So every file will have this structure (tab delimited).
The first line must be read in as a variable as well as the second and third lines.
Next we have four blocks of code separated by a -1 -1 0.0000000000. Each block of code is 'n' lines long. The first two numbers represent the position/location that the 3rd number in the line is to be inserted in an array. Only the unique positions are listed (so, position 0 1 would be the same as 1 0 but that information would not be shown).
Note: The 4th block of code has a 4-index number.
What I need
The first 3 lines read in as unique variables
Each block of data read into an array using the first 2 (or 4 ) column of numbers as the array index and the 3rd column as the value being inserted into an array.
Only unique array elements shown. I need the mirrored position to be filled with the proper value as well (a 0 1 value should also appear in 1 0).
The last block would need to be inserted into a 4-dimensional array.

I rewrote the code. Now it's almost what you need. You only need fine tuning.
I decided to leave the old answer - perhaps it would be helpful too.
Because the new is feature-rich enough, and sometimes may not be clear to understand.
def the_function(filename):
"""
returns tuple of list of independent values and list of sparsed arrays as dicts
e.g. ( [1,2,0.5], [{(0.0):1,(0,1):2},...] )
on fail prints the reason and returns None:
e.g. 'failed on text.txt: invalid literal for int() with base 10: '0.0', line: 5'
"""
# open file and read content
try:
with open(filename, "r") as f:
data_txt = [line.split() for line in f]
# no such file
except IOError, e:
print 'fail on open ' + str(e)
# try to get the first 3 variables
try:
vars =[int(data_txt[0][0]), int(data_txt[1][0]), float(data_txt[2][0])]
except ValueError,e:
print 'failed on '+filename+': '+str(e)+', somewhere on lines 1-3'
return
# now get arrays
arrays =[dict()]
for lineidx, item in enumerate(data_txt[3:]):
try:
# for 2d array data
if len(item) == 3:
i, j = map(int, item[:2])
val = float(item[-1])
# check for 'block separator'
if (i,j,val) == (-1,-1,0.0):
# make new array
arrays.append(dict())
else:
# update last, existing
arrays[-1][(i,j)] = val
# almost the same for 4d array data
if len(item) == 5:
i, j, k, m = map(int, item[:4])
val = float(item[-1])
arrays[-1][(i,j,k,m)] = val
# if value is unparsable like '0.00' for int or 'text'
except ValueError,e:
print 'failed on '+filename+': '+str(e)+', line: '+str(lineidx+3)
return
return vars, arrays

As i anderstand what did you ask for..
# read data from file into list
parsed=[]
with open(filename, "r") as f:
for line in f:
# # you can exclude separator here with such code (uncomment) (1)
# # be careful one zero more, one zero less and it wouldn work
# if line == '-1 -1 0.00000000000':
# continue
parsed.append(line.split())
# a simpler version
with open(filename, "r") as f:
# # you can exclude separator here with such code (uncomment, replace) (2)
# parsed = [line.split() for line in f if line != '-1 -1 0.00000000000']
parsed = [line.split() for line in f]
# at this point 'parsed' is a list of lists of strings.
# [['1'],['6'],['0.714285714286'],['0', '0', '1.00000000000'],['0', '1', '0.61356352337'] .. ]
# ALT 1 -------------------------------
# we do know the len of each data block
# get the first 3 lines:
head = parsed[:3]
# get the body:
body = parsed[3:-2]
# get the last 2 lines:
tail = parsed[-2:]
# now you can do anything you want with your data
# but remember to convert str to int or float
# first3 as unique:
unique0 = int(head[0][0])
unique1 = int(head[1][0])
unique2 = float(head[2][0])
# cast body:
# check each item of body has 3 inner items
is_correct = all(map(lambda item: len(item)==3, body))
# parse str and cast
if is_correct:
for i, j, v in body:
# # you can exclude separator here (uncomment) (3)
# # * 1. is the same as float(1)
# if (i,j,v) == (0,0,1.):
# # here we skip iteration for line w/ '-1 -1 0.0...'
# # but you can place another code that will be executed
# # at the point where block-termination lines appear
# continue
some_body_cast_function(int(i), int(j), float(v))
else:
raise Exception('incorrect body')
# cast tail
# check each item of body has 5 inner items
is_correct = all(map(lambda item: len(item)==5, tail))
# parse str and cast
if is_correct:
for i, j, k, m, v in body: # 'l' is bad index, because similar to 1.
some_tail_cast_function(int(i), int(j), int(k), int(m), float(v))
else:
raise Exception('incorrect tail')
# ALT 2 -----------------------------------
# we do NOT know the len of each data block
# maybe we have some array?
array = dict() # your array may be other type
v1,v2,v2 = parsed[:3]
unique0 = int(v1[0])
unique1 = int(v2[0])
unique2 = float(v3[0])
for item in parsed[3:]:
if len(item) == 3:
i,j,v = item
i = int(i)
j = int(j)
v = float(v)
# # yo can exclude separator here (uncomment) (4)
# # * 1. is the same as float(1)
# # logic is the same as in 3rd variant
# if (i,j,v) == (0,0,1.):
# continue
# do your stuff
# for example,
array[(i,j)]=v
array[(j,i)]=v
elif len(item) ==5:
i, j, k, m, v = item
i = int(i)
j = int(j)
k = int(k)
m = int(m)
v = float(v)
# do your stuff
else:
raise Exception('unsupported') # or, maybe just 'pass'

To read lines from a file iteratively, you can use something like:
with open(filename, "r") as f:
var1 = int(f.next())
var2 = int(f.next())
var3 = float(f.next())
for line in f:
do some stuff particular to the line we are on...
Just create some data structures outside the loop, and fill them in the loop above. To split strings into elements, you can use:
>>> "spam ham".split()
['spam', 'ham']
I also think you want to take a look at the numpy library for array datastructures, and possible the SciPy library for analysis.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing text in python - python

No need to reinvent the wheel.. import numpy as np for xi, yi in np.loadtxt('blah.txt'): print(xi) print(yi)

count = 0 for x in open(filename): # strip removes all whitespace on the right (so the newline in this case) # split will break a string in two based on the passed parameter xi, yi = x.rstrip().split("\t") # multiple values can be assigned at once count += 1 return count

Related

Add new line after finding last string in a region

Finding missing lines in file

Python: read line after string is found

Reading coordinates from text file in python

Read in file contents to Arrays

Categories

Resources