Importing data from a text file using python - python

I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.
The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).
What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.

Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:
import struct
def parsefile(filename):
with open(filename) as myfile:
for line in myfile:
line = line.rstrip('\n')
fields = struct.unpack('11s11s8s8s5s', line)
if 'OW' in fields[1]:
yield (int(fields[3]), int(fields[4]))
Usage:
if __name__ == '__main__':
for field in parsefile('file.txt'):
print field
Test data:
1234567890a1234567890a123456781234567812345
something maybe OW d 111111118888888855555
aaaaa bbbbb 1234 1212121233333
other thinganother OW 121212 6666666644444
Output:
(88888888, 55555)
(66666666, 44444)

In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.
So you can do something like this:
columns = [slice(11,22), slice(30,38), slice(38,44)]
myfile = open('some/file/path')
for line in myfile:
fields = [line[column].strip() for column in columns]
if "OW" in fields[0]:
value1 = int(fields[1])
value12 = int(fields[2])
....
Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.

Here's a function which might help you:
def rows(f, columnSizes):
while True:
row = {}
for (key, size) in columnSizes:
value = f.read(size)
if len(value) < size: # EOF
return
row[key] = value
yield row
for an example of how it's used:
from StringIO import StringIO
sample = StringIO("""aaabbbccc
d e f
g h i
""")
for row in rows(sample, [('first', 3),
('second', 3),
('third', 4)]):
print repr(row)
Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.
You can test if one string is a substring of another with the 'in' operator. For example,
>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True
So in this case, you might do
if 'OW' in row['third']:
stuff()
but you can obviously test any field for any value as you see fit.

entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])
for num1, num2 in entries:
# whatever

entries = []
with open('my_file.txt', 'r') as f:
for line in f.read().splitlines()
line = line.split()
if line[1].find('OW') >= 0
entries.append( ( int(line[-2]) , int(line[-1]) ) )
entries is an array containing tuples of the last two entries
edit: oops

Related

a loop that is suppose to write lines to a file isnt working

I have a very large file that looks like this:
[original file][1]
field number 7 (info) contains ~100 pairs of X=Y separated by ';'.
I first want to split all X=Y pairs.
Next I want to scan one pair at a time, and if X is one of 4 titles and Y is an int- I want to put them them in a dictionary.
After finishing going through the pairs I want to check if the dictionary contains all 4 of my titles, and if so, I want to calculate something and write it into a new file.
This is the part of my code which suppose to do that:
for row in reader:
m = re.split(';',row[7]) # split the info field by ';'
d = {}
nl = []
for c in m: # for each info field, split by '=', if it is one of the 4 fields wanted and the value is int- add it to a dict
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE') and type(t[1])==int:
d[t[0]] = t[1]
if 'AC_MALE' in d and 'AC_FEMALE' in d and 'AN_MALE' in d and 'AN_FEMALE' in d: # if the dict contain all 4 wanted fields- make a new line for the final file
total_ac = int(d['AC_MALE'])+ int(d['AC_FEMALE'])
total_an = int(d['AN_MALE'])+ int(d['AN_FEMALE'])
ac_an = total_ac/total_an
nl.extend([row[0],row[1],row[3],row[4],total_ac,total_an, ac_an])
writer.writerow(nl)
The code is running with no errors but isnt writing anything to the file.
Can someone figure out why?
Thanks!
type(t[1])==int is never true. t[1] is a string, always, because you just split that object from another string. It doesn't matter here if the string contains only digits and could be converted to a int.
Test if you can convert your string to an integer, and if that fails, just move on to the next. If it succeeds, add the value to your dictionary:
for c in m:
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE'):
try:
d[t[0]] = int(t[1])
except ValueError:
# string could not be converted, so move on
pass
Note that you don't need to use re.split(); use the standard str.split() method instead. You don't need to test if all keys are present in your dictionary afterwards, just test if the dictionary contains 4 elements, so has a length of 4. You can also simplify the code to test the key name:
for row in reader:
d = {}
for key_value in row[7].split(','):
key, value = key_value.split('=')
if key in {'AC_MALE', 'AC_FEMALE', 'AN_MALE', 'AN_FEMALE'}:
try:
d[key] = int(value)
except ValueError:
pass
if len(d) == 4:
total_ac = d['AC_MALE'] + d['AC_FEMALE']
total_an = d['AN_MALE'] + d['AN_FEMALE']
ac_an = total_ac / total_an
writer.writerow([
row[0], row[1], row[3], row[4],
total_ac, total_an, ac_an])

eliminate text after certain character in python pipeline- with slice?

This is a short script I've written to refine and validate a large dataset that I have.
# The purpose of this script is the refinement of the job data attained from the
# JSI as it is rendered by the `csv generator` contributed by Luis for purposes
# of presentation on the dashboard map.
import csv
# The number of columns
num_headers = 9
# Remove invalid characters from records
def url_escaper(data):
for line in data:
yield line.replace('&','&')
# Be sure to configure input & output files
with open("adzuna_input_THRESHOLD.csv", 'r') as file_in, open("adzuna_output_GO.csv", 'w') as file_out:
csv_in = csv.reader( url_escaper( file_in ) )
csv_out = csv.writer(file_out)
# Get rid of rows that have the wrong number of columns
# and rows that have only whitespace for a columnar value
for i, row in enumerate(csv_in, start=1):
if not [e for e in row if not e.strip()]:
if len(row) == num_headers:
csv_out.writerow(row)
else:
print "line %d is malformed" % i
I have one field that is structured like so:
finance|statistics|lisp
I've seen ways to do this using other utilities like R, but I want to ideally achieve the same effect within the scope of this python code.
Maybe I can iterate over all the characters of all the columnar values, perhaps as a list, and if I see a | I can dispose of the | and all the text that follows it within the scope of the column value.
I think surely it can be achieved with slices as they do here but I don't quite understand how the indices with slices work- and I can't see how I could include this process harmoniously within the cascade of the current script pipeline.
With regex I guess it's something like this
(?:|)(.*)
Why not use string's split method?
In[4]: 'finance|statistics|lisp'.split('|')[0]
Out[4]: 'finance'
It does not fail with exception when you do not have separator character in the string too:
In[5]: 'finance/statistics/lisp'.split('|')[0]
Out[5]: 'finance/statistics/lisp'

Python - Splitting a large string by number of delimiter occurrences

I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.
For instance:
ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//
In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.
So an input of splitting the string by // by 1 would return:
ABCDEF
an input of splitting the string by // by 2 would return:
ABCDEF
//
GHIJKLMN
an input of splitting the string by // by 3 would return:
ABCDEF
//
GHIJKLMN
//
OPQ
And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.
I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.
Is there a simple or elegant way to achieve this? Thanks!
If you want to work with files instead of strings in memory, here is another answer.
This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).
def file_split(file_name, delimiter, n=1):
with open(file_name) as fh:
for line in fh:
line = line.rstrip() # use .rstrip("\n") to only strip newlines
if line == delimiter:
n -= 1
if n <= 0:
return
print line
file_split('data.txt', '//', 3)
You can use this to write the output to a new file like this:
python split.py > newfile.txt
With a little extra work, you can use argparse to pass parameters to the program.
As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job:
from itertools import islice
with open('filename') as f :
lines = islice(f,0,2*N-1)
The method that comes to my mind when I read your question uses a for loop
where you cut up the string into several (for example the 100 you called) and iterate through the substring.
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
log = log+steps
# and go again from the start only with this offset
now you can go through all the elements go through the whole 2 million(!) line string.
best thing to do here is actually make a recursive function from this(if that is what you want):
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
def iterateThroughHugeString(beginning):
substring = thestring[:beginning+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
iterateThroughHugeString(beginning+steps)
# and go again from the start only with this offset
For instance:
i = 0
s = ""
fd = open("...")
for l in fd:
if l[:-1] == delimiter: # skip last '\n'
i += 1
if i >= max_split:
break
s += l
fd.close()
Since you are learning Python it would be a challenge to model a complete dynamic solution. Here's a notion of how you can model one.
Note: The following code snippet only works for file(s) which is/are in the given format (see the 'For Instance' in the question). Hence, it is a static solution.
num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
print ([next(myfile) for x in range(num-1)])
Now that have the idea, you can use pattern matching and so on.

How do i format the ouput of a list of list into a textfile properly?

I am really new to python and now I am struggeling with some problems while working on a student project. Basically I try to read data from a text file which is formatted in columns. I store the data in a list of list and sort and manipulate the data and write them into a file again. My problem is to align the written data in proper columns. I found some approaches like
"%i, %f, %e" % (1000, 1000, 1000)
but I don't know how many columns there will be. So I wonder if there is a way to set all columns to a fixed width.
This is how the input data looks like:
2 232.248E-09 74.6825 2.5 5.00008 499.482
5 10. 74.6825 2.5 -16.4304 -12.3
This is how I store the data in a list of list:
filename = getInput('MyPath', workdir)
lines = []
f = open(filename, 'r')
while 1:
line = f.readline()
if line == '':
break
splitted = line.split()
lines.append(splitted)
f.close()
To write the data I first put all the row elements of the list of list into one string with a free fixed space between the elements. But instead i need a fixed total space including the element. But also I don't know the number of columns in the file.
for k in xrange(len(lines)):
stringlist=""
for i in lines[k]:
stringlist = stringlist+str(i)+' '
lines[k] = stringlist+'\n'
f = open(workdir2, 'w')
for i in range(len(lines)):
f.write(lines[i])
f.close()
This code works basically, but sadly the output isn't formatted properly.
Thank you very much in advance for any help on this issue!
You are absolutely right about begin able to format widths as you have above using string formatting. But as you correctly point out, the tricky bit is doing this for a variable sized output list. Instead, you could use the join() function:
output = ['a', 'b', 'c', 'd', 'e',]
# format each column (len(a)) with a width of 10 spaces
width = [10]*len(a)
# write it out, using the join() function
with open('output_example', 'w') as f:
f.write(''.join('%*s' % i for i in zip(width, output)))
will write out:
' a b c d e'
As you can see, the length of the format array width is determined by the length of the output, len(a). This is flexible enough that you can generate it on the fly.
Hope this helps!
String formatting might be the way to go:
>>> print("%10s%9s" % ("test1", "test2"))
test1 test2
Though you might want to first create strings from those numbers and then format them as I showed above.
I cannot fully comprehend your writing code, but try working on it somehow like that:
from itertools import enumerate
with open(workdir2, 'w') as datei:
for key, item in enumerate(zeilen):
line = "%4i %6.6" % key, item
datei.write(item)

How can I count the line number between two character in a file with python?

Hi
I'm new to python and I have a 3.2 python!
I have a file which has some sort of format like this:
Number of segment pairs = 108570; number of pairwise comparisons = 54234
'+' means given segment; '-' means reverse complement
Overlaps Containments No. of Constraints Supporting Overlap
******************* Contig 1 ********************
E_180+
E_97-
******************* Contig 2 ********************
E_254+
E_264+ is in E_254+
E_276+
******************* Contig 3 ********************
E_256-
E_179-
I want to count the number of non-empty lines between the *****contig#****
and I want to get a result like this
contig1=2
contig2=3
contig3=2**
Probably, it's best to use regular expressions here. You can try the following:
import re
str = open(file).read()
pairs = re.findall(r'\*+ (Contig \d+) \*+\n([^*]*)',str)
pairs is a list of tuples, where the tuples have the form ('Contig x', '...')
The second component of each tuple contains the text after the mark
Afterwards, you could count the number of '\n' in those texts; most easily this can be done via a list comprehension:
[(contig, txt.count('\n')) for (contig,txt) in pairs]
(edit: if you don't want to count empty lines you can try:
[(contig, txt.count('\n')-txt.count('\n\n')) for (contig,txt) in pairs]
)
def give(filename):
with open(filename) as f:
for line in f:
if 'Contig' in line:
category = line.strip('* \r\n')
break
cnt = 0
aim = []
for line in f:
if 'Contig' in line:
yield (category+'='+str(cnt),aim)
category = line.strip('* \r\n')
cnt = 0
aim= []
elif line.strip():
cnt+=1
if 'is in' in line:
aim.append(line.strip())
yield (category+'='+str(cnt),aim)
for a,b in give('input.txt'):
print a
if b: print b
result
Contig 1=2
Contig 2=3
['E_264+ is in E_254+']
Contig 3=2
The function give() isn't a normal function, it is a generator function. See the doc, and if you have question, I will answer.
strip() is a function that eliminates characters at the beginning and at the end of a string
When used without argument, strip() removes the whitespaces (that is to say \f \n \r \t \v and blank space). When there is a string as argument, all the characters present in the string argument that are found in the treated string are removed from the treated string. The order of characters in the string argument doesn't matter: such an argument doesn't designates a string but a set of characters to be removed.
line.strip() is a means to know if there are characters that aren't whitespaces in a line
The fact that elif line.strip(): is situated after the line if 'Contig' in line: , and that it is written elif and not if, is important: if it was the contrary, line.strip() would be True for line being for exemple
******** Contig 2 *********\n
I suppose that you will be interested to know the content of the lines like this one:
E_264+ is in E_254+
because it is this kind of line that make a difference in the countings
So I edited my code in order that the function give() produce also the information of these kind of lines

Categories

Resources