I have a file that looks like this
N1 1.023 2.11 3.789
Cl1 3.124 2.4534 1.678
Cl2 # # #
Cl3 # # #
Cl4
Cl5
N2
Cl6
Cl7
Cl8
Cl9
Cl10
N3
Cl11
Cl12
Cl13
Cl14
Cl15
The three numbers continue down throughout.
What I would like to do is pretty much a permutation. These are 3 data sets, set 1 is N1-Cl5, 2 is N2-Cl10, and set three is N3 - end.
I want every combination of N's and Cl's. For example the first output would be
Cl1
N1
Cl2
then everything else the same. the next set would be Cl1, Cl2, N1, Cl3...and so on.
I have some code but it won't do what I want, becuase it would know that there are three individual data sets. Should I have the three data sets in three different files and then combine, using a code like:
list1 = ['Cl1','Cl2','Cl3','Cl4', 'Cl5']
for line in file1:
line.replace('N1', list1(0))
list1.pop(0)
print >> file.txt, line,
or is there a better way?? Thanks in advance
This should do the trick:
from itertools import permutations
def print_permutations(in_file):
separators = ['N1', 'N2', 'N3']
cur_separator = None
related_elements = []
with open(in_file, 'rb') as f:
for line in f:
line = line.strip()
# Split Nx and CIx from numbers.
value = line.split()[0]
# Found new Nx. Print previous permutations.
if value in separators and related_elements:
for perm in permutations([cur_separator] + related_elements)
print perm
cur_separator = line
related_elements = []
else:
# Found new CIx. Append to the list.
related_elements.append(value)
You could use regex to find the line numbers of the "N" patterns and then slice the file using those line numbers:
import re
n_pat = re.compile(r'N\d')
N_matches = []
with open(sample, 'r') as f:
for num, line in enumerate(f):
if re.match(n_pat, line):
N_matches.append((num, re.match(n_pat, line).group()))
>>> N_matches
[(0, 'N1'), (12, 'N2'), (24, 'N3')]
After you figure out the line numbers where these patterns appear, you can then use itertools.islice to break the file up into a list of lists:
import itertools
first = N_matches[0][0]
final = N_matches[-1][0]
step = N_matches[1][0]
data_set = []
locallist = []
while first < final + step:
with open(file, 'r') as f:
for item in itertools.islice(f, first, first+step):
if item.strip():
locallist.append(item.strip())
dataset.append(locallist)
locallist = []
first += step
itertools.islice is a really nice way to take a slice of an iterable. Here's the result of the above on a sample:
>>> dataset
[['N1 1.023 2.11 3.789', 'Cl1 3.126 2.6534 1.878', 'Cl2 3.124 2.4534 1.678', 'Cl3 3.924 2.1134 1.1278', 'Cl4', 'Cl5'], ['N2', 'Cl6 3.126 2.6534 1.878', 'Cl7 3.124 2.4534 1.678', 'Cl8 3.924 2.1134 1.1278', 'Cl9', 'Cl10'], ['N3', 'Cl11', 'Cl12', 'Cl13', 'Cl14', 'Cl15']]
After that, I'm a bit hazy on what you're seeking to do, but I think you want permutations of each sublist of the dataset? If so, you can use itertools.permutations to find permutations on various sublists of dataset:
for item in itertools.permutations(dataset[0]):
print(item)
etc.
Final Note:
Assuming I understand correctly what you're doing, the number of permutations is going to be huge. You can calculate how many permutations there are them by taking the factorial of the number of items. Anything over 10 (10!) is going to produce over 3,000,000 million permutations.
Related
I'm a beginner programmer, and I'm trying to figure out how to create a 2d nested list (grid) from a particular text file. For example, the text file would look like this:
3
3
150
109
80
892
123
982
0
98
23
The first two lines in the text file would be used to create the grid, meaning that it is 3x3. The next 9 lines would be used to populate the grid, with the first 3 making up the first row, the next 3 making up the middle row, and the final 3 making up the last row. So the nested list would look like this:
[[150, 109, 80] [892, 123, 982] [0, 98, 23]]
How do I go about doing this? I was able to make a list of all of the contents, but I can't figure out how to use the first 2 lines to define the size of the inner lists within the outer list:
lineContent = []
innerList = ?
for lines in open('document.txt','r'):
value = int(lines)
lineContent.append(value)
From here, where do I go to turn it into a nested list using the given values on the first 2 lines?
Thanks in advance.
You can make this quite neat using list comprehension.
def txt_grid(your_txt):
with open(your_txt, 'r') as f:
# Find columns and rows
columns = int(f.readline())
rows = int(f.readline())
your_list = [[f.readline().strip() for i in range(rows)] for j in range(columns)]
return your_list
print(txt_grid('document.txt'))
strip() just clears the newline characters (\n) from each line before storing them in the list.
Edit: A modified version with logic for if your txt file didn't have enough rows for the defined dimensions.
def txt_grid(your_txt):
with open(your_txt, 'r') as f:
# Find columns and rows
columns = int(f.readline())
rows = int(f.readline())
dimensions = columns * rows
# Test to see if there are enough rows, creating grid if there are
nonempty_lines = len([line.strip("\n") for line in f]) # This ignores the first two lines as they have already been written
if nonempty_lines < dimensions:
# Either raise an error
# raise ValueError("Insufficient non-empty rows in text file for given dimensions")
# Or return something that's not a list
your_list = None
else:
# Creating grid
your_list = [[f.readline().strip() for i in range(rows)] for j in range(columns)]
return your_list
print(txt_grid('document.txt'))
def parse_txt(filepath):
lineContent = []
with open(filepath, 'r') as txt: # The with statement closes the txt file after its been used
nrows = int(txt.readline())
ncols = int(txt.readline())
for i in range(nrows): # For each row
row = []
for j in range(ncols): # Grab each value in the row
row.append(int(txt.readline()))
lineContent.append(row)
return lineContent
grid_2d = parse_txt('document.txt')
lineContent = []
innerList = []
for lines in open('testQuestion.txt', 'r'):
value = int(lines)
lineContent.append(value)
rowSz = lineContent[0] # row size
colSz = lineContent[1] # column size
del lineContent[0], lineContent[0] # makes line contents just the values in the matrix, could also just start currentLine at 2, notice 0 index is repeated because 1st element was deleted
assert rowSz * colSz == len(lineContent), 'not enough values for array' # to ensure there are enough entries to complete array of rowSz * colSz elements
arr = []
currentLine = 0
for x in range(rowSz):
arr.append([])
for y in range(colSz):
arr[x].append(lineContent[currentLine])
currentLine += 1
print(arr)
I have a text file with twenty car prices and its serial number there are 50 lines in this file. I would like to find the max car price and its serial for every 10 lines.
priceandserial.txt
102030 4000.30
102040 5000.40
102080 5500.40
102130 4000.30
102140 5000.50
102180 6000.50
102230 2000.60
102240 4000.30
102280 6000.30
102330 9000.70
102340 1000.30
102380 3000.30
102430 4000.80
102440 5000.30
102480 7000.30
When I tried Python's builtin max function I get 102480 as the max value.
x = np.loadtxt('carserial.txt', unpack=True)
print('Max:', np.max(x))
Desired result:
102330 9000.70
102480 7000.30
There are 50 lines in file, therefore I should have a 5 line result with serial and max prices of each 10 lines.
Respectfully, I think the first solution is over-engineered. You don't need numpy or math for this task, just a dictionary. As you loop through, you update the dictionary if the latest value is greater than the current value, and do nothing if it isn't. Everything 10th item, you append the values from the dictionary to an output list and reset the buffer.
with open('filename.txt', 'r') as opened_file:
data = opened_file.read()
rowsplitdata = data.split('\n')
colsplitdata = [u.split(' ') for u in rowsplitdata]
x = [[int(j[0]), float(j[1])] for j in colsplitdata]
output = []
buffer = {"max":0, "index":0}
count = 0
#this assumes x is a list of lists, not a numpy array
for u in x:
count += 1
if u[1] > buffer["max"]:
buffer["max"] = u[1]
buffer["index"] = u[0]
if count == 10:
output.append([buffer["index"], buffer["max"]])
buffer = {"max":0, "index":0}
count = 0
#append the remainder of the buffer in case you didn't get to ten in the final pass
output.append([buffer["index"], buffer["max"]])
output
[[102330, 9000.7], [102480, 7000.3]]
You should iterate over it and for each 10 lines extract the maximum:
import math
# New empty list for colecting the results
max_list=[]
#iterate thorught x supposing
for i in range(math.ceil(len(x)/10)):
### append only 10 elments if i+10 is not superior to the lenght of the array
if i+11<len(x):
max_list=max_list.append(np.max(x[i:i+11]))
### if it is superior, then append all the remaining elements
else:
max_list=max_list.append(np.max(x[i:]))
This should do your job.
number_list = [[],[]]
with open('filename.txt', 'r') as opened_file:
for line in opened_file:
if len(line.split()) == 0:
continue
else:
a , b = line.split(" ")
number_list[0].append(a)
number_list[1].append(b)
col1_max, col2_max = max(number_list[0]), max(number_list[1])
col1_max, col2_max
Just change the filename. col1_max, col2_max have the respective column's max value. You can edit the code to accommodate more columns.
You can transpose your input first, then use np.split and for each submatrix you calculate its max.
x = np.genfromtxt('carserial.txt', unpack=True).T
print(x)
for submatrix in np.split(x,len(x)//10):
print(max(submatrix,key=lambda l:l[1]))
working example
I'm new to programming and python and I'm looking for a way to distinguish between two input formats in the same input file text file. For example, let's say I have an input file like so where values are comma-separated:
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
Where the format is N followed by N lines of Data1, and M followed by M lines of Data2. I tried opening the file, reading it line by line and storing it into one single list, but I'm not sure how to go about to produce 2 lists for Data1 and Data2, such that I would get:
Data1 = ["Washington,A,10", "New York,B,20", "Seattle,C,30", "Boston,B,20", "Atlanta,D,50"]
Data2 = ["New York,5", "Boston,10"]
My initial idea was to iterate through the list until I found an integer i, remove the integer from the list and continue for the next i iterations all while storing the subsequent values in a separate list, until I found the next integer and then repeat. However, this would destroy my initial list. Is there a better way to separate the two data formats in different lists?
You could use itertools.islice and a list comprehension:
from itertools import islice
string = """
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
"""
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [string.split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
This yields
[['Washington,A,10', 'New York,B,20', 'Seattle,C,30', 'Boston,B,20', 'Atlanta,D,50'], ['New York,5', 'Boston,10']]
For a file, you need to change it to:
with open("testfile.txt", "r") as f:
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [f.read().split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
You're definitely on the right track.
If you want to preserve the original list here, you don't actually have to remove integer i; you can just go on to the next item.
Code:
originalData = []
formattedData = []
with open("data.txt", "r") as f :
f = list(f)
originalData = f
i = 0
while i < len(f): # Iterate through every line
try:
n = int(f[i]) # See if line can be cast to an integer
originalData[i] = n # Change string to int in original
formattedData.append([])
for j in range(n):
i += 1
item = f[i].replace('\n', '')
originalData[i] = item # Remove newline char in original
formattedData[-1].append(item)
except ValueError:
print("File has incorrect format")
i += 1
print(originalData)
print(formattedData)
The following code will produce a list results which is equal to [Data1, Data2].
The code assumes that the number of entries specified is exactly the amount that there is. That means that for a file like this, it will not work.
2
New York,5
Boston,10
Seattle,30
The code:
# get the data from the text file
with open('filename.txt', 'r') as file:
lines = file.read().splitlines()
results = []
index = 0
while index < len(lines):
# Find the start and end values.
start = index + 1
end = start + int(lines[index])
# Everything from the start up to and excluding the end index gets added
results.append(lines[start:end])
# Update the index
index = end
I have a target file called TARGFILE of the form:
10001000020002002001100100200000111
10201001020000120210101100110010011
02010010200000011100012021001012021
00102000012001202100101202100111010
My idea here was to leave this as a string, and use slicing in python to remove the indices.
The removal will occur based on a list of integers called INDICES like so:
[1, 115654, 115655, 115656, 2, 4, 134765, 134766, 18, 20, 21, 23, 24, 17659, 92573, 30, 32, 88932, 33, 35, 37, 110463, 38, 18282, 46, 18458, 48, 51, 54]
I want to remove every position of every line in TARGFILE that matches with INDICES. For instance, the first digit in INDICES is 1, so the first column of TARGFILE containing 1,1,0,0 would be removed. However, I am weary of doing this incorrectly due to off-by-one errors and changing index positions if everything is not removed at the same time.
Thus, a solution that removed every column from each row at the same time would likely be both much faster and safer than using a nested loop, but I am unsure of how to code this.
My code so far is here:
#!/usr/bin/env python
import fileinput
SRC_FILES=open('YCP.txt', 'r')
for line in SRC_FILES:
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split(' ')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
TARGFILE=open(EUR, 'r')
It seems to me that a solution using enumerate might be possible, but I have not found it, and that might be suboptimal in the first place...
EDIT: in response to concerns about memory: the longest lines are ~180,000 items, but I should be able to get this into memory without a problem, I have access to a cluster.
I like the simplicity of Peter's answer, even though it's currently off-by-one. My thought is that you can get rid of the index-shifting problem, by sorting INDICES, and doing the process from the back to the front. That led to remove_indices1, which is really inefficient. I think 2 is better, but simplest is 3, which is Peter's answer.
I may do timing in a bit for some large numbers, but my intuition says that my remove_indices2 will be faster than Peter's remove_indices3 if INDICES is very sparse. (Because you don't have to iterate over each character, but only over the indices that are being deleted.)
BTW - If you can sort INDICES once, then you don't need to make the local copy to sort/reverse, but I didn't know if you could do that.
rows = [
'0000000001111111111222222222233333333334444444444555555555566666666667',
'1234567890123456789012345678901234567890123456789012345678901234567890',
]
def remove_nth_character(row,n):
return row[:n-1] + row[n:]
def remove_indices1(row,indices):
local_indices = indices[:]
retval = row
local_indices.sort()
local_indices.reverse()
for i in local_indices:
retval = remove_nth_character(retval,i)
return retval
def remove_indices2(row,indices):
local_indices = indices[:]
local_indices.sort()
local_indices.reverse()
front = row
chunks = []
for i in local_indices:
chunks.insert(0,front[i:])
front = front[:i-1]
chunks.insert(0,front)
return "".join(chunks)
def remove_indices3(row,indices):
return ''.join(c for i,c in enumerate(row) if i+1 not in indices)
indices = [1,11,4,54,33,20,7]
for row in rows:
print remove_indices1(row,indices)
print ""
for row in rows:
print remove_indices2(row,indices)
print ""
for row in rows:
print remove_indices3(row,indices)
EDIT: Adding timing info, plus a new winner!
As I suspected, my algorithm (remove_indices2) wins when there aren't many indices to remove. It turns out that the enumerate-based one, though, gets worse even faster as there are more indices to remove. Here's the timing code (bigrows rows have 210000 characters):
bigrows = []
for row in rows:
bigrows.append(row * 30000)
for indices_len in [10,100,1000,10000,100000]:
print "indices len: %s" % indices_len
indices = range(indices_len)
#for func in [remove_indices1,remove_indices2,remove_indices3,remove_indices4]:
for func in [remove_indices2,remove_indices4]:
start = time.time()
for row in bigrows:
func(row,indices)
print "%s: %s" % (func.__name__,(time.time() - start))
And here are the results:
indices len: 10
remove_indices1: 0.0187089443207
remove_indices2: 0.00184297561646
remove_indices3: 1.40601491928
remove_indices4: 0.692481040955
indices len: 100
remove_indices1: 0.0974130630493
remove_indices2: 0.00125503540039
remove_indices3: 7.92742991447
remove_indices4: 0.679095029831
indices len: 1000
remove_indices1: 0.841033935547
remove_indices2: 0.00370812416077
remove_indices3: 73.0718669891
remove_indices4: 0.680690050125
So, why does 3 do so much worse? Well, it turns out that the in operator isn't efficient on a list. It's got to iterate through all of the list items to check. remove_indices4 is just 3 but converting indices to a set first, so the inner loop can do a fast hash-lookup, instead of iterating through the list:
def remove_indices4(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if i+1 not in indices_set)
And, as I originally expected, this does better than my algorithm for high densities:
indices len: 10
remove_indices2: 0.00230097770691
remove_indices4: 0.686790943146
indices len: 100
remove_indices2: 0.00113391876221
remove_indices4: 0.665997982025
indices len: 1000
remove_indices2: 0.00296902656555
remove_indices4: 0.700706005096
indices len: 10000
remove_indices2: 0.074893951416
remove_indices4: 0.679219007492
indices len: 100000
remove_indices2: 6.65899395943
remove_indices4: 0.701599836349
If you've got fewer than 10000 indices to remove, 2 is fastest (even faster if you do the indices sort/reverse once outside the function). But, if you want something that is pretty stable in time, no matter how many indices, use 4.
The simplest way I can see would be something like:
>>> for line in TARGFILE:
... print ''.join(c for i,c in enumerate(line) if (i+1) not in INDICES)
...
100000200020020100200001
100010200001202010110001
010102000000111021001021
000000120012021012100110
(Substituting print for writing to your output file etc)
This relies on being able to load each line into memory which may or may not be reasonable given your data.
Edit: explaination:
The first line is straightforward:
>>> for line in TARGFILE:
Just iterates through each line in TARGFILE. The second line is a bit more complex:
''.join(...) concatenates a list of strings together with an empty joiner (''). join is often used with a comma like: ','.join(['a', 'b', 'c']) == 'a,b,c', but here we just want to join each item to the next.
enumerate(...) takes an interable and returns pairs of (index, item) for each item in the iterable. For example enumerate('abc') == (0, 'a'), (1, 'b'), (2, 'c')
So the line says,
Join together each character of line whose index are not found in INDICES
However, as John pointed out, Python indexes are zero base, so we add 1 to the value from enumerate.
The script I ended up using is the following:
#!/usr/bin/env python
def remove_indices(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if (i+1) in indices_set)
SRC_FILES=open('YCP2.txt', 'r')
CEUDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/CEU/PARSED/'
YRIDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/YRI/PARSED/'
i=0
for line in SRC_FILES:
i+=1
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split('\t')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
INDEXSORT=sorted(INDICES, key=int)
EURF=open(EUR, 'r')
EURFOUT=open(CEUDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype in EURF:
TRIMLINE=remove_indices(haplotype, INDEXSORT)
EURFOUT.write(TRIMLINE + '\n')
EURFOUT.close()
AFRF=open(YRI, 'r')
AFRFOUT=open(YRIDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype2 in AFRF:
TRIMLINE=remove_indices(haplotype2, INDEXSORT)
AFRFOUT.write(TRIMLINE + '\n')
AFRFOUT.close()
This is for homework, so I must try to use as little python functions as possible, but still allow for a computer to process a list of 1 million numbers efficiently.
#!/usr/bin/python3
#Find the 10 largest integers
#Don't store the whole list
import sys
import heapq
def fOpen(fname):
try:
fd = open(fname,"r")
except:
print("Couldn't open file.")
sys.exit(0)
all = fd.read().splitlines()
fd.close()
return all
words = fOpen(sys.argv[1])
numbs = map(int,words)
print(heapq.nlargest(10,numbs))
li=[]
count = 1
#Make the list
for x in words:
li.append(int(x))
count += 1
if len(li) == 10:
break
#Selection sort, largest-to-smallest
for each in range(0,len(li)-1):
pos = each
for x in range(each+1,10):
if li[x] > li[pos]:
pos = x
if pos != each:
li[each],li[pos] = li[pos],li[each]
for each in words:
print(li)
each = int(each)
if each > li[9]:
for x in range(0,9):
pos = x
if each > li[x]:
li[x] = each
for i in range(x+1,10):
li[pos],li[i] = li[i],li[pos]
break
#Selection sort, largest-to-smallest
for each in range(0,len(li)-1):
pos = each
for x in range(each+1,10):
if li[x] > li[pos]:
pos = x
if pos != each:
li[each],li[pos] = li[pos],li[each]
print(li)
The code is working ALMOST the way that I want it to. I tried to create a list from the first 10 digits. Sort them, so that it in descending order. And then have python ONLY check the list, if the digits are larger than the smaller one (instead of reading through the list 10(len(x)).
This is the output I should be getting:
>>>[9932, 9885, 9779, 9689, 9682, 9600, 9590, 9449, 9366, 9081]
This is the output I am getting:
>>>[9932, 9689, 9885, 9779, 9682, 9025, 9600, 8949, 8612, 8575]
If you only need the 10 top numbers, and don't care to sort the whole list.
And if "must try to use as little python functions as possible" means that you (or your theacher) prefer to to avoid heapq.
Another way could be to keep track of the 10 top numbers while you parse the whole file only one time:
top = []
with open('numbers.txt') as f:
# the first ten numbers are going directly in
for line in f:
top.add(int(line.strip()))
if len(top) == 10:
break
for line in f:
num = int(line.strip())
min_top = min(top)
if num > min_top: # check if the new number is a top one
top.remove(min_top)
top.append(num)
print(sorted(top))
Update: If you don't really need an in-place sort and since you're going to sort only 10 numebrs, I'd avoid the pain of reordering.
I'd just build a new list, example:
sorted_top = []
while top:
max_top = max(top)
sorted_top.append(max_top)
top.remove(max_top)
well, by both reading in the entire file and splitting it, then using map(), you are keeping a lot of data in memory.
As Adrien pointed out, files are iterators in py3k, so you can just use a generator literal to provide the iterable for nlargest:
nums = (int(x) for x in open(sys.argv[1]))
then, using
heapq.nlargest(10, nums)
should get you what you need, and you haven't stored the entire list even once.
the program is even shorter than the original, as well!
#!/usr/bin/env python3
from heapq import nlargest
import sys
nums = (int(x) for x in open(sys.argv[1]))
print(nlargest(10, nums))