Selecting an inputted number of files from a folder - python

I have a folder with a few hundred thousands data files in it. And what I want to do is separate the data into groups of n (which will be inputted by the user) in a way that allows me to manipulate just that group of data. Then I want to start back where I left off in the folder and take another group. For example if n was 5 I would want files 1-5 to be read and manipulated then I'd like to have 6-10 read and so on and so forth.
def fileavg(path,n):
import numpy as np
import xlsxwriter
from glob import glob
workbook = xlsxwriter.Workbook('Test.xlsx')
worksheet = workbook.add_worksheet()
row=0
b=glob.iglob(path) #when inputting path name begin with r' and end with a'
for i in range(0,len(b),n):
f=yield [i:i +n]
A=np.mean(f(1),axis=1)
for col, data in enumerate(A):
worksheet.write_column(row, col, data)
row +=1
I have tried using a for loop and the yield keyword but I was having a problem with a generator error. I would like to continue using the for loop with just a different technique of grabbing the data.
Updated Code
def fileavg(path,n):
import numpy as np
import xlsxwriter
import glob
workbook = xlsxwriter.Workbook('Test.xlsx')
worksheet = workbook.add_worksheet()
row=0
more_files=True
b=glob.iglob(path) #when inputting path name begin with r'and end with a '
while more_files:
for i in range(n):
try:
next_file=next(b)
print(row,next_file)
A=np.mean(next_file(1))
except StopIteration:
more_files=False
break
for col, data in enumerate(A):
worksheet.write_column(row, col, data)
row +=1

iglob returns an iterator not an iterable, so it does not have a length. The easy way to grab the next n items from an iterator is to use itertools.islice.
import glob
from itertools import islice
num_items = 5
b=glob.iglob(path) #when inputting path name begin with r' and end with a'
next_n_files = list(islice(b, num_items))
If you really want to use a for loop for this, then you can use next() to grab the next item. You'll have to maintain a flag that triggers once the next() function runs out of files and raises a StopIteration exception. When this exception is raised, we need to break out of the for loop that grabs the n files and also change the flag to exit the while loop. For grouping, we use the row variable to keep track of each group.
import glob
from itertools import islice
num_items = 5
b = glob.iglob(path)
row = 0
more_files = True
while more_files:
for i in range(num_items):
try:
next_file = next(b)
print(row, next_file)
# Do other stuff with file here
except StopIteration:
more_files = False
break
row += 1

Related

Python : using a loop, read and save data to different lists

I want to create a loop which reads data from multiple .txt files and save it to different lists (according to these different .txt files).
I don't know how to create the following lists (or np.arrays) in a loop: time1, time2, .. timeN, where N is a number of .txt files to analyze.
I used the following to connect the 'constant' time and the 'variable' number: obj['time'+str(shot)], but it is not saved as a variable.
import numpy as np
import os
for shot in range(1,10):
#get the first txt:
txt_file = os.path.join(path,'shot'+str(shot)+'.txt')
#get data from txt:
data = np.genfromtxt(txt_file, skip_header = 4)
# now save it to the list (or np.array):
obj['time'+str(shot)] = np.array([row[0] for row in data])
As an output I want to have 10 lists or arrays with time to work with them in future.
You can use either empty list or empty dictionary to save your results.
In the first case, you should create an empty list first and next append data. Hence, i-th element in the list would corresponds to (i+1)-th source file.
import numpy as np
import os
my_data = []
for shot in range(1,10):
#get the first txt:
txt_file = os.path.join(path,'shot'+str(shot)+'.txt')
#get data from txt:
data = np.genfromtxt(txt_file, skip_header = 4)
#append data
my_data.append(data)
first_source_data = my_data[0]
Second option for you, if you want to have an access using filename.
import numpy as np
import os
my_data = {}
for shot in range(1,10):
#gen filename
filename = 'shot'+str(shot)+'.txt'
#get the first txt:
txt_file = os.path.join(path, filename)
#get data from txt:
data = np.genfromtxt(txt_file, skip_header = 4)
# now save it to the list (or np.array):
my_data[filename] = data
first_source_data = my_data['shot1.txt']
Your question is little confusing. But if I understand correctly.
You actually don't want to use list but dictionary. Dictionary is list in which you use keywords to access saved data(oversimplified)
If that is true need to put obj = dict() before for.
Or obj = [] if you want to use list
And correct last line for dict is
obj['time'+str(shot)] = data[0]
Or with list
obj[shot] = data[0]

How to use multiprocessing module to iterate a list and match it with a key in dictionary?

I have a list named master_lst created from a CSV file using the following code
infile= open(sys.argv[1], "r")
lines = infile.readlines()[1:]
master_lst = ["read"]
for line in lines:
line= line.strip().split(',')
fourth_field = line [3]
master_lst.append(fourth_field)
This master list has the unique set of sequences. Now I have to loop 30 collapsed FASTA files to count the number of occurrences of each of these sequences in the master list. The file format of the 30 files is as follow:
>AAAAAAAAAAAAAAA
7451
>AAAAAAAAAAAAAAAA
4133
>AAAAAAAAAAAAAAAAA
2783
For counting the number of occurrences, I looped through each of the 30 file and created a dictionary with sequences as key and number of occurrences as values. Then I iterated each element of the master_lst and matched it with the key in the dictionary created from the previous step. If there is a match, I appended the value of the key to a new list (ind_lst). If not I appended 0 to the ind_lst. The code for that is as follow:
for file in files:
ind_lst = []
if file.endswith('.fa'):
first = file.split(".")
first_field = first [0]
ind_lst.append(first_field)
fasta= open(file)
individual_dict= {}
for line in fasta:
line= line.strip()
if line == '':
continue
if line.startswith('>'):
header = line.lstrip('>')
individual_dict[header]= ''
else:
individual_dict[header] += line
for key in master_lst[1:]:
a = 0
if key in individual_dict.keys():
a = individual_dict[key]
else:
a = 0
ind_lst.append(a)
then I write the master_lst to a CSV file and ind_lst using the code explained here: How to append a new list to an existing CSV file?
The final output should look like this:
Read file1 file2 so on until file 30
AAAAAAAAAAAAAAA 7451 4456
AAAAAAAAAAAAAAAA 4133 3624
AAAAAAAAAAAAAAAAA 2783 7012
This codes work perfectly fine when I use a smaller master_lst. But when the size of the master_lst increases then the execution time increases too much. The master_lst I am working with right now has 35,718,501 sequences(elements). When I subset 50 sequences and run the code, the script takes 2 hours to execute. So for 35,718,501 sequences it will take forever to complete.
Now I don't know how to speed up the script. I am not quite sure if there could be some improvements that can be made to this script to make it execute in a shorter time. I am running my script on a Linux server which has 16 CPU cores. When I use the command top, I could see that the script uses only one CPU. But I am not a expert in python and I don't know how to make it run on all available CPU cores using multiprocessing module. I checked this webpage: Learning Python's Multiprocessing Module.
But, I wasn't quite sure what should come under def and if __name__ == '__main__':. I am also not quite sure what arguments should I pass to the function. I was getting an error when I try the first code from Douglas, without passing any arguments as follow:
File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
I have been working this for the last few days and I haven't been successful in generating my desired output. If anyone can suggest an alternative code that could run fast or if anyone could suggest how to run this code on multiple CPUs, that would be awesome. Any help to resolve this issue would be much appreciated.
Here's a multiprocessing version. It uses a slightly different approach than you do in your code which does away with the need for creating the ind_lst.
The essence of the difference is that it first produces a transpose of the desired data, and then transpose that into the desired result.
In other words, instead of creating this directly:
Read,file1,file2
AAAAAAAAAAAAAAA,7451,4456
AAAAAAAAAAAAAAAA,4133,3624
AAAAAAAAAAAAAAAAA,2783,7012
It first produces:
Read,AAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAA,AAAAAAAAAAAAAAAAA
file1,7451,4133,2783
file2,4456,3624,7012
...and then transposes that with the built-in zip() function to obtain the desired format.
Besides not needing to create the ind_lst, it also allows the creation of one row of data per file rather than one column of it (which is easier and can be done more efficiently with less effort).
Here's the code:
from __future__ import print_function
import csv
from functools import partial
from glob import glob
from itertools import izip # Python 2
import operator
import os
from multiprocessing import cpu_count, Pool, Queue
import sys
def get_master_list(filename):
with open(filename, "rb") as csvfile:
reader = csv.reader(csvfile)
next(reader) # ignore first row
sequence_getter = operator.itemgetter(3) # retrieves fourth column of each row
return map(sequence_getter, reader)
def process_fa_file(master_list, filename):
fa_dict = {}
with open(filename) as fa_file:
for line in fa_file:
if line and line[0] != '>':
fa_dict[sequence] = int(line)
elif line:
sequence = line[1:-1]
get = fa_dict.get # local var to expedite access
basename = os.path.basename(os.path.splitext(filename)[0])
return [basename] + [get(key, 0) for key in master_list]
def process_fa_files(master_list, filenames):
pool = Pool(processes=4) # "processes" is the number of worker processes to
# use. If processes is None then the number returned
# by cpu_count() is used.
# Only one argument can be passed to the target function using Pool.map(),
# so create a partial to pass first argument, which doesn't vary.
results = pool.map(partial(process_fa_file, master_list), filenames)
header_row = ['Read'] + master_list
return [header_row] + results
if __name__ == '__main__':
master_list = get_master_list('master_list.csv')
fa_files_dir = '.' # current directory
filenames = glob(os.path.join(fa_files_dir, '*.fa'))
data = process_fa_files(master_list, filenames)
rows = zip(*data) # transpose
with open('output.csv', 'wb') as outfile:
writer = csv.writer(outfile)
writer.writerows(rows)
# show data written to file
for row in rows:
print(','.join(map(str, row)))

Python: Looping through multiple csv files and making multiple new csv files

I am starting out in Python, and I am looking at csv files.
Basically my situation is this:
I have coordinates X, Y, Z in a csv.
X Y Z
1 1 1
2 2 2
3 3 3
and I want to go through and add a user defined offset value to all Z values and make a new file with the edited z-values.
here is my code so far which I think is right:
# list of lists we store all data in
allCoords = []
# get offset from user
offset = int(input("Enter an offset value: "))
# read all values into memory
with open('in.csv', 'r') as inFile: # input csv file
reader = csv.reader(inFile, delimiter=',')
for row in reader:
# do not add the first row to the list
if row[0] != "X":
# create a new coord list
coord = []
# get a row and put it into new list
coord.append(int(row[0]))
coord.append(int(row[1]))
coord.append(int(row[2]) + offset)
# add list to list of lists
allCoords.append(coord)
# write all values into new csv file
with open(in".out.csv", "w", newline="") as f:
writer = csv.writer(f)
firstRow = ['X', 'Y', 'Z']
allCoords.insert(0, firstRow)
writer.writerows(allCoords)
But now come's the hard part. How would I go about going through a bunch of csv files (in the same location), and producing a new file for each of the csv's.
I am hoping to have something like: "filename.csv" turns into "filename_offset.csv" using the original file name as a starter for the new filename, appending ".offset" to the end.
I think I need to use "os." functions, but I am not sure how to, so any explanation would be much appreciated along with the code! :)
Sorry if I didn't make much sense, let me know if I need to explain more clearly. :)
Thanks a bunch! :)
shutil.copy2(src, dst)ΒΆ
Similar to shutil.copy(), but metadata is copied as well
shutil
The glob module finds all the pathnames matching a specified pattern
according to the rules used by the Unix shell. No tilde expansion is
done, but *, ?, and character ranges expressed with [] will be correctly matched
glob
import glob
from shutil import copy2
import shutil
files = glob.glob('cvs_DIR/*csv')
for file in files:
try:
# need to have full path of cvs_DIR
oldName = os.path.join(cvs_DIR, file)
newName = os.path.join(cvs_DIR, file[:4] + '_offset.csv')
copy2(oldName,newName)
except shutil.Error as e:
print('Error: {}'.format(e))
BTW, you can write ...
for row in reader:
if row[0] == "X":
break
for row in reader:
coord = []
...
... instead of ...
for row in reader:
if row[0] != "X":
coord = []
...
This stops checking for 'X'es after the first line.
It works because you dont work with a real list here but with a self consuming iterator, which you can stop and restart.
See also: Detecting if an iterator will be consumed.

How to read last line of CSV file into a list that I can manipulate (python)

I wrote a small function that will return the last row of a csv file. I can't for the life of me figure out how to store that row into a list or array to use after the function has completed.
import os
import csv
hello = []
def getLastFile(filename):
distance = 1024
with open(filename,'rb') as f:
reader = csv.reader(f)
reader.seek(0, os.SEEK_END)
if reader.tell() < distance:
reader.seek(0, os.SEEK_SET)
lines = reader.readlines()
lastline = lines[-1]
else:
reader.seek(-1 * distance, os.SEEK_END)
lines = reader.readlines()
lastline = lines[-1]
return lastline
I have tried defining an empty list at the top, and appending lastline to that list, but I found that that was wrong. Right now the function returns a csv row, ex: 'a','b','c', how can I make the output of my function save this into a global list or array that I can then use? Thank you!
If you want the result of a function to be used elsewhere you have to actually call the function. For instance:
def print_last_line(filename):
print getLastFile(filename)
Additionally there's the problem of scope, where you must define anything you want to use within the function in the function itself. For instance:
test = []
def use_last_line(filename):
test.append(getLastFile(filename)) # This will error, because test is not in scope
def use_last_line(filename):
test = []
test.append(getLastFile(filename)) # This will success because test is in the function.
Specifically to do this for what I'm guessing you're trying to do above, you would want to just actually call the function and assign the result to your hello array:
hello = getLastFile(filename)
you could open the CSV file as an array with numpy's genfromtxt and then slice the last row, or if you know the length of your file, it looks like 1024 in your example, you can use the skip_header keyword. i.e.
import numpy as np
data = np.genfromtxt(filename, delimiter=",",skip_header=1024)

Python- Read from Multiple Files

I have 125 data files containing two columns and 21 rows of data. Please see the image below:
and I'd like to import them into a single .csv file (as 250 columns and 21 rows).
I am fairly new to python but this what I have been advised, code wise:
import glob
Results = [open(f) for f in glob.glob("*.data")]
fout = open("res.csv", 'w')
for row in range(21):
for f in Results:
fout.write( f.readline().strip() )
fout.write(',')
fout.write('\n')
fout.close()
However, there is slight problem with the code as I only get 125 columns, (i.e. the force and displacement columns are written in one column) Please refer to the image below:
I'd very much appreciate it if anyone could help me with this !
import glob
results = [open(f) for f in glob.glob("*.data")]
sep = ","
# Uncomment if your Excel formats decimal numbers like 3,14 instead of 3.14
# sep = ";"
with open("res.csv", 'w') as fout:
for row in range(21):
iterator = (f.readline().strip().replace("\t", sep) for f in results)
line = sep.join(iterator)
fout.write("{0}\n".format(line))
So to explain what went wrong with your code, your source files use tab as a field separator, but your code uses comma to separate the lines it reads from those files. If your excel uses period as a decimal separator, it uses comma as a default field separator. The whitespace is ignored unless enclosed in quotes, and you see the result.
If you use the text import feature of Excel (Data ribbon => From Text) you can ask it to consider both comma and tab as valid field separators, and then I'm pretty sure your original output would work too.
In contrast, the above code should produce a file that will open correctly when double clicked.
You don't need to write your own program to do this, in python or otherwise. You can use an existing unix command (if you are in that environment):
paste *.data > res.csv
Try this:
import glob, csv
from itertools import cycle, islice, count
def roundrobin(*iterables):
"roundrobin('ABC', 'D', 'EF') --> A D E B F C"
# Recipe credited to George Sakkis
pending = len(iterables)
nexts = cycle(iter(it).next for it in iterables)
while pending:
try:
for next in nexts:
yield next()
except StopIteration:
pending -= 1
nexts = cycle(islice(nexts, pending))
Results = [open(f).readlines() for f in glob.glob("*.data")]
fout = csv.writer(open("res.csv", 'wb'), dialect="excel")
row = []
for line, c in zip(roundrobin(Results), cycle(range(len(Results)))):
splitline = line.split()
for item,currItem in zip(splitline, count(1)):
row[c+currItem] = item
if count == len(Results):
fout.writerow(row)
row = []
del fout
It should loop over each line of your input file and stitch them together as one row, which the csv library will write in the listed dialect.
I suggest to get used to csv module. The reason is that if the data is not that simple (simple strings in headings, and then numbers only) it is difficult to implement everything again. Try the following:
import csv
import glob
import os
datapath = './data'
resultpath = './result'
if not os.path.isdir(resultpath):
os.makedirs(resultpath)
# Initialize the empty rows. It does not check how many rows are
# in the file.
rows = []
# Read data from the files to the above matrix.
for fname in glob.glob(os.path.join(datapath, '*.data')):
with open(fname, 'rb') as f:
reader = csv.reader(f)
for n, row in enumerate(reader):
if len(rows) < n+1:
rows.append([]) # add another row
rows[n].extend(row) # append the elements from the file
# Write the data from memory to the result file.
fname = os.path.join(resultpath, 'result.csv')
with open(fname, 'wb') as f:
writer = csv.writer(f)
for row in rows:
writer.writerow(row)
The with construct for opening a file can be replaced by the couple:
f = open(fname, 'wb')
...
f.close()
The csv.reader and csv.writer are simply wrappers that parse or compose the line of the file. The doc says that they require to open the file in the binary mode.

Categories

Resources