Extract a value from each text file obeying a naming convention - how? - python

I need to extract the last number in the last line of each text file in a directory. Can someone get me started on this in Python? The data is information formatted as follows:
# time 'A' 'B'
0.000000E+00 10000 0
1.000000E+05 7742 2263
where the '#' column is empty in each file. The filenames obey the following naming convention:
for i in `seq 1 100`; for j in `seq 1 101`; for letter in {A..D};
filename = $letter${j}_${i}.txt
These files contain the resulting data from running simulations in KaSim (Kappa language). I want to take the averages of subsets of the extracted numbers and plot some results.
Matlab can't handle the set of 50,000 files I'm dealing with. I'm relatively new to Python but I have experience in Matlab and R. I want to do the data extraction through Python and the analysis in Matlab or R.
Thanks for any help.

This code should get you started. As far as the directory has only those files for which you need the last number, the naming convention can be ignored. Because, you can rather look up all of the file in that directory.
import glob
last_numbers = []
for filename in glob.glob("/path/to/directory/*"): # dont forget this ending * (its wild character)
last_number = file.open(filename).readlines()[-1].split(" ")[-1]
# in case last line is empty line '\n' and your interest is in last second line then it should be '.readlines()[-2].split(" ")[-1]'
last_numbers.append(last_number)

Related

Problem in reading in a file in r - the number of observations is wronly interpreted as number of variables

What do I want to do:
I want to read in a file in r that contains 2 variables and 2500 observations for each variable.
The data is an output (list) of a function out of a python project. I have one list for each variable (2 lists with 2500 data points).
I first copied the data in a excel file and transformed it to a csv file and read it in in r.
Since that strategy did not work, I copied the lists in a text file.
What is my output/problem?
When I read in the file in r with read.csv() (obviously with the csv file) I get 1 observation but 2500 variables (it should be 2 variables with 2500 observations each).
When I read in the file in r with read.table() I get this error and this warning message:
Error in read.table(file = "dataset.txt", header = TRUE) :
more columns than column names
In addition: Warning message:
In read.table(file = "dataset.txt", header = TRUE) :
incomplete final line found by readTableHeader in 'dataset.txt'
What do I think is the problem?
The data points are side by side and not one below the other.
Example:
A=[0.25, 0.67, ...,0.1]
B=[0.03, 0.14, ..., 0.09]
My guess is that R sees elements that are side by side as variables and all data points that are under the variables as observations and perhaps the first line as heading (so the first line is seen as heading, the second line as 1 observation and all data points that are in the second line are seen as variables).
What did I try:
1. I tried to separate the data points with the sep= ',' function, but that did not change the number of observations or variables (I tried it with a bunch of other signs like ';' '\')
2. I tried to copy the data points (out of the python print output) into an excel file, but it always put the data points one next to the other and gave me the error that excel only works with maximum ~800 data points.
3. I tried to create a csv file in the python program with csv.writer() that jumps into a new line after each comma. This gave me an empty file as output, I don’t know why.
import csv
with open('dataset.txt','w', newline= '') as csvfile:
simval_similar = csv.writer(csvfile, delimiter= ' ', quotechar=',', quoting=csv.QUOTE_ALL)
print(dataset.txt)
To prove that my guess (explained in the ‘what do I think is the problem’ section) I rearranged 200 data points such that they are listed one below the other in a text file (manually), that suddenly gave me 200 observations, which means that my guess was probably right (or at least partly).
But to do that manually would, first of all, mean that I have to do this for 5000 data points and this strategy is error-prone.
I don’t know how to continue and would really appreciate help….

Iteratively replace two strings with values from numpy array

I'm currently trying to make an automation script for writing new files from a master where there are two strings I want to replace (x1 and x2) with values from a 21 x 2 array of numbers (namely, [[0,1000],[50,950],[100,900],...,[1000,0]]). Additionally, with each double replacement, I want to save that change as a unique file.
Here's my script as it stands:
import numpy
lines = []
x1x2 = numpy.array([[0,1000],[50,950],[100,900],...,[1000,0])
for i,j in x1x2:
with open("filenamexx.inp") as infile:
for line in infile:
linex1 = line.replace('x1',str(i))
linex2 = line.replace('x2',str(j))
lines.append(linex1)
lines.append(linex2)
with open("filename"+str(i)+str(j)+".inp", 'w') as outfile:
for line in lines:
outfile.write(line)
With my current script there are a few problems. First, the string replacements are being done separately, i.e. I end up with a new file that contains the contents of the master file twice where one line has the first change and then the next will reflect the second separately. Second, with each subsequent iteration, the new files have the contents of the previous file prepended (i.e. filename100900.inp will contain its unique contents as well as the contents of both filename01000.inp and filename50950.inp before it). Anyone think they can take a crack at solving my problem?
Note: I've looked at using regex module solutions (somehing like this: https://www.safaribooksonline.com/library/view/python-cookbook-2nd/0596007973/ch01s19.html) in order to do multiple replacements in a single pass, but I'm not sure if the way I'm indexing is translatable to a dictionary object.
I'm not sure I understood the second issue but you can use replace more than one time on the same string, so:
s = "x1 stuff x2"
s = s.replace('x1',str(1)).replace('x2',str(2))
print(s)
, will output:
1 stuff 2
No need to do this two times for two different variables. As for the second issue it just seems as your not "reset-ing" the "lines" variable before starting to write a new file. So once you finish writing a file just add:
lines = []
It should be enough to solve these issues.

Using Matlab Regex to insert "disclaimer" at begining of multiple codes within multiple subfolders

I have a folder with multiple subfolders that all contain several files. I am looking to write a matlab code that will insert a commented out "disclaimer" on the top of every relevant code [c, python (.py not .pyc), .urdf, .xml (.launch, .xacro, .config)]
My current thought process is to first list out every subfolder within the main folder. Then search within each subfolder for the relevant codes. If a relevant code is found, the disclaimer is commented in the top of the code... (each language has a different disclaimer)
I am having a hard time piecing this all together.. any help?
data_dir = 'C:thedirectorytomainfolder':
topLevelFolder = data_dir;
if topLevelFolder == 0
return;
end
% Get list of all subfolders.
allSubFolders = genpath(topLevelFolder);
remain = allSubFolders;
listOfFolderNames = {};
while true
[singleSubFolder, remain] = strtok(remain, ';');
if isempty(singleSubFolder)
break;
end
listOfFolderNames = [listOfFolderNames singleSubFolder];
end
numberOfFolders = length(listOfFolderNames)
%% Process all (wanted) files in those folders
for k = 1 : numberOfFolders
% Get this folder and print it out.
thisFolder = listOfFolderNames{k};
fprintf('Processing folder %s\n', thisFolder);
% Get .xml files.
filePattern = sprintf('%s/*.xml', thisFolder);
baseFileNames = dir(filePattern);
filePattern = sprintf('%s/*.c', thisFolder);
baseFileNames = [baseFileNames; dir(filePattern)];
numberOfImageFiles = length(baseFileNames)
I'm having a hard time reading each relevant file and inserting the correct comment code at the beginning of the file... any help?
Most of matlab's methods for reading text files assume you are trying to load in primarily numeric data but one of them might still work for you.
Sometimes it's easier to fopen the file and then read lines with fgetl of fread. Because you're doing low-level IO you have to test for the end of file too with while ~feof or somesuch. You could store each line in a cell array, prepend it with a cell array of your disclaimer and then write back out with fwrite, converting the cell back to a string with char.
It'll be pretty cumbersome. Does it have to be matlab? If you have the option it might be quicker to do it in a different language - it would be less than twenty lines in shell, and ruby/python/perl are all more geared up for text processing, which isn't matlab's strongest point.

Comparing a file containing a chromosomal region and another containing point coordinates

Could I please be advised on the following problem. I have csv files which I would like to compare. The first contains coordinates of specific points in the genome (e.g. chr3: 987654 – 987654). The other csv files contain coordinates of genomic regions (e.g.chr3: 135596 – 123456789). I would like to cross compare my first file with my other files to see if any point locations in the first file overlaps with any regional coordinates in the other files and to write this set of overlap into a separate file. To make things simple for a start, I have drafted a simple code to cross compare between 2 csv files. Strangely, my code runs and prints the coordinates but does not write the point coordinates into a separate file. My first question is if my approach (from my code) at comparing these two files optimal or is there a better way of doing this? Secondly, why is it not writing into a separate file?
import csv
Region = open ('Region_test1.csv', 'rt', newline = '')
reader_Region = csv.reader (Region, delimiter = ',')
DMC = open ('DMC_test.csv', 'rt', newline = '')
reader_DMC = csv.reader (DMC, delimiter = ',')
DMC_testpoint = open ('DMC_testpoint.csv', 'wt', newline ='')
writer_Exon = csv.writer (DMC_testpoint, delimiter = ',')
for col in reader_Region:
Chr_region = col[0]
Start_region = int(col[1])
End_region = int(col [2])
for col in reader_DMC:
Chr_point = col[0]
Start_point = int(col [1])
End_point = int(col[2])
if Chr_region == Chr_point and Start_region <= Start_point and End_region >= End_point:
print (True, col)
else:
print (False, col)
writer_Exon.writerow(col)
Region.close()
DMC.close()
A couple of things are wrong, not the least of which is that you never check to see if your files opened successfully. The most glaring is that you never close your writer.
That said this an incredibly non-optimal way to go about the program. File I/O is slow. You don't want to keep rereading everything in a factorial fashion. Given how your search requires all possible comparisons you'll want to store at least one of the two files completely in memory, and potentially use a generator/iterator over the other if you dont wish to store both complete sets of data in memory.
One you have both sets loaded, proceed to do your intersection checks
I'd suggest you take a look at http://docs.python.org/2/library/csv.html for how to use a csv reader because what you are doing doesn't appear to make anysense because col[0], col[1] and col[2] aren't going to be what you think they are.
These are style and readability things but:
The names of some iteration variables seem a bit off, for col in ... should probably be for token in ... because you are processing token by token, and not column by columns/line by line, etc.
Additionally it would be nice to pick something consistent to stick to for your variables, sometimes you start with uppercase, sometimes you save the uppercase for after your '_'
That are putting ' ' between your objects and some function noames and not others is also very odd. But again these dont change the functionality of your code.

Python - Import txt in a sequential pattern

In the directory I have say, 30 txt files each containing two columns of numbers with roughly 6000 numbers in each column. What i want to do is to import the first 3 txt files, process the data which gives me the desired output, then i want to move onto the next 3 txt files.
The directory looks like:
file0a
file0b
file0c
file1a
file1b
file1c ... and so on.
I don't want to import all of the txt files simultaneously, I want to import the first 3, process the data, then the next 3 and so forth. I was thinking of making a dictionary - though i have a feeling this might involve writing each file name in the dictionary, which would take far too long.
EDIT:
For those that are interested, I think i have come up with a work around. Any feedback would greatly be appreciated, since i'm not sure if this is the quickest way to do things or the most pythonic.
import glob
def chunks(l,n):
for i in xrange(0,len(l),n):
yield l[i:i+n]
Data = []
txt_files = glob.iglob("./*.txt")
for data in txt_files:
d = np.loadtxt(data, dtype = np.float64)
Data.append(d)
Data_raw_all = list(chunks(Data,3))
Here the list 'Data' is all of the text files from the directory, and 'Data_raw_all' uses the function 'chunks' to group the elements in 'Data' into sets of 3. This way you can selecting one element in Data_raw_all selects the corresponding 3 text files in the directory.
First of all, I have nothing original to include here and I definitely do not want to claim credit for it at all because it all comes from the Python Cookbook 3rd Ed and from this wonderful presentation on generators by David Beazley (one of the co-authors of the aforementioned Cookbook). However, I think you might really benefit from the examples given in the slideshow on generators.
What Beazley does is chain a bunch of generators together in order to do the following:
yields filenames matching a given filename pattern.
yields open file objects from a sequence of filenames.
concatenates a sequence of generators into a single sequence
greps a series of lines for those that match a regex pattern
All of these code examples are located here. The beauty of this method is that the chained generators simply chew up the next pieces of information: they don't load all files into memory in order to process all the data. It's really a nice solution.
Anyway, if you read through the slideshow, I believe it will give you a blueprint for exactly what you want to do: you just have to change it for the information you are seeking.
In short, check out the slideshow linked above and follow along and it should provide a blueprint for solving your problem.
I'm presuming you want to hardcode as few of the file names as possible. Therefore most of this code is for generating the filenames. The files are then opened with a with statement.
Example code:
from itertools import cycle, count
root = "UVF2CNa"
for n in count(1):
for char in cycle("abc"):
first_part = "{}{}{}".format(root, n, char)
try:
with open(first_part + "i") as i,\
open(first_part + "j") as j,\
open(first_part + "k") as k:
# do stuff with files i, j and k here
pass
except FileNotFoundError:
# deal with this however
pass

Categories

Resources