Iteratively replace two strings with values from numpy array

Iteratively replace two strings with values from numpy array - python

I'm currently trying to make an automation script for writing new files from a master where there are two strings I want to replace (x1 and x2) with values from a 21 x 2 array of numbers (namely, [[0,1000],[50,950],[100,900],...,[1000,0]]). Additionally, with each double replacement, I want to save that change as a unique file.
Here's my script as it stands:
import numpy
lines = []
x1x2 = numpy.array([[0,1000],[50,950],[100,900],...,[1000,0])
for i,j in x1x2:
with open("filenamexx.inp") as infile:
for line in infile:
linex1 = line.replace('x1',str(i))
linex2 = line.replace('x2',str(j))
lines.append(linex1)
lines.append(linex2)
with open("filename"+str(i)+str(j)+".inp", 'w') as outfile:
for line in lines:
outfile.write(line)
With my current script there are a few problems. First, the string replacements are being done separately, i.e. I end up with a new file that contains the contents of the master file twice where one line has the first change and then the next will reflect the second separately. Second, with each subsequent iteration, the new files have the contents of the previous file prepended (i.e. filename100900.inp will contain its unique contents as well as the contents of both filename01000.inp and filename50950.inp before it). Anyone think they can take a crack at solving my problem?
Note: I've looked at using regex module solutions (somehing like this: https://www.safaribooksonline.com/library/view/python-cookbook-2nd/0596007973/ch01s19.html) in order to do multiple replacements in a single pass, but I'm not sure if the way I'm indexing is translatable to a dictionary object.

I'm not sure I understood the second issue but you can use replace more than one time on the same string, so:
s = "x1 stuff x2"
s = s.replace('x1',str(1)).replace('x2',str(2))
print(s)
, will output:
1 stuff 2
No need to do this two times for two different variables. As for the second issue it just seems as your not "reset-ing" the "lines" variable before starting to write a new file. So once you finish writing a file just add:
lines = []
It should be enough to solve these issues.

Related

Reading selected parts of a file of unknown name

I am having a lot of datafiles with unknown names. I have figured out a way to get them all read and printed but I want to make graphs of them so I need the data in a way that is workable.
The datafiles are very neatly arranged (every line of the header contains information on what is stored there) but I am having trouble making a script that selects the data I need. The first 50+ lines of the file contain headers of which I need only a few to be used, this poses no problem when using something like:
for filename in glob.glob(fullpath):
with open(filename, 'r') as f:
for line in f:
if 'xx' in line:
Do my thing
if 'yy' in line:
Do my thing etc.
But below the headers there is a block of data of undetermined number of columns and undetermined number of lines (number of columns and what each column is, is specified in the headers). This I can't get read in a way that a graph can be made by for example matplotlib. (I can get it right by manually copying the data to a separate file and read that to a plottable format but that is not what I want to do every time of every file...) The line before the data starts contains the very useful #eoh but I can't figure out a way to combine the selective reading of the first 50+ lines and then swith to reading everything into an array. If there are methods to do what I want in a better way (including the selection of the map and seeing which files are there and readable) I am open to suggestions.
Update:
The solution proposed by #ImportanceOfBeingErnest seems very useful but I don't get it to work.
So I'll start with the data mentioned as missing in the answer.
Columnnames are given in the following format:
#COLUMNINFO= NUMBER1, UNIT, MEASUREMENT, NUMBER2
In this format number1 is the columnnumber, unit is the unit of the measurement, measurement is what is measured and number2 is in numbers what is measured.
The data is separated by spaces but that won't be a problem, I suspect.
I tried to implement the reading of the headers in the loop to determine the end of the headers, which failed to have any visible effects, even the print commands to check intermediate results did not show.
Once I put 'print line' after 'for line in f:' I thought I could see what went wrong but it appears as if the whole loop is ignored, including the break command which causes an error since the file is done reading and no data is left to read for the other parts...
Any help would be appreciated.

First of all, if the header has a certain character at the beginning of each line, this can be used to filter the header out automatically. Using numpy you could use numpy.loadtxt(filename, delimiter=";", comment="#") to load the data and every line starting with # would simply be ignored.
I don't know if this is the case here?!
In the case that you describe, where you have a header-ending flag #eoh you could first read in the header line by line to find out how many lines you later need to ignore and then use that number when loading the file.
I have assembled a little example, how it could work.
def readFile(filename):
#first find the number of lines to skip in the header
eoh = 0
with open(filename, "r") as f:
for line in f:
eoh = eoh+1
if "#eoh" in line:
break
# now at this point we need to find out about the column names
# but as no data is given as example, this is impossible
columnnames = []
# load the file by skipping eoh lines,
# the rest should be well behaving
a = np.genfromtxt(filename, skip_header = eoh, delimiter=";" )
return a, columnnames
def plot(a, columnnames, show=True, save=False, filename="something"):
fig = plt.figure()
ax = fig.add_subplot(111)
# plot the forth column agains the second
ax.plot(a[:, 1], a[:,3])
# if we had some columnname, we could also plot
# column named "ulf" against the one named "alf"
#ax.plot(a[:, columnnames.index("alf")], a[:,columnnames.index("ulf")])
#now save and/or show
if save:
plt.savefig(filename+".png")
if show:
plt.show()
if __name__ == "__main__":
fullpath = "path/to/files/file*.txt" # or whatever
for filename in glob.glob(fullpath):
a, columnnames = readFile(filename)
plot(a, columnnames, show=True, save=False, filename=filename[:-4])
One remaining problem is the names of the columns. Since you did not provide any example data, it's hard to estimate how to exactly do that.
This all assumes that you do not have any missing data in between or anything of that kind. If this was the case, then you'd need to use all the arguments to numpy.genfromtxt() to filter the data accordingly.

Extract a value from each text file obeying a naming convention - how?

I need to extract the last number in the last line of each text file in a directory. Can someone get me started on this in Python? The data is information formatted as follows:
# time 'A' 'B'
0.000000E+00 10000 0
1.000000E+05 7742 2263
where the '#' column is empty in each file. The filenames obey the following naming convention:
for i in `seq 1 100`; for j in `seq 1 101`; for letter in {A..D};
filename = $letter${j}_${i}.txt
These files contain the resulting data from running simulations in KaSim (Kappa language). I want to take the averages of subsets of the extracted numbers and plot some results.
Matlab can't handle the set of 50,000 files I'm dealing with. I'm relatively new to Python but I have experience in Matlab and R. I want to do the data extraction through Python and the analysis in Matlab or R.
Thanks for any help.

This code should get you started. As far as the directory has only those files for which you need the last number, the naming convention can be ignored. Because, you can rather look up all of the file in that directory.
import glob
last_numbers = []
for filename in glob.glob("/path/to/directory/*"): # dont forget this ending * (its wild character)
last_number = file.open(filename).readlines()[-1].split(" ")[-1]
# in case last line is empty line '\n' and your interest is in last second line then it should be '.readlines()[-2].split(" ")[-1]'
last_numbers.append(last_number)

Filter RDD based on row_number

sc.textFile(path) allows to read an HDFS file but it does not accept parameters (like skip a number of rows, has_headers,...).
in the "Learning Spark" O'Reilly e-book, it's suggested to use the following function to read a CSV (Example 5-12. Python load CSV example)
import csv
import StringIO
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"])
return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
My question is about how to be selective about the rows "taken":
How to avoid loading the first row (the headers)
How to remove a specific row (for instance, row 5)
I see some decent solutions here: select range of elements but I'd like to see if there is anything more simple.
Thx!

Don't worry about loading the rows/lines you don't need. When you do:
input = sc.textFile(inputFile)
you are not loading the file. You are just getting an object that will allow you to operate on the file. So to be efficient, it is better to think in terms of getting only what you want. For example:
header = input.take(1)[0]
rows = input.filter(lambda line: line != header)
Note that here I am not using an index to refer to the line I want to drop but rather its value. This has the side effect that other lines with this value will also be ignored but is more in the spirit of Spark as Spark will distribute your text file in different parts across the nodes and the concept of line numbers gets lost in each partition. This is also the reason why this is not easy to do in Spark(Hadoop) as each partition should be considered independent and a global line number would break this assumption.
If you really need to work with line numbers I recommend that you add them to the file outside of Spark(see here) and then just filter by this column inside of Spark.
Edit: Added zipWithIndex solution as suggested by #Daniel Darabos.
sc.textFile('test.txt')\
.zipWithIndex()\ # [(u'First', 0), (u'Second', 1), ...
.filter(lambda x: x[1]!=5)\ # select columns
.map(lambda x: x[0])\ # [u'First', u'Second'
.collect()

Comparing a file containing a chromosomal region and another containing point coordinates

Could I please be advised on the following problem. I have csv files which I would like to compare. The first contains coordinates of specific points in the genome (e.g. chr3: 987654 – 987654). The other csv files contain coordinates of genomic regions (e.g.chr3: 135596 – 123456789). I would like to cross compare my first file with my other files to see if any point locations in the first file overlaps with any regional coordinates in the other files and to write this set of overlap into a separate file. To make things simple for a start, I have drafted a simple code to cross compare between 2 csv files. Strangely, my code runs and prints the coordinates but does not write the point coordinates into a separate file. My first question is if my approach (from my code) at comparing these two files optimal or is there a better way of doing this? Secondly, why is it not writing into a separate file?
import csv
Region = open ('Region_test1.csv', 'rt', newline = '')
reader_Region = csv.reader (Region, delimiter = ',')
DMC = open ('DMC_test.csv', 'rt', newline = '')
reader_DMC = csv.reader (DMC, delimiter = ',')
DMC_testpoint = open ('DMC_testpoint.csv', 'wt', newline ='')
writer_Exon = csv.writer (DMC_testpoint, delimiter = ',')
for col in reader_Region:
Chr_region = col[0]
Start_region = int(col[1])
End_region = int(col [2])
for col in reader_DMC:
Chr_point = col[0]
Start_point = int(col [1])
End_point = int(col[2])
if Chr_region == Chr_point and Start_region <= Start_point and End_region >= End_point:
print (True, col)
else:
print (False, col)
writer_Exon.writerow(col)
Region.close()
DMC.close()

A couple of things are wrong, not the least of which is that you never check to see if your files opened successfully. The most glaring is that you never close your writer.
That said this an incredibly non-optimal way to go about the program. File I/O is slow. You don't want to keep rereading everything in a factorial fashion. Given how your search requires all possible comparisons you'll want to store at least one of the two files completely in memory, and potentially use a generator/iterator over the other if you dont wish to store both complete sets of data in memory.
One you have both sets loaded, proceed to do your intersection checks
I'd suggest you take a look at http://docs.python.org/2/library/csv.html for how to use a csv reader because what you are doing doesn't appear to make anysense because col[0], col[1] and col[2] aren't going to be what you think they are.
These are style and readability things but:
The names of some iteration variables seem a bit off, for col in ... should probably be for token in ... because you are processing token by token, and not column by columns/line by line, etc.
Additionally it would be nice to pick something consistent to stick to for your variables, sometimes you start with uppercase, sometimes you save the uppercase for after your '_'
That are putting ' ' between your objects and some function noames and not others is also very odd. But again these dont change the functionality of your code.

Python 3: Extracting Data from a .txt File?

So, I have this file that has data set up like this:
Bob 5 60
Carl 7 80
Rick 8 100
Santiago 7 30
I need to separate each part into three different lists. One for the name, one for the first number, and one for the second number.
But I don't really understand, how exactly do I extract those parts? Also, let's say I want to make a tuple with the first line, with each of the different parts (the name, first number, and second number) into a single tuple?
I just don't get how I extract that information.
I just learned how to read and write text files...so I'm pretty clueless.
EDIT: As a note, the text file already exists. The program I'm working on needs to read the text file, which has its data formatted in the way I listed.

You can split each line on whitespace:
with open(yourfile) as f:
rows = [l.split() for l in f]
names, firstnums, secondnums = zip(*rows)
zip(*iterable) re-arranges the 3 columns into 3 lists.

Would not the module Pickle be ideal here? Pickle gives Python functionality to load and save things that need to be 'useable' in Python, so instead of just importing a string from a text file and having to parse it, pickle can load it and give you the actual container you're trying to work with.
example:
import pickle
myList = ["Bob", 1, 2]
listToBeSaved = pickle.dumps(myList) # write this data to your save file
#insert code where you work with the file and save it
#.........
#upon needing to open and work with this file
listToBeLoaded = open(fileYouWroteTo)
listTranslated = pickle.loads(listToBeLoaded) # turns the loaded data back into a proper list

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.