I am having a lot of datafiles with unknown names. I have figured out a way to get them all read and printed but I want to make graphs of them so I need the data in a way that is workable.
The datafiles are very neatly arranged (every line of the header contains information on what is stored there) but I am having trouble making a script that selects the data I need. The first 50+ lines of the file contain headers of which I need only a few to be used, this poses no problem when using something like:
for filename in glob.glob(fullpath):
with open(filename, 'r') as f:
for line in f:
if 'xx' in line:
Do my thing
if 'yy' in line:
Do my thing etc.
But below the headers there is a block of data of undetermined number of columns and undetermined number of lines (number of columns and what each column is, is specified in the headers). This I can't get read in a way that a graph can be made by for example matplotlib. (I can get it right by manually copying the data to a separate file and read that to a plottable format but that is not what I want to do every time of every file...) The line before the data starts contains the very useful #eoh but I can't figure out a way to combine the selective reading of the first 50+ lines and then swith to reading everything into an array. If there are methods to do what I want in a better way (including the selection of the map and seeing which files are there and readable) I am open to suggestions.
Update:
The solution proposed by #ImportanceOfBeingErnest seems very useful but I don't get it to work.
So I'll start with the data mentioned as missing in the answer.
Columnnames are given in the following format:
#COLUMNINFO= NUMBER1, UNIT, MEASUREMENT, NUMBER2
In this format number1 is the columnnumber, unit is the unit of the measurement, measurement is what is measured and number2 is in numbers what is measured.
The data is separated by spaces but that won't be a problem, I suspect.
I tried to implement the reading of the headers in the loop to determine the end of the headers, which failed to have any visible effects, even the print commands to check intermediate results did not show.
Once I put 'print line' after 'for line in f:' I thought I could see what went wrong but it appears as if the whole loop is ignored, including the break command which causes an error since the file is done reading and no data is left to read for the other parts...
Any help would be appreciated.
First of all, if the header has a certain character at the beginning of each line, this can be used to filter the header out automatically. Using numpy you could use numpy.loadtxt(filename, delimiter=";", comment="#") to load the data and every line starting with # would simply be ignored.
I don't know if this is the case here?!
In the case that you describe, where you have a header-ending flag #eoh you could first read in the header line by line to find out how many lines you later need to ignore and then use that number when loading the file.
I have assembled a little example, how it could work.
def readFile(filename):
#first find the number of lines to skip in the header
eoh = 0
with open(filename, "r") as f:
for line in f:
eoh = eoh+1
if "#eoh" in line:
break
# now at this point we need to find out about the column names
# but as no data is given as example, this is impossible
columnnames = []
# load the file by skipping eoh lines,
# the rest should be well behaving
a = np.genfromtxt(filename, skip_header = eoh, delimiter=";" )
return a, columnnames
def plot(a, columnnames, show=True, save=False, filename="something"):
fig = plt.figure()
ax = fig.add_subplot(111)
# plot the forth column agains the second
ax.plot(a[:, 1], a[:,3])
# if we had some columnname, we could also plot
# column named "ulf" against the one named "alf"
#ax.plot(a[:, columnnames.index("alf")], a[:,columnnames.index("ulf")])
#now save and/or show
if save:
plt.savefig(filename+".png")
if show:
plt.show()
if __name__ == "__main__":
fullpath = "path/to/files/file*.txt" # or whatever
for filename in glob.glob(fullpath):
a, columnnames = readFile(filename)
plot(a, columnnames, show=True, save=False, filename=filename[:-4])
One remaining problem is the names of the columns. Since you did not provide any example data, it's hard to estimate how to exactly do that.
This all assumes that you do not have any missing data in between or anything of that kind. If this was the case, then you'd need to use all the arguments to numpy.genfromtxt() to filter the data accordingly.
Related
What do I want to do:
I want to read in a file in r that contains 2 variables and 2500 observations for each variable.
The data is an output (list) of a function out of a python project. I have one list for each variable (2 lists with 2500 data points).
I first copied the data in a excel file and transformed it to a csv file and read it in in r.
Since that strategy did not work, I copied the lists in a text file.
What is my output/problem?
When I read in the file in r with read.csv() (obviously with the csv file) I get 1 observation but 2500 variables (it should be 2 variables with 2500 observations each).
When I read in the file in r with read.table() I get this error and this warning message:
Error in read.table(file = "dataset.txt", header = TRUE) :
more columns than column names
In addition: Warning message:
In read.table(file = "dataset.txt", header = TRUE) :
incomplete final line found by readTableHeader in 'dataset.txt'
What do I think is the problem?
The data points are side by side and not one below the other.
Example:
A=[0.25, 0.67, ...,0.1]
B=[0.03, 0.14, ..., 0.09]
My guess is that R sees elements that are side by side as variables and all data points that are under the variables as observations and perhaps the first line as heading (so the first line is seen as heading, the second line as 1 observation and all data points that are in the second line are seen as variables).
What did I try:
1. I tried to separate the data points with the sep= ',' function, but that did not change the number of observations or variables (I tried it with a bunch of other signs like ';' '\')
2. I tried to copy the data points (out of the python print output) into an excel file, but it always put the data points one next to the other and gave me the error that excel only works with maximum ~800 data points.
3. I tried to create a csv file in the python program with csv.writer() that jumps into a new line after each comma. This gave me an empty file as output, I don’t know why.
import csv
with open('dataset.txt','w', newline= '') as csvfile:
simval_similar = csv.writer(csvfile, delimiter= ' ', quotechar=',', quoting=csv.QUOTE_ALL)
print(dataset.txt)
To prove that my guess (explained in the ‘what do I think is the problem’ section) I rearranged 200 data points such that they are listed one below the other in a text file (manually), that suddenly gave me 200 observations, which means that my guess was probably right (or at least partly).
But to do that manually would, first of all, mean that I have to do this for 5000 data points and this strategy is error-prone.
I don’t know how to continue and would really appreciate help….
I have a log record like this (millions of rows):
previous_status>SERVICE</previous_status><reason>1</>device_id>SENSORS</device_id><DEVICE>ISCS</device_type><status>OK
I would like to to extract all the words in capital into individual columns in excel using python to look like this :
SERVICE SENSORS DEVICE
As per the comments from #peter-wood, it isn't clear what your input is. However, assuming that your input is as you posted, then here is a minimal solution that works off the given structure. If it is not quite right, you should be able to easily change it to search on whatever is really your structure.
import csv
# You need to change this path.
lines = [row.strip() for row in open('/path/to/log.txt').readlines()]
# You need to change this path to where you want to write the file.
with open('/path/to/write/to/mydata.csv', 'w') as fh:
# If you want a different delimiter, like tabs '\t', change it here.
writer = csv.writer(fh, delimiter=',')
for l in lines:
# You can cut and paste the tokens that start and stop the pieces you are looking for here.
service = l[l.find('previous_status>')+len('previous_status>'):l.find('</previous_status')]
sensor = l[l.find('device_id>')+len('device_id>'):l.find('</device_id>')]
device = l[l.find('<DEVICE>')+len('<DEVICE>'):l.find('</device_type>')]
writer.writerow([service, sensor, device])
I'm currently trying to make an automation script for writing new files from a master where there are two strings I want to replace (x1 and x2) with values from a 21 x 2 array of numbers (namely, [[0,1000],[50,950],[100,900],...,[1000,0]]). Additionally, with each double replacement, I want to save that change as a unique file.
Here's my script as it stands:
import numpy
lines = []
x1x2 = numpy.array([[0,1000],[50,950],[100,900],...,[1000,0])
for i,j in x1x2:
with open("filenamexx.inp") as infile:
for line in infile:
linex1 = line.replace('x1',str(i))
linex2 = line.replace('x2',str(j))
lines.append(linex1)
lines.append(linex2)
with open("filename"+str(i)+str(j)+".inp", 'w') as outfile:
for line in lines:
outfile.write(line)
With my current script there are a few problems. First, the string replacements are being done separately, i.e. I end up with a new file that contains the contents of the master file twice where one line has the first change and then the next will reflect the second separately. Second, with each subsequent iteration, the new files have the contents of the previous file prepended (i.e. filename100900.inp will contain its unique contents as well as the contents of both filename01000.inp and filename50950.inp before it). Anyone think they can take a crack at solving my problem?
Note: I've looked at using regex module solutions (somehing like this: https://www.safaribooksonline.com/library/view/python-cookbook-2nd/0596007973/ch01s19.html) in order to do multiple replacements in a single pass, but I'm not sure if the way I'm indexing is translatable to a dictionary object.
I'm not sure I understood the second issue but you can use replace more than one time on the same string, so:
s = "x1 stuff x2"
s = s.replace('x1',str(1)).replace('x2',str(2))
print(s)
, will output:
1 stuff 2
No need to do this two times for two different variables. As for the second issue it just seems as your not "reset-ing" the "lines" variable before starting to write a new file. So once you finish writing a file just add:
lines = []
It should be enough to solve these issues.
sc.textFile(path) allows to read an HDFS file but it does not accept parameters (like skip a number of rows, has_headers,...).
in the "Learning Spark" O'Reilly e-book, it's suggested to use the following function to read a CSV (Example 5-12. Python load CSV example)
import csv
import StringIO
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"])
return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
My question is about how to be selective about the rows "taken":
How to avoid loading the first row (the headers)
How to remove a specific row (for instance, row 5)
I see some decent solutions here: select range of elements but I'd like to see if there is anything more simple.
Thx!
Don't worry about loading the rows/lines you don't need. When you do:
input = sc.textFile(inputFile)
you are not loading the file. You are just getting an object that will allow you to operate on the file. So to be efficient, it is better to think in terms of getting only what you want. For example:
header = input.take(1)[0]
rows = input.filter(lambda line: line != header)
Note that here I am not using an index to refer to the line I want to drop but rather its value. This has the side effect that other lines with this value will also be ignored but is more in the spirit of Spark as Spark will distribute your text file in different parts across the nodes and the concept of line numbers gets lost in each partition. This is also the reason why this is not easy to do in Spark(Hadoop) as each partition should be considered independent and a global line number would break this assumption.
If you really need to work with line numbers I recommend that you add them to the file outside of Spark(see here) and then just filter by this column inside of Spark.
Edit: Added zipWithIndex solution as suggested by #Daniel Darabos.
sc.textFile('test.txt')\
.zipWithIndex()\ # [(u'First', 0), (u'Second', 1), ...
.filter(lambda x: x[1]!=5)\ # select columns
.map(lambda x: x[0])\ # [u'First', u'Second'
.collect()
Could I please be advised on the following problem. I have csv files which I would like to compare. The first contains coordinates of specific points in the genome (e.g. chr3: 987654 – 987654). The other csv files contain coordinates of genomic regions (e.g.chr3: 135596 – 123456789). I would like to cross compare my first file with my other files to see if any point locations in the first file overlaps with any regional coordinates in the other files and to write this set of overlap into a separate file. To make things simple for a start, I have drafted a simple code to cross compare between 2 csv files. Strangely, my code runs and prints the coordinates but does not write the point coordinates into a separate file. My first question is if my approach (from my code) at comparing these two files optimal or is there a better way of doing this? Secondly, why is it not writing into a separate file?
import csv
Region = open ('Region_test1.csv', 'rt', newline = '')
reader_Region = csv.reader (Region, delimiter = ',')
DMC = open ('DMC_test.csv', 'rt', newline = '')
reader_DMC = csv.reader (DMC, delimiter = ',')
DMC_testpoint = open ('DMC_testpoint.csv', 'wt', newline ='')
writer_Exon = csv.writer (DMC_testpoint, delimiter = ',')
for col in reader_Region:
Chr_region = col[0]
Start_region = int(col[1])
End_region = int(col [2])
for col in reader_DMC:
Chr_point = col[0]
Start_point = int(col [1])
End_point = int(col[2])
if Chr_region == Chr_point and Start_region <= Start_point and End_region >= End_point:
print (True, col)
else:
print (False, col)
writer_Exon.writerow(col)
Region.close()
DMC.close()
A couple of things are wrong, not the least of which is that you never check to see if your files opened successfully. The most glaring is that you never close your writer.
That said this an incredibly non-optimal way to go about the program. File I/O is slow. You don't want to keep rereading everything in a factorial fashion. Given how your search requires all possible comparisons you'll want to store at least one of the two files completely in memory, and potentially use a generator/iterator over the other if you dont wish to store both complete sets of data in memory.
One you have both sets loaded, proceed to do your intersection checks
I'd suggest you take a look at http://docs.python.org/2/library/csv.html for how to use a csv reader because what you are doing doesn't appear to make anysense because col[0], col[1] and col[2] aren't going to be what you think they are.
These are style and readability things but:
The names of some iteration variables seem a bit off, for col in ... should probably be for token in ... because you are processing token by token, and not column by columns/line by line, etc.
Additionally it would be nice to pick something consistent to stick to for your variables, sometimes you start with uppercase, sometimes you save the uppercase for after your '_'
That are putting ' ' between your objects and some function noames and not others is also very odd. But again these dont change the functionality of your code.