Saving results of regular expression to csv or xls - python

I have a log record like this (millions of rows):
previous_status>SERVICE</previous_status><reason>1</>device_id>SENSORS</device_id><DEVICE>ISCS</device_type><status>OK
I would like to to extract all the words in capital into individual columns in excel using python to look like this :
SERVICE SENSORS DEVICE

As per the comments from #peter-wood, it isn't clear what your input is. However, assuming that your input is as you posted, then here is a minimal solution that works off the given structure. If it is not quite right, you should be able to easily change it to search on whatever is really your structure.
import csv
# You need to change this path.
lines = [row.strip() for row in open('/path/to/log.txt').readlines()]
# You need to change this path to where you want to write the file.
with open('/path/to/write/to/mydata.csv', 'w') as fh:
# If you want a different delimiter, like tabs '\t', change it here.
writer = csv.writer(fh, delimiter=',')
for l in lines:
# You can cut and paste the tokens that start and stop the pieces you are looking for here.
service = l[l.find('previous_status>')+len('previous_status>'):l.find('</previous_status')]
sensor = l[l.find('device_id>')+len('device_id>'):l.find('</device_id>')]
device = l[l.find('<DEVICE>')+len('<DEVICE>'):l.find('</device_type>')]
writer.writerow([service, sensor, device])

Related

write dicom header to csv

I've got a bunch of .dcm-files (dice-files) where I would like to extract the header and save the information there in a CSV file.
As you can see in the following picture, I've got a problem with the delimiters:
For example when looking at the second line in the picture: I'd like to split it like this:
0002 | 0000 | File Meta Information Group Length | UL | 174
But as you can see, I've not only multiple delimiters but also sometimes ' ' is one and sometimes not. Also the length of the 3rd column varies, so sometimes there is only a shorter text there, e.g. Image Type further down in the picture.
Does anyone have a clever idea, how to write it in a CSV file?
I use pydicom to read and display the files in my IDE.
I'd be very thankful for any advice :)
I would suggest going back to the data elements themselves and working from there, rather than from a string output (which is really meant for exploring in interactive sessions)
The following code should work for a dataset with no Sequences, would need some modification to work with sequences:
import csv
import pydicom
from pydicom.data import get_testdata_file
filename = get_testdata_file("CT_small.dcm") # substute your own filename here
ds = pydicom.dcmread(filename)
with open('my.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow("Group Elem Description VR value".split())
for elem in ds:
writer.writerow([
f"{elem.tag.group:04X}", f"{elem.tag.element:04X}",
elem.description(), elem.VR, str(elem.value)
])
It may also require a bit of change to make the elem.value part look how you want it, or you may want to set the CSV writer to use quotes around items, etc.
Output looks like:
Group,Elem,Description,VR,value
0008,0005,Specific Character Set,CS,ISO_IR 100
0008,0008,Image Type,CS,"['ORIGINAL', 'PRIMARY', 'AXIAL']"
0008,0012,Instance Creation Date,DA,20040119
0008,0013,Instance Creation Time,TM,072731
...

python script to exclude specific field of csv cell data

using python I'm trying to create summary with existing data of csv and finding difficulties in extracting data from one of the cell.
the input csv file
I want to include only the city name and file path from info 4 column and expecting the summary like - AlexxxxxyyyyzzzzzNewyork\Folder1\Folder2\Test.txt
the code
csv_data_out[csv_line_out].append(conten[Name])
csv_data_out[csv_line_out].append(conten[info 1])
csv_data_out[csv_line_out].append(conten[info 2])
csv_data_out[csv_line_out].append(conten[info 3])
csv_data_out[csv_line_out].append(conten[info 4])
csv_summary = ("".join(csv_data_out[csv_line_out]))
with open(outputfile, 'wb') as newfile:
writer = csv.writer(newfile, delimiter = ';')
writer.writerow(csv_columns_out[:])
writer.writerows(csv_data_out)
newfile.close()
any idea to fetch only the required details from info 4 col ?
Essentially you have a csv inside a csv. There's not info posted to give a fully complete answer but here's most of it.
You can take a string and process it as a csv using io.StringIO (or io.BytesIO if a byte string).
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import csv
from io import StringIO
# Create somewhere to put the inputs in case needed later
stored_items = []
with open('data.csv', 'r') as csvfile:
inputs = csv.reader(csvfile)
# skip the header row
next(inputs)
for row in inputs:
# Extract the Info 4 column for processing
f = StringIO(row[4])
string_file = csv.reader(f,quotechar='"')
build_string = ""
for string_row in string_file:
build_string = f"{string_row[0]}{string_row[1]}"
# Merge everything into a summary
summary_string = f"{row[0]}{row[1]}{row[2]}{row[3]}{build_string}"
# Add all the data back to storage
stored_items.append((row[0],row[1],row[2],row[3],row[4],summary_string))
print(summary_string)
The reason why I say there's not enough information posted in because, for example, will the location always be (a) which can have a fixed text replacement, or will be conditional e.g. it could be (a) or (b) in which case it would possibly require regex. (My preference is not to use regex unless absolutely necessary). Also, is it always the first two terms you are after from Info 4, or will the terms be found in different places in the text etc. Without seeing more samples of the data it's impossible to answer definitely.

Not getting the full output out of a list

Objective
I'm trying to extract the GPS "Latitude" and "Longitude" data from a bunch of JPG's and I have been successful so far but my main problem is that when I try to write the coordinates to a text file for example I see that only 1 set of coordinates was written compared to my console output which shows that every image was extracted. Here is an example: Console Output and here is my text file that is supposed be a mirror output along my console: Text file
I don't fully understand whats the problem and why it won't just write all of them instead of one. I believe it is being overwritten somehow or the 'GPSPhoto' module is causing some issues.
Code
from glob import glob
from GPSPhoto import gpsphoto
# Scan jpg's that are located in the same directory.
data = glob("*.jpg")
# Scan contents of images and GPS values.
for x in data:
data = gpsphoto.getGPSData(x)
data = [data.get("Latitude"), data.get("Longitude")]
print("\nsource: {}".format(x), "\n ↪ {}".format(data))
# Write coordinates to a text file.
with open('output.txt', 'w') as f:
print('Coordinates:', data, file=f)
I have tried pretty much everything that I can think of including: changing the write permissions, not using glob, no loops, loops, lists, no lists, different ways to write to the file, etc.
Any help is appreciated because I am completely lost at this point. Thank you.
You're replacing the data variable each time through the loop, not appending to a list.
all_coords = []
for x in data:
data = gpsphoto.getGPSData(x)
all_coords.append([data.get("Latitude"), data.get("Longitude")])
with open('output.txt', 'w') as f:
print('Coordinates:', all_coords, file=f)

Reading selected parts of a file of unknown name

I am having a lot of datafiles with unknown names. I have figured out a way to get them all read and printed but I want to make graphs of them so I need the data in a way that is workable.
The datafiles are very neatly arranged (every line of the header contains information on what is stored there) but I am having trouble making a script that selects the data I need. The first 50+ lines of the file contain headers of which I need only a few to be used, this poses no problem when using something like:
for filename in glob.glob(fullpath):
with open(filename, 'r') as f:
for line in f:
if 'xx' in line:
Do my thing
if 'yy' in line:
Do my thing etc.
But below the headers there is a block of data of undetermined number of columns and undetermined number of lines (number of columns and what each column is, is specified in the headers). This I can't get read in a way that a graph can be made by for example matplotlib. (I can get it right by manually copying the data to a separate file and read that to a plottable format but that is not what I want to do every time of every file...) The line before the data starts contains the very useful #eoh but I can't figure out a way to combine the selective reading of the first 50+ lines and then swith to reading everything into an array. If there are methods to do what I want in a better way (including the selection of the map and seeing which files are there and readable) I am open to suggestions.
Update:
The solution proposed by #ImportanceOfBeingErnest seems very useful but I don't get it to work.
So I'll start with the data mentioned as missing in the answer.
Columnnames are given in the following format:
#COLUMNINFO= NUMBER1, UNIT, MEASUREMENT, NUMBER2
In this format number1 is the columnnumber, unit is the unit of the measurement, measurement is what is measured and number2 is in numbers what is measured.
The data is separated by spaces but that won't be a problem, I suspect.
I tried to implement the reading of the headers in the loop to determine the end of the headers, which failed to have any visible effects, even the print commands to check intermediate results did not show.
Once I put 'print line' after 'for line in f:' I thought I could see what went wrong but it appears as if the whole loop is ignored, including the break command which causes an error since the file is done reading and no data is left to read for the other parts...
Any help would be appreciated.
First of all, if the header has a certain character at the beginning of each line, this can be used to filter the header out automatically. Using numpy you could use numpy.loadtxt(filename, delimiter=";", comment="#") to load the data and every line starting with # would simply be ignored.
I don't know if this is the case here?!
In the case that you describe, where you have a header-ending flag #eoh you could first read in the header line by line to find out how many lines you later need to ignore and then use that number when loading the file.
I have assembled a little example, how it could work.
def readFile(filename):
#first find the number of lines to skip in the header
eoh = 0
with open(filename, "r") as f:
for line in f:
eoh = eoh+1
if "#eoh" in line:
break
# now at this point we need to find out about the column names
# but as no data is given as example, this is impossible
columnnames = []
# load the file by skipping eoh lines,
# the rest should be well behaving
a = np.genfromtxt(filename, skip_header = eoh, delimiter=";" )
return a, columnnames
def plot(a, columnnames, show=True, save=False, filename="something"):
fig = plt.figure()
ax = fig.add_subplot(111)
# plot the forth column agains the second
ax.plot(a[:, 1], a[:,3])
# if we had some columnname, we could also plot
# column named "ulf" against the one named "alf"
#ax.plot(a[:, columnnames.index("alf")], a[:,columnnames.index("ulf")])
#now save and/or show
if save:
plt.savefig(filename+".png")
if show:
plt.show()
if __name__ == "__main__":
fullpath = "path/to/files/file*.txt" # or whatever
for filename in glob.glob(fullpath):
a, columnnames = readFile(filename)
plot(a, columnnames, show=True, save=False, filename=filename[:-4])
One remaining problem is the names of the columns. Since you did not provide any example data, it's hard to estimate how to exactly do that.
This all assumes that you do not have any missing data in between or anything of that kind. If this was the case, then you'd need to use all the arguments to numpy.genfromtxt() to filter the data accordingly.

Comparing a file containing a chromosomal region and another containing point coordinates

Could I please be advised on the following problem. I have csv files which I would like to compare. The first contains coordinates of specific points in the genome (e.g. chr3: 987654 – 987654). The other csv files contain coordinates of genomic regions (e.g.chr3: 135596 – 123456789). I would like to cross compare my first file with my other files to see if any point locations in the first file overlaps with any regional coordinates in the other files and to write this set of overlap into a separate file. To make things simple for a start, I have drafted a simple code to cross compare between 2 csv files. Strangely, my code runs and prints the coordinates but does not write the point coordinates into a separate file. My first question is if my approach (from my code) at comparing these two files optimal or is there a better way of doing this? Secondly, why is it not writing into a separate file?
import csv
Region = open ('Region_test1.csv', 'rt', newline = '')
reader_Region = csv.reader (Region, delimiter = ',')
DMC = open ('DMC_test.csv', 'rt', newline = '')
reader_DMC = csv.reader (DMC, delimiter = ',')
DMC_testpoint = open ('DMC_testpoint.csv', 'wt', newline ='')
writer_Exon = csv.writer (DMC_testpoint, delimiter = ',')
for col in reader_Region:
Chr_region = col[0]
Start_region = int(col[1])
End_region = int(col [2])
for col in reader_DMC:
Chr_point = col[0]
Start_point = int(col [1])
End_point = int(col[2])
if Chr_region == Chr_point and Start_region <= Start_point and End_region >= End_point:
print (True, col)
else:
print (False, col)
writer_Exon.writerow(col)
Region.close()
DMC.close()
A couple of things are wrong, not the least of which is that you never check to see if your files opened successfully. The most glaring is that you never close your writer.
That said this an incredibly non-optimal way to go about the program. File I/O is slow. You don't want to keep rereading everything in a factorial fashion. Given how your search requires all possible comparisons you'll want to store at least one of the two files completely in memory, and potentially use a generator/iterator over the other if you dont wish to store both complete sets of data in memory.
One you have both sets loaded, proceed to do your intersection checks
I'd suggest you take a look at http://docs.python.org/2/library/csv.html for how to use a csv reader because what you are doing doesn't appear to make anysense because col[0], col[1] and col[2] aren't going to be what you think they are.
These are style and readability things but:
The names of some iteration variables seem a bit off, for col in ... should probably be for token in ... because you are processing token by token, and not column by columns/line by line, etc.
Additionally it would be nice to pick something consistent to stick to for your variables, sometimes you start with uppercase, sometimes you save the uppercase for after your '_'
That are putting ' ' between your objects and some function noames and not others is also very odd. But again these dont change the functionality of your code.

Categories

Resources