Reading from more than one line after keyword? - python

I have an output file which prints out a matrix of numeric data. I need to search through this file for the identifier at the start of each data set, which is:
GROUP 1 FIRST 1 LAST 163
Here GROUP 1 is the first column of the matrix, FIRST 1 is the first non-zero element of this matrix in position 1, and LAST 163 is the last non-zero element of the matrix in position 163. The matrix doesn't necessarily end at this LAST value - in this case there are 172 values.
I want to read this data into a simpler form to work with. Here is an example of the first two column results:
GROUP 1 FIRST 1 LAST 163
7.150814E-02 9.866657E-03 8.500540E-04 1.818338E-03 2.410691E-03 3.284499E-03 3.011986E-03 1.612432E-03
1.674247E-03 3.436244E-03 3.655873E-03 4.056876E-03 4.560725E-03 2.462454E-03 2.567764E-03 5.359393E-03
5.457415E-03 2.679373E-03 2.600020E-03 2.491592E-03 2.365089E-03 2.228494E-03 5.792616E-03 1.623274E-03
1.475062E-03 1.331820E-03 1.195052E-03 2.832699E-03 7.298341E-04 6.301271E-04 1.377459E-03 1.048925E-03
1.677453E-04 3.580640E-04 1.575301E-04 1.150545E-04 1.197719E-04 2.950028E-05 5.380539E-05 1.228784E-05
1.627659E-05 4.522051E-05 7.736908E-06 1.758838E-05 8.161204E-06 6.103670E-06 6.431876E-06 1.585671E-06
4.110246E-06 4.512924E-07 2.775227E-06 5.107739E-07 1.219448E-06 1.653674E-07 4.429047E-07 4.837661E-07
2.036820E-07 3.449548E-07 1.457648E-07 4.494116E-07 1.629392E-07 1.300509E-07 1.730199E-07 8.130338E-08
1.591993E-08 5.457638E-08 1.713141E-08 7.806754E-09 1.154869E-08 3.545961E-09 2.862203E-09 2.289470E-09
4.324002E-09 2.243199E-09 2.627165E-09 2.273119E-09 1.973867E-09 1.710714E-09 1.468845E-09 1.772236E-09
1.764492E-09 1.004393E-09 1.044698E-09 5.201382E-10 2.660613E-10 3.012732E-10 2.630323E-10 4.381052E-10
2.521794E-10 9.213524E-11 2.619283E-10 3.591906E-11 1.449830E-10 1.867363E-11 1.230445E-10 1.108149E-11
2.775004E-11 1.156249E-11 4.393752E-11 5.318751E-11 6.815569E-12 1.817489E-11 2.044674E-11 2.044673E-11
1.931080E-11 1.931076E-11 1.817484E-11 2.044668E-11 5.486837E-12 7.681572E-12 1.536314E-11 7.132886E-12
8.230253E-12 1.426577E-11 1.426577E-11 4.389468E-12 5.925780E-12 2.853153E-12 2.853153E-12 5.706307E-12
5.706307E-12 2.194733E-12 3.292099E-12 5.267358E-12 2.194733E-12 3.072626E-12 4.828412E-12 4.389466E-12
4.389465E-12 1.097366E-11 2.194732E-12 1.316839E-11 2.194732E-12 1.608784E-11 1.674222E-11 1.778860E-11
6.993074E-12 2.622402E-12 9.090994E-12 5.769285E-12 1.573441E-12 6.861030E-12 4.782885E-12 8.768619E-13
2.311727E-12 3.188589E-12 4.393636E-12 3.844430E-12 4.256331E-12 1.235709E-12 2.746020E-12 2.746020E-12
8.238059E-13 2.608719E-12 1.445203E-12 4.817344E-13 1.445203E-12 7.609642E-14 2.536547E-13 2.000924E-13
7.075681E-14 7.075681E-14 3.056704E-14
GROUP 2 FIRST 2 LAST 168
6.740271E-02 8.310813E-03 3.609403E-03 1.307012E-03 2.949375E-03 3.605043E-03 1.612647E-03 1.640960E-03
3.597806E-03 4.022993E-03 4.289805E-03 4.480576E-03 2.352539E-03 2.415121E-03 5.018262E-03 5.188098E-03
2.589224E-03 2.546116E-03 2.472462E-03 2.374431E-03 2.260519E-03 5.981164E-03 1.700972E-03 1.556116E-03
1.410140E-03 1.273499E-03 3.061941E-03 7.995844E-04 6.967963E-04 1.553994E-03 1.216266E-03 1.997540E-04
4.426460E-04 1.990445E-04 1.470610E-04 1.539762E-04 3.814900E-05 7.024764E-05 1.611156E-05 2.136422E-05
5.984886E-05 1.035646E-05 2.363444E-05 1.105747E-05 8.308678E-06 8.789299E-06 2.257693E-06 5.807418E-06
6.248625E-07 3.822327E-06 6.987942E-07 1.660586E-06 2.240283E-07 5.983062E-07 6.513773E-07 2.735403E-07
4.614998E-07 1.940877E-07 5.895136E-07 2.081549E-07 1.662117E-07 2.316650E-07 1.101916E-07 2.162701E-08
7.493990E-08 2.341661E-08 1.072330E-08 1.606536E-08 4.945307E-09 3.936301E-09 3.147244E-09 5.945972E-09
3.108514E-09 3.682241E-09 3.210760E-09 2.795020E-09 2.436545E-09 2.118219E-09 2.612622E-09 2.586657E-09
1.432507E-09 1.457386E-09 7.264341E-10 3.803348E-10 4.514677E-10 3.959518E-10 6.541553E-10 3.707172E-10
1.334816E-10 3.875547E-10 5.294296E-11 2.294557E-10 2.790137E-11 1.719152E-10 1.408339E-11 3.526731E-11
1.469469E-11 5.583990E-11 6.759567E-11 8.766360E-12 2.337697E-11 2.629908E-11 2.629908E-11 2.483802E-11
2.483802E-11 2.337697E-11 2.629908E-11 7.112706E-12 9.957791E-12 1.991557E-11 9.246516E-12 1.066906E-11
1.849303E-11 1.849303E-11 5.690165E-12 7.681722E-12 3.698607E-12 3.698607E-12 7.397214E-12 7.397214E-12
2.845082E-12 4.267624E-12 6.828199E-12 2.845082E-12 3.983115E-12 6.259180E-12 5.690165E-12 5.690165E-12
1.422541E-11 2.845082E-12 1.707049E-11 2.845082E-12 2.095991E-11 2.193285E-11 2.330364E-11 1.096642E-11
4.112407E-12 1.425635E-11 8.906802E-12 2.429128E-12 1.106603E-11 8.097092E-12 1.484468E-12 3.913596E-12
5.398063E-12 8.624785E-12 7.546689E-12 8.355261E-12 2.425721E-12 5.390492E-12 5.390492E-12 1.617147E-12
5.120967E-12 2.710198E-12 9.033993E-13 2.710198E-12 3.744092E-13 1.248030E-12 6.614939E-13 4.359798E-13
4.359798E-13 1.364861E-13 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15 4.856661E-15
What I have at the moment works, except it only reads in the first line after the GROUP keyword line. How can I make it continue reading the data in until it reaches the next GROUP keyword?
file_name = "test_data.txt"
import re
import io
group_pattern = re.compile(r"GROUP +\d+ FIRST +(?P<first>\d+) LAST +(?P<last>\d+)")
def read_data_from_file(file_name, start_identifier, end_identifier):
results = []
longest = 0
with open(file_name) as file:
t = file.read()
t=t[t.find('MACRO'):]
t=t[t.find(start_identifier)+len(start_identifier):t.find(end_identifier)]
t=io.StringIO(t)
for line in t:
match = group_pattern.search(line)
if match:
first = int(match.group('first'))
last = int(match.group('last'))
data = [float(value) for value in next(t).split()]
row = [0.0] * last
for i, value in enumerate(data, start=first-1):
row[i] = value
longest = max(longest, len(row))
results.append(row)
for row in results:
if len(row) < longest:
row.extend([0.0] * (longest-len(row)))
return results
start_identifier = "SCATTER MOMENT 1"
end_identifier = "SCATTER MOMENT 2"
results = read_data_from_file(file_name, start_identifier, end_identifier)
print(results)
What I want the code to produce is a matrix with just the numerical data. In this case it would be size [2x168] but my full data set is [172x172]. I want every GROUP to be read in as a row of the matrix, and zeroes filled into every element not specified in the output data. The current code does almost all of this, except that it only reads the first line of data after the GROUP keyword line.

So I took a look at the data you provided in your question. I found what I think is a better and simpler way of pulling those data points out of that file. However I noticed that you have some other code thats looking for other things in the file as well but those weren't in the test data you posted. So you may have to adapt this a little to work with your dataset.
def read_data_from_file(file_name):
with open(file_name) as fp:
index = -1
matrices = []
# Iterate over the file line by line via iter. Reduces memory usage
for line in fp:
# Since headers are always on their own line and data points always being with
# two spaces we can just look for lines that start with two spaces.
# If we find a line without these spaces then its the header line, add a new
# list to matrices and add one to index
if not line.startswith(' '):
index += 1
matrices.append([])
else:
# Splice str at index 2 to ignore first two spaces
# Then split by two spaces to get each data point
str_data_points = line[2:].split(' ')
# Map the string data points to a floats
float_data_points = map(lambda s: float(s), str_data_points)
# Add those float data points to the list in matrices via index
matrices[index].extend(float_data_points)
max_matrix_length = max(map(lambda matrix: len(matrix), matrices))
for matrix in matrices:
matrix.extend([0.0] * (max_matrix_length - len(matrix)))
return matrices

Here's my solution to read the data from the .txt file and produce a matrix-like output (0.0 padded at the end of each group)
import re
def read_data_from_file(filepath):
GROUP_DATA = []
MAX_ELEMENT_COUNT = 0
with open(file_path) as f:
for line in f.readlines():
if 'GROUP' in line:
GROUP_DATA.append([])
MAX_ELEMENT_COUNT = max(MAX_ELEMENT_COUNT, int(re.findall(r'\d+', line)[-1]))
else:
values = line.split(' ')
for value in values:
try:
GROUP_DATA[-1].append(float(value))
except ValueError:
pass
for DATA in GROUP_DATA:
if len(DATA) < MAX_ELEMENT_COUNT:
DATA += [0.0] * (MAX_ELEMENT_COUNT - len(DATA))
return GROUP_DATA
For the data in the given question saved into data.txt, the output would be as follows:
>>> import numpy as np ------------------------------> Just to check the output shape
>>> mat = read_data_from_file('data.txt')
>>> np.shape(mat)
(2, 168) <-------------------------------------------- Output shape as expected
The Output Matrix's size is flexible to the given data

Related

How to transform a multi dimensional array from a CSV file into a list

screenshot of the csv file
Hi(sorry if this is a dump question)..i have a data set as CSV file ...every row contains 44 column and every cell containes 44 float number separated by two spaces like this(look at the screenshot) ...i tried CSV readline/s + numpy and non of them worked
i want to take every row as a list with[1936] variable (44*44)
and then combine the whole data set into 2d array ...my_data[n_of_samples][1936]
so as stated by user ybl, this is not a CSV. It's not even close to being a CSV.
This means that you have to implement some processing to turn this into something useable. I put the screenshot through an OCR to extract the actual text values, but next time provide the input file. Screenshots of data are annoying to work with.
The processing you need to to is to find the start and end of the rows, using the [ and ] characters respectively. Then you split this data with the basic string.split() which doesn't care about the number of spaces.
Try the code below and see if that works for the input file.
rows = []
current_row = ""
with open("somefile.txt") as infile:
for line in infile.readlines():
cleaned = line.replace('"', '').replace("\n", " ")
if "]" in cleaned:
current_row = f"{current_row} {cleaned.split(']')[0]}"
rows.append(current_row.split())
current_row = ""
cleaned = cleaned.split(']')[1]
if "[" in cleaned:
cleaned = cleaned.split("[")[1]
current_row = f"{current_row} {cleaned}"
for row in rows:
print(len(row))
output
44
44
44
input file:
"[ 1.79619717e+04 1.09988207e+02 4.13270009e+01 1.72227906e+01
1.06178751e+01 5.20957856e+00 7.50891645e+00 4.57943370e+00
2.65572713e+00 2.96725867e-01 2.43040664e+00 1.32822091e+00
4.09853169e-01 1.18412873e+00 6.43398990e-01 1.23796528e+00
9.63975374e-02 2.95295579e-01 7.68998970e-01 4.98040980e-01
2.84036936e-01 1.76004564e-01 1.43527613e-01 1.64765236e-01
1.51171075e-01 1.02586637e-01 3.27835810e-02 1.21872869e-02
-7.59824907e-02 8.48217334e-02 7.29953754e-02 4.89750588e-02
5.89426950e-02 5.05485266e-02 2.34761263e-02 -2.41095452e-02
5.15952510e-02 1.39933210e-02 2.12354074e-02 3.40820680e-03
-2.57466949e-03 -1.06481222e-02 -8.35155410e-03 1.21653512e-12]","[-6.12189619e+02 1.03584744e+04 2.34417495e+02 7.01761526e+01
3.92495170e+01 1.81609738e+01 2.58114624e+01 1.52275550e+01
8.59676934e+00 9.45036161e-01 7.71943506e+00 4.17516432e+00
1.27920413e+00 3.68862368e+00 1.99582544e+00 3.82999035e+00
2.96068511e-01 9.06341796e-01 2.35621065e+00 1.52094079e+00
8.64565916e-01 5.34605108e-01 4.35456793e-01 4.99450615e-01
4.57778770e-01 3.10324997e-01 9.90860520e-02 3.68281889e-02
-2.29532895e-01 2.56108491e-01 2.20284123e-01 1.47727878e-01
1.77724506e-01 1.52350751e-01 7.07318164e-02 -7.26252404e-02
1.55364050e-01 4.21222079e-02 6.39113311e-02 1.02558665e-02
-7.74736016e-03 -3.20368093e-02 -2.51241082e-02 1.21653512e-12]","[-5.03959282e+02 -5.64452044e+02 7.90433958e+03 1.94146598e+02
1.06178751e+01 5.20957856e+00 7.50891645e+00 4.57943370e+00
2.65572713e+00 2.96725867e-01 2.43040664e+00 1.32822091e+00
4.09853169e-01 1.18412873e+00 6.43398990e-01 1.23796528e+00
9.63975374e-02 2.95295579e-01 7.68998970e-01 4.98040980e-01
2.84036936e-01 1.76004564e-01 1.43527613e-01 1.64765236e-01
1.51171075e-01 1.02586637e-01 3.27835810e-02 1.21872869e-02
-7.59824907e-02 8.48217334e-02 7.29953754e-02 4.89750588e-02
5.89426950e-02 5.05485266e-02 2.34761263e-02 -2.41095452e-02
5.15952510e-02 1.39933210e-02 2.12354074e-02 3.40820680e-03
-2.57466949e-03 -1.06481222e-02 -8.35155410e-03 1.21653512e-12]"
The option is this:
import numpy as np
import csv
c = np.array([n_of_samples])
with open('cocacola_sick.csv') as f:
p = csv.reader(f) # read file as csv
for s in p:
a = ','.join(s) # concatenate all lines into one line
a = a.replace("\n", "") # remove line breaks
b = np.array(np.mat(a))
my_data = np.vstack((c,b))
print(my_data)

Extracting lines from multiline string with various variable length sections

I'm working with a pandas dataframe containing a large block of plain text for each row. The block of text has the following format:
Year 1
... (variable # of lines)
7. Stuff
... (variable # of lines, can be 0)
TOTAL Stuff
(single line, numeric)
... (variable # of lines)
Services
(single line)
... (variable # of lines)
Year 2
... (same format as prev)
<repeats for n years>
TOTAL
... (same format as years)
Justification
... (variable number of lines)
<repeat m times>
I'm trying to extract the plain text under the "7. Stuff" and "Justification" headings as well a numerical values for "TOTAL Stuff". My current code creates a list based on the line breaks and iterates through them, but I feel like this is not efficient. It my current implementation also only works when there is one cycle of years -> Total -> Justification (not m).
Here is my parse_text function. Any help on making it more 'pythonic' or just efficient in general is greatly appreciated.
def parse_budget_text(row):
stuff_value = 0
stuff_text = ''
justification_txt = ''
#ensure text is not hidden within a list
text = row['text_raw']
#parse and sum equipment lines
line_iter = iter([line.strip() for line in text.split("\n")])
total_flag = False
justification_flag = False
for line in line_iter:
#find each yearly section
if line.startswith("YEAR"):
while not line.startswith("7. Stuff"):
line = next(line_iter)
line = next(line_iter)
while not line.startswith("Services"):
if ("TOTAL Stuff" not in line) and (not is_number(line)) and (line[0] != "$"):
stuff_txt += line+'; '
line = next(line_iter)
#find total summary
elif line.startswith("TOTAL"):
cumulative_flag = True
while not line.startswith("TOTAL Stuff"):
line =next(line_iter)
stuff_value += int(next(line_iter).replace(',',''))
#find Justification line
elif line.startswith("Justification") and cumulative_flag:
justification_flag = True
#extract justification
elif justification_flag == True:
justification_txt += line
return pd.Series({'raw_text': text, 'Stuff_val': stuff_value, 'Stuff_txt': stuff_txt,})

How to write regular expressions to match white space delimited multi-line column data

So I have log files that come in the form of
n400_108tb_48gb 2 G 1,3-7 1 20G / 286T (< 1% )
n400_108tb_48gb:1 1 D 1-3:bay1-6 - 2.1G / 48T (< 1% )
n400_108tb_48gb:3 3 D 1-3:bay7-12 - 1.9G / 48T (< 1% )
n400_108tb_48gb:4 4 D 1-3:bay13-18 - 10G / 48T (< 1% )
n400_108tb_48gb:5 5 D 1-3:bay19-24 - 2.0G / 48T (< 1% )
n400_108tb_48gb:6 6 D 1-3:bay25-30 - 2.2G / 48T (< 1% )
n400_108tb_48gb:7 7 D 1-3:bay31-36 - 1.7G / 48T (< 1% )
That seems nice and simple to deal with so I can just write regular expressions to deal with that one line at a time.
([0-9a-z_:]*)\s*([1-9])\s*([DGPTE])\s*([0-9a-z_:,-]*)\s*([1-9])\s*([0-9.]+[KMGTPE]).*?([0-9]*[KMGTPE])
I mean, that's ugly but I can simplify it to
_name = r"([0-9a-z_:]*)\s*"
_id = r"([1-9])
_type = r"([DGPTE])"
_members = r"([0-9a-z_:,-]*)"
_vhs = r"([1-9-])"
_used = r"([0-9.]*[KMGTPE])"
_size = r"([0-9.]*[KMGTPE])"
_disk_protections_regex_string = r"{0}\s*{1}\s*{2}\s*{3}\s*{4}\s*{5}.*?{6}".format(
_name,
_id,
_type,
_members,
_vhs,
_used,
_size,)
Then I discovered that I have to parse files with this format.
s200_13tb_400gb 1 +3 system, vhs_de 1:0-23, 1 53T / 218T (25% )
-ssd_48gb-ram ny_writes, vhs 2:0-23, 3:0-
_hide_spare, 1,3-19,21-25
ssd_metadata , 4:0-23,
5:0-23,
6:0-23,
7:0-23,
8:0-23,
9:0-23,
10:0-23,
11:0-23,
12:0-23,
13:0-23,
14:0-23,
15:0-23,
16:0-23,
17:0-23,
18:2-25
and suddenly The expected values are
s200_13tb_400gb-ssd_48gb-ram
system vhs_deny_writes, vhs_hide_spare, ssd_metadata
1:0-23, 2:0-23, 3:0-1,3-19,21-25, 4:0-23, 5:0-23, 6:0-23, 7:0-23, 8:0-23, 9:0-23, 10:0-23, 11:0-23, 12:0-23, 13:0-23, 14:0-23, 15:0-23, 16:0-23, 17:0-23, 18:0-23,
As well as the original formatting I presented. I don't even know where to start with white space delimited column separated values.
I've created a more dynamic method, which finds the column definitions itself.
Explanation
The script first looks in the file for columns where in each
line the character is a whitespace.
It then defines the data column definitions based on being between whitespace columns. + [len(content[0])] adds an additional whitespace column at the end making the last data column accessible if needed.
The data is extracted with the defined columns.
The data is printed if it matches the specific defined patterns. Warning: If you have multiple records per file, you will have to change this step.
Code
import re
from collections import Counter
# Patterns to save in the end, [name, attr, values]
patterns = [r"^([0-9a-z_-]{4,}$)", r"^([a-z_,\s]*$)", r"([0-9:,\s-]{4,})$"]
# Get file content, remove any trailing empty line.
with open('/path/to/my/file') as f:
content = f.read().split('\n')
if not content[-1]:
content = content[:-1]
# 1) Find all single character columns in content with only whitespaces.
no_lines = len(content)
whitespaces = [i for l in content for i, char in enumerate(l) if char == ' ']
whi_columns = [k for k, v in Counter(whitespaces).iteritems() if v == no_lines]
# .items() in python3
# 2) Get all real columns that are between whitespace columns.
columns_defs = []
for i, whi_col in enumerate(whi_columns + [len(content[0])]):
if whi_col and not i: #special first column
columns_defs.append(slice(whi_col))
if whi_col > whi_columns[i - 1] + 1:
columns_defs.append(slice(whi_columns[i - 1] + 1, whi_col))
# 3) Extract columns from file content.
data_columns = [[line[col].strip() for line in content] for col in columns_defs]
# 4) Save columns fitting patterns.
for data_col in data_columns:
data = ''.join(data_col)
if re.match(r'|'.join(patterns), data):
print data
Output
s200_13tb_400gb-ssd_48gb-ram
system, vhs_deny_writes, vhs_hide_spare,ssd_metadata
1:0-23,2:0-23, 3:0-1,3-19,21-25, 4:0-23,5:0-23,6:0-23,7:0-23,8:0-23,9:0-23,10:0-23,11:0-23,12:0-23,13:0-23,14:0-23,15:0-23,16:0-23,17:0-23,18:2-25
Define slices for the columns then aggregate the data in each line
col_1 = slice(17)
col_2 = slice(25,40)
col_3 = slice(41,54)
col_4 = slice(55,None)
one, two, three, four = list(), list(), list(), list()
with open('file.txt') as f:
for line in f:
one.append(line[col_1])
two.append(line[col_2])
three.append(line[col_3])
four.append(line[col_4])
print ''.join(item.strip() for item in one)
print ''.join(item.strip() for item in two)
print ''.join(item.strip() for item in three)
print ''.join(item.strip() for item in four)
>>>
s200_13tb_400gb-ssd_48gb-ram
system, vhs_deny_writes, vhs_hide_spare,ssd_metadata
1:0-23,2:0-23, 3:0-1,3-19,21-25, 4:0-23,5:0-23,6:0-23,7:0-23,8:0-23,9:0-23,10:0-23,11:0-23,12:0-23,13:0-23,14:0-23,15:0-23,16:0-23,17:0-23,18:2-25
53T / 218T (25% )
>>>
This will extract data from the collimated format shown in the example. If there are multiple records in a file, the record delimiter needs to be determined.

How do you make tables with previously stored strings?

So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7
Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.
This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()

How do I convert integers into high-resolution times in Python? Or how do I keep Python from dropping zeros?

Currently, I'm using this to calculate the time between two messages and listing the times if they are above 20 seconds.
def time_deltas(infile):
entries = (line.split() for line in open(INFILE, "r"))
ts = {}
for e in entries:
if " ".join(e[2:5]) == "OuchMsg out: [O]":
ts[e[8]] = e[0]
elif " ".join(e[2:5]) == "OuchMsg in: [A]":
in_ts, ref_id = e[0], e[7]
out_ts = ts.pop(ref_id, None)
yield (float(out_ts),ref_id[1:-1],(float(in_ts)*10000 - float(out_ts)*10000))
n = (float(in_ts)*10000 - float(out_ts)*10000)
if n> 20:
print float(out_ts),ref_id[1:-1], n
INFILE = 'C:/Users/klee/Documents/text.txt'
import csv
with open('output_file1.csv', 'w') as f:
csv.writer(f).writerows(time_deltas(INFILE))
However, there are two major errors. First of all, python drops zeros when the time is before 10, ie. 0900. And, it drops zeros making the time difference not accurate.
It looks like:
130203.08766
when it should be:
130203.087660
You are yielding floats, so the csv writer turns those floats into strings as it pleases.
If you want your output values to be a certain format, yield a string in that format.
Perhaps something like this?
print "%04.0f" % (900) # prints 0900

Categories

Resources