Reading txt file into python - python

I python rookie here. I have multiple text files each with the following format of floats 1x10, 1x10 and 10x10
0.0551 1500.0 [273.639, 273.331, 273.021, 272.711, 272.399, 272.087, 271.773, 271.46, 271.145]
0.0553 1532.5 [272.422, 273.96, 273.021, 273.321, 272.494, 273.129, 271.12, 271.23, 271.889]
0.0555 1560.0 [273.234, 273.44, 273.133, 272.065, 272.234, 272.012, 271.942, 271.43, 271.145]
0.0558 1582.5 [272.45, 273.011, 273.45, 273.331, 272.321, 273.234, 271.34, 271.531, 271.932]
I would like to read them as a column as following to be able to plot them:
column1 = [0.0551,0.0553,0.0555,0.0558,....]
column2 = [1500.0,1532.5,1560.0,1582.5,....]
column3 = [[273.639, 273.331, 273.021, 272.711, 272.399, 272.087, 271.773, 271.46, 271.145],[272.422, 273.96, 273.021, 273.321, 272.494, 273.129, 271.12, 271.23, 271.889],[273.234, 273.44, 273.133, 272.065, 272.234, 272.012, 271.942, 271.43, 271.145],[272.45, 273.011, 273.45, 273.331, 272.321, 273.234, 271.34, 271.531, 271.932]]
I tried numpy loadtxt and numerous other functions but was never able to successfully read them in python. What is the best way to read the text file in the desired format?

Your file structure is kinda weird, you should clean it upstream.
Anyway here's the function to load your data. If the file structure changes too much, the function may not work.
def load_data(file):
cols = [[] for _ in range(3)]
to_remove = ['[', ']', '\n']
with open(file, 'r') as f:
for line in f.readlines():
if len(line) > 1:
split_line = line
for x in to_remove: split_line = split_line.replace(x, '')
split_line = split_line.split(' ', 2)
cols[0].append(float(split_line[0]))
cols[1].append(float(split_line[1]))
cols[2].append([float(i) for i in split_line[2].split(',')])
return cols

Related

How to transform a multi dimensional array from a CSV file into a list

screenshot of the csv file
Hi(sorry if this is a dump question)..i have a data set as CSV file ...every row contains 44 column and every cell containes 44 float number separated by two spaces like this(look at the screenshot) ...i tried CSV readline/s + numpy and non of them worked
i want to take every row as a list with[1936] variable (44*44)
and then combine the whole data set into 2d array ...my_data[n_of_samples][1936]
so as stated by user ybl, this is not a CSV. It's not even close to being a CSV.
This means that you have to implement some processing to turn this into something useable. I put the screenshot through an OCR to extract the actual text values, but next time provide the input file. Screenshots of data are annoying to work with.
The processing you need to to is to find the start and end of the rows, using the [ and ] characters respectively. Then you split this data with the basic string.split() which doesn't care about the number of spaces.
Try the code below and see if that works for the input file.
rows = []
current_row = ""
with open("somefile.txt") as infile:
for line in infile.readlines():
cleaned = line.replace('"', '').replace("\n", " ")
if "]" in cleaned:
current_row = f"{current_row} {cleaned.split(']')[0]}"
rows.append(current_row.split())
current_row = ""
cleaned = cleaned.split(']')[1]
if "[" in cleaned:
cleaned = cleaned.split("[")[1]
current_row = f"{current_row} {cleaned}"
for row in rows:
print(len(row))
output
44
44
44
input file:
"[ 1.79619717e+04 1.09988207e+02 4.13270009e+01 1.72227906e+01
1.06178751e+01 5.20957856e+00 7.50891645e+00 4.57943370e+00
2.65572713e+00 2.96725867e-01 2.43040664e+00 1.32822091e+00
4.09853169e-01 1.18412873e+00 6.43398990e-01 1.23796528e+00
9.63975374e-02 2.95295579e-01 7.68998970e-01 4.98040980e-01
2.84036936e-01 1.76004564e-01 1.43527613e-01 1.64765236e-01
1.51171075e-01 1.02586637e-01 3.27835810e-02 1.21872869e-02
-7.59824907e-02 8.48217334e-02 7.29953754e-02 4.89750588e-02
5.89426950e-02 5.05485266e-02 2.34761263e-02 -2.41095452e-02
5.15952510e-02 1.39933210e-02 2.12354074e-02 3.40820680e-03
-2.57466949e-03 -1.06481222e-02 -8.35155410e-03 1.21653512e-12]","[-6.12189619e+02 1.03584744e+04 2.34417495e+02 7.01761526e+01
3.92495170e+01 1.81609738e+01 2.58114624e+01 1.52275550e+01
8.59676934e+00 9.45036161e-01 7.71943506e+00 4.17516432e+00
1.27920413e+00 3.68862368e+00 1.99582544e+00 3.82999035e+00
2.96068511e-01 9.06341796e-01 2.35621065e+00 1.52094079e+00
8.64565916e-01 5.34605108e-01 4.35456793e-01 4.99450615e-01
4.57778770e-01 3.10324997e-01 9.90860520e-02 3.68281889e-02
-2.29532895e-01 2.56108491e-01 2.20284123e-01 1.47727878e-01
1.77724506e-01 1.52350751e-01 7.07318164e-02 -7.26252404e-02
1.55364050e-01 4.21222079e-02 6.39113311e-02 1.02558665e-02
-7.74736016e-03 -3.20368093e-02 -2.51241082e-02 1.21653512e-12]","[-5.03959282e+02 -5.64452044e+02 7.90433958e+03 1.94146598e+02
1.06178751e+01 5.20957856e+00 7.50891645e+00 4.57943370e+00
2.65572713e+00 2.96725867e-01 2.43040664e+00 1.32822091e+00
4.09853169e-01 1.18412873e+00 6.43398990e-01 1.23796528e+00
9.63975374e-02 2.95295579e-01 7.68998970e-01 4.98040980e-01
2.84036936e-01 1.76004564e-01 1.43527613e-01 1.64765236e-01
1.51171075e-01 1.02586637e-01 3.27835810e-02 1.21872869e-02
-7.59824907e-02 8.48217334e-02 7.29953754e-02 4.89750588e-02
5.89426950e-02 5.05485266e-02 2.34761263e-02 -2.41095452e-02
5.15952510e-02 1.39933210e-02 2.12354074e-02 3.40820680e-03
-2.57466949e-03 -1.06481222e-02 -8.35155410e-03 1.21653512e-12]"
The option is this:
import numpy as np
import csv
c = np.array([n_of_samples])
with open('cocacola_sick.csv') as f:
p = csv.reader(f) # read file as csv
for s in p:
a = ','.join(s) # concatenate all lines into one line
a = a.replace("\n", "") # remove line breaks
b = np.array(np.mat(a))
my_data = np.vstack((c,b))
print(my_data)

IndexError: list index out of range in Python Script

I'm new to python and so I apologize if this question has already been answered. I've used this script before and its worked so I'm not at all sure what is wrong.
I'm trying to transform a MALLET output document into a long list of topic, weight, value rather than a wide list of topics documents and weights.
Here's what the original csv I'm trying to convert looks like but there are 30 topics in it (its a text file called mb_composition.txt):
0 file:/Users/mandyregan/Dropbox/CPH-DH/MiningtheSurge/txt/Abizaid.txt 6.509147794508226E-6 1.8463345214533957E-5 3.301298069640119E-6 0.003825178550032757 0.15240841618294929 0.03903974304065183 0.10454783676528623 0.1316719812119471 1.8018057013225344E-5 4.869261713020613E-6 0.0956868156114931 1.3521101623203115E-5 9.514591058923748E-6 1.822741355900598E-5 4.932324961835634E-4 2.756817586271138E-4 4.039186874601744E-5 1.0503346606335033E-5 1.1466132458804392E-5 0.007003443189848799 6.7094360963952E-6 0.2651753488982284 0.011727025879070194 0.11306132549594633 4.463460490946615E-6 0.0032751230536005056 1.1887304822238514E-5 7.382714572306351E-6 3.538808652077042E-5 0.07158823129977483
1 file:/Users/mandyregan/Dropbox/CPH-DH/MiningtheSurge/txt/Jeffrey,%20Jim%20-%20Chk5-%20ASC%20-%20FINAL%20-%20Sept%202017.docx.txt 4.296636200313062E-6 1.218750594272488E-5 1.5556725986514498E-4 0.043172816021532695 0.04645757277949794 0.01963429696910822 0.1328206370818606 0.116826297071711 1.1893574776047563E-5 3.2141605637859693E-6 0.10242945223692496 0.010439315937573735 0.2478814493196687 1.2031769351093548E-5 0.010142417179693447 2.858721603853616E-5 2.6662348272204834E-5 6.9331747684835E-6 7.745091995495631E-4 0.04235638910274044 4.428844900369446E-6 0.0175105406405736 0.05314379308820005 0.11788631730736487 2.9462944350793084E-6 4.746133386282654E-4 7.846714475661223E-6 4.873270616886766E-6 0.008919869163605806 0.02884824479155971
And here's the python script I'm trying to use to convert it:
infile = open('mallet_output_files/mb_composition.txt', 'r')
outfile = open('mallet_output_files/weights.csv', 'w+')
outfile.write('file,topicnum,weight\n')
for line in infile:
tokens = line.split('\t')
fn = tokens[1]
topics = tokens[2:]
#outfile.write(fn[46:] + ",")
for i in range(0,59):
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
I'm running this in the terminal with python reshape.py and I get this error:
Traceback (most recent call last):
File "reshape.py", line 12, in <module>
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
IndexError: list index out of range
Any idea what I'm doing wrong here? I can't seem to figure it out and am frustrated because I know Ive used this script many times before with success! If it helps I'm on Mac OSx with Python Version 2.7.10
The problem is you're looking for 60 topics per line of your CSV.
If you just want to print out the topics in the list up to the nth topic per line, you should probably define your range by the actual number of topics per line:
for i in range(len(topics) // 2):
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
Stated more pythonically, it would look something like this:
# Group the topics into tuple-pairs for easier management
paired_topics = [tuple(topics[i:i+2]) for i in range(0, len(topics), 2)]
# Iterate the paired topics and print them each on a line of output
for topic in paired_topics:
outfile.write(fn[46:] + ',' + ','.join(topic) + '\n')
You need to debug your code. Try printing out variables.
infile = open('mallet_output_files/mb_composition.txt', 'r')
outfile = open('mallet_output_files/weights.csv', 'w+')
outfile.write('file,topicnum,weight\n')
for line in infile:
tokens = line.split('\t')
fn = tokens[1]
topics = tokens[2:]
# outfile.write(fn[46:] + ",")
for i in range(0,59):
# Add a print statement like this
print(f'Topics {i}: {i*2} and {i*2+1}')
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
Your 'topics' list only has 30 elements? It looks like you're trying to access items far outside of the available range, i.e., you're trying to access topics[x] where x > 30.

Concatenate multiple text files of DNA sequences in Python or R?

I was wondering how to concatenate exon/DNA fasta files using Python or R.
Example files:
So far I really liked using R ape package for the cbind method, solely because of the fill.with.gaps=TRUE attribute. I really need gaps inserted when a species is missing an exon.
My code:
ex1 <- read.dna("exon1.txt", format="fasta")
ex2 <- read.dna("exon2.txt", format="fasta")
output <- cbind(ex1, ex2, fill.with.gaps=TRUE)
write.dna(output, "Output.txt", format="fasta")
Example:
exon1.txt
>sp1
AAAA
>sp2
CCCC
exon2.txt
>sp1
AGG-G
>sp2
CTGAT
>sp3
CTTTT
Output file:
>sp1
AAAAAGG-G
>sp2
CCCCCTGAT
>sp3
----CTTTT
So far I am having trouble trying to apply this technique when I have multiple exon files (trying to figure out a loop to open and execute the cbind method for all files ending with .fa in the directory), and sometimes not all files have exons that are all identical in length - hence DNAbin stops working.
So far I have:
file_list <- list.files(pattern=".fa")
myFunc <- function(x) {
for (file in file_list) {
x <- read.dna(file, format="fasta")
out <- cbind(x, fill.with.gaps=TRUE)
write.dna(out, "Output.txt", format="fasta")
}
}
However when I run this and I check my output text file, it misses many exons and I think that is because not all files have the same exon length... or my script is failing somewhere and I can't figure it out: (
Any ideas? I can also try Python.
If you prefer using Linux one liners you have
cat exon1.txt exon2.txt > outfile
if you want only the unique records from the outfile use
awk '/^>/{f=!d[$1];d[$1]=1}f' outfile > sorted_outfile
I just came out with this answer in Python 3:
def read_fasta(fasta): #Function that reads the files
output = {}
for line in fasta.split("\n"):
line = line.strip()
if not line:
continue
if line.startswith(">"):
active_sequence_name = line[1:]
if active_sequence_name not in output:
output[active_sequence_name] = []
continue
sequence = line
output[active_sequence_name].append(sequence)
return output
with open("exon1.txt", 'r') as file: # read exon1.txt
file1 = read_fasta(file.read())
with open("exon2.txt", 'r') as file: # read exon2.txt
file2 = read_fasta(file.read())
finaldict = {} #Concatenate the
for i in list(file1.keys()) + list(file2.keys()): #both files content
if i not in file1.keys():
file1[i] = ["-" * len(file2[i][0])]
if i not in file2.keys():
file2[i] = ["-" * len(file1[i][0])]
finaldict[i] = file1[i] + file2[i]
with open("output.txt", 'w') as file: # output that in file
for k, i in finaldict.items(): # named output.txt
file.write(">{}\n{}\n".format(k, "".join(i))) #proper formatting
It's pretty hard to comment and explain it completely, and it might not help you, but this is better than nothing :P
I used Ɓukasz Rogalski's code from answer to Reading a fasta file format into Python dict.

Writing list to text file not making new lines python

I'm having trouble writing to text file. Here's my code snippet.
ram_array= map(str, ram_value)
cpu_array= map(str, cpu_value)
iperf_ba_array= map(str, iperf_ba)
iperf_tr_array= map(str, iperf_tr)
#with open(ram, 'w') as f:
#for s in ram_array:
#f.write(s + '\n')
#with open(cpu,'w') as f:
#for s in cpu_array:
#f.write(s + '\n')
with open(iperf_b,'w') as f:
for s in iperf_ba_array:
f.write(s+'\n')
f.close()
with open(iperf_t,'w') as f:
for s in iperf_tr_array:
f.write(s+'\n')
f.close()
The ram and cpu both work flawlessly, however when writing to a file for iperf_ba and iperf_tr they always come out look like this:
[45947383.0, 47097609.0, 46576113.0, 47041787.0, 47297394.0]
Instead of
1
2
3
They're both reading from global lists. The cpu and ram have values appended 1 by 1, but otherwise they look exactly the same pre processing.
Here's how they're made
filename= "iperfLog_2015_03_12_20:45:18_123_____tag_33120L06.csv"
write_location= self.tempLocation()
location=(str(write_location) + str(filename));
df = pd.read_csv(location, names=list('abcdefghi'))
transfer = df.h
transfer=transfer[~transfer.isnull()]#uses pandas to remove nan
transfer=transfer.tolist()
length= int(len(transfer))
extra= length-1
del transfer[extra]
bandwidth= df.i
bandwidth=bandwidth[~bandwidth.isnull()]
bandwidth=bandwidth.tolist()
del bandwidth[extra]
iperf_tran.append(transfer)
iperf_band.append(bandwidth)
[from comment]
you need to use .extend(list) if you want to add a list to a list - and don't worry: we're all spending hours debugging/chasing classy-stupid-me mistakes sometimes ;)

Change two lines in text

I have a python script mostly coded so far for a project I'm currently working on and have hit a road block. I essentially run a program that spits out the following output file (called big.dmp):
)O+_05 Big-body initial data (WARNING: Do not delete this line!!)
) Lines beginning with `)' are ignored.
)---------------------------------------------------------------------
style (Cartesian, Asteroidal, Cometary) = Cartesian
epoch (in days) = 1365250.
)---------------------------------------------------------------------
COMPSTAR r=5.00000E-01 d=3.00000E+00 m= 0.160000000000000E+01
4.570923967127310E-01 1.841433531828977E+01 0.000000000000000E+00
-6.207379670518027E-03 1.540861575481520E-04 0.000000000000000E+00
0.000000000000000E+00 0.000000000000000E+00 0.000000000000000E+00
Now with this file I need to edit both the epoch line and the line beginning with COMPSTAR while keeping the rest of the information constant from integration to integration as the last 3 lines contain the cartesian coordinates of my object and is essentially what the program is outputting.
I know how to use f = open('big.dmp', 'w') and f.write('text here') to create the initial file but how would one go about reading these final three lines into a new big.dmp file for the next integration?
Something like this perhaps?
infile = open('big1.dmp')
outfile = open('big2.dmp', 'w')
for line in infile:
if line.startswith(')'):
# ignore comments
pass
elif 'epoch' in line:
# do something with line
line = line.replace('epoch', 'EPOCH')
elif line.startswith('COMPSTAR'):
# do something with line
line = line.replace('COMPSTAR', 'comparison star')
outfile.write(line)
Here is a somewhat more change-tolerant version:
import re
reg_num = r'\d+'
reg_sci = r'[-+]?\d*\.?\d+([eE][+-]?\d+)?'
def update_config(s, finds=None, replaces=None, **kwargs):
if finds is None: finds = update_config.finds
if replaces is None: replaces = update_config.replaces
for name,value in kwargs.iteritems():
s = re.sub(finds[name], replaces[name].format(value), s)
return s
update_config.finds = {
'epoch': r'epoch \(in days\) =\s*'+reg_num+'\.',
'r': r' r\s*=\s*' + reg_sci,
'd': r' d\s*=\s*' + reg_sci,
'm': r' m\s*=\s*' + reg_sci
}
update_config.replaces = {
'epoch': 'epoch (in days) ={:>11d}.',
'r': ' r={:1.5E}',
'd': ' d={:1.5E}',
'm': ' m= {:1.15E}'
}
def main():
with open('big.dmp') as inf:
s = inf.read()
s = update_config(s, epoch=1365252, r=0.51, d=2.99, m=1.1)
with open('big.dmp', 'w') as outf:
outf.write(s)
if __name__=="__main__":
main()
On the off-chance that the format of your file is fixed with regard to line numbers, this solution will change only the two lines:
with open('big.dmp') as inf, open('out.txt', 'w') as outf:
data = inf.readlines()
data[4] = ' epoch (in days) = 9999.\n' # line with epoch
data[6] = 'COMPSTAR r=2201 d=3330 m= 12\n' # line with COMPSTAR
outf.writelines(data)
resulting in this output file:
)O+_05 Big-body initial data (WARNING: Do not delete this line!!)
) Lines beginning with `)' are ignored.
)---------------------------------------------------------------------
style (Cartesian, Asteroidal, Cometary) = Cartesian
epoch (in days) = 9999.
)---------------------------------------------------------------------
COMPSTAR r=2201 d=3330 m= 12
4.570923967127310E-01 1.841433531828977E+01 0.000000000000000E+00
-6.207379670518027E-03 1.540861575481520E-04 0.000000000000000E+00
0.000000000000000E+00 0.000000000000000E+00 0.000000000000000E+00
Clearly this will not work if the line numbers aren't consistent, but I thought I'd offer it up just in case your data format is consistent in terms of line numbers.
Also, since it reads the whole file into memory at once, it won't be an ideal solution for truly huge files.
The advantage of opening files using with is that they are automatically closed for you when you are done with them, or if you encounter an exception.
There are more flexible solution (searching for the strings, processing the file line-by-line) but if your data is fixed and small, there's no downside of taking advantage of those factors. Somebody smart once said "Simple is better than complex." (The Zen of Python)
It's a little hard to understand what you want, but assuming that you only want to remove the lines not starting with ):
text = open(filename).read()
lines = text.split("\n")
result = [line for line in lines if not line.startswith(")")
or, the one liner:
[line for line in open(file_name).read().split("\n") if not line.startswith(")")]

Categories

Resources