IndexError: list index out of range in Python Script - python

I'm new to python and so I apologize if this question has already been answered. I've used this script before and its worked so I'm not at all sure what is wrong.
I'm trying to transform a MALLET output document into a long list of topic, weight, value rather than a wide list of topics documents and weights.
Here's what the original csv I'm trying to convert looks like but there are 30 topics in it (its a text file called mb_composition.txt):
0 file:/Users/mandyregan/Dropbox/CPH-DH/MiningtheSurge/txt/Abizaid.txt 6.509147794508226E-6 1.8463345214533957E-5 3.301298069640119E-6 0.003825178550032757 0.15240841618294929 0.03903974304065183 0.10454783676528623 0.1316719812119471 1.8018057013225344E-5 4.869261713020613E-6 0.0956868156114931 1.3521101623203115E-5 9.514591058923748E-6 1.822741355900598E-5 4.932324961835634E-4 2.756817586271138E-4 4.039186874601744E-5 1.0503346606335033E-5 1.1466132458804392E-5 0.007003443189848799 6.7094360963952E-6 0.2651753488982284 0.011727025879070194 0.11306132549594633 4.463460490946615E-6 0.0032751230536005056 1.1887304822238514E-5 7.382714572306351E-6 3.538808652077042E-5 0.07158823129977483
1 file:/Users/mandyregan/Dropbox/CPH-DH/MiningtheSurge/txt/Jeffrey,%20Jim%20-%20Chk5-%20ASC%20-%20FINAL%20-%20Sept%202017.docx.txt 4.296636200313062E-6 1.218750594272488E-5 1.5556725986514498E-4 0.043172816021532695 0.04645757277949794 0.01963429696910822 0.1328206370818606 0.116826297071711 1.1893574776047563E-5 3.2141605637859693E-6 0.10242945223692496 0.010439315937573735 0.2478814493196687 1.2031769351093548E-5 0.010142417179693447 2.858721603853616E-5 2.6662348272204834E-5 6.9331747684835E-6 7.745091995495631E-4 0.04235638910274044 4.428844900369446E-6 0.0175105406405736 0.05314379308820005 0.11788631730736487 2.9462944350793084E-6 4.746133386282654E-4 7.846714475661223E-6 4.873270616886766E-6 0.008919869163605806 0.02884824479155971
And here's the python script I'm trying to use to convert it:
infile = open('mallet_output_files/mb_composition.txt', 'r')
outfile = open('mallet_output_files/weights.csv', 'w+')
outfile.write('file,topicnum,weight\n')
for line in infile:
tokens = line.split('\t')
fn = tokens[1]
topics = tokens[2:]
#outfile.write(fn[46:] + ",")
for i in range(0,59):
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
I'm running this in the terminal with python reshape.py and I get this error:
Traceback (most recent call last):
File "reshape.py", line 12, in <module>
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
IndexError: list index out of range
Any idea what I'm doing wrong here? I can't seem to figure it out and am frustrated because I know Ive used this script many times before with success! If it helps I'm on Mac OSx with Python Version 2.7.10

The problem is you're looking for 60 topics per line of your CSV.
If you just want to print out the topics in the list up to the nth topic per line, you should probably define your range by the actual number of topics per line:
for i in range(len(topics) // 2):
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')
Stated more pythonically, it would look something like this:
# Group the topics into tuple-pairs for easier management
paired_topics = [tuple(topics[i:i+2]) for i in range(0, len(topics), 2)]
# Iterate the paired topics and print them each on a line of output
for topic in paired_topics:
outfile.write(fn[46:] + ',' + ','.join(topic) + '\n')

You need to debug your code. Try printing out variables.
infile = open('mallet_output_files/mb_composition.txt', 'r')
outfile = open('mallet_output_files/weights.csv', 'w+')
outfile.write('file,topicnum,weight\n')
for line in infile:
tokens = line.split('\t')
fn = tokens[1]
topics = tokens[2:]
# outfile.write(fn[46:] + ",")
for i in range(0,59):
# Add a print statement like this
print(f'Topics {i}: {i*2} and {i*2+1}')
outfile.write(fn[46:] + ",")
outfile.write(topics[i*2]+','+topics[i*2+1]+'\n')

Your 'topics' list only has 30 elements? It looks like you're trying to access items far outside of the available range, i.e., you're trying to access topics[x] where x > 30.

Related

Is there a way to print data from a log file between two endpoints in python

I have a log file and am trying to print the data between two dates.
2020-01-31T20:12:38.1234Z, asdasdasdasdasdasd,...\n
2020-01-31T20:12:39.1234Z, abcdef,...\n
2020-01-31T20:12:40.1234Z, ghikjl,...\n
2020-01-31T20:12:41.1234Z, mnopqrstuv,...\n
2020-01-31T20:12:42.1234Z, wxyzdsasad,...\n
This is the sample log file and I want to print the lines between 2020-01-31T20:12:39 up to 2020-01-31T20:12:41.
So far I have manged to find and print the starting date line. I have passed the starting date as start.
with open("logfile.log") as myFile:
for line in myFile:
linenum += 1
if line.find(start) != -1:
print("Line " + str(linenum) + ": " + line.rstrip('\n'))
but how do I keep printing till the end date?
Not the answer in python but in bash.
sed -n '/2020-01-31T20:12:38.1234Z/,/2020-01-31T20:12:41.1234Z/p' file.log
Output:
2020-01-31T20:12:38.1234Z, asdasdasdasdasdasd,...\n
2020-01-31T20:12:39.1234Z, abcdef,...\n
2020-01-31T20:12:40.1234Z, ghikjl,...\n
2020-01-31T20:12:41.1234Z, mnopqrstuv,...\n
Since the time string is already structured nicely in your file, you can just do a simple string comparison between the times you're interested in without converting the string to a datetime object.
Use the csv module to read in the file, using the default comma delimiter, and then the filter() function to filter between two dates.
import csv
reader = csv.reader(open("logfile.log"))
filtered = filter(lambda p: p[0].split('.')[0] >= '2020-01-31T20:12:39' and p[0].split('.')[0] <= '2020-01-31T20:12:41', reader)
for l in filtered:
print(','.join(l))
Edit:
I used split() to remove the fractional part of the time string in the string comparison since you're interested in times to the nearest minute accuracy, e.g. 2020-01-31T20:12:39.
if you want in python,
import time
from datetime import datetime as dt
def to_timestamp(date,forma='%Y-%m-%dT%H:%M:%S'):
return time.mktime(dt.strptime(date,forma).timetuple())
start=to_timestamp(startdate)
end=to_timestamp(enddate)
logs={}
with open("logfile.log") as f:
for line in f:
date=line.split(', ')[0].split('.')[0]
logline=line.split(', ')[1].strip('\n')
if to_timestamp(date)>=start and to_timestamp(end) <= end:
logs[date]=logline

How to solve problem decoding from wrong json format

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

Python: Having trouble replacing lines from file

I'm trying to build a translator using deepl for subtitles but it isn't running perfectly. I managed to translate the subtitles and most of the part I'm having problems replacing the lines. I can see that the lines are translated because it prints them but it doesn't replace them. Whenever I run the program it is the same as the original file.
This is the code responsible for:
def translate(input, output, languagef, languaget):
file = open(input, 'r').read()
fileresp = open(output,'r+')
subs = list(srt.parse(file))
for sub in subs:
try:
linefromsub = sub.content
translationSentence = pydeepl.translate(linefromsub, languaget.upper(), languagef.upper())
print(str(sub.index) + ' ' + translationSentence)
for line in fileresp.readlines():
newline = fileresp.write(line.replace(linefromsub,translationSentence))
except IndexError:
print("Error parsing data from deepl")
This is the how the file looks:
1
00:00:02,470 --> 00:00:04,570
- Yes, I do.
- (laughs)
2
00:00:04,605 --> 00:00:07,906
My mom doesn't want
to babysit everyday
3
00:00:07,942 --> 00:00:09,274
or any day.
4
00:00:09,310 --> 00:00:11,977
But I need
my mom's help sometimes.
5
00:00:12,013 --> 00:00:14,046
She's just gonna
have to be grandma today.
Help will be appreaciated :)
Thanks.
You are opening fileresp with r+ mode. When you call readlines(), the file's position will be set to the end of the file. Subsequent calls to write() will then append to the file. If you want to overwrite the original contents as opposed to append, you should try this instead:
allLines = fileresp.readlines()
fileresp.seek(0) # Set position to the beginning
fileresp.truncate() # Delete the contents
for line in allLines:
fileresp.write(...)
Update
It's difficult to see what you're trying to accomplish with r+ mode here but it seems you have two separate input and output files. If that's the case consider:
def translate(input, output, languagef, languaget):
file = open(input, 'r').read()
fileresp = open(output, 'w') # Use w mode instead
subs = list(srt.parse(file))
for sub in subs:
try:
linefromsub = sub.content
translationSentence = pydeepl.translate(linefromsub, languaget.upper(), languagef.upper())
print(str(sub.index) + ' ' + translationSentence)
fileresp.write(translationSentence) # Write the translated sentence
except IndexError:
print("Error parsing data from deepl")

Writing list to text file not making new lines python

I'm having trouble writing to text file. Here's my code snippet.
ram_array= map(str, ram_value)
cpu_array= map(str, cpu_value)
iperf_ba_array= map(str, iperf_ba)
iperf_tr_array= map(str, iperf_tr)
#with open(ram, 'w') as f:
#for s in ram_array:
#f.write(s + '\n')
#with open(cpu,'w') as f:
#for s in cpu_array:
#f.write(s + '\n')
with open(iperf_b,'w') as f:
for s in iperf_ba_array:
f.write(s+'\n')
f.close()
with open(iperf_t,'w') as f:
for s in iperf_tr_array:
f.write(s+'\n')
f.close()
The ram and cpu both work flawlessly, however when writing to a file for iperf_ba and iperf_tr they always come out look like this:
[45947383.0, 47097609.0, 46576113.0, 47041787.0, 47297394.0]
Instead of
1
2
3
They're both reading from global lists. The cpu and ram have values appended 1 by 1, but otherwise they look exactly the same pre processing.
Here's how they're made
filename= "iperfLog_2015_03_12_20:45:18_123_____tag_33120L06.csv"
write_location= self.tempLocation()
location=(str(write_location) + str(filename));
df = pd.read_csv(location, names=list('abcdefghi'))
transfer = df.h
transfer=transfer[~transfer.isnull()]#uses pandas to remove nan
transfer=transfer.tolist()
length= int(len(transfer))
extra= length-1
del transfer[extra]
bandwidth= df.i
bandwidth=bandwidth[~bandwidth.isnull()]
bandwidth=bandwidth.tolist()
del bandwidth[extra]
iperf_tran.append(transfer)
iperf_band.append(bandwidth)
[from comment]
you need to use .extend(list) if you want to add a list to a list - and don't worry: we're all spending hours debugging/chasing classy-stupid-me mistakes sometimes ;)

How do you make tables with previously stored strings?

So the question basically gives me 19 DNA sequences and wants me to makea basic text table. The first column has to be the sequence ID, the second column the length of the sequence, the third is the number of "A"'s, 4th is "G"'s, 5th is "C", 6th is "T", 7th is %GC, 8th is whether or not it has "TGA" in the sequence. Then I get all these values and write a table to "dna_stats.txt"
Here is my code:
fh = open("dna.fasta","r")
Acount = 0
Ccount = 0
Gcount = 0
Tcount = 0
seq=0
alllines = fh.readlines()
for line in alllines:
if line.startswith(">"):
seq+=1
continue
Acount+=line.count("A")
Ccount+=line.count("C")
Gcount+=line.count("G")
Tcount+=line.count("T")
genomeSize=Acount+Gcount+Ccount+Tcount
percentGC=(Gcount+Ccount)*100.00/genomeSize
print "sequence", seq
print "Length of Sequence",len(line)
print Acount,Ccount,Gcount,Tcount
print "Percent of GC","%.2f"%(percentGC)
if "TGA" in line:
print "Yes"
else:
print "No"
fh2 = open("dna_stats.txt","w")
for line in alllines:
splitlines = line.split()
lenstr=str(len(line))
seqstr = str(seq)
fh2.write(seqstr+"\t"+lenstr+"\n")
I found that you have to convert the variables into strings. I have all of the values calculated correctly when I print them out in the terminal. However, I keep getting only 19 for the first column, when it should go 1,2,3,4,5,etc. to represent all of the sequences. I tried it with the other variables and it just got the total amounts of the whole file. I started trying to make the table but have not finished it.
So my biggest issue is that I don't know how to get the values for the variables for each specific line.
I am new to python and programming in general so any tips or tricks or anything at all will really help.
I am using python version 2.7
Well, your biggest issue:
for line in alllines: #1
...
fh2 = open("dna_stats.txt","w")
for line in alllines: #2
....
Indentation matters. This says "for every line (#1), open a file and then loop over every line again(#2)..."
De-indent those things.
This puts the info in a dictionary as you go and allows for DNA sequences to go over multiple lines
from __future__ import division # ensure things like 1/2 is 0.5 rather than 0
from collections import defaultdict
fh = open("dna.fasta","r")
alllines = fh.readlines()
fh2 = open("dna_stats.txt","w")
seq=0
data = dict()
for line in alllines:
if line.startswith(">"):
seq+=1
data[seq]=defaultdict(int) #default value will be zero if key is not present hence we can do +=1 without originally initializing to zero
data[seq]['seq']=seq
previous_line_end = "" #TGA might be split accross line
continue
data[seq]['Acount']+=line.count("A")
data[seq]['Ccount']+=line.count("C")
data[seq]['Gcount']+=line.count("G")
data[seq]['Tcount']+=line.count("T")
data[seq]['genomeSize']+=data[seq]['Acount']+data[seq]['Gcount']+data[seq]['Ccount']+data[seq]['Tcount']
line_over = previous_line_end + line[:3]
data[seq]['hasTGA']= data[seq]['hasTGA'] or ("TGA" in line) or (TGA in line_over)
previous_line_end = str.strip(line[-4:]) #save previous_line_end for next line removing new line character.
for seq in data.keys():
data[seq]['percentGC']=(data[seq]['Gcount']+data[seq]['Ccount'])*100.00/data[seq]['genomeSize']
s = '%(seq)d, %(genomeSize)d, %(Acount)d, %(Ccount)d, %(Tcount)d, %(Tcount)d, %(percentGC).2f, %(hasTGA)s'
fh2.write(s % data[seq])
fh.close()
fh2.close()

Categories

Resources