Python - how to optimize iterator in file parsing

Python - how to optimize iterator in file parsing - python

I get files that have NTFS audit permissions and I'm using Python to parse them. The raw CSV files list the path and then which groups have which access, such as this type of pattern:
E:\DIR A, CREATOR OWNER FullControl
E:\DIR A, Sales FullControl
E:\DIR A, HR Full Control
E:\DIR A\SUBDIR, Sales FullControl
E:\DIR A\SUBDIR, HR FullControl
My code parses the file to output this:
File Access for: E:\DIR A
CREATOR OWNER,FullControl
Sales,FullControl
HR,FullControl
File Access For: E:\DIR A\SUBDIR
Sales,FullControl
HR,FullControl
I'm new to generators but I'd like to use them to optimize my code. Nothing I've tried seems to work, so here is the original code (I know it's ugly). It works but it's very slow. The only way I can do this is by parsing out the paths first, put them in a list, make a set so that they're unique, then iterate over that list and match them with the path in the second list, and list all of the items it finds. Like I said, it's ugly but works.
import os, codecs, sys
reload(sys)
sys.setdefaultencoding('utf8') // to prevent cp-932 errors on screen
file = "aud.csv"
outfile = "access-2.csv"
filelist = []
accesslist = []
with codecs.open(file,"r",'utf-8-sig') as infile:
for line in infile:
newline = line.split(',')
folder = newline[0].replace("\"","")
user = newline[1].replace("\"","")
filelist.append(folder)
accesslist.append(folder+","+user)
newfl = sorted(set(filelist))
def makeFile():
print "Starting, please wait"
for i in range(1,len(newfl)):
searchItem = str(newfl[i])
with codecs.open(outfile,"a",'utf-8-sig') as output:
outtext = ("\r\nFile access for: "+ searchItem + "\r\n")
output.write(outtext)
for item in accesslist:
searchBreak = item.split(",")
searchTarg = searchBreak[0]
if searchItem == searchTarg:
searchBreaknew = searchBreak[1].replace("FSA-INC01S\\","")
searchBreaknew = str(searchBreaknew)
# print(searchBreaknew)
searchBreaknew = searchBreaknew.replace(" ",",")
searchBreaknew = searchBreaknew.replace("CREATOR,OWNER","CREATOR OWNER")
output.write(searchBreaknew)
How should I optimize this?
EDIT:
Here is an edited version. It works MUCH faster, though I'm sure it can still be fixed:
import os, codecs, sys, csv
reload(sys)
sys.setdefaultencoding('utf8')
file = "aud.csv"
outfile = "access-3.csv"
filelist = []
accesslist = []
with codecs.open(file,"r",'utf-8-sig') as csvinfile:
auditfile = csv.reader(csvinfile, delimiter=",")
for line in auditfile:
folder = line[0]
user = line[1].replace("FSA-INC01S\\","")
filelist.append(folder)
accesslist.append(folder+","+user)
newfl = sorted(set(filelist))
def makeFile():
print "Starting, please wait"
for i in xrange(1,len(newfl)):
searchItem = str(newfl[i])
outtext = ("\r\nFile access for: "+ searchItem + "\r\n")
accessUserlist = ""
for item in accesslist:
searchBreak = item.split(",")
if searchItem == searchBreak[0]:
searchBreaknew = str(searchBreak[1]).replace(" ",",")
searchBreaknew = searchBreaknew.replace("R,O","R O")
accessUserlist += searchBreaknew+"\r\n"
with codecs.open(outfile,"a",'utf-8-sig') as output:
output.write(outtext)
output.write(accessUserlist)

I'm misguided from your used .csv file extension.
Your given expected output isn't compatible with csv, as inside a record no \n possible.
Proposal using a generator returning record by record:
class Audit(object):
def __init__(self, fieldnames):
self.fieldnames = fieldnames
self.__access = {}
def append(self, row):
folder = row[self.fieldnames[0]]
access = row[self.fieldnames[1]].strip(' ')
access = access.replace("FSA-INC01S\\", "")
access = access.split(' ')
if len(access) == 3:
if access[0] == 'CREATOR':
access[0] += ' ' + access[1]
del access[1];
elif access[1] == 'Full':
access[1] += ' ' + access[2]
del access[2];
if folder not in self.__access:
self.__access[folder] = []
self.__access[folder].append(access)
# Generator for class Audit
def __iter__(self):
record = ''
for folder in sorted(self.__access):
record = folder + '\n'
for access in self.__access[folder]:
record += '%s\n' % (','.join(access) )
yield record + '\n'
How to use it:
def main():
import io, csv
audit = Audit(['Folder', 'Accesslist'])
with io.open(file, "r", encoding='utf-8') as csc_in:
for row in csv.DictReader(csc_in, delimiter=","):
audit.append(row)
with io.open(outfile, 'w', newline='', encoding='utf-8') as txt_out:
for record in audit:
txt_out.write(record)
Tested with Python:3.4.2 - csv:1.0

Related

Python: Counting words from a directory of txt files and writing word counts to a separate txt file

New to Python and I'm trying to count the words in a directory of text files and write the output to a separate text file. However, I want to specify conditions. So if word count is > 0 is would like to write the count and file path to one file and if the count is == 0. I would like to write the count and file path to a separate file. Below is my code so far. I think I'm close, but I'm hung up on how to do the conditions and separate files. Thanks.
import sys
import os
from collections import Counter
import glob
stdoutOrigin=sys.stdout
sys.stdout = open("log.txt", "w")
def count_words_in_dir(dirpath, words, action=None):
for filepath in glob.iglob(os.path.join("path", '*.txt')):
with open(filepath) as f:
data = f.read()
for key,val in words.items():
#print("key is " + key + "\n")
ct = data.count(key)
words[key] = ct
if action:
action(filepath, words)
def print_summary(filepath, words):
for key,val in sorted(words.items()):
print(filepath)
if val > 0:
print('{0}:\t{1}'.format(
key,
val))
filepath = sys.argv[1]
keys = ["x", "y"]
words = dict.fromkeys(keys,0)
count_words_in_dir(filepath, words, action=print_summary)
sys.stdout.close()
sys.stdout=stdoutOrigin

I would strongly urge you to not repurpose stdout for writing data to a file as part of the normal course of your program. I also wonder how you can ever have a word "count < 0". I assume you meant "count == 0".
The main problem that your code has is in this line:
for filepath in glob.iglob(os.path.join("path", '*.txt')):
The string constant "path" I'm pretty sure doesn't belong there. I think you want filepath there instead. I would think that this problem would prevent your code from working at all.
Here's a version of your code where I fixed these issues and added the logic to write to two different output files based on the count:
import sys
import os
import glob
out1 = open("/tmp/so/seen.txt", "w")
out2 = open("/tmp/so/missing.txt", "w")
def count_words_in_dir(dirpath, words, action=None):
for filepath in glob.iglob(os.path.join(dirpath, '*.txt')):
with open(filepath) as f:
data = f.read()
for key, val in words.items():
# print("key is " + key + "\n")
ct = data.count(key)
words[key] = ct
if action:
action(filepath, words)
def print_summary(filepath, words):
for key, val in sorted(words.items()):
whichout = out1 if val > 0 else out2
print(filepath, file=whichout)
print('{0}: {1}'.format(key, val), file=whichout)
filepath = sys.argv[1]
keys = ["country", "friend", "turnip"]
words = dict.fromkeys(keys, 0)
count_words_in_dir(filepath, words, action=print_summary)
out1.close()
out2.close()
Result:
file seen.txt:
/Users/steve/tmp/so/dir/data2.txt
friend: 1
/Users/steve/tmp/so/dir/data.txt
country: 2
/Users/steve/tmp/so/dir/data.txt
friend: 1
file missing.txt:
/Users/steve/tmp/so/dir/data2.txt
country: 0
/Users/steve/tmp/so/dir/data2.txt
turnip: 0
/Users/steve/tmp/so/dir/data.txt
turnip: 0
(excuse me for using some search words that were a bit more interesting than yours)

Hello I hope I understood your question correctly, this code will count how many different words are in your file and depending on the conditions will do something you want.
import os
all_words = {}
def count(file_path):
with open(file_path, "r") as f:
# for better performance it is a good idea to go line by line through file
for line in f:
# singles out all the words, by splitting string around spaces
words = line.split(" ")
# and checks if word already exists in all_words dictionary...
for word in words:
try:
# ...if it does increment number of repetitions
all_words[word.replace(",", "").replace(".", "").lower()] += 1
except Exception:
# ...if it doesn't create it and give it number of repetitions 1
all_words[word.replace(",", "").replace(".", "").lower()] = 1
if __name__ == '__main__':
# for every text file in your current directory count how many words it has
for file in os.listdir("."):
if file.endswith(".txt"):
all_words = {}
count(file)
n = len(all_words)
# depending on the number of words do something
if n > 0:
with open("count1.txt", "a") as f:
f.write(file + "\n" + str(n) + "\n")
else:
with open("count2.txt", "a") as f:
f.write(file + "\n" + str(n) + "\n")
if you want to count same word multiple times you can add up all values from dictionary or you can eliminate try-except block and count every word there.

Match Domains to DNS Resolver Name - Python

I am still new to Python, and have been working on this for work, and a few side projects with it for automating my Plex Media Management tasks.
I am trying to write a python script that would allow me to take a set list of domains from a csv file, match them to their dns name: Example (Plex.tv using 'NS' would return jeremy.ns.cloudflare.com)
My main goal is to read in the list of domains from a csv
run my code to match those domains to a dns resolver name
write those to either a new CSV file, and then zip the two together, which is what I have in my code.
I am having a few problems along the way.
Visual Code doesn't allow import dns.resolver (not a huge issue, but if you know the fix for that it would save me from having to run it from command line)
Matching Domains to their DNS resolver is throwing the error "AttributeError: 'list' object has no attribute 'is_absolute'"
import csv
import socket
import dns.resolver
import os
from os.path import dirname, abspath
# Setting Variables
current_path = dirname(abspath(__file__))
domainFName = '{0}/domains.csv'.format(current_path)
outputFile = '{0}/output.csv'.format(current_path)
dnsList = '{0}/list2.csv'.format(current_path)
case_list = []
fields = ['Domains', 'DNS Resolvers']
caseList = []
dnsResolve = []
# Read in all domains from csv into list
with open(domainFName, 'r') as file:
for line in csv.reader(file):
case_list.append(line)
print(case_list)
# Match domains to the DNS Resolver Name
for domains in case_list:
answer = dns.resolver.resolve(domains, 'NS')
server = answer.target
dnsResolve.append(server)
# Write the dns Resolver names into a new csv file
with open(dnsList,'w', newline="") as r:
writers = csv.writer(r)
writers.writerows(caseList)
# Write the domains and dns resolvers to new output csv
with open(outputFile,'w', newline="") as f:
writer = csv.writer(f)
writer.writerow(fields)
writer.writerow(zip(case_list,caseList))
exit()
Thanks for any help

After a discussion with a co-worker, I was able to resolve my issue, and just for the sake of it, if anyone wants to use this code for a similar need (we use it for DMARC), I will post the whole code:
import dns.resolver
import csv
import os
from os.path import dirname, abspath
# Setting Variables
current_path = dirname(abspath(__file__))
domainFName = '{0}/domains.csv'.format(current_path)
outputFile = '{0}/output.csv'.format(current_path)
dnsList = '{0}/dnslist.csv'.format(current_path)
backupCSV = '{0}/backup-output.csv'.format(current_path)
case_list = []
dns_list = []
fields = ['Domains', 'DNS Resolvers']
csv_output = zip(case_list, dns_list)
domainAmount = 0
rd = 00
dnresolve = 00
part = 0
percentL = []
percents = [10,20,30,40,50,60,70,80,90,95,96,97,98,99]
percentList = []
floatingList = []
floatPart = []
x = 00
keyAzure = 'azure'
keyCSC = 'csc'
while x < .99:
x += .01
floatingList.append(x)
# THIS IS THE CODE FOR WRITING CSV FILES INTO LISTS - LABELED AS #1
print("FILES MUST BE CSV, WILL RETURN AN ERROR IF NOT. LEAVE OFF .CSV")
# Here we will gather the input of which csv file to use. If none are entered, it will use domains.csv
print("Enter your output file name (if blank will use default):")
UserFile = str(input("Enter your filename: ") or "domains")
fullFile = UserFile + '.csv'
domainFName = fullFile.format(current_path)
# Here will will specify the output file name. If the file is not created, it will create it
# If the user enters not data, the default will be used, output.csv
print("Enter your output file name (if blank will use default):")
UserOutput = str(input("Enter your filename: ") or "output")
fullOutput = UserOutput + '.csv'
outputFIle = fullOutput.format(current_path)
# Read in all domains from csv into list
with open(domainFName, 'r') as file:
for line in csv.reader(file):
case_list.append(line)
domainAmount += 1
print("Starting the resolver:")
print("You have " + str(domainAmount) + " Domains to resolve:")
# THIS IS THE END OF THE CODE FOR WRITING CSV FILES INTO LISTS - LABELED AS #1
# THE CODE BELOW IS WORKING FOR FINDING THE DNS RESOLVERS - LABELED AS #2
# Function for matching domains to DNS resolvers
def dnsResolver (domain):
try:
answers = dns.resolver.resolve(domain, 'NS')
for server in answers:
dns_list.append(server.target)
except:
dns_list.append("Did Not Resolve")
print("Now resolving domains to their DNS name:")
print("This will take a few minutes. Check out the progress bar for your status:")
print("I have resolved 0% Domains:")
# This code is for finding the percentages for the total amount of domains to find progress status
def percentageFinder(percent, whole):
return (percent * whole) / 100
def percentGetter(part, whole):
return (100 * int(part)/int(whole))
for x in percents:
percentList.append(int(percentageFinder(x,domainAmount)))
percentL = percentList
#End code for percentage finding
for firstdomain in case_list:
for domain in firstdomain:
dnsResolver(domain)
if dnsResolver != "Did Not Resolve":
rd += 1
else:
dnresolve += 1
# Using w+ to overwrite all Domain Names &
with open(dnsList,'w+', newline="") as r:
writers = csv.writer(r)
writers.writerows(dns_list)
# This is used for showing the percentage of the matching you have done
part += 1
if part in percentL:
total = int(percentGetter(part, domainAmount))
print("I Have Resolved {}".format(total) + "%" + " Domains:")
else:
pass
print("Resolving has completed. Statistics Below:")
print("------------------------------------------")
print("You had " + str(rd) + " domains that resolved.")
print("You had " + str(dnresolve) + " domains that did NOT resolve")
# THIS IS THE END OF THE WORKING CODE - LABELED AS #2
# Write the dns Resolver names into a new csv file
print("Now writing your domains & their DNS Name to an Output File:")
with open(outputFile,'w+', newline="\n") as f:
writer = csv.writer(f, dialect='excel')
writer.writerow(fields)
for row in csv_output:
writer.writerow(row)
print("Writing a backup CSV File")
# Using this to create a backup in case to contain all domains, and all resolvers
# If someone runs the script with a small list of domains, still want to keep a
# running list of everything in case any questions arise.
# This is done by using 'a' instead of 'w' or 'w+' done above.
with open(backupCSV,'w', newline="") as f:
writer = csv.writer(f, dialect='excel')
writer.writerow(fields)
for row in csv_output:
writer.writerow(row)
print("Your backup is now done processing. Exiting program")
# Sort the files by keyword, in this case the domain being azure or csc
for r in dns_list:
if keyAzure in r:
for x in keyAzure:
FileName = x
print(FileName)
exit()

How to read fields from JSON-LD to CSV?

I am trying to extract values from json ld to csv as they are in the file. There are a couple of issues I am facing.
1. The values being read for different fields are getting truncated in most of the cases. In the remaining cases the value of some other field is appearing in some other field.
2. I am also getting an error - 'Additional data' after some 4,000 lines.
The file is quite big(half a gb). I am attaching a shortened version of my code. Please tell me where am I going wrong.
The input file - I have shortened it and kept it here. There was no way of putting it here.
https://github.com/Architsi/json-ld-issue
I tried writing this script and I tried multiple online converters too
import csv, sys, math, operator, re, os, json, ijson
from pprint import pprint
filelist = []
for file in os.listdir("."):
if file.endswith(".json"):
filelist.append(file)
for input in filelist:
newCsv = []
splitlist = input.split(".")
output = splitlist[0] + '.csv'
newFile = open(output, 'w', newline='') #wb for windows, else you'll see newlines added to csv
# initialize csv writer
writer = csv.writer(newFile)
#Name of the columns
header_row = ('Format', 'Description', 'Object', 'DataProvider')
writer.writerow(header_row)
with open(input, encoding="utf8") as json_file:
data = ijson.items(json_file, 'item')
#passing all the values through try except
for s in data:
source = s['_source']
try:
source_resource = source['sourceResource']
except:
print ("Warning: No source resource in record ID: " + id)
try:
data_provider = source['dataProvider'].encode()
except:
data_provider = "N/A"
try:
_object = source['object'].encode()
except:
_object = "N/A"
try:
descriptions = source_resource['description']
string = ""
for item in descriptions:
if len(descriptions) > 1:
description = item.encode() #+ " | "
else:
description = item.encode()
string = string + description
description = string.encode()
except:
description = "N/A"
created = ""
#writing it to csv
write_tuple = ('format', description, _object, data_provider)
writer.writerow(write_tuple)
print ("File written to " + output)
newFile.close()
The error that I am getting is this- raise common.JSONError('Additional Data')
Expected result is a csv file with all the columns and correct values

Can't figure out how to properly output my data

I'm a relative novice at python but yet, somehow managed to build a scraper for Instagram. I now want to take this one step further and output the 5 most commonly used hashtags from an IG profile into my CSV output file.
Current output:
I've managed to isolate the 5 most commonly used hashtags, but I get this result in my csv:
[('#striveforgreatness', 3), ('#jamesgang', 3), ('#thekidfromakron',
2), ('#togetherwecanchangetheworld', 1), ('#halloweenchronicles', 1)]
Desired output:
What I'm looking to end up with in the end is having 5 columns at the end of my .CSV outputting the X-th most commonly used value.
So something in the lines of this:
I've Googled for a while and managed to isolate them separately, but I always end up with '('#thekidfromakron', 2)' as an output. I seem to be missing some part of the puzzle :(.
Here is what I'm working with at the moment:
import csv
import requests
from bs4 import BeautifulSoup
import json
import re
import time
from collections import Counter
ts = time.gmtime()
def get_csv_header(top_numb):
fieldnames = ['USER','MEDIA COUNT','FOLLOWERCOUNT','TOTAL LIKES','TOTAL COMMENTS','ER','ER IN %', 'BIO', 'ALL CAPTION TEXT','HASHTAGS COUNTED','MOST COMMON HASHTAGS']
return fieldnames
def write_csv_header(filename, headers):
with open(filename, 'w', newline='') as f_out:
writer = csv.DictWriter(f_out, fieldnames=headers)
writer.writeheader()
return
def read_user_name(t_file):
with open(t_file) as f:
user_list = f.read().splitlines()
return user_list
if __name__ == '__main__':
# HERE YOU CAN SPECIFY YOUR USERLIST FILE NAME,
# Which contains a list of usernames's BY DEFAULT <current working directory>/userlist.txt
USER_FILE = 'userlist.txt'
# HERE YOU CAN SPECIFY YOUR DATA FILE NAME, BY DEFAULT (data.csv)', Where your final result stays
DATA_FILE = 'users_with_er.csv'
MAX_POST = 12 # MAX POST
print('Starting the engagement calculations... Please wait until it finishes!')
users = read_user_name(USER_FILE)
""" Writing data to csv file """
csv_headers = get_csv_header(MAX_POST)
write_csv_header(DATA_FILE, csv_headers)
for user in users:
post_info = {'USER': user}
url = 'https://www.instagram.com/' + user + '/'
#for troubleshooting, un-comment the next two lines:
#print(user)
#print(url)
try:
r = requests.get(url)
if r.status_code != 200:
print(timestamp,' user {0} not found or page unavailable! Skipping...'.format(user))
continue
soup = BeautifulSoup(r.content, "html.parser")
scripts = soup.find_all('script', type="text/javascript", text=re.compile('window._sharedData'))
stringified_json = scripts[0].get_text().replace('window._sharedData = ', '')[:-1]
j = json.loads(stringified_json)['entry_data']['ProfilePage'][0]
timestamp = time.strftime("%d-%m-%Y %H:%M:%S", ts)
except ValueError:
print(timestamp,'ValueError for username {0}...Skipping...'.format(user))
continue
except IndexError as error:
# Output expected IndexErrors.
print(timestamp, error)
continue
if j['graphql']['user']['edge_followed_by']['count'] <=0:
print(timestamp,'user {0} has no followers! Skipping...'.format(user))
continue
if j['graphql']['user']['edge_owner_to_timeline_media']['count'] <12:
print(timestamp,'user {0} has less than 12 posts! Skipping...'.format(user))
continue
if j['graphql']['user']['is_private'] is True:
print(timestamp,'user {0} has a private profile! Skipping...'.format(user))
continue
media_count = j['graphql']['user']['edge_owner_to_timeline_media']['count']
accountname = j['graphql']['user']['username']
followercount = j['graphql']['user']['edge_followed_by']['count']
bio = j['graphql']['user']['biography']
i = 0
total_likes = 0
total_comments = 0
all_captiontext = ''
while i <= 11:
total_likes += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_liked_by']['count']
total_comments += j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_comment']['count']
captions = j['graphql']['user']['edge_owner_to_timeline_media']['edges'][i]['node']['edge_media_to_caption']
caption_detail = captions['edges'][0]['node']['text']
all_captiontext += caption_detail
i += 1
engagement_rate_percentage = '{0:.4f}'.format((((total_likes + total_comments) / followercount)/12)*100) + '%'
engagement_rate = (((total_likes + total_comments) / followercount)/12*100)
#isolate and count hashtags
hashtags = re.findall(r'#\w*', all_captiontext)
hashtags_counted = Counter(hashtags)
most_common = hashtags_counted.most_common(5)
with open('users_with_er.csv', 'a', newline='', encoding='utf-8') as data_out:
print(timestamp,'Writing Data for user {0}...'.format(user))
post_info["USER"] = accountname
post_info["FOLLOWERCOUNT"] = followercount
post_info["MEDIA COUNT"] = media_count
post_info["TOTAL LIKES"] = total_likes
post_info["TOTAL COMMENTS"] = total_comments
post_info["ER"] = engagement_rate
post_info["ER IN %"] = engagement_rate_percentage
post_info["BIO"] = bio
post_info["ALL CAPTION TEXT"] = all_captiontext
post_info["HASHTAGS COUNTED"] = hashtags_counted
csv_writer = csv.DictWriter(data_out, fieldnames=csv_headers)
csv_writer.writerow(post_info)
""" Done with the script """
print('ALL DONE !!!! ')
The code that goes before this simply scrapes the webpage, and compiles all the captions from the last 12 posts into "all_captiontext".
Any help to solve this (probably simple) issue would be greatly appreciated as I've been struggling with this for days (again, I'm a noob :') ).

Replace line
post_info["MOST COMMON HASHTAGS"] = most_common
with:
for i, counter_tuple in enumerate(most_common):
tag_name = counter_tuple[0].replace('#','')
label = "Top %d" % (i + 1)
post_info[label] = tag_name
There's also a bit of code missing. For example, your code doesn't include csv_headers variable, which I suppose would be
csv_headers = post_info.keys()
It also seems that you're opening a file to write just one row. I don't think that's intended, so what you would like to do is to collect the results into a list of dictionaries. A cleaner solution would be to use pandas' dataframe, which you can output straight into a csv file.

most_common being the output of the call to hashtags_counted.most_common, I had a look at the doc here: https://docs.python.org/2/library/collections.html#collections.Counter.most_common
Output if formatted the following : [(key, value), (key, value), ...] and ordered in decreasing importance of number of occurences.
Hence, to get only the name and not the number of occurence, you should replace:
post_info["MOST COMMON HASHTAGS"] = most_common
by
post_info["MOST COMMON HASHTAGS"] = [x[0] for x in most_common]
You have a list of tuple. This statement builds on the fly the list of the first element of each tuple, keeping the sorting order.

Segmentation Fault

I am using python 2.4.4 (old machine, can't do anything about it) on a UNIX machine. I am extremely new to python/programming and have never used a UNIX machine before. This is what I am trying to do:
extract a single sequence from a FASTA file (proteins + nucleotides) to a temporary text file.
Give this temporary file to a program called 'threader'
Append the output from threader (called tempresult.out) to a file called results.out
Remove the temporary file.
Remove the tempresult.out file.
Repeat using the next FASTA sequence.
Here is my code so far:
import os
from itertools import groupby
input_file = open('controls.txt', 'r')
output_file = open('results.out', 'a')
def fasta_parser(fasta_name):
input = fasta_name
parse = (x[1] for x in groupby(input, lambda line: line[0] == ">"))
for header in parse:
header = header.next()[0:].strip()
seq = "\n".join(s.strip() for s in parse.next())
yield (header, '\n', seq)
parsedfile = fasta_parser(input_file)
mylist = list(parsedfile)
index = 0
while index < len(mylist):
temp_file = open('temp.txt', 'a+')
temp_file.write(' '.join(mylist[index]))
os.system('threader' + ' temp.txt' + ' tempresult.out' + ' structures.txt')
os.remove('temp.txt')
f = open('tempresult.out', 'r')
data = str(f.read())
output_file.write(data)
os.remove('tempresult.out')
index +=1
output_file.close()
temp_file.close()
input_file.close()
When I run this script I get the error 'Segmentation Fault'. From what I gather this is to do with me messing with memory I shouldn't be messing with (???). I assume it is something to do with the temporary files but I have no idea how I would get around this.
Any help would be much appreciated!
Thanks!
Update 1:
Threader works fine when I give it the same sequence multiple times like this:
import os
input_file = open('control.txt', 'r')
output_file = open('results.out', 'a')
x=0
while x<3:
os.system('threader' + ' control.txt' + ' tempresult.out' + ' structures.txt')
f = open('tempresult.out', 'r')
data = str(f.read())
output_file.write(data)
os.remove('result.out')
x += 1
output_file.close()
input_file.close()
Update 2: In the event that someone else gets this error. I forgot to close temp.txt before invoking the threader program.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - how to optimize iterator in file parsing - python

Related

Python: Counting words from a directory of txt files and writing word counts to a separate txt file

Match Domains to DNS Resolver Name - Python

How to read fields from JSON-LD to CSV?

Can't figure out how to properly output my data

Segmentation Fault

Categories

Resources