If statement and writing to file - python

I am using the KEGG API to download genomic data and writing it to a file. There are 26 files total and some of of them contain the the dictionary 'COMPOUND'. I would like to assign these to CompData and write them to the output file. I tried writing it as an if True statement but this does not work.
# Read in hsa links
hsa = []
with open ('/users/skylake/desktop/pathway-HSAs.txt', 'r') as file:
for line in file:
line = line.strip()
hsa.append(line)
# Import KEGG API Bioservices | Create KEGG Variable
from bioservices.kegg import KEGG
k = KEGG()
# Data Parsing | Writing to File
# for i in range(len(hsa)):
data = k.get(hsa[2])
dict_data = k.parse(data)
if dict_data['COMPOUND'] == True:
compData = str(dict_data['COMPOUND'])
nameData = str(dict_data['NAME'])
geneData = str(dict_data['GENE'])
f = open('/Users/Skylake/Desktop/pathway-info/' + nameData + '.txt' , 'w')
f.write("Genes\n")
f.write(geneData)
f.write("\nCompounds\n")
f.write(compData)
f.close()

I guess that by
if dict_data['COMPOUND'] == True:
You test (wrongly) for the existence of the string key 'COMPOUND' in dict_data. In this case what you want is
if 'COMPOUND' in dict_data:
Furthermore, note that the definition of the variable compData won't occur if the key is not present, which will raise an error when trying to write its value. This means that you should always define it whatever happens, via doing, e.g.
compData = str(dict_data.get('COMPOUND', 'undefined'))
The above line of code means that if the key exists it gets its value and if does not exist, it gets 'undefined' instead. Note that you can chose the alternative value you want, or even not giving any, which results in None by default.

Related

Rename '.tbl' files in directory using string from the first line of file python

I have a directory filled with '.tbl' files. The file set up is as follows:
\STAR_ID = "HD 74156"
\DATA_CATEGORY = "Planet Radial Velocity Curve"
\NUMBER_OF_POINTS = "82"
\TIME_REFERENCE_FRAME = "JD"
\MINIMUM_DATE = "2453342.23249"
\DATE_UNITS = "days"
\MAXIMUM_DATE = "2454231.60002"
....
I need to rename every file in the directory using the STAR_ID, so in this case the files name would be 'HD 74156.tbl.' I have been able to do it for about 20 of the ~600 files. I am not sure why it will not continue through the rest of the files. My current code is:
for i in os.listdir(path):
with open(i) as f:
first_line = f.readline()
system = first_line.split('"')[1]
new_file = system + ".tbl"
os.rename(file, new_file)`
and the error message is:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-37-5883c060a977> in <module>
3 with open(i) as f:
4 first_line = f.readline()
----> 5 system = first_line.split('"')[1]
6 new_file = system + ".tbl"
7 os.rename(file, new_file)
IndexError: list index out of range
This error occurs because of first_line.split('"') is returning a list with less of 2 items.
you can try
first_line_ls = first_line.split('"')
if len(first_line_ls) > 1:
system = first_line_ls[1]
else:
#other method
This code can help you prevent the error and handle cases the file_line str have less then 2 "
It looks like these .tbl files are not as uniform as you might have hoped. If this line:
----> 5 system = first_line.split('"')[1]
fails on some files, it's because their first line is not formatted as you expected, as #Leo Arad noted. You also want to make sure you're actually using the STAR_ID field. Perhaps these files usually put all the fields in the same order (as an aside, what are these .tbl files? What software did they come from? I've never seen it before), but since you've already found other inconsistencies with the format, better to be safe than sorry.
I might write a little helper function to parse the fields in this file. It takes a single line and returns a (key, value) tuple for the field. If the line does not look like a valid field it returns (None, None):
import re
# Dissection of this regular expression:
# ^\\ : line begins with \
# (?P<key>\w+) : extract the key, which is one or more letters, numbers or underscores
# \s*=\s* : an equal sign surrounding by any amount of white space
# "(?P<value>[^"]*)" : extract the value, which is between a pair of double-quotes
# and contains any characters other than double-quotes
# (Note: I don't know if this file format has a mechanism for escaping
# double-quotes inside the value; if so that would have to be handled as well)
_field_re = re.compile(r'^\\(?P<key>\w+)\s*=\s*"(?P<value>[^"]*)"')
def parse_field(line):
# match the line against the regular expression
match = _field_re.match(line)
# if it doesn't match, return (None, None)
if match is None:
return (None, None)
else:
# return the key and value pair
return match.groups()
Now open your file, loop over all the lines, and perform the rename once you find STAR_ID. If not, print a warning (this is mostly the same as your code with some slight variations):
for filename in os.listdir(path):
filename = os.path.join(path, filename)
star_id = None
# NOTE: Do the rename outside the with statement so that the
# file is closed; on Linux it doesn't matter but on Windows
# the rename will fail if the file is not closed first
with open(filename) as fobj:
for line in fobj:
key, value = parse_field(line)
if key == 'STAR_ID':
star_id = value
break
if star_id is not None:
os.rename(filename, os.path.join(path, star_id + '.tbl'))
else:
print(f'WARNING: STAR_ID key missing from {filename}', file=sys.stderr)
If you are not comfortable with regular expressions (and really, who is?) it would be good to learn the basics as it's an extremely useful tool to have in your belt. However, this format is simple enough that you could get away with using simple string parsing methods like you were doing. Though I would still enhance it a bit to make sure you're actually getting the STAR_ID field. Something like this:
def parse_field(line):
if '=' not in line:
return (None, None)
key, value = [part.strip() for part in line.split('=', 1)]
if key[0] != '\\':
return (None, None)
else:
key = key[1:]
if value[0] != '"' or value[-1] != '"':
# still not a valid line assuming quotes are required
return (None, None)
else:
return (key, value.split('"')[1])
This is similar to what you were doing, but a little more robust (and returns the key as well as the value). But you can see this is more involved than the regular expression version. It's actually more-or-less implementing the exact same logic as the regular expression, but more slowly and verbosely.

How to read off of a specific line in a text file using python

Looking to have my code read one text file and store the line number of a user input as num and then use the variable num to read the same line on another file.
currently, the code for the first step of reading the first text file is working and has been tested but the second part doesn't display anything after being executed. I have changed multiple things but am still stuck. Help would be much appreciated.
here is my code:
print("Check Stock")
ca = input("Check all barcodes?")
if ca == "y":
for x in range(0,5):
with open ("stockbarcodes.txt") as f:
linesa = f.readlines()
print(linesa[x])
with open ("stockname.txt") as f:
linesb = f.readlines()
print(linesb[x])
print(" ")
else:
bc = input("Scan barcode: ")
f1 = open ("stockname.txt")
for num, line in enumerate(f1, 1):
if bc in line:
linesba = f1.readlines()
print(linesba[num])
As user Ikriemer points, it seems that you want to retrieve the stock name based on the barcode. For that kind of task you rather create a normalized Data Base, which discribes Entities, Properties and relationships. As you can se here there are a lot of things to take into account.
This code was tested on Mac OS, but considering OP's comment (who seems to be using windows), it is ok if the dtype is not specified.
Considering that the above solution may not be as quick as you like, you also have two options.
First option
As I can not check the content of your example files, the strategy that you show in your code makes me believe that your assuming both files are ordered, in a way that first line of the barcode file corresponds to first item in the stock name file. Given that, you can query the index of an element (barcode) in an array like data structure, and retrieve the element of another array (name) stored in the same position. Code below:
import numpy as np
print("Check Stock")
ca = input("Check all barcodes? (y/n): ")
if ca == "y":
for x in range(0, 5):
with open("stockbarcodes.txt") as f:
linesa = f.readlines()
print(linesa[x], sep="")
with open("stockname.txt") as f:
linesb = f.readlines()
print(linesb[x], sep="")
print(" ")
else:
try:
codes = np.genfromtxt("stockbarcodes.txt").tolist()
names = np.genfromtxt("stockname.txt", dtype=np.str).tolist()
bc = input("Scan barcode: ")
index = codes.index(int(bc))
print(names[index])
except IndexError:
print("Bar code {} not found".format(bc))
Second option
This option could be considered a workaround method to a data base like file. You need to store your data in some way that you can search the values associated with an specific entry. Such kind of tasks could be done with a dictionary. Just replace the else clause with this:
else:
try:
codes = np.genfromtxt("stockbarcodes.txt").tolist()
names = np.genfromtxt("stockname.txt", dtype=np.str).tolist()
table = {k: v for k, v in zip(codes, names)}
bc = input("Scan barcode: ")
print(table[int(bc)])
except KeyError:
print("Bar code {} not found".format(bc))
Again, in the dictionary comprehension we are assuming both files are ordered. I strongly suggest you to validate this assumption, to warranty that the first bar code corresponds to the first stock, second to second, and so on. Only after that, you may like to store the dictionary as a file, so you can load it and query it as you please. Check this answer fot that purpose.

Parsing massive XML files to JSON

I am working on a project that requires me to parse massive XML files to JSON. I have written code, however it is too slow. I have looked at using lxml and BeautifulSoup but am unsure how to proceed.
I have included my code. It works exactly how it is supposed to, except it is too slow. It took around 24 hours to go through a sub-100Mb file to parse 100,000 records.
product_data = open('productdata_29.xml', 'r')
read_product_data = product_data.read()
def record_string_to_dict(record_string):
'''This function takes a single record in string form and iterates through
it, and sorts it as a dictionary. Only the nodes present in the parent_rss dict
are appended to the new dict (single_record_dict). After each record,
single_record_dict is flushed to final_list and is then emptied.'''
#Iterating through the string to find keys and values to put in to
#single_record_dict.
while record_string != record_string[::-1]:
try:
k = record_string.index('<')
l = record_string.index('>')
temp_key = record_string[k + 1:l]
record_string = record_string[l+1:]
m = record_string.index('<')
temp_value = record_string[:m]
#Cleaning thhe keys and values of unnecessary characters and symbols.
if '\n' in temp_value:
temp_value = temp_value[3:]
if temp_key[-1] == '/':
temp_key = temp_key[:-1]
n = record_string.index('\n')
record_string = record_string[n+2:]
#Checking parent_rss dict to see if the key from the record is present. If it is,
#the key is replaced with keys and added to single_record_dictionary.
if temp_key in mapped_nodes.keys():
temp_key = mapped_nodes[temp_key]
single_record_dict[temp_key] = temp_value
except Exception:
break
while len(read_product_data) > 10:
#Goes through read_product_data to create blocks, each of which is a single
#record.
i = read_product_data.index('<record>')
j = read_product_data.index('</record>') + 8
single_record_string = read_product_data[i:j]
single_record_string = single_record_string[9:-10]
#Runs previous function with the input being the single string found previously.
record_string_to_dict(single_record_string)
#Flushes single_record_dict to final_list, and empties the dict for the next
#record.
final_list.append(single_record_dict)
single_record_dict = {}
#Removes the record that was previously processed.
read_product_data = read_product_data[j:]
#For keeping track/ease of use.
print('Record ' + str(break_counter) + ' has been appended.')
#Keeps track of the number of records. Once the set value is reached
#in the if loop, it is flushed to a new file.
break_counter += 1
flush_counter += 1
if break_counter == 100 or flush_counter == break_counter:
record_list = open('record_list_'+str(file_counter)+'.txt', 'w')
record_list.write(str(final_list))
#file_counter keeps track of how many files have been created, so the next
#file has a different int at the end.
file_counter += 1
record_list.close()
#resets break counter
break_counter = 0
final_list = []
#For testing purposes. Causes execution to stop once the number of files written
#matches the integer.
if file_counter == 2:
break
print('All records have been appended.')
Any reason, why are you not considering packages such as xml2json and xml2dict. See this post for working examples:
How can i convert an xml file into JSON using python?
Relevant code reproduced from above post:
xml2json
import xml2json
s = '''<?xml version="1.0"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>'''
print xml2json.xml2json(s)
xmltodict
import xmltodict, json
o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'
See this post if working in Python 3:
https://pythonadventures.wordpress.com/2014/12/29/xml-to-dict-xml-to-json/
import json
import xmltodict
def convert(xml_file, xml_attribs=True):
with open(xml_file, "rb") as f: # notice the "rb" mode
d = xmltodict.parse(f, xml_attribs=xml_attribs)
return json.dumps(d, indent=4)
You definitely don't want to be hand-parsing the XML. As well as the libraries others have mentioned, you could use an XSLT 3.0 processor. To go above 100Mb you would benefit from a streaming processor such as Saxon-EE, but up to that kind of level the open source Saxon-HE should be able to hack it. You haven't shown the source XML or target JSON, so I can't give you specific code - the assumption in XSLT 3.0 is that you probably want a customized transformation rather than an off-the-shelf one, so the general idea is to write template rules that define how different parts of your input XML should be handled.

Vigenere Cipher in Python bug

I'm trying to implement Vigenere's Cipher. I want to be able to obfuscate every single character in a file, not just alphabetic characters.
I think I'm missing something with the different types of encoding. I have made some test cases and some characters are getting replaced badly in the final result.
This is one test case:
,.-´`1234678abcde^*{}"¿?!"·$%&/\º
end
And this is the result I'm getting:
).-4`1234678abcde^*{}"??!"7$%&/:
end
As you can see, ',' is being replaced badly with ')' as well as some other characters.
My guess is that the others (for example, '¿' being replaced with '?') come from the original character not being in the range of [0, 127], so its normal those are changed. But I don't understand why ',' is failing.
My intent is to obfuscate CSV files, so the ',' problem is the one I'm mainly concerned about.
In the code below, I'm using modulus 128, but I'm not sure if that's correct. To execute it, put a file named "OriginalFile.txt" in the same folder with the content to cipher and run the script. Two files will be generated, Ciphered.txt and Deciphered.txt.
"""
Attempt to implement Vigenere cipher in Python.
"""
import os
key = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
fileOriginal = "OriginalFile.txt"
fileCiphered = "Ciphered.txt"
fileDeciphered = "Deciphered.txt"
# CIPHER PHASE
if os.path.isfile(fileCiphered):
os.remove(fileCiphered)
keyToUse = 0
with open(fileOriginal, "r") as original:
with open(fileCiphered, "a") as ciphered:
while True:
c = original.read(1) # read char
if not c:
break
k = key[keyToUse]
protected = chr((ord(c) + ord(k))%128)
ciphered.write(protected)
keyToUse = (keyToUse + 1)%len(key)
print("Cipher successful")
# DECIPHER PHASE
if os.path.isfile(fileDeciphered):
os.remove(fileDeciphered)
keyToUse = 0
with open(fileCiphered, "r") as ciphered:
with open(fileDeciphered, "a") as deciphered:
while True:
c = ciphered.read(1) # read char
if not c:
break
k = key[keyToUse]
unprotected = chr((128 + ord(c) - ord(k))%128) # +128 so that we don't get into negative numbers
deciphered.write(unprotected)
keyToUse = (keyToUse + 1)%len(key)
print("Decipher successful")
Assumption: you're trying to produce a new, valid CSV with the contents of cells enciphered via Vigenere, not to encipher the whole file.
In that case, you should check out the csv module, which will handle properly reading and writing CSV files for you (including cells that contain commas in the value, which might happen after you encipher a cell's contents, as you see). Very briefly, you can do something like:
with open("...", "r") as fpin, open("...", "w") as fpout:
reader = csv.reader(fpin)
writer = csv.writer(fpout)
for row in reader:
# row will be a list of strings, one per column in the row
ciphered = [encipher(cell) for cell in row]
writer.writerow(ciphered)
When using the csv module you should be aware of the notion of "dialects" -- ways that different programs (usually spreadsheet-like things, think Excel) handle CSV data. csv.reader() usually does a fine job of inferring the dialect you have in the input file, but you might need to tell csv.writer() what dialect you want for the output file. You can get the list of built-in dialects with csv.list_dialects() or you can make your own by creating a custom Dialect object.

Parse key value pairs in a text file

I am a newbie with Python and I search how to parse a .txt file.
My .txt file is a namelist with computation informations like :
myfile.txt
var0 = 16
var1 = 1.12434E10
var2 = -1.923E-3
var3 = 920
How to read the values and put them in myvar0, myvar1, myvar2, myvar3 in python?
I suggest storing the values in a dictionary instead of in separate local variables:
myvars = {}
with open("namelist.txt") as myfile:
for line in myfile:
name, var = line.partition("=")[::2]
myvars[name.strip()] = float(var)
Now access them as myvars["var1"]. If the names are all valid python variable names, you can put this below:
names = type("Names", [object], myvars)
and access the values as e.g. names.var1.
I personally solved this by creating a .py file that just contains all the parameters as variables - then did:
include PARAMETERS.py
in the program modules that need the parameters.
It's a bit ugly, but VERY simple and easy to work with.
Dict comprehensions (PEP 274) can be used for a shorter expression (60 characters):
d = {k:float(v) for k, v in (l.split('=') for l in open(f))}
EDIT: shortened from 72 to 60 characters thanks to #jmb suggestion (avoid .readlines()).
As #kev suggests, the configparser module is the way to go.
However in some scenarios (a bit ugly, I admit) but very simple and effective way to do to this is to rename myfile.txt to myfile.py and do a from myfile import * (after you fix the typo var 0 -> var0)
However, this is very insecure, so if the file is from an external source or can be written by a malicious attacker, use something that validates the data instead of executing it blindly.
If there are multiple comma-separated values on a single line, here's code to parse that out:
res = {}
pairs = args.split(", ")
for p in pairs:
var, val = p.split("=")
res[var] = val
Use pandas.read_csv when the file format becomes more fancy (like comments).
val = u'''var0 = 16
var1 = 1.12434E10
var2 = -1.923E-3
var3 = 920'''
print(pandas.read_csv(StringIO(val), # or read_csv('myfile.txt',
delimiter='\s*=\s*',
header=None,
names=['key','value'],
dtype=dict(key=numpy.object,value=numpy.object), # or numpy.float64
index_col=['key']).to_dict()['value'])
# prints {u'var1': u'1.12434E10', u'var0': u'16', u'var3': u'920', u'var2': u'-1.923E-3'}
Similar to #lauritz-v-thaulow but, just a line by line read into a variable.
Here is a simple Copy-Pasta so you can understand a bit more.
As the config file has to be a specific format.
import os
# Example creating an valid temp test file to get a better result.
MY_CONFIG = os.path.expanduser('~/.test_credentials')
with open(MY_CONFIG, "w") as f:
f.write("API_SECRET_KEY=123456789")
f.write(os.linesep)
f.write("API_SECRET_CONTENT=12345678")
myvars = {}
with open(MY_CONFIG, "r") as myfile:
for line in myfile:
line = line.strip()
name, var = line.partition("=")[::2]
myvars[name.strip()] = str(var)
# Iterate thru all the items created.
for k, v in myvars.items():
print("{} | {}".format(k, v))
# API_SECRET_KEY | 123456789
# API_SECRET_CONTENT | 12345678
# Access the API_SECRET_KEY item directly
print("{}".format(myvars['API_SECRET_KEY']))
# 123456789
# Access the API_SECRET_CONTENT item directly
print("{}".format(myvars['API_SECRET_CONTENT']))
# 12345678

Categories

Resources