I am a newbie with Python and I search how to parse a .txt file.
My .txt file is a namelist with computation informations like :
myfile.txt
var0 = 16
var1 = 1.12434E10
var2 = -1.923E-3
var3 = 920
How to read the values and put them in myvar0, myvar1, myvar2, myvar3 in python?
I suggest storing the values in a dictionary instead of in separate local variables:
myvars = {}
with open("namelist.txt") as myfile:
for line in myfile:
name, var = line.partition("=")[::2]
myvars[name.strip()] = float(var)
Now access them as myvars["var1"]. If the names are all valid python variable names, you can put this below:
names = type("Names", [object], myvars)
and access the values as e.g. names.var1.
I personally solved this by creating a .py file that just contains all the parameters as variables - then did:
include PARAMETERS.py
in the program modules that need the parameters.
It's a bit ugly, but VERY simple and easy to work with.
Dict comprehensions (PEP 274) can be used for a shorter expression (60 characters):
d = {k:float(v) for k, v in (l.split('=') for l in open(f))}
EDIT: shortened from 72 to 60 characters thanks to #jmb suggestion (avoid .readlines()).
As #kev suggests, the configparser module is the way to go.
However in some scenarios (a bit ugly, I admit) but very simple and effective way to do to this is to rename myfile.txt to myfile.py and do a from myfile import * (after you fix the typo var 0 -> var0)
However, this is very insecure, so if the file is from an external source or can be written by a malicious attacker, use something that validates the data instead of executing it blindly.
If there are multiple comma-separated values on a single line, here's code to parse that out:
res = {}
pairs = args.split(", ")
for p in pairs:
var, val = p.split("=")
res[var] = val
Use pandas.read_csv when the file format becomes more fancy (like comments).
val = u'''var0 = 16
var1 = 1.12434E10
var2 = -1.923E-3
var3 = 920'''
print(pandas.read_csv(StringIO(val), # or read_csv('myfile.txt',
delimiter='\s*=\s*',
header=None,
names=['key','value'],
dtype=dict(key=numpy.object,value=numpy.object), # or numpy.float64
index_col=['key']).to_dict()['value'])
# prints {u'var1': u'1.12434E10', u'var0': u'16', u'var3': u'920', u'var2': u'-1.923E-3'}
Similar to #lauritz-v-thaulow but, just a line by line read into a variable.
Here is a simple Copy-Pasta so you can understand a bit more.
As the config file has to be a specific format.
import os
# Example creating an valid temp test file to get a better result.
MY_CONFIG = os.path.expanduser('~/.test_credentials')
with open(MY_CONFIG, "w") as f:
f.write("API_SECRET_KEY=123456789")
f.write(os.linesep)
f.write("API_SECRET_CONTENT=12345678")
myvars = {}
with open(MY_CONFIG, "r") as myfile:
for line in myfile:
line = line.strip()
name, var = line.partition("=")[::2]
myvars[name.strip()] = str(var)
# Iterate thru all the items created.
for k, v in myvars.items():
print("{} | {}".format(k, v))
# API_SECRET_KEY | 123456789
# API_SECRET_CONTENT | 12345678
# Access the API_SECRET_KEY item directly
print("{}".format(myvars['API_SECRET_KEY']))
# 123456789
# Access the API_SECRET_CONTENT item directly
print("{}".format(myvars['API_SECRET_CONTENT']))
# 12345678
Related
I have a very big tsv file: 1.5 GB. i want to parse this file. Im using the following function:
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = {}
for row in eval:
ids = row.split("\t")
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
eval.close()
return evalIDs
It is take more than 10 hours and it is still working. I dont know how to accelerate this step and if there is another method to parse such as file
several issues here:
testing for keys with if ids[0] not in evalIDs.keys() takes forever in python 2, because keys() is a list. .keys() is rarely useful anyway. A better way already is if ids[0] not in evalIDs, but, but...
why not use a collections.defaultdict instead?
why not use csv module?
overriding eval built-in (well, not really an issue seeing how dangerous it is)
my proposal:
import csv, collections
def readEvalFileAsDictInverse(evalFile):
with open(evalFile, "r") as handle:
evalIDs = collections.defaultdict(list)
cr = csv.reader(handle,delimiter='\t')
for ids in cr:
evalIDs[ids[0]].append(ids[1]]
the magic evalIDs[ids[0]].append(ids[1]] creates a list if doesn't already exist. It's also portable and very fast whatever the python version and saves a if
I don't think it could be faster with default libraries, but a pandas solution probably would.
Some suggestions:
Use a defaultdict(list) instead of creating inner lists yourself or using dict.setdefault().
dict.setfdefault() will create the defautvalue every time, thats a time burner - defautldict(list) does not - it is optimized:
from collections import defaultdict
def readEvalFileAsDictInverse(evalFile):
eval = open(evalFile, "r")
evalIDs = defaultdict(list)
for row in eval:
ids = row.split("\t")
evalIDs[ids[0]].append(ids[1])
eval.close()
If your keys are valid file names you might want to investigate awk for much more performance then doing this in python.
Something along the lines of
awk -F $'\t' '{print > $1}' file1
will create your split files much faster and you can simply use the latter part of the following code to read from each file (assuming your keys are valid filenames) to construct your lists. (Attributation: here ) - You would need to grab your created files with os.walk or similar means. Each line inside the files will still be tab-seperated and contain the ID in front
If your keys are not filenames in their own right, consider storing your different lines into different files and only keep a dictionary of key,filename around.
After splitting the data, load the files as lists again:
Create testfile:
with open ("file.txt","w") as w:
w.write("""
1\ttata\ti
2\tyipp\ti
3\turks\ti
1\tTTtata\ti
2\tYYyipp\ti
3\tUUurks\ti
1\ttttttttata\ti
2\tyyyyyyyipp\ti
3\tuuuuuuurks\ti
""")
Code:
# f.e. https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename
def make_filename(k):
"""In case your keys contain non-filename-characters, make it a valid name"""
return k # assuming k is a valid file name else modify it
evalFile = "file.txt"
files = {}
with open(evalFile, "r") as eval_file:
for line in eval_file:
if not line.strip():
continue
key,value, *rest = line.split("\t") # omit ,*rest if you only have 2 values
fn = files.setdefault(key, make_filename(key))
# this wil open and close files _a lot_ you might want to keep file handles
# instead in your dict - but that depends on the key/data/lines ratio in
# your data - if you have few keys, file handles ought to be better, if
# have many it does not matter
with open(fn,"a") as f:
f.write(value+"\n")
# create your list data from your files:
data = {}
for key,fn in files.items():
with open(fn) as r:
data[key] = [x.strip() for x in r]
print(data)
Output:
# for my data: loaded from files called '1', '2' and '3'
{'1': ['tata', 'TTtata', 'tttttttata'],
'2': ['yipp', 'YYyipp', 'yyyyyyyipp'],
'3': ['urks', 'UUurks', 'uuuuuuurks']}
Change evalIDs to a collections.defaultdict(list). You can avoid the if to check if a key is there.
Consider splitting the file externally using split(1) or even inside python using a read offset. Then use multiprocessing.pool to parallelise the loading.
Maybe, you can make it somewhat faster; change it:
if ids[0] not in evalIDs.keys():
evalIDs[ids[0]] = []
evalIDs[ids[0]].append(ids[1])
to
evalIDs.setdefault(ids[0],[]).append(ids[1])
The 1st solution searches 3 times in the "evalID" dictionary.
I am using the KEGG API to download genomic data and writing it to a file. There are 26 files total and some of of them contain the the dictionary 'COMPOUND'. I would like to assign these to CompData and write them to the output file. I tried writing it as an if True statement but this does not work.
# Read in hsa links
hsa = []
with open ('/users/skylake/desktop/pathway-HSAs.txt', 'r') as file:
for line in file:
line = line.strip()
hsa.append(line)
# Import KEGG API Bioservices | Create KEGG Variable
from bioservices.kegg import KEGG
k = KEGG()
# Data Parsing | Writing to File
# for i in range(len(hsa)):
data = k.get(hsa[2])
dict_data = k.parse(data)
if dict_data['COMPOUND'] == True:
compData = str(dict_data['COMPOUND'])
nameData = str(dict_data['NAME'])
geneData = str(dict_data['GENE'])
f = open('/Users/Skylake/Desktop/pathway-info/' + nameData + '.txt' , 'w')
f.write("Genes\n")
f.write(geneData)
f.write("\nCompounds\n")
f.write(compData)
f.close()
I guess that by
if dict_data['COMPOUND'] == True:
You test (wrongly) for the existence of the string key 'COMPOUND' in dict_data. In this case what you want is
if 'COMPOUND' in dict_data:
Furthermore, note that the definition of the variable compData won't occur if the key is not present, which will raise an error when trying to write its value. This means that you should always define it whatever happens, via doing, e.g.
compData = str(dict_data.get('COMPOUND', 'undefined'))
The above line of code means that if the key exists it gets its value and if does not exist, it gets 'undefined' instead. Note that you can chose the alternative value you want, or even not giving any, which results in None by default.
I have a file with 2 columns:
Anzegem Anzegem
Gijzelbrechtegem Anzegem
Ingooigem Anzegem
Aalst Sint-Truiden
Aalter Aalter
The first column is a town and the second column is the district of that town.
I made a dictionary of that file like this:
def readTowns(text):
input = open(text, 'r')
file = input.readlines()
dict = {}
verzameling = set()
for line in file:
tmp = line.split()
dict[tmp[0]] = tmp[1]
return dict
If I set a variable 'writeTowns' equal to readTowns(text) and do writeTown['Anzegem'], I want to get a collection of {'Anzegem', 'Gijzelbrechtegem', 'Ingooigem'}.
Does anybody know how to do this?
I think you can just create another function that can create appropriate data structure for what you need. Because, at the end you will end up writing code which basically manipulates the dictionary returned by readTowns to generate data as per your requirement. Why not keep the code clean and create another function for that. You Just create a name to list dictionary and you are all set.
def writeTowns(text):
input = open(text, 'r')
file = input.readlines()
dict = {}
for line in file:
tmp = line.split()
dict[tmp[1]] = dict.get(tmp[1]) or []
dict.get(tmp[1]).append(tmp[0])
return dict
writeTown = writeTowns('file.txt')
print writeTown['Anzegem']
And if you are concerned about reading the same file twice, you can do something like this as well,
def readTowns(text):
input = open(text, 'r')
file = input.readlines()
dict2town = {}
town2dict = {}
for line in file:
tmp = line.split()
dict2town[tmp[0]] = tmp[1]
town2dict[tmp[1]] = town2dict.get(tmp[1]) or []
town2dict.get(tmp[1]).append(tmp[0])
return dict2town, town2dict
dict2town, town2dict = readTowns('file.txt')
print town2dict['Anzegem']
You could do something like this, although, please have a look at #ubadub's answer, there are better ways to organise your data.
[town for town, region in dic.items() if region == 'Anzegem']
It sounds like you want to make a dictionary where the keys are the districts and the values are a list of towns.
A basic way to do this is:
def readTowns(text):
with open(text, 'r') as f:
file = input.readlines()
my_dict = {}
for line in file:
tmp = line.split()
if tmp[1] in dict:
my_dict[tmp[1]].append(tmp[0])
else:
my_dict[tmp[1]] = [tmp[0]]
return dict
The if/else blocks can also be achieved using python's defaultdict subclass (docs here) but I've used the if/else statements here for readability.
Also some other points: the variables dict and file are python types so it is bad practice to overwrite these with your own local variable (notice I've changed dict to my_dict in the code above.
If you build your dictionary as {town: district}, so the town is the key and the district is the value, you can't do this easily*, because a dictionary is not meant to be used in that way. Dictionaries allow you to easily find the values associated with a given key. So if you want to find all the towns in a district, you are better of building your dictionary as:
{district: [list_of_towns]}
So for example the district Anzegem would appear as {'Anzegem': ['Anzegem', 'Gijzelbrechtegem', 'Ingooigem']}
And of course the value is your collection.
*you could probably do it by iterating through the entire dict and checking where your matches occur, but this isn't very efficient.
I've got an old informix database that was written for cobol. All the fields are in code so my SQL queries look like.
SELECT uu00012 FROM uu0001;
This is pretty hard to read.
I have a text file with the field definitions like
uu00012 client
uu00013 date
uu00014 f_name
uu00015 l_name
I would like to swap out the code for the more english name. Run a python script on it maybe and have a file with the english names saved.
What's the best way to do this?
If each piece is definitely a separate word, re.sub is definitely the way to go here:
#create a mapping of old vars to new vars.
with open('definitions') as f:
d = dict( [x.split() for x in f] )
def my_replace(match):
#if the match is in the dictionary, replace it, otherwise, return the match unchanged.
return d.get( match.group(), match.group() )
with open('inquiry') as f:
for line in f:
print re.sub( r'\w+', my_replace, line )
Conceptually,
I would probably first build a mapping of codings -> english (in memory or o.
Then, for each coding in your map, scan your file and replace with the codes mapped english equivalent.
infile = open('filename.txt','r')
namelist = []
for each in infile.readlines():
namelist.append((each.split(' ')[0],each.split(' ')[1]))
this will give you a list of key,value pairs
i dont know what you want to do with the results from there though, you need to be more explicit
dictionary = '''uu00012 client
uu00013 date
uu00014 f_name
uu00015 l_name'''
dictionary = dict(map(lambda x: (x[1], x[0]), [x.split() for x in dictionary.split('\n')]))
def process_sql(sql, d):
for k, v in d.items():
sql = sql.replace(k, v)
return sql
sql = process_sql('SELECT f_name FROM client;', dictionary)
build dictionary:
{'date': 'uu00013', 'l_name': 'uu00015', 'f_name': 'uu00014', 'client': 'uu00012'}
then run thru your SQL and replace human readable values with coded stuff. The result is:
SELECT uu00014 FROM uu00012;
import re
f = open("dictfile.txt")
d = {}
for mapping in f.readlines():
l, r = mapping.split(" ")
d[re.compile(l)] = r.strip("\n")
sql = open("orig.sql")
out = file("translated.sql", "w")
for line in sql.readlines():
for r in d.keys():
line = r.sub(d[r], line)
out.write(line)
Going to re-word the question.
Basically I'm wondering what is the easiest way to manipulate a string formatted like this:
Safety/Report/Image/489
or
Safety/Report/Image/490
And sectioning off each word seperated by a slash(/), and storing each section(token) into a store so I can call it later. (Reading in about 1200 cells from a CSV file).
The answer for your question:
>>> mystring = "Safety/Report/Image/489"
>>> mystore = mystring.split('/')
>>> mystore
['Safety', 'Report', 'Image', '489']
>>> mystore[2]
'Image'
>>>
If you want to store data from more than one string, then you have several options depending on how do you want to organize it. For example:
liststring = ["Safety/Report/Image/489",
"Safety/Report/Image/490",
"Safety/Report/Image/491"]
dictstore = {}
for line, string in enumerate(liststring):
dictstore[line] = string.split('/')
print dictstore[1][3]
print dictstore[2][3]
prints:
490
491
In this case you can use in the same way a dictionary or a list (a list of lists) for storage. In case each string has a especial identifier (one better than the line number), then the dictionary is the option to choose.
I don't quite understand your code and don't have too much time to study it, but I thought that the following might be helpful, at least if order isn't important ...
in_strings = ['Safety/Report/Image/489',
'Safety/Report/Image/490',
'Other/Misc/Text/500'
]
out_dict = {}
for in_str in in_strings:
level1, level2, level3, level4 = in_str.split('/')
out_dict.setdefault(level1, {}).setdefault(
level2, {}).setdefault(
level3, []).append(level4)
print out_dict
{'Other': {'Misc': {'Text': ['500']}}, 'Safety': {'Report': {'Image': ['489', '490']}}}
If your csv is line seperated:
#do something to load the csv
split_lines = [x.strip() for x in csv_data.split('\n')]
for line_data in split_lines:
split_parts = [x.strip() for x in line_data.split('/')]
# do something with individual part data
# such as some_variable = split_parts[1] etc
# if using indexes, I'd be sure to catch for index errors in case you
# try to go to index 3 of something with only 2 parts
check out the python csv module for some importing help (I'm not too familiar).