tensorflow.dataset.shuffle = tensorflow.dataset.prefetch + then shuffle internally?

tensorflow.dataset.shuffle = tensorflow.dataset.prefetch + then shuffle internally? - python

According to the documentation of tf.dataset.shuffle, it will fill in a buffer with size k then shuffle inside of it. Tho I don't want the order of data to be changed, I want it to be buffered. Then I found there is tf.dataset.prefetch, which says "This allows later elements to be prepared while the current element is being processed."
From the description I guess prefetch is what I want (i.e. pre-loading the data while the pervious data are being used in training), but while trying to look into the code of tf.dataset.shuffle to see if they actually call tf.dataset.prefetch, I got stuck in these lines (paste them below), cannot find where is shuffle_dataset_v3 defined.
variant_tensor = gen_dataset_ops.shuffle_dataset_v3(
input_dataset._variant_tensor, # pylint: disable=protected-access
buffer_size=self._buffer_size,
seed=self._seed,
seed2=self._seed2,
seed_generator=gen_dataset_ops.dummy_seed_generator(),
reshuffle_each_iteration=self._reshuffle_each_iteration,
**self._flat_structure)
My major question is whether prefetch is the replacement of shuffle in terms of buffering the data, and it would also be nice if someone can point me to where shuffle_dataset_v3 was implemented?

Yes. Prefetch is for buffering data.
gen_dataset_ops, and other gen_xxx_ops are not included in source code because it is automatically generated by bazel to wrap C++ implementation for use in python. You should be able to find these gen_xxx_ops code in your local installation. For example, ${PYTHON_ROOT}/site-packages/tensorflow/python/ops/gen_dataset_ops.py

Related

Extract list of variables with start attribute from Modelica model

Is there an easy way to extract a list of all variables with start attribute from a Modelica model? The ultimate goal is to run a simulation until it reaches steady-state, then run a python script that compares the values of start attribute against the steady-state value, so that I can identify start values that were chosen badly.
In the Dymola Python interface I could not find such a functionality. Another approach could be to generate the modelDescription.xml and parse it, I assume the information is available somewhere in there, but for that approach I also feel I need help to get started.

Similar to this answer, you can extract that info easily from the modelDescription.xml inside a FMU with FMPy.
Here is a small runnable example:
from fmpy import read_model_description
from fmpy.util import download_test_file
from pprint import pprint
fmu_filename = 'CoupledClutches.fmu'
download_test_file('2.0', 'CoSimulation', 'MapleSim', '2016.2', 'CoupledClutches', fmu_filename)
model_description = read_model_description(fmu_filename)
start_vars = [v for v in model_description.modelVariables if v.start and v.causality == 'local']
pprint(start_vars)

The files dsin.txt and dsfinal.txt might help you around with this. They have the same structure, with values at the start and at the end of the simulation; by renaming dsfinal.txt to dsin.txt you can start your simulation from the (e.g. steady-state) values you computed in a previous run.
It might be worthy working with these two files if you have in mind already to use such values for running other simulations.
They give you information about solvers/simulation settings, that you won't find in the .mat result files (if they're of any interest for your case)
However, if it is only a comparison between start and final values of variables that are present in the result files anyway, a better choice might be to use python and a library to read the result.mat file (dymat, modelicares, etc). It is then a matter of comparing start-end values of the signals of interest.

After some trial and error, I came up with this python code snippet to get that information from modelDescription.xml:
import xml.etree.ElementTree as ET
root = ET.parse('modelDescription.xml').getroot()
for ScalarVariable in root.findall('ModelVariables/ScalarVariable'):
varStart = ScalarVariable.find('*[#start]')
if varStart is not None:
name = ScalarVariable.get('name')
value = varStart.get('start')
print(f"{name} = {value};")
To generate the modelDescription.xml file, run Dymola translation with the flag
Advanced.FMI.GenerateModelDescriptionInterface2 = true;
Python standard library has several modules for processing XML:
https://docs.python.org/3/library/xml.html
This snippet uses ElementTree.
This is just a first step, not sure if I missed something basic.

Define context variables in behave python

Sometimes, you need to define values dynamically, (like datetime now, random strings, random integers, file contents, etc.) and use them across different steps without being explicit or hard-coding the value.
So, my question is how could I define variables inside of steps (the correct way to do it) to use these variables in the following steps.
Some example
Given A random string of length "100" as "my_text"
And I log in to my platform
And I ask to add the following post:
| title | description |
| Some example of title | {{my_text}} |
When I submit the post form
Then The posts table shows these posts:
| title | description |
| Some example of title | {{my_text}} |
And I delete any post containing in the description "{{my_text}}"
This is a basic example trying to explain why I would like to define variables in steps and save them in the context to use it in the following steps.
My idea was to modify before_step and after_step methods... to set a variable in context to store my custom variables like this:
def before_step(context):
if not hasattr(context, 'vars'):
context.vars = {}
if hasattr(context, table) and context.table:
parse_table(context)
def parse_table(context):
# Here use a regex to check each cell and look for `"{{<identifier>}}"` and if match, replace the cell value by context.vars[identifier] so the step "the posts table shows these posts will never know what is `{{my_text}}` it will be abstract seeing the random string.
Scenarios Outline, use something like this defining variables like "<some_identifier>" and then for each example replace the value in the step.
It's basically to reproduce the behaviour but for any kind of step, simple or using tables.
Is it the right way to do something like this?

From Behave docs on the context:
When behave launches into a new feature or scenario it adds a new layer to the context, allowing the new activity level to add new values, or overwrite ones previously defined, for the duration of that activity. These can be thought of as scopes:
#given('I request a new widget for an account via SOAP')
def step_impl(context):
client = Client("http://127.0.0.1:8000/soap/")
// method client.Allocate(...) returns a dict
context.response = client.Allocate(customer_first='Firstname',
customer_last='Lastname', colour='red')
// context vars can be set more directly
context.new_var = "My new variable!"
#then('I should receive an OK SOAP response')
def step_impl(context):
eq_(context.response['ok'], 1)
cnv = str(context.new_var)
print (f"This is my new variable:'{cnv}'"
So, the value can be set using dot notation and retrieved the same.

To answer this question, one needs note:
Does the test data needs to be controlled externally? For example, test data can be inputed from command line so that the value can be chosen explicitly.
If the answer is no, then probably we should not bother hard coding anything in the feature file. And we can leave the data generated in one step, save it in context, and accessed again in any followed step.
The example I can think is exactly like what the question described. Do we care what the random text content we generated, posted and verified? Probably not. Then we should not expose such detail to user (i.e. feature file) since it is not important to the behaviour we are testing.
If the answer is yes, we do need a bit hack to make it happen. I am experiencing a case like this. What I want is to change the test data when I run the test so I don't have to hard code them in the feature files as in a table or scenario outline. How can I do this?
I can use -D option in command line to pass in as many user data as possible, which can then be accessed in context.config.userdata dictionary in any steps. If the number of test data is very limited. This approach is an easy way to go. But if the test data set contains many data that no one want type one by one in command line, it can be stored externally, for example, a ini file with section names like testdata_1...testdata_n, and thus a string can be passed in from command line to be used to address the section name in this config file. And the test data can be read out in either before_all, or before_scenario, etc., and get used in all steps.

In my experience , you cannot create a dynamic value in feature file.
for example, this step :
Given A random string of length "100" as "my_text"
I dont see any way to change {my_text} each time you run the scenario. (not consider to use behave -D to parse the value to context.config.userdata,I think it is also a wrong approach)
Even Scenario Outline, it actually splits to many scenarios. each scenario will have
different value but the value of {my_text} is already defined in Examples table for each scenario.
The way makes a step dynamic is using Step definition (Code layer).
You can generate a random number in step definition #given('A random string of length "100" as "{my_text}"')
And use context.my_text to store the created number and using it arround.
I also agree with Murphy Meng that you don't need to expose the generated random number
explicitly in feature file. You know which step will use that number, simply use context.my_text in that step to get the value. That's it.

How to write back to a PDB file after doing Superimposer for atoms of a protein in PDB.BIO python

I read and extracted information of atoms from a PDB file and did a Superimposer() to align a mutation to wild-type. How can I write the aligned values of atoms back to PDB file? I tried to use PDBIO() library but it doesn't work since it doesn't accept a list as an input. Anyone has an idea how to do it?
mutantAtoms = []
mutantStructure = PDBParser().get_structure("name",pdbFile)
mutantChain = mutStructure[0]["B"]
# Extract information of atoms
for residues in mutantChain:
mutantAtoms.append(residues)
# Do alignment
si =Superimposer()
si.set_atoms(wildtypeAtoms, mutantAtoms)
si.apply(mutantAtoms)
Now mutantAtoms is the aligned atom to wild-type atom. I need to write this information to a PDB file. My question is how to convert from list of aligned atoms to a structure and use PDBIO() or some other ways to write to a PDB file.

As I see in an example in the PDBIO package documentation in Biopython documentation:
p = PDBParser()
s = p.get_structure("1fat", "1fat.pdb")
io = PDBIO()
io.set_structure(s)
io.save("out.pdb")
Seems like PDBIO module needs an object of class Structure to work, which is in principle what I understand Superimposer works with. When you say it does not accept a list do you mean you have a list of structures? In that case you could simply do it by iterating throught the structures as in:
for s in my_results_list:
io.set_structure(s)
io.save("out.pdb")
If what you have is a list of atoms, I guess you could create a Structure object with that and then pass it to PDBIO.
However, it is difficult to tell more without knowing more about your problem. You could put on your question the code lines where you get the problem.
Edit: Now I have better understood what you want to do. So I have seen in an interesting Biopython Structural Bioinformatics FAQ some information about the Structure class, which is a little complex apparently. At first sight, I do not see a very easy way to create Structure objects from scratch, but what you could do is modify the structure you get from PDBIO substituting the atoms list with the result you get from Superimposer and then write the .pdb file using the same modified structure. So you could try to put your mutantAtoms list into the mutantStructure object you already have.

updating a feature class from a feature layer in arcpy

I'm writing a python script to find error in attribute codes in a feature class. In order to find some of these errors I need to use the select by location tool. But, the select by location tool only takes layers as inputs so I have to create a layer from the feature class. So if I update the error code field in the layer file how do I then populate the error code field in the original feature class?

Update
One can use the arcpy data access toolbox's UpdateCursor, which is newer and faster than the original form of the UpdateCursor I initially described.
error_code=-1
with arcpy.da.UpdateCursor('lulcTV', ['error_field', 'VALUE']) as coverCSR:
for tree in coverCSR:
species = tree[1] # returns'VALUE'. Not really needed, but good to know about
tree[0] = error_code # sets first requested field, "error_field"
coverCSR.updateRow(tree)
Original answer
Seems like you could use an UpdateCursor. Example:
coverCSR=arcpy.UpdateCursor('lulcTV')
error_code=-1
for tree in coverCSR:
species=tree.getValue('VALUE') # not really needed, but good to know about
tree.setValue('error_field', error_code)
coverCSR.updateRow(tree)
This iterates over all rows, one by one.

Most efficient way in Python to iterate over a large file (10GB+)

I'm working on a Python script to go through two files - one containing a list of UUIDs, the other containing a large amount of log entries - each line containing one of the UUIDs from the other file. The purpose of the program is to create a list of the UUIDS from file1, then for each time that UUID is found in the log file, increment the associated value for each time a match is found.
So long story short, count how many times each UUID appears in the log file.
At the moment, I have a list which is populated with UUID as the key, and 'hits' as the value. Then another loop which iterates over each line of the log file, and checking if the UUID in the log matches a UUID in the UUID list. If it matches, it increments the value.
for i, logLine in enumerate(logHandle): #start matching UUID entries in log file to UUID from rulebase
if logFunc.progress(lineCount, logSize): #check progress
print logFunc.progress(lineCount, logSize) #print progress in 10% intervals
for uid in uidHits:
if logLine.count(uid) == 1: #for each UUID, check the current line of the log for a match in the UUID list
uidHits[uid] += 1 #if matched, increment the relevant value in the uidHits list
break #as we've already found the match, don't process the rest
lineCount += 1
It works as it should - but I'm sure there is a more efficient way of processing the file. I've been through a few guides and found that using 'count' is faster than using a compiled regex. I thought reading files in chunks rather than line by line would improve performance by reducing the amount of disk I/O time but the performance difference on a test file ~200MB was neglible. If anyone has any other methods I would be very grateful :)

Think functionally!
Write a function which will take a line of the log file and return the uuid. Call it uuid, say.
Apply this function to every line of the log file. If you are using Python 3 you can use the built-in function map; otherwise, you need to use itertools.imap.
Pass this iterator to a collections.Counter.
collections.Counter(map(uuid, open("log.txt")))
This will be pretty much optimally efficient.
A couple comments:
This completely ignores the list of UUIDs and just counts the ones that appear in the log file. You will need to modify the program somewhat if you don't want this.
Your code is slow because you are using the wrong data structures. A dict is what you want here.

This is not a 5-line answer to your question, but there was an excellent tutorial given at PyCon'08 called Generator Tricks for System Programmers. There is also a followup tutorial called A Curious Course on Coroutines and Concurrency.
The Generator tutorial specifically uses big log file processing as its example.

Like folks above have said, with a 10GB file you'll probably hit the limits of your disk pretty quickly. For code-only improvements, the generator advice is great. In python 2.x it'll look something like
uuid_generator = (line.split(SPLIT_CHAR)[UUID_FIELD] for line in file)
It sounds like this doesn't actually have to be a python problem. If you're not doing anything more complex than counting UUIDs, Unix might be able to solve your problems faster than python can.
cut -d${SPLIT_CHAR} -f${UUID_FIELD} log_file.txt | sort | uniq -c

Have you tried mincemeat.py? It is a Python implementation of the MapReduce distributed computing framework. I'm not sure if you'll have performance gain since I've not yet processed 10GB of data before using it, though you might explore this framework.

Try measuring where most time is spent, using a profiler http://docs.python.org/library/profile.html
Where best to optimise will depend on the nature of your data: If the list of uuids isn't very long, you may find, for example, that a large proportion of time is spend on the "if logFunc.progress(lineCount, logSize)". If the list is very long, you it could help to save the result of uidHits.keys() to a variable outside the loop and iterate over that instead of the dictionary itself, but Rosh Oxymoron's suggesting of finding the id first and then checking for it in uidHits would probably help even more.
In any case, you can eliminate the lineCount variable, and use i instead. And find(uid) != -1 might be better than count(uid) == 1 if the lines are very long.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.