Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
This post was edited and submitted for review 11 months ago and failed to reopen the post:
Original close reason(s) were not resolved
Improve this question
I was simulating the solar system (Sun, Earth and Moon). When I first started working on the project, I used the base units: meters for distance, seconds for time, and metres per second for velocity. Because I was dealing with the solar system, the numbers were pretty big, for example the distance between the Earth and Sun is 150·10⁹ m.
When I numerically integrated the system with scipy.solve_ivp, the results were completely wrong. Here is an example of Earth and Moon trajectories.
But then I got a suggestion from a friend that I should use standardised units: astronomical unit (AU) for distance and years for time. And the simulation started working flawlessly!
My question is: Why is this a generally valid advice for problems such as mine? (Mind that this is not about my specific problem which was already solved, but rather why the solution worked.)
Most, if not all integration modules work best out of the box if:
your dynamical variables have the same order of magnitude;
that order of magnitude is 1;
the smallest time scale of your dynamics also has the order of magnitude 1.
This typically fails for astronomical simulations where the orders of magnitude vary and values as well as time scales are often large in typical units.
The reason for the above behaviour of integrators is that they use step-size adaption, i.e., the integration step is adjusted to keep the estimated error at a defined level.
The step-size adaption in turn is governed by a lot of parameters like absolute tolerance, relative tolerance, minimum time step, etc.
You can usually tweak these parameters, but if you don’t, there need to be some default values and these default values are chosen with the above setup in mind.
Digression
You might ask yourself: Can these parameters not be chosen more dynamically? As a developer and maintainer of an integration module, I would roughly expect that introducing such automatisms has the following consequences:
About twenty in a thousand users will not run into problems like yours.
About fifty a thousand users (including the above) miss an opportunity to learn rudimentary knowledge about how integrators work and reading documentations.
About one in thousand users will run into a horrible problem with the automatisms that is much more difficult to solve than the above.
I need to introduce new parameters governing the automatisms that are even harder to grasp for the average user.
I spend a lot of time in devising and implementing the automatisms.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am an engineering student and I have to turn in projects that run code for simulations. The calculations are all carried out in Python, but the derivations/ setup is done on paper and included in the report. I sometimes find myself copying and pasting code for repeat graphs or numerical simulations, so I want to write functions for these (trying to be better about DRY).
When writing functions to do these things, I'm worried that I'm "hiding" to much of the code behind the functions. For example, this function would plot two simulations differing only by tolerance level. This would be used to check for convergence in a numerical simulation of an ode.
def plot_visual_convergence_check(odefunc, time, reltol):
sol_1 = solve_ivp(odefunc, time, (0,0), rtol=reltol)
sol_2 = solve_ivp(odefunc, time, (0,0), rtol=reltol/100)
plt.text('rtol is...')
plt.text('rtol for the increased tolerance is...')
plt.plot(<both plots together>)
return plt.show()
Here, all the business with running solve_ivp for two scenarios that only differ by relative tolerance and plotting them is wrapped up into one. I don't imagine that people would want the specifics, just the output and the confirmation that "yes" the simulation has converged. I even write both rtol's on the graph to be more clear about what values were used without showing the code.
Is wrapping these types of operations up okay (because I think it looks cleaner that way), or would it be better as an engineer to have everything laid out for everyone to see without them having to scroll over to the function definition?
In my experience the dry principle is more important when writing libraries and code that you want to reuse. However my experience is that if you want to make a report then sometimes making things to dry can actually make things harder to maintain. I.e. sometimes you want to be able to change an individual graph/plot/piece of data in one place without it affecting the rest of the report. It takes a bit of practice and experience to find out what works best for you. In this particular use case be less focussed on following the DRY rule then when writing a library or an application.
Additionally if I had to make such a report and the situation would allow it. I would make in in a Jupyter notebook. Here you can nicely mix code with text and graphical output.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
So I am fairly new to machine learning and all and I am trying to create a python script to analyse a energy dataset of a computer.
The script should in the end determine the different states of the computer (like idle, standby, working, etc...) and how much energy those states are using on average.
And I was wondering if this task could be done by some clustering method like k-means or DBSCAN.
I tinkered a bit with some clustering methods in scikit learn but the results so far where not as good as I expected.
I researched a lot about clustering methods but I could never find a scenario similar to mine.
So my question is if it's even worth the trouble and if yes wich clustering method (or overall machine learning algorithm) would be best fitted for that task? or are there better ways to do it?
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
The energy dataset is just a single column table with one cell being one energy value per second of a few days.
You will not be able to apply supervised learning for this dataset as you do not have labels for your dataset (there is no known state given an energy value). This means that models like SVM, decision trees, etc. are not feasible given your dataset.
What you have is a timeseries with a single variable output. As I understand it, your goal is to determine whether or not there are different energy states, and what the average value is for those state(s).
I think it would be incredibly helpful to plot the timeseries using something like matplotlib or seaborn. After plotting the data, you can have a better feel for whether your hypothesis is reasonable and how you might further want to approach the problem. You may be able to solve your problem by just plotting the timeseries and observing that there are, say, four distinct energy states (e.g. idle, standby, working, etc.), avoiding any complex statistical techniques, machine learning, etc.
To answer your question, you can in principle use k-means for one dimensional data. However, this is probably not recommended as these techniques are usually used on multidimensional data.
I would recommend that you look into Jenks natural breaks optimization or kernel density optimization. Similar questions to yours can be found here and here, and should help you get started.
Don't ignore time.
First of all, if your signal is noisy, temporal smoothing will likely help.
Secondly, you'll want to perform some feature extraction first. For example, by using segmentation to cut your time series into separate states. You can then try to cluster these states, but I am not convinced that clustering is applicable here at all. You probably will want to use a histogram, or a density plot. It's one dimensional data - you can visualize this, and choose thresholds manually instead of hoping that some automated technique may work (because it may not...)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I am testing out a few clustering algorithms on a dataset of text documents (with word frequencies as features). Running some of the methods of Scikit Learn Clustering one after the other, below is how long they take on ~ 50,000 files with 26 features per file. There are big differences in how long each take to converge that get more extreme the more data I put in; some of them (e.g. MeanShift) just stop working after the dataset grows to a certain size.
(Times given below are totals from the start of the script, i.e. KMeans took 0.004 minutes, Meanshift (2.56 - 0.004) minutes, etc. )
shape of input: (4957, 26)
KMeans: 0.00491824944814
MeanShift: 2.56759268443
AffinityPropagation: 4.04678163528
SpectralClustering: 4.1573699673
DBSCAN: 4.16347868443
Gaussian: 4.16394021908
AgglomerativeClustering: 5.52318491936
Birch: 5.52657626867
I know that some clustering algorithms are inherently more computing intensive (e.g. the chapter here outlines that Kmeans' demand is linear to number of data points while hierarchical models are O(m2logm)).
So I was wondering
How can I determine how many data points each of these algorithms can
handle; and are the number of input files / input features equally
relevant in this equation?
How much does the computation intensity depend on the clustering
settings -- e.g. the distance metric in Kmeans or the e in DBSCAN?
Does clustering success influence computation time? Some algorithms
such as DBSCAN finish very quickly - mabe because they don't find
any clustering in the data; Meanshift does not find clusters either
and still takes forever. (I'm using the default settings here). Might
that change drastically once they discover structure in the data?
How much is raw computing power a limiting factor for these kind of
algorithms? Will I be able to cluster ~ 300,000 files with ~ 30
features each on a regular desktop computer? Or does it make sense to
use a computer cluster for these kind of things?
Any help is greatly appreciated! The tests were run on an Mac mini, 2.6 Ghz, 8 GB. The data input is a numpy array.
This is a too broad question.
In fact, most of these questions are unanswered.
For example k-means is not simply linear O(n), but because the number of iterations needed until convergence tends to grow with data set size, it's more expensive than that (if run until convergence).
Hierarchical clustering can be anywhere from O(n log n) to O(n^3) mostly depending on the way it is implemented and on the linkage. If I recall correctly, the sklearn implementation is the O(n^3) algorithm.
Some algorithms have parameters to stop early. Before they are actually finished! For k-means, you should use tol=0 if you want to really finish the algorithm. Otherwise, it stops early if the relative improvement is less than this factor - which can be much too early. MiniBatchKMeans does never convergence. Because it only looks at random parts of the data every time, it would just go on forever unless you choose a fixed amount of iterations.
Never try to draw conclusions from small data sets. You need to go to your limits. I.e. what is the largest data set you can still process within say, 1, and 2, and 4, and 12 hours, with each algorithm?
To get meaningful results, your runtimes should be hours, except if the algorithms simply run out of memory before that - then you might be interested in predicting how far you could scale until your run out of memory - assuming you had 1 TB of RAM, how large would the data be that you can still process?
The problem is, you can't simply use the same parameters for data sets of different size. If you do not chose the parameters well (e.g. DBSCAN puts everything into noise, or everything into one cluster) then you cannot draw conclusions from that either.
And then, there might simply be an implementation error. DBSCAN in sklearn has become a lot lot lot faster recently. It's still the same algorithm. So most results done 2 years ago were simply wrong, because the implementation of DBSCAN in sklearn was bad... now it is much better, but is it optimal? Probably not. And similar problems might be in any of these algorithms!
Thus, doing a good benchmark of clustering is really difficult. In fact, I have not seen a good benchmark in a looong time.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I made a heightmap generator which uses gradient/value noise to generate a terrain. The problem is, that the height map is too chaotic to look realistic.
Here's what I am talking about:
Here's the map without the colors:
I used a 257x257 grid of blocks with 17x17 gradients.
As it is visible, there are too many islands as well as there are some random small beach islands in the middle of the ocean.
Also, There are a lot of sharp edges, especially for the mountain terrain (dark gray).
What I would like is a smoother and less chaotic terrain, such as a large island, etcetera. How do I do that?
In games, the most common noise generator for textures and heightmaps is the Perlin Noise.
I don't know from your answer is you actually want to create the noise generator or use it in your application.
If you are looking to create your own Perlin Noise Generator, this would be a good starting point.
I would however recommend using the noise (https://pypi.python.org/pypi/noise/) library available through pip using:
pip install noise
You can then use the noise.snoise2(x,y,a,b,c) function and fiddle with with the different parameters.
I would recommend reading this article: http://simblob.blogspot.ch/2010/01/simple-map-generation.html if you want to learn more about terrain generation.
Look at this article where Amit walks through some map generation techniques. He even has sample code online.
In the article, he takes perlin noise as a randomization parameter to his terrain generator, but doesn't use it as the whole generator. The result looks really good. (I'd post a picture of the result, but I don't know of copyright issues just yet.)
While you're at it, Amit has written and curated on things game programming for years and years. Here and here are a few more articles of his on the subject. I hope this doesn't become a time sink for you, I've certainly spent many hours on his blog. :)
(PS. I prefer simplex noise over perlin noise. Same inventor, simpler implementation, and looks better to me.)
From what I see, your sample may lack octaves and interpolation.
Depending on the implementation you are using, you may play with octave number, frequency, persistence / lacunarity, various interpolation techniques, etc...
Try playing / mixing with turbulence too (easy way to add fancy features to your height maps).
Many simplex noise (Ken Perlin's too, but scales better / faster on more dimensions) implementations deal with pretty complete set of parameters for you to play with, when generating your height maps.
So, my first entry on Stack Overflow! I really hope someone answers. This goes out to anyone generally using Meep or more specifically Python Meep for FDTD simulations.
Is it possible to include a complex value of conductivity as well as a real value(we are trying to do graphene, which has a complex component as well as a real one)? If not, I guess I could approximate it with just the real component, but I'd rather know. Also, is there a way to add surface charges in meep? And finally, is it capable of handling strictly 2d structures without any thickness whatsoever? I think so, but I just want to check...