(Conceptual question) Tensorflow dataset... why use it? [closed] - python

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm taking a MOOC on Tensorflow 2 and in the class the assignments insist that we need to use tf datasets; however, it seems like all the hoops you have to jump through to do anything with datasets means that everything is way more difficult than using a Pandas dataframe, or a NumPy array... So why use it?

The things you mention are generally meant for small data, when a dataset can all fit in RAM (which is usually a few GB).
In practice, datasets are often much bigger than that. One way of dealing with this is to store the dataset on a hard drive. There are other more complicated storage solutions. TF DataSets allow you to interface with these storage solutions easily. You create the DataSet object, which represents the storage, in your script, and then as far as you're concerned you can train a model on it as usual. But behind the scenes, TF is repeatedly reading the data into RAM, using it, and then discarding it.
TF Datasets provide many helpful methods for handling big data, such as prefetching (doing the storage reading and data preprocessing at the same time as other stuff), multithreading (doing calculations like preprocessing on several examples at the same time), shuffling (which is harder to do when you can't just reorder a dataset each time in RAM), and batching (preparing sets of several examples for feeding to a model as a batch). All of this stuff would be a pain to do yourself in an optimised way with Pandas or NumPy.

Related

Using machine learning to detect fish spawning in audio files [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 months ago.
Improve this question
My friend is doing his thesis related to fish spawning in rivers, for this, he collects hours of data that he then analysis manually in Audacity and looks for specific patterns in the spectrograms that might indicate the sound of fish spawning.
Since he has days' worth of data I proposed a challenge to myself: to create an algorithm that might help him in detecting these patterns.
I am fairly new to Machine Learning, but a junior in programming and this sounds like a fun learning experience.
I identify the main problems as:
samples are 1 hour in length.
noise in the background (such as cars and the rivers)
Is this achievable with machine learning or should I look into other options? If yes which ones?
Thank you for taking the time to read!
the first step would be to convert the sound signals into features that machines can understand. Maybe look into MFCCs for that.
Given that you have an appropriate feature representation of your problem domain, the main thing to consider would be what kind of machine learning algorithm would you apply? Unless you would like to sit and annotate hours of data, naive supervised learning is out of the window.
I think your best bet would be to modify VAD (voice activity detection) algorithms or better yet, Speaker recognition/Identification modals.
You could also approach it by first having a complex enough representation that allows you to "see" the sound and comparing it with every frame in the test data of the specific length. Might be useful to check out DTW (Dynamic Time warping)
If you have not designed such modals before, it will be a bit difficult and might take quite a long time.

Pandas Dataframe consumes too much memory. Any alternatives? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Despite following best practices for reducing Pandas Dataframe object memory usage, I still find that the memory usage is too high. I've tried chunking, converting dtype, reading less data...etc.
For example, even though the CSV file I'm reading in is 2.7 GB large, when I use pd.read_csv, task manager shows that 25 GB of RAM have been used. I've tried converting objects to category, but some columns are not suitable for the conversion so object data type is the only choice I have.
Anyone have advice for how to reduce the memory usage, or alternative python libraries to use for low memory consuming dataframe objects? I've tried PySpark but the lazy evaluation is killing me every time I want to run a simple show statement.
Why use Dask dataframe:
Common Uses and Anti-Uses
Best Practices
Dask DataFrame is used in situations where Pandas is commonly needed,
usually when Pandas fails due to data size or computation speed.
For data that fits into RAM, Pandas can often be faster and easier to
use than Dask DataFrame. While “Big Data” tools can be exciting, they
are almost always worse than normal data tools while those remain
appropriate.

Should One Use Pandas or Sklearn for Imputation/Normalization etc.? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am currently trying to get myself more familiar with sklearn, though I am still kind of a newbie to ML. While working through a couple of tutorials I stumbled across sklearn implementations of techniques I already did using pandas. Like tools for normalization, imputation of missing values etc.
My current workflow looked like this: Loading and preprocessing data using pandas, doing normalization, imputation etc with them ,mostly in a notebook. Then I export the csv-File to a cleaned version and do my ML work in seperate python-files on this cleaned and processed dataset. Is there anything wrong with this workflow?
I'd really like to here from some people that spent more time in the field than me on which there are any advantages/disadvantages to use pandas for preprocessing or using sklearn. Maybe you already saw some roadblocks I didn't?
In my opinion, if after doing imputation, normalization (and other "data cleaning / preprocessing" steps), you plan on doing some actual Machine Learning, you should use scikit-learn. The main advantage is that you can easily concatenate all your preprocessing steps, and your final Machine Learning estimator, in a single Pipeline object.
This is very convenient if you want to make sure to apply on new data the same steps that you applied on training data. Your code will also be more compact and readable.
Also take a look at Column Transformers (that can be included in a Pipeline), if you need to apply different preprocessing steps to different columns of your original data.
On the other hand, if your project does not involve Machine Learning, and you only need to preprocess and clean your data to visualize it and to calculate some statistics, you may decide to only use Pandas, so your project will have less dependencies.

Finding a model for a machine learning problem with a sensor [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm doing a project where I have data of 100 sensors and its cycles until it breaks. It shows a lot of characteristcs until its failure, and then shows it for the replacement sensor. With this data, I have to built a model where I can predict for how long the sensor will work until its failure, but only with a few data, not the full cycle. I have no idea what machine learning model is suitable for this.
The type of problem you are describing is known as survival analysis. A wide range of both statistical and machine learning methods are available to help you solve these type of problems.
What is great about these methods is it also allow you to use data points where the event you are interested in has not occur. In your example, it means you can possibly extend your dataset by including data from sensors which has not failed yet.
When you look at the methods I suggest you also spend some time examining how to evaluate these types of models, since the evaluation methods are also slightly different then in typical machine learning problems.
A comprehensive range of techniques is available at: http://dmkd.cs.vt.edu/TUTORIAL/Survival/Slides.pdf

Is it beneficial to use OOP on large datasets in Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm implementing Kalman Filter on two types of measurements. I have GPS measurement every second (1Hz) and 100 measurment of accelration in one second (100Hz).
So basically I have two huge tables and they have to be fused at some point. My aim is: I really want to write readable and maintainable code.
My first approach was: there is a class for both of the datatables (so an object is a datatable), and I do bulk calculations in the class methods (so almost all of my methods include a for loop), until I get to the actual filter. I found this approach a bit too stiff. It works, but there is so much data-type transformation, and it is just not that convenient.
Now I want to change my code. If I would want to stick to OOP, my second try would be: every single measurment is an object of either the GPS_measurment or the acceleration_measurement. This approach seems better, but this way thousands of objects would have been created.
My third try would be a data-driven design, but I'm not really familiar with this approach.
Which paradigm should I use? Or perhaps it should be solved by some kind of mixture of the above paradigms? Or should I just use procedural programming with the use of pandas dataframes?
It sounds like you would want to use pandas. OOP is a concept btw, not something you explicitly code in inflexibly. Generally speaking, you only want to define your own classes if you plan on extending them or encapsulating certain features. pandas and numpy are 2 modules that already do almost everything you can ask for with regards to data and they are faster in execution.

Categories

Resources