I know all of the datasets that can be loaded with sns.load_dataset() are all example datasets, used for Seaborn's documentation, but do these example datasets use actual data?
I'm asking because I want to know if it's useful to pay attention to the results I get as I play around with these datasets, or if I should just see them as solely a means to learning the module.
The data does appear to be real. This is not formally documented by Seaborn, but:
Several of the dataset are "real" well-known datasets that can be verified elsewhere, such as the Iris dataset hosted on UCI's Machine Learning repository.
All of the data are sourced from https://github.com/mwaskom/seaborn-data, and in turn from actual CSVs on Michael Waskom's (core Seaborn developer) local drive, it appears. If the data were random/fake, it is more likely it would be generated by Python libraries such as NumPy.
Related
I began to fall in love with a Python Visualization library called Altair, and i use it with every small data science project that ive done.
Now, in terms of Industry use cases, Does it make sense to visualize Big Data or should we just take a random sample?
Short answer: no, if you're trying to visualize data with tens of thousands of rows or more, Altair is probably not the right tool. But there are efforts in progress to add support for larger datasets in the vega ecosystem; see https://github.com/vega/scalable-vega.
I am trying to histogram counts of a large (300,000 records) temporal data set. I am for now just trying to histogram by month which is only 6 data points, but doing this with either json or altair_data_server storage makes the page crash. Is this impossible to handle well with pure Altair? I could of course preprocess in pandas, but that ruins the wonderful declarative nature of altair.
If so is this a missing feature of altair or is it out of scope? I'm learning that vegalite stores the entire underlying data and applies the transformation at run time, but it seems like altair could (and maybe does) have a way to store only the relevant data for the chart.
alt.Chart(df).mark_bar().encode(
x=alt.X('month(timestamp):T'),
y='count()'
)
Altair charts work by sending the entire dataset to your browser and processing it in the frontend; for this reason it does not work well for larger datasets, no matter how the dataset is served to the frontend.
In cases like yours, where you are aggregating the data before displaying it, it would in theory be possible to do that aggregation in the backend, and only send aggregated data to the frontend renderer. There are some projects that hope to make this more seamless, including scalable Vega and altair-transform, but neither approach is very mature yet.
In the meantime, I'd suggest doing your aggregations in Pandas, and sending the aggregated data to Altair to plot.
Edit 2023-01-25: VegaFusion addresses this problem by automatically pre-aggregating the data on the server and is mature enough for production use. Version 1.0 is available under the same license as Altair.
Try below :-
alt.data_transformers.enable('default', max_rows=None)
and then
alt.Chart(df).mark_bar().encode(
x=alt.X('month(timestamp):T'),
y='count()'
)
you will get the chart but make sure to save all of your work if the browser will crash.
Using the following works for me:
alt.data_transformers.enable('data_server')
I am trying to implement U-net and I use https://github.com/jakeret/tf_unet/tree/master/scripts this link as reference. I don't understand which dataset they used. please give me some idea or link which dataset i use.
On their github README.md they show three different datasets, that they applied their implementation to. Their implementation is dataset agnostic, therefore it shouldn't matter too much what data they use if you're trying to solve your own problem with your own data. But if you're looking for a toy data set to play around, check out their demos. There you'll see two readily available examples and how they can be used:
demo_radio_data.ipynb which uses an astronomic radio data example set from here: http://people.phys.ethz.ch/~ast/cosmo/bgs_example_data/
demo_toy_problem.ipynb which uses their built-in data generator of a noisy image with circles that are to be detected.
The latter is probably the easiest one if it comes to just having something to play with. To see how data is generated, check out the class:
image_gen.py -> GrayScaleDataProvider
(with an IDE like PyCharm you can just jump into the according classes in the demo source code)
We have a research work that we are doing as part of our college project in which we need to analyse twitter data.
We have already built the prototype for classification and analysis using pandas and nltk, reading the comments from a csv file and then processing it. The problem now is that we want to scale it so as to read and analyse some big comments file also. But the problem is that we dont have anybody who could guide us(majority of them being from biology background) with what technologies to use for this massive amount.
Our issues are :-
1.] How to store a massive comments file(5 gb, offline data). Till now we had only 5000-10000 line of comments which we processed using pandas. But how do we store and process such a huge file. Which database to use for it.
2.] Also since we plan to use nltk, machine learning on this data, what should be our approach on parallels of :: csv->pandas,nltk,machine learning->model->prediction. That is, where in this path we need changes and with what technologies should we replace them to handle the huge data.
Generally speaking, there's two types of scaling:
Scale up
Scale out
Scale up, most of the time, means taking what you already have, and run it on a bigger machine (more CPU, RAM, disk throughput).
Scale out generally means partitioning your problem, and handling parts on separate threads/processes/machines.
Scaling up is much easier: keep the code you already have and run it on a big machine (possibly on Amazon EC2 or Rackspace, if you don't have one available).
If scaling up is not enough, you will need to scale out. Start by identifying what parts of your problem can be partitioned. Since you're processing twitter comments, there's a good chance you can simply partition your file into multiple ones, and train N independent models.
Since you're just processing text data, there isn't a big advantage to using a database over plain text files (for storing the input data, at least). Simply split your file into multiple files and distribute each one to a different processing unit.
Depending on the specific machine learning techniques you're using, it may be easy to merge the independent models into a single one, but it will likely require expert knowledge.
If you're using K-nearest-neighbors, for example, it's trivial to join the independent models
Is it possible to save a predictive model in Python?
Background: I'm building regression models in SPSS and I need to share those when done. Unfortunately, no one else has SPSS.
Idea: My idea is to make the model in Python, do something XYZ, use another library to convert XYZ into and exe that will pick up a csv file with data and spit out the model fit results on that data. In this way, I can share the model with anyone I want without the need of SPSS or other expensive software
Challenge: I need to find out XYZ, how do I save the instance when the model is built. For example, in case of linear/logistic, it would be the set of coefficients.
PS: I'm using linear/logistic as examples, in reality, I need to share more complex models like SVM etc.
Using FOSS (Free & Open Source Software) is great to facilitate collaboration. Consider using R or Sage (which has a Python backbone and includes R) so that you can freely share programs and data. Or even use Sagemath Cloud so that you can work collaboratively in real-time.
Yes, this is possible. What you're looking for is scitkit-learn in combination with joblib. A working example of your problem can be found in this question.