Do we Visualize Big Data - python

I began to fall in love with a Python Visualization library called Altair, and i use it with every small data science project that ive done.
Now, in terms of Industry use cases, Does it make sense to visualize Big Data or should we just take a random sample?

Short answer: no, if you're trying to visualize data with tens of thousands of rows or more, Altair is probably not the right tool. But there are efforts in progress to add support for larger datasets in the vega ecosystem; see https://github.com/vega/scalable-vega.

Related

How to visualize real-time data candlestick chart for cryptocurrencies?

I'm actually trying to code a small program in Python to visualize the actual price of a crypto asset with real-time data. I already have the data (historic and actual data updated every second). I just want to find a good python library (as optimized as possible) in order to show the candlestick chart and eventually some indicators or lines/curves on the same graph. I did some quick research and it seems like "plotly" (used with "cufflinks") or "bokeh" are good choices. Which one would you advise me and why ? I'm also open to some suggestions of other libraries if they are good and optimized !
Thank you in advance :)
Take a look at https://github.com/highfestiva/finplot.
Where you can find examples of fetching realtime data from crypto exchange. Author notifies that this library designed in favor of speed and crypto.
Looks very nice.

Realtime Data Visualization Task with LSL

I have to develop a real-time data visualization module in python that is relatively simple, but I don't know where to begin or what tools to use.
Essentially, I would have two images drawn on either side of the screen, and depending on values streamed through lab streaming layer (LSL), the images would change size. That's it.
Any pointers would be extremely appreciated.
Maybe this would help: BrainStreamingLayer. It's a higher-level implementation around pyLSL. https://github.com/bsl-tools/bsl
It has a real-time visualization module, however, the initial use case is EEG amplifiers, so some adaptation may be required.

Trying to work out how to produce a synthetic data set using python or javascript in a repeatable way

I have a reasonably technical background and have done a fair bit of node development, but I’m a bit of a novice when it comes to statistics and a complete novice with python, so any advice on a synthetic data generation experiment I’m trying my hand at would be very welcome :)
I’ve set myself the problem of generating some realistic(ish) sales data for a bricks and mortar store (old school, I know).
I’ve got a smallish real-world transactional dataset (~500k rows) from the internet that I was planning on analysing with a tool of some sort, to provide the input to a PRNG.
Hopefully if I explain my thinking across a couple of broad problem domains, someone(s?!) can help me:
PROBLEM 1
I think I should be able to use the real data I have to either:
a) generate a probability distribution curve or
b) identify an ‘out of the box’ distribution that’s the closest match to the actual data
I’m assuming there’s a tool or library in Python or Node that will do one or both of those things if fed the data and, further, give me the right values to plug in to a PRNG to produce a series of data points that not are not only distributed like the original's, but also within the same sort of ranges.
I suspect b) would be less expensive computationally and, also, better supported by tools - my need for absolute ‘realness’ here isn’t that high - it’s only an experiment :)
Which leads me to…
QUESTION 1: What tools could I use to do do the analysis and generate the data points? As I said, my maths is ok, but my statistics isn't great (and the docs for the tools I’ve seen are a little dense and, to me at least, somewhat impenetrable), so some guidance on using the tool would also be welcome :)
And then there’s my next, I think more fundamental, problem, which I’m not even sure how to approach…
PROBLEM 2
While I think the approach above will work well for generating timestamps for each row, I’m going round in circles a little bit on how to model what the transaction is actually for.
I’d like each transaction to be relatable to a specific product from a list of products.
Now the products don’t need to be ‘real’ (I reckon I can just use something like Faker to generate random words for the brand, product name etc), but ideally the distribution of what is being purchased should be a bit real-ey (if that’s a word).
My first thought was just to do the same analysis for price as I’m doing for timestamp and then ‘make up’ a product for each price that’s generated, but I discarded that for a couple of reasons: It might be consistent ‘within’ a produced dataset, but not ‘across’ data sets. And I imagine on largish sets would double count quite a bit.
So my next thought was I would create some sort of lookup table with a set of pre-defined products that persists across generation jobs, but Im struggling with two aspects of that:
I’d need to generate the list itself. I would imagine I could filter the original dataset to unique products (it has stock codes) and then use the spread of unit costs in that list to do the same thing as I would have done with the timestamp (i.e. generate a set of products that have a similar spread of unit cost to the original data and then Faker the rest of the data).
QUESTION 2: Is that a sensible approach? Is there something smarter I could do?
When generating the transactions, I would also need some way to work out what product to select. I thought maybe I could generate some sort of bucketed histogram to work out what the frequency of purchases was within a range of costs (say $0-1, 1-2$ etc). I could then use that frequency to define the probability that a given transaction's cost would fall within one those ranges, and then randomly select a product whose cost falls within that range...
QUESTION 3: Again, is that a sensible approach? Is there a way I could do that lookup with a reasonably easy to understand tool (or at least one that’s documented in plain English :))
This is all quite high level I know, but any help anyone could give me would be greatly appreciated as I’ve hit a wall with this.
Thanks in advance :)
The synthesised dataset would simply have timestamp, product_id and item_cost columns.
The source dataset looks like this:
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850,United Kingdom
536365,22752,SET 7 BABUSHKA NESTING BOXES,2,12/1/2010 8:26,7.65,17850,United Kingdom
536365,21730,GLASS STAR FROSTED T-LIGHT HOLDER,6,12/1/2010 8:26,4.25,17850,United Kingdom
536366,22633,HAND WARMER UNION JACK,6,12/1/2010 8:28,1.85,17850,United Kingdom

Using Altair on data aggregated from large datasets

I am trying to histogram counts of a large (300,000 records) temporal data set. I am for now just trying to histogram by month which is only 6 data points, but doing this with either json or altair_data_server storage makes the page crash. Is this impossible to handle well with pure Altair? I could of course preprocess in pandas, but that ruins the wonderful declarative nature of altair.
If so is this a missing feature of altair or is it out of scope? I'm learning that vegalite stores the entire underlying data and applies the transformation at run time, but it seems like altair could (and maybe does) have a way to store only the relevant data for the chart.
alt.Chart(df).mark_bar().encode(
x=alt.X('month(timestamp):T'),
y='count()'
)
Altair charts work by sending the entire dataset to your browser and processing it in the frontend; for this reason it does not work well for larger datasets, no matter how the dataset is served to the frontend.
In cases like yours, where you are aggregating the data before displaying it, it would in theory be possible to do that aggregation in the backend, and only send aggregated data to the frontend renderer. There are some projects that hope to make this more seamless, including scalable Vega and altair-transform, but neither approach is very mature yet.
In the meantime, I'd suggest doing your aggregations in Pandas, and sending the aggregated data to Altair to plot.
Edit 2023-01-25: VegaFusion addresses this problem by automatically pre-aggregating the data on the server and is mature enough for production use. Version 1.0 is available under the same license as Altair.
Try below :-
alt.data_transformers.enable('default', max_rows=None)
and then
alt.Chart(df).mark_bar().encode(
x=alt.X('month(timestamp):T'),
y='count()'
)
you will get the chart but make sure to save all of your work if the browser will crash.
Using the following works for me:
alt.data_transformers.enable('data_server')

Machine learning - generate new data from current dataset

I have created a dataset from some sensor measurements and some labels and did some classification on it with good results. However, since my the amount of data in my dataset is relatively small (1400 examples) I want to generate more data based on this data. Each row from my dataset consists of 32 numeric values and a label.
Which would be the best approach to generate more data based on the existing dataset I have? So far I have looked at Generative Adversarial Networks and Autoencoders, but I don't think this methods are suitable in my case.
Until now I have worked in Scikit-learn but I could use other libraries as well.
The keyword is here Data Augmentation. You use your available data and modify them slightly to generate additional data which are a little bit different from your source data.
Please take a look at this link. The author uses Data Augmentation to rotate and flip the cat image. So he generate 6 additional images with different perspectives from a single source image.
If you transfer this idea to your sensor data you can add some kind of random noise to your data to increase the dataset. You can find a simple example for Data Aufmentation for time series data here.
Another approach is to window the data and move the window a small step, so the data in the window are a little bit different.
The guys from the statistics stackexchange write something about it. Please check this for additional information.

Categories

Resources