Sampling from a 6GB csv file without loading in Python [closed] - python

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 months ago.
Improve this question
I have a training data-set in CSV format of size 6 GB which I am required to analyze and implement machine learning on it. My system RAM is 6 GB so it is not possible for me to load the file in the memory. I need to perform random sampling and load the samples from the data-set. The number of samples may vary according to requirement. How to do this?

Something to start with:
with open('dataset.csv') as f:
for line in f:
sample_foo(line.split(","))
This will load only one line at a time in memory and not the whole file.

Related

Data from Pandas DF to plot a graph in powerbi using external IDE [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 days ago.
Improve this question
I cant seem to figure out the code on how i can use the data received from Pandas DF to create a new file in Power BI and plot a graph using the data in a external IDE.The graph has to be in power BI hence the issue i am currently facing.
so i am wondering if it is possible to create a power bi file and plot the graph using only code from an external python IDE.(Not too versed in using power bi so there might be many solutions i might not know about)
Any suggestions would also be greatly appreciated.

Using python to convert wav file to csv file before feed the data into FFT for audio spectrum analyzer [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am working on a simple audio spectrum analyzer using FPGA. For the preprocessing part, my idea is to use python to convert wav file to csv file, and then feed the data to a fast fourier transform module. Is it possible to get it work?
There are plenty of available open source modules to perform this:
A GitHub repository for same.
Just open github and type wav to csv and you'll find quite a lot of them.
Or even google a bit and you can find lot of answers on same.
One small query though. You basically want to convert the .wav file into a time series data right?
In that case, I'll highly recommend to go through:
KDNugget's post about same.

How can I reduce the financial cost of working in databricks? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
was just wondering whether anyone had any thoughts on best practices when working in databricks. It is financially costing a lot to develop within databricks, hence would like to know where else it would be best to develop python code in. With thought also to collaborative work, is there a similar set up to databricks for collaborative work that is free or of little cost to use.
Any suggestions, greatly appreciated!
The cost of Databricks is really related to the size of the clusters you are running (1 worker, 1 driver or 1 driver 32 workers?), the spec of the machines in the cluster (low RAM and CPU or high RAM and CPU), and how long you leave them running (always running or short time to live, aka "Terminate after x minutes of inactivity". I am also assuming you are not running the always on High Concurrency cluster mode.
Some general recommendations would be:
work with smaller datasets in dev, eg representative samples which would enable you to...
work with smaller clusters in dev, eg instead of working with large 32 node clusters, work with 2 node small clusters
set time to live as short eg 15 mins
which together would reduce your cost
Obviously there is a trade-off in assembling representative samples and making sure your outputs are still accurate and useful but that's up to you.

Handling large binary files in Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a binary file (>1GB in size) which contains single precision data, created in Matlab.
I am new to Python and would like to read the same file structure in Python.
any help would be much appreciated:
From Matlab, I can load the file as follow:
fid = fopen('file.dat','r');
my_data = fread(fid,[117276,1794],'single');
Many thanks
InP
Using numpy is easiest with fromfile https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromfile.html:
np.fromfile('file.dat', dtype=np.dtype('single')).reshape((117276, 1794))
where np.dtype('single') is the same as np.dtype('float32')
Note that it may be transposed from what you want since MATLAB reads in column order, while numpy reshapes with row-order.
Also, I'm assuming that using numpy is ok since you are coming from MATLAB and probably will end up using it if you want to keep having MATLAB-like functions and not have to deal with pure python like these answers Reading binary file and looping over each byte

What is the average length of a Python module [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I have bits and bobs of code and I'm thinking of getting them in a python module. But I might need a Python package.
I know it has to do mostly with how I want to divide my code.
But I still need to know what is an average length (in lines) of a python module.
Using the following numbers please select small | average | big
1,000 lines of python
10,000 lines
50,000 lines
100,000 lines
1,000,000 lines
Please help.
A module should be the smallest indepently useable unit of code. That's what modules are for: modularity, independence, take only what you need. A package should be a set of modules that functionally belong together to cover a certain problem area, e.g. statistical computations or 3D graphics.
So the number of lines is not really important. Still I think modules of 10000+ lines are rare. But there's no lower bound.

Categories

Resources