Easiest way to validate between two CSV files using python [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have two CSV files, and I would like to validate(find the differences and similarities) the data between these two files.
I am retrieving this data from vertica and because the data is so large I would like to do the validation at CSV level.

csvdiff allows you to compare the semantic contents of two CSV files, ignoring things like row and column ordering in order to get to what’s actually changed. This is useful if you’re comparing the output of an automatic system from one day to the next, so that you can look at just what’s changed.

I don't think you can directly compare sheets using openpyxl without manually looping on each rows and using your own validation code.
That depend your aim at performance, if speed is not a requirement, then why not but that will require some additional work.
Instead I would use pandas dataframes for any CSV validation needs, if you can add this dependency it should become really easier to compare files while keeping it at a great performance.
Here is a link to complete example:
http://pbpython.com/excel-diff-pandas.html
However, use read_csv() instead of read_excel() to read data from your files.

Related

Using python to convert wav file to csv file before feed the data into FFT for audio spectrum analyzer [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am working on a simple audio spectrum analyzer using FPGA. For the preprocessing part, my idea is to use python to convert wav file to csv file, and then feed the data to a fast fourier transform module. Is it possible to get it work?
There are plenty of available open source modules to perform this:
A GitHub repository for same.
Just open github and type wav to csv and you'll find quite a lot of them.
Or even google a bit and you can find lot of answers on same.
One small query though. You basically want to convert the .wav file into a time series data right?
In that case, I'll highly recommend to go through:
KDNugget's post about same.

Assign multiple dataset as one variable [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I am extracting multiple dataset into one csv file.
data = Dataset(r'C:/path/2011.daily_rain.nc', 'r')
I successfully assigned one dataset but i still have ten more to work with in the same way. Are there any methods or functions can allow me to assign or combine multiple dataset as one variable?
From what you've described, it sounds like you want to perform the same task on each set of data. If that is the case, then consider using storing your dataset paths in an array, then using a for .. in loop to iterate through each path.
Consider the following sample code:
dataset_paths = [
"C:/path/some_data_file-0.nc",
"C:/path/some_data_file-1.nc",
"C:/path/some_data_file-2.nc",
"C:/path/some_data_file-3.nc",
# ... and the rest of your dataset file paths
]
for path in dataset_paths:
data = Dataset(path, 'r')
# Code that uses the data here
Everything in the for .. in block will be run for each path defined in the dataset_paths array. This will allow you to work with each dataset in the same way.

Converge multiple rows of an CSV in one [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a CSV in which a student appears on multiple lines. The goal is to obtain a CSV where the student's name appears only once and a "Sports" column is created where all the sports practiced by the student separated by a space converge (like the photos)
csv
final csv
I'm not going to post a full solution, as this sounds like a homework problem. If this is infact for a school assignment, please edit your question to include the information.
From your description, the problem can be broken into three steps, each of which can be independently written as code in your solution.
Parse a CSV file
Create a new data structure that reduces the number of rows and adds a new column
Output the data to a new CSV file.
Step 1 and 3 are the simplest. You will want to use things like with open('file', 'r'), list.split(), and ",".join()
For step 2, the problem is eaiser to understand if you think in terms of dictionaries. If you can turn your original data (which is a list of rows) into a dictionary of rows, then it becomes eaiser to detect duplicates. All dictionaries must have a unique key (or column in this case), and you already know that you have a key (student name) that you would like to be unique, but isn't.
Your code for step 2 will iterate over the list of rows, adding each one to a dictionary using student_name as a unique key. If that key already exists, then instead of adding a new entry, you will need to modify the existing entry's "sports" field.

Is it beneficial to use OOP on large datasets in Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm implementing Kalman Filter on two types of measurements. I have GPS measurement every second (1Hz) and 100 measurment of accelration in one second (100Hz).
So basically I have two huge tables and they have to be fused at some point. My aim is: I really want to write readable and maintainable code.
My first approach was: there is a class for both of the datatables (so an object is a datatable), and I do bulk calculations in the class methods (so almost all of my methods include a for loop), until I get to the actual filter. I found this approach a bit too stiff. It works, but there is so much data-type transformation, and it is just not that convenient.
Now I want to change my code. If I would want to stick to OOP, my second try would be: every single measurment is an object of either the GPS_measurment or the acceleration_measurement. This approach seems better, but this way thousands of objects would have been created.
My third try would be a data-driven design, but I'm not really familiar with this approach.
Which paradigm should I use? Or perhaps it should be solved by some kind of mixture of the above paradigms? Or should I just use procedural programming with the use of pandas dataframes?
It sounds like you would want to use pandas. OOP is a concept btw, not something you explicitly code in inflexibly. Generally speaking, you only want to define your own classes if you plan on extending them or encapsulating certain features. pandas and numpy are 2 modules that already do almost everything you can ask for with regards to data and they are faster in execution.

pandas dataframe to R using pyRserve [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
A large data frame (a couple of million rows, a few thousand columns) is created Pandas in python. This data frame is to be passed to R using PyRserve. This has to be quick - few seconds at most.
There is a to_json function in pandas. Is to and from json conversation for such large objects the only way? is it OK for such large objects?
I can always write it to disk and read it (fast using fread, and that it what I have done), but what is the best way to do this?
Without having tried it out, to_json seems to be a very bad idea, getting worse with larger dataframes as this has a lot of overhead, both in writing and reading the data.
I'd recommend using rpy2 (which is supported directly by pandas) or, if you want to write something to disk (maybe because the dataframe is only generated once) you can use HDF5 (see this thread for more information on interfacing pandas and R using this format).

Categories

Resources