Data (excel) prepping before analyzing with Python

Data (excel) prepping before analyzing with Python - python

Disclaimer: I'm very new to DA, like super new.
I'm working on my first project from scratch. It's not that big of a data set, it's a little bit over a thousand rows in excel, but very messy. They are arbitrarily combined from several sheets but the format is not unified. Which makes me wonder what is the data prepping process like before working in python. And is it more economically efficient to just manually clean up in excel first or it could be done in Python with less time?
I mean of course it depends on the size of the data and how skillful the analyst is. But I wanted to know what are the normal routes analysts take as in preparation, in consideration of efficiency and data integrity.
Many thanks!

Related

Are temporary files generated while working with pandas data frames

Until now, I have always used SAS to work with sensitive data. It would be nice to change to Python instead. However, I realize I do not understand how the data is handled during processing in pandas.
While running SAS, one knows exactly where all the temporary files are stored (hence it is easy to keep these in an encrypted container). But what happens when I use pandas data frames? I think I would not even notice, if the data left my computer during processing.
The size of the mere flat files, of which I typically have dozens to merge, are a couple of Gb. Hence I cannot simply rely on the hope, that everything will be kept in the RAM during processing - or can I? I am currently using a desktop with 64 Gb RAM.

If it's a matter of life and death, I would write the data merging function in C. This is the only way to be 100% sure of what happens with the data. The general philosophy of Python is to hide whatever happens "under the hood", this does not seem to fit your particular use case.

Most efficient way to store pandas Dataframe on disk for frequent access?

I am working on an application which generates a couple of hundred datasets every ten minutes. These datasets consist of a timestamp, and some corresponding values from an ongoing measurement.
(Almost) Naturally, I use pandas dataframes to manage the data in memory.
Now I need to do some work with history data (eg. averaging or summation over days/weeks/months etc. but not limited to that), and I need to update those accumulated values rather frequently (ideally also every ten minutes), so I am wondering which would be the most access-efficient way to store the data on disk?
So far I have been storing the data for every ten minute interval in a separate csv-file and then read the relevant files into a new dataframe as needed. But I feel that there must be a more efficient way, especially when it comes to working with a larger amount of datasets. Although computation cost and memory are not the central issue, as I am running the code on a comparatively powerful machine, but I still don't want to (and most likely, can't afford to) read all the data into memory every time.
It seems to me that the answer should lie within the built-in serialization functions of pandas, but from the docs and my google findings I honestly can't really tell which would fit my needs best.
Any ideas how I could manage my data better?

Save periodically gathered data with python

I periodically receive data (every 15 minutes) and have them in an array (numpy array to be precise) in python, that is roughly 50 columns, the number of rows varies, usually is somewhere around 100-200.
Before, I only analyzed this data and tossed it, but now I'd like to start saving it, so that I can create statistics later.
I have considered saving it in a csv file, but it did not seem right to me to save high amounts of such big 2D arrays to a csv file.
I've looked at serialization options, particularly pickle and numpy's .tobytes(), but in both cases I run into an issue - I have to track the amount of arrays stored. I've seen people write the number as the first thing in the file, but I don't know how I would be able to keep incrementing the number while having the file still opened (the program that gathers the data runs practically non-stop). Constantly opening the file, reading the number, rewriting it, seeking to the end to write new data and closing the file again doesn't seem very efficient.
I feel like I'm missing some vital information and have not been able to find it. I'd love it if someone could show me something I can not see and help me solve the problem.

Saving on a csv file might not be a good idea in this case, think about the accessibility and availability of your data. Using a database will be better, you can easily update your data and control the size amount of data you store.

Editing a big database in Excel - any easy to learn language that provides array manipulation except for VBA? Python? Which library?

I am currently trying to develop macros/programs to help me edit a big database in Excel.
Just recently I successfully wrote a custom macro in VBA, which stores two big arrays into memory, in memory it compares both arrays by only one column in each (for example by names), then the common items that reside in both arrays are copied into another temporary arrays TOGETHER with other entries in the same row of the array. So if row(11) name was "Tom", and it is common for both arrays, and next to Tom was his salary of 10,000 and his phone number, the entire row would be copied.
This was not easy, but I got to it somehow.
Now, this works like a charm for arrays as big as 10,000 rows x 5 columns + another array of the same size 10,000 rows x 5 columns. It compares and writes back to a new sheet in a few seconds. Great!
But now I tried a much bigger array with this method, say 200,000 rows x 10 columns + second array to be compared 10,000 rows x 10 columns...and it took a lot of time.
Problem is that Excel is only running at 25% CPU - I checked that online it is normal.
Thus, I am assuming that to get a better performance I would need to use another 'tool', in this case another programming language.
I heard that Python is great, Python is easy etc. but I am no programmer, I just learned a few dozen object names and I know some logic so I got around in VBA.
Is it Python? Or perhaps changing the programming language won't help? It is really important to me that the language is not too complicated - I've seen C++ and it stings my eyes, I literally have no idea what is going on in those codes.
If indeed python, what libraries should I start with? Perhaps learn some easy things first and then go into those arrays etc.?
Thanks!

I have no intention of condescending but anything I say would sound like condescending, so so be it.
The operation you are doing is called join. It's a common operation in any kind of database. Unfortunately, Excel is not a database.
I suspect that you are doing NxM operation in Excel. 200,000 rows x 10,000 rows operation quickly explodes. Pick a key in N, search a row in M, and produce result. When you do this, regardless of computer language, the computation order becomes so large that there is no way to finish the task in reasonable amount of time.
In this case, 200,000 rows x 10,000 rows require about 5,000 lookup per every row on average in 200,000 rows. That's 1,000,000,000 times.
So, how do the real databases do this in reasonable amount of time? Use index. When you look into this 10,000 rows of table, what you are looking for is indexed so searching a row becomes log2(10,000). The total order of computation becomes N * log2(M) which is far more manageable. If you hash the key, the search cost is almost O(1) - meaning it's constant. So, the computation order becomes N.
What you are doing probably is, in real database term, full table scan. It is something to avoid for real database because it is slow.
If you use any real (SQL) database, or programming language that provides a key based search in dataset, your join will become really fast. It's nothing to do with any programming language. It is really a 101 of computer science.
I do not know anything about what Excel can do. If Excel provides some facility to lookup a row based on indexing or hashing, you may be able to speed it up drastically.

Ideally you want to design a database (there are many such as SQLite, PostgreSQL, MySQL etc.) and stick your data into it. SQL is the language of talking to a database (DML data manipulation language) or creating/editing the structure of the database (DDL data definition language).
Why a database? You’ll get data validation and the ability to query data with many relationships (such as One to Many, e.g. one author can have many books but you’ll have an Author table and a Book table and will need to join these).
Pandas works not just with databases but CSV and text files, Microsoft Excel, HDF5 and is great for reading and writing to these in memory structures as well as merging, joining, slicing the data. The quickest way to what you want is likely read the data you have into panda dataframes and then manipulate from there. This makes a database optional though recommended. See Pandas Merging 101 for an idea of what you can do with pandas.
Another python tool you could use is SQLAlchemy which is an ORM object relational mapper (converts say a row in an Author table to an Author class object in python). Whilst it’s important to know SQL and database principles you don’t need to use SQL statements directly when using SQLAlchemy.
Each of these areas are huge like the ocean. You can dip your toes into each but if you wade in too deep you’ll want to know how to swim. I have fist-sized books on each to give (that I’ve not finished) you a rough idea what I mean by this.
A possible roadmap may look like:
Database (optional but recommended):
Learn about relational data
Learn database design
Learn SQL
Pandas (highly recommended):
Learn to read and write data (to excel / database)
Learn to merge, join, concatenate and update a DataFrame

Database or Table Solution for Temporary Numpy Arrays

I am creating a Python desktop application that allows users to select different distributional forms to model agricultural yield data. I have the time series agricultural data - close to a million rows - saved in a SQLite database (although this is not set in stone if someone knows of a better choice). Once the user selects some data, say corn yields from 1990-2010 in Illinois, I want them to select a distributional form from a drop-down. Next, my function fits the distribution to the data and outputs 10,000 points drawn from that fitted distributional form in a Numpy array. I would like this data to be temporary during the execution of the program.
In an attempt to be efficient, I would only like to make this fit and the subsequent drawing of numbers one time for a specified region and distribution. I have been researching temporary files in Python, but I am not sure that is the best approach for saving many different Numpy arrays. PyTables also looks like an interesting approach and seems to be compatible with Numpy, but I am not sure it is good for handling temporary data. No SQL solutions, like MongoDB, seem to be very popular these days as well, which also interests me from a resume building perspective.
Edit: After reading the comment below and researching it, I am going to go with PyTables, but I am trying to find the best way to tackle this. Is it possible to create a table like below, where instead of Float32Col I can use createTimeSeriesTable() from the scikits time series class or do I need to create a datetime column for the date and a boolean column for the mask, in addition to the Float32Col below to hold the data. Or is there a better way to be going about this problem?
class Yield(IsDescription):
geography_id = UInt16Col()
data = Float32Col(shape=(50, 1)) # for 50 years of data
Any help on the matter would be greatly appreciated.

What's your use case for the temporary data? Are you just going to be reading it all in at once (and never wanting to just read in a subset)?
If so, just save the array to a temporary file (e.g. with numpy.save, or equivalently, pickle with a binary protocol). There's no need for fancier solutions in that case.
On a side note, I'd highly recommend PyTables over SQLite for storing your original time series data.
Based on what it sounds like you're doing, you're not going to need the "relational" parts of a relational database (e.g. joins). If you don't need to join or relate tables, you just need fast simple queries, and you want the data in memory as a numpy array, PyTables is an excellent option. PyTables uses HDF to store your data, which can be much more compact on disk than a SQLite database. PyTables is also considerably faster for loading large chunks of data into memory as numpy arrays.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.