I have several thousand excel documents. All of these documents are 95% the same in terms of column headings. However, since they are not 100% identical, I cannot simply merge them together and upload it into a database without messing up the data.
Would anyone happen to have a library or an example that they've ran into that would help?
If a large proportion of them are similar, and this is a one-off operation it may be worth your while coding the solution for the majority and handling the other documents (or groups of them if they are similar) separately. If using Python to do this you could simply build a dynamic query where the columns that are present in a given excel sheet are built into the INSERT statements. Of course, this assumes that your database table allows for NULLs or that a default value is present on the columns that aren't in a given document.
Related
I have multiple Excel files with different sheets in each file, these files have been made my people, so each one has different formats, different number of columns and also different structures to represent the data.
For example, in one sheet, the dataframe/table starts at 8th row, second column. In other it starts at 122 row, etc...
I want to retrieve something in common from these Excels, it is variable names and information.
However, I don't how could I possibly retrieve all this information without needing to parse each individual file. This is not an option because there are lot of these files with lots of sheets in each file.
I have been thinking about using regex as well as edit distance between words, but I don't know if that is the best option.
Any help is appreciated.
I will divide my answer into what I think you can do now, and suggestions for the future (if feasible).
An attempt to "solve" the problem you have with existing files.
Without regularity on your input files (such as at least a common name in the column), I think what you're describing is among the best solutions. Having said that, perhaps a "fancier" similarity metric between column names would be more useful than using regular expressions.
If you believe that there will be some regularity in the column names, you could look at string distances such as the Hamming Distance or the Levenshtein distance, and using a threshold on the distance that works for you. As an example, let's say that you have a function d(a:str, b:str) -> float that calculates a distance between column names, you could do something like this:
# this variable is a small sample of "expected" column names
plausible_columns = [
'interesting column',
'interesting',
'interesting-column',
'interesting_column',
]
for f in excel_files:
# process the file until you find columns
# I'm assuming you can put the colum names into
# a variable `columns` here.
for c in columns:
for p in plausible_columns:
if d(c,p) < threshold:
# do something to process the column,
# add to a pandas DataFrame, calculate the mean,
# etc.
If the data itself can tell you something on whether you should process it (such as having a particular distribution, or being in a particular range), you can use such features to decide on whether you should be using that column or not. Even better, you can use many of these characteristics to make a finer decision.
Having said this, I don't think a fully automated solution exists without inspecting some of the data manually, and studying the ditribution of the data, or variability in the names of the columns, etc.
For the future
Even with fancy methods to calculate features and doing some data analysis on the data you have right now, I think it would be impossible to ensure that you will always get the data you need (by the very nature of the problem). A reasonable way to solve this, in my opinion (and if this is feasible in whatever context you're working in), is to impose a stricter format in the data generation end (I suppose this is a manual thing with people inputting data to excel directly). I would argue that the best solution is to get rid of the problem at the root, and create a unified form, or excel sheet format, and distribute it to the people that will fill the files with data, so that you can ensure the data is then automatically ingested minimizing the risk of errors.
I am currently trying to develop macros/programs to help me edit a big database in Excel.
Just recently I successfully wrote a custom macro in VBA, which stores two big arrays into memory, in memory it compares both arrays by only one column in each (for example by names), then the common items that reside in both arrays are copied into another temporary arrays TOGETHER with other entries in the same row of the array. So if row(11) name was "Tom", and it is common for both arrays, and next to Tom was his salary of 10,000 and his phone number, the entire row would be copied.
This was not easy, but I got to it somehow.
Now, this works like a charm for arrays as big as 10,000 rows x 5 columns + another array of the same size 10,000 rows x 5 columns. It compares and writes back to a new sheet in a few seconds. Great!
But now I tried a much bigger array with this method, say 200,000 rows x 10 columns + second array to be compared 10,000 rows x 10 columns...and it took a lot of time.
Problem is that Excel is only running at 25% CPU - I checked that online it is normal.
Thus, I am assuming that to get a better performance I would need to use another 'tool', in this case another programming language.
I heard that Python is great, Python is easy etc. but I am no programmer, I just learned a few dozen object names and I know some logic so I got around in VBA.
Is it Python? Or perhaps changing the programming language won't help? It is really important to me that the language is not too complicated - I've seen C++ and it stings my eyes, I literally have no idea what is going on in those codes.
If indeed python, what libraries should I start with? Perhaps learn some easy things first and then go into those arrays etc.?
Thanks!
I have no intention of condescending but anything I say would sound like condescending, so so be it.
The operation you are doing is called join. It's a common operation in any kind of database. Unfortunately, Excel is not a database.
I suspect that you are doing NxM operation in Excel. 200,000 rows x 10,000 rows operation quickly explodes. Pick a key in N, search a row in M, and produce result. When you do this, regardless of computer language, the computation order becomes so large that there is no way to finish the task in reasonable amount of time.
In this case, 200,000 rows x 10,000 rows require about 5,000 lookup per every row on average in 200,000 rows. That's 1,000,000,000 times.
So, how do the real databases do this in reasonable amount of time? Use index. When you look into this 10,000 rows of table, what you are looking for is indexed so searching a row becomes log2(10,000). The total order of computation becomes N * log2(M) which is far more manageable. If you hash the key, the search cost is almost O(1) - meaning it's constant. So, the computation order becomes N.
What you are doing probably is, in real database term, full table scan. It is something to avoid for real database because it is slow.
If you use any real (SQL) database, or programming language that provides a key based search in dataset, your join will become really fast. It's nothing to do with any programming language. It is really a 101 of computer science.
I do not know anything about what Excel can do. If Excel provides some facility to lookup a row based on indexing or hashing, you may be able to speed it up drastically.
Ideally you want to design a database (there are many such as SQLite, PostgreSQL, MySQL etc.) and stick your data into it. SQL is the language of talking to a database (DML data manipulation language) or creating/editing the structure of the database (DDL data definition language).
Why a database? You’ll get data validation and the ability to query data with many relationships (such as One to Many, e.g. one author can have many books but you’ll have an Author table and a Book table and will need to join these).
Pandas works not just with databases but CSV and text files, Microsoft Excel, HDF5 and is great for reading and writing to these in memory structures as well as merging, joining, slicing the data. The quickest way to what you want is likely read the data you have into panda dataframes and then manipulate from there. This makes a database optional though recommended. See Pandas Merging 101 for an idea of what you can do with pandas.
Another python tool you could use is SQLAlchemy which is an ORM object relational mapper (converts say a row in an Author table to an Author class object in python). Whilst it’s important to know SQL and database principles you don’t need to use SQL statements directly when using SQLAlchemy.
Each of these areas are huge like the ocean. You can dip your toes into each but if you wade in too deep you’ll want to know how to swim. I have fist-sized books on each to give (that I’ve not finished) you a rough idea what I mean by this.
A possible roadmap may look like:
Database (optional but recommended):
Learn about relational data
Learn database design
Learn SQL
Pandas (highly recommended):
Learn to read and write data (to excel / database)
Learn to merge, join, concatenate and update a DataFrame
I generally use Pandas to extract data from MySQL into a dataframe. This works well and allows me to manipulate the data before analysis. This workflow works well for me.
I'm in a situation where I have a large MySQL database (multiple tables that will yield several million rows). I want to extract the data where one of the columns matches a value in a Pandas series. This series could be of variable length and may change frequently. How can I extract data from the MySQL database where one of the columns of data is found in the Pandas series? The two options I've explored are:
Extract all the data from MySQL into a Pandas dataframe (using pymysql, for example) and then keep only the rows I need (using df.isin()).
or
Query the MySQL database using a query with multiple WHERE ... OR ... OR statements (and load this into Pandas dataframe). This query could be generated using Python to join items of a list with ORs.
I guess both these methods would work but they both seem to have high overheads. Method 1 downloads a lot of unnecessary data (which could be slow and is, perhaps, a higher security risk) whilst method 2 downloads only the desired records but it requires an unwieldy query that contains potentially thousands of OR statements.
Is there a better alternative? If not, which of the two above would be preferred?
I am not familiar with pandas but strictly speaking from a database point of view you could just have your panda values inserted in a PANDA_VALUES table and then join that PANDA_VALUES table with the table(s) you want to grab your data from.
Assuming you will have some indexes in place on both PANDA_VALUES table and the table with your column the JOIN would be quite fast.
Of course you will have to have a process in place to keep PANDA_VALUES tables updated as the business needs change.
Hope it helps.
So I'm trying to store Pandas DataFrames in HDF5 and getting strange errors, rather inconsistently. At least half the time, some part of the read-process-move-write cycle fails, often with no clearer explanation than "HDF5 Read Error". Even worse, sometimes the table ends up with nonsense/corrupted data that doesn't stop things until downstream -- either values that are off by orders of magnitude (and not even correlated with the correct ones) or dates that don't make sense (recent data mismarked as being dated in the 1750s...etc).
I thought I'd go through the current process and then the things that I suspect might be causing problems of that might help. Here's what it looks like:
Read some of the tables (call them "QUERY1" and "QUERY2") to see if they're up to date, and if they arent,
Take the table that had been in the HDF5 store as "QUERY1" and store it as QUERY1_YYYY_MM_DD" in the HDF5 store instead
Run the associated query on external database for that table. Each one is between 100 and 1500 columns of daily data back to 1980.
Store the result of query 1 as the new "QUERY1" in the HDF5 store
Compute several transformations of one or more of QUERY1, QUERY2,...QUERYn which will have hierarchical (Pandas MultiIndex) columns. Overwrite item "Derived_Frame1"...etc with its update/replacement in the HDF5 store
Multiple people with access to the relevant .h5 file on a Windows network drive run this routine -- potentially sometimes, but not usually, at the same time.
Some things I suspect could be part of the problem:
using default format (df.to_hdf(store, key)) instead of insisting on "Table" format with df.to_hdf(store, key, format='table')). I do this because default format is between 2 and 5x faster on both the read and the write according to %timeit
Using a network drive to allow several users to run this routine and access at least the derived frames. Not much I can do about this requirement, especially for read access to the derived dataframes at any time.
From the docs, it sounds like repeatedly deleting and re-writing an item in the HDF5 store can do weird things (at least gradually increasing the file size, not sure what else). Maybe I should be storing query archives in another file? Maybe I should be nuking and replacing the whole main file upon update?
Storing dataframes with MultiIndex columns in HDF5 in the first place -- this seems to be what gets me a "warning" under the default format, although it seems like the warning goes away if I use format='table'.
Edit: it is also possible/likely that different users running the routine above are using different versions of Pandas and different versions of PyTables.
Any ideas?
I have an old excel spreadsheet with a lot of data in a relational database type format, with one main primary key that I need to go through.
I want to compare some rows but there are many entries (thousands of rows, dozens of columns) and Excel doesn't really have built-in features to do this. After looking around I found out the best way to extract the data is using a script with Python, but I have no programming skills in python or any language for the matter. I need to look for duplicates in the key column and then check if there are duplicates rows in that same column and if so merge them in a new row and then a new excel file/sheet separating the merged rows from the non-merged rows.
I don't know if this sounds too complicated or not and I am new here so I did do some research scouring the internet to see if I can find any scripts to do it but no luck really... Here are the closest posts I found that may have something to do with what I want but what I found usually is about people wanting to merge 2 different excel files together:
http://pbpython.com/excel-file-combine.html
Looking to merge two Excel files by ID into one Excel file using Python 2.7
(I have more links but could only post two.)
Basically i'm looking for duplicate rows and want to merge them together into a new file or spreadsheet in excel, separating them from the non dupes and putting it all back together.