Importing TSV Data with python to an arango Database - python

I'm currently working on a project, with which I have big TSV files and need to import them in a database. I need a noSQL database so I chose Arango. With arango, we can't import TSV files in python. Only JSON files. But, we can import TSV files with PowerShell.
The files are around 1Gb and once they are imported, I will need to do daily updates to the database, with the same TSV files, but that could be modified.
What are the best options?
Convert tsv files to json with panda in my python program, then bulk import (I think this cause memory issues)
Just use with open() and insert documents line by line (Again, memory issue, but you guys may have a solution)
Use the powershell way to import the data. The only problem with that is that I use docker, therefore I can't simply run a powershell script.
Use another database
Also, what would be my best bet for the daily updates, taking into consideration the memory issues?

Related

Read and write a single cell in excel using Python

I am looking to replace the database (SQL)(around 50,00050 rowscolumns) for my app with excel. I need to update a single cell in excel without loading the whole workbook and then saving it again (I am using Openpyxl) as it is computationally very expensive. I need an alternative that will help me save execution time.
I have tried excel APIs like xlwings but need an alternative to APIs
I cannot comment yet, so I will "answer". Why would you replace a database with Excel? Sounds crazy to me. There are plenty of other persistent storage file systems out there to use, pickle, HD5, pyarrow stuff, csv, etc.. I used the feather format for a while, super fast and pandas can use it natively.

What is the correct way to upload files to a SQL Server inside my web application?

I am developing a web application in which users can upload excel files. I know I can use the OPENROWSET function to read data from excel into a SQL Server but I am refraining from doing so because this function requires a file path.
It seems kind of indirect as I am uploading a file to a directory and then telling SQL Server go look in that directory for the file instead of just giving SQL Server the file.
The other option would be to read the Excel file into a pandas dataframe and then use the to_sql function but pandas read_excel function is quite slow and the other method I am sure would be faster.
Which of these two methods is "correct" when handling file uploads from a web application?
If the first method is not frowned upon or "incorrect", then I am almost certain it is faster and will use that. I just want an experienced developers thoughts or opinions. The webapp's backend is Python and flask.
If I am understanding your question correctly, you are trying to load the contents of an xls(s) file into a SQLServer database. This is actually not trivial to do, as depending on what is in the Excel file you might want to have one table, or more probably multiple tables based on the data. So I would step back for a bit and ask three questions:
What is the data I need to save and how should that data be structured in my SQL tables. Forget about excel at this point -- maybe just examine the first row of data and see how you need to save it.
How do I get the file into my web application? For example, when the user uploads a file you would want to use a POST form and send the file data to your server and your server to save that file (for example, either on S3, or in a /tmp folder, or into memory for temporary processing).
Now that you know what your input is (the xls(x) file and its location) and how you need to save your data (the sql schema), now it's time to decide what the best tool for the job is. Pandas is probably not going to be a good tool, unless you literally just want to load the file and dump it as-is with minimal (if any) changes to a single table. At this point I would suggest using something like xlrd if only xls files, or openpyxl for xls and xlsx files. This way you can shape your data any way you want. For example, if the user enters in malformed dates; empty cells (should they default to something?); mismatched types, etc.
In other words, the task you're describing is not trivial at all. It will take quite a bit of planning and designing, and then quite a good deal of python code once you have your design decided. Feel free to ask more questions here for more specific questions if you need to (for example, how to capture the POST data in a file update or whatever you need help with).

How to import and read multiple large .csv files into a table in an SQL database using Python?

So, I have a large amount of data that I wish to upload on a table in MySQL. I can use MySQL's inbuilt data import wizard to upload each .csv file(around 90 files, ~150 mb each) into the table but each file takes too long and it will take months to upload all this data.
So instead, I want to use the 'LOAD DATA INFILE' command(which apparently is faster according to the internet) in MySQL but this is generally done for individual files, so I was wondering if I can maybe 'loop' this SQL command using Python(Python has a module that connects to MySQL named 'MySQLclient') and run it through the directory where each of my data file is in so that all the data gets uploaded on to my table one by one automatically. But unfortunately I am not able to come up with a syntactically accurate method to do so in Python 3.6. Maybe you can help?
Other methods/commands to perform this task are also welcome.
Versions: Python 3.6; MySQL 5.7; Win 8.1;

Embed CSV in Excel and import the data

I wrote a tool that extracts data from a large DB and outputs it to an Excel file along with (conditional) formatting to improve readability. For this I use Python with openpyxl on a Linux machine. It works great, but this package is rather slow for writing Excel.
It seems to be a lot quicker to dump the table as (compressed) csv, import that into Excel and apply formatting there using a macro/vba.
To automate the process I'd like to create an empty Excel file pre-loaded with the required VBA to do the formatting; a template. For every data dump, the data is embedded (compressed using deflate) into the Excel file and loaded into the Workbook upon opening the document (or using a "LOAD" button to circumvent macro related security things).
However, just adding some file into the Excel file raises an error when opened:
We found a problem with some content in 'Werkmap1_test_embed.xlsx'. Do you want us to try to recover as much as we can? If you trust the source of this workbook, click Yes.
Clicking Yes opens the file and shows some tracing information as XML:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>Repair Result to Werkmap1_OLE_Word0.xml</logFileName>
<summary>Errors were detected in file '/Users/joostk/mnt/cluster/Werkmap1_OLE_Word.xlsx'</summary>
<additionalInfo>
<info>Excel completed file level validation and repair. Some parts of this workbook may have been repaired or discarded.</info>
</additionalInfo>
</recoveryLog>
Is it possible to avoid this? How would I embed a file into the Excel ZIP? Do I need to update some file table (which I could not file easily).
When that's done, I'd like to import the data. Can I access files in the Excel ZIP from VBA? I guess not, and I need to extract the data to some temporary path and load it from there.
I have found these helpful answers elsewhere to load ZIP and plain text:
https://stackoverflow.com/a/35781621/4998990
https://stackoverflow.com/a/11267603/4998990
Many thanks for sharing your thoughts!
so my "Answer" here is that this is caused by using Named Ranges, or an underlying table, or an embedded Query/Connection. When you start manipulating this file you will get the error that you are talking about:
There is no harm to the file if you click "yes" and open. Excel will open this in Repaired Mode which will require you to re-save the file.
The way I've worked around this is to re-read the "repaired" file, in python, and save it as another file or replace it. Essentially just do an extra step of re-reading the data into memory, and write it to a new file. The error will go away. As always, test this method before deploying to production to ensure no records are lost. The way I solve it is with two lines of pandas.
import pandas as pd
repair = pd.read_excel('PATH_TO_REPAIR_FILE')
new_file = repair.to_excel('PATH_TO_WHERE_NEW_FILE_GOES')

XLRD vs Win32 COM performance comparison

I have this huge Excel (xls) file that I have to read data from. I tried using the xlrd library, but is pretty slow. I then found out that by converting the Excel file to CSV file manually and reading the CSV file is orders of magnitude faster.
But I cannot ask my client to save the xls as csv manually every time before importing the file. So I thought of converting the file on the fly, before reading it.
Has anyone done any benchmarking as to which procedure is faster:
Open the Excel file with with the xlrd library and save it as CSV file, or
Open the Excel file with win32com library and save it as CSV file?
I am asking because the slowest part is the opening of the file, so if I can get a performance boots from using win32com I would gladly try it.
if you need to read the file frequently, I think it is better to save it as CSV. Otherwise, just read it on the fly.
for performance issue, I think win32com outperforms. however, considering cross-platform compatibility, I think xlrd is better.
win32com is more powerful. With it, one can handle Excel in all ways (e.g. reading/writing cells or ranges).
However, if you are seeking a quick file conversion, I think pandas.read_excel also works.
I am using another package xlwings. so I am also interested with a comparison among these packages.
to my opinion,
I would use pandas.read_excel to for quick file conversion.
If demanding more processing on Excel, I would choose win32com.

Categories

Resources