Is there a way to iterate over table rows using Ibis (impala) - python

I have a fairly large Ibis TableExpr for which I would like to iterate over the rows to produce a specialized file output (FASTA nucleotide sequences). Is there any way to do this with Ibis, or should I just call execute to create a pandas DataFrame for which I can call iterrows?
I cannot find anything in the API or tutorials.

You should iterate over the pandas DataFrame as you say.
Or you should be able to also get the Impyla cursor that the backend generates calling lower level functions than .execute(). But those functions are likely to change when we release Ibis 2.0, so your code is likely to break.
Happy to receive feedback if that's something you'd be interested in. You can open an issue in the project GitHub.

Related

What is the correct way to upload files to a SQL Server inside my web application?

I am developing a web application in which users can upload excel files. I know I can use the OPENROWSET function to read data from excel into a SQL Server but I am refraining from doing so because this function requires a file path.
It seems kind of indirect as I am uploading a file to a directory and then telling SQL Server go look in that directory for the file instead of just giving SQL Server the file.
The other option would be to read the Excel file into a pandas dataframe and then use the to_sql function but pandas read_excel function is quite slow and the other method I am sure would be faster.
Which of these two methods is "correct" when handling file uploads from a web application?
If the first method is not frowned upon or "incorrect", then I am almost certain it is faster and will use that. I just want an experienced developers thoughts or opinions. The webapp's backend is Python and flask.
If I am understanding your question correctly, you are trying to load the contents of an xls(s) file into a SQLServer database. This is actually not trivial to do, as depending on what is in the Excel file you might want to have one table, or more probably multiple tables based on the data. So I would step back for a bit and ask three questions:
What is the data I need to save and how should that data be structured in my SQL tables. Forget about excel at this point -- maybe just examine the first row of data and see how you need to save it.
How do I get the file into my web application? For example, when the user uploads a file you would want to use a POST form and send the file data to your server and your server to save that file (for example, either on S3, or in a /tmp folder, or into memory for temporary processing).
Now that you know what your input is (the xls(x) file and its location) and how you need to save your data (the sql schema), now it's time to decide what the best tool for the job is. Pandas is probably not going to be a good tool, unless you literally just want to load the file and dump it as-is with minimal (if any) changes to a single table. At this point I would suggest using something like xlrd if only xls files, or openpyxl for xls and xlsx files. This way you can shape your data any way you want. For example, if the user enters in malformed dates; empty cells (should they default to something?); mismatched types, etc.
In other words, the task you're describing is not trivial at all. It will take quite a bit of planning and designing, and then quite a good deal of python code once you have your design decided. Feel free to ask more questions here for more specific questions if you need to (for example, how to capture the POST data in a file update or whatever you need help with).

How to improve the write speed to sql database using python

I'm trying to find a better way to push data to sql db using python. I have tried
dataframe.to_sql() method and cursor.fast_executemany()
but they don't seem to increase the speed with that data(the data is in csv files) i'm working with right now. Someone suggested that i could use named tuples and generators to load data much faster than pandas can do.
[Generally the csv files are atleast 1GB in size and it takes around 10-17 minutes to push one file]
I'm fairly new to much of concepts of python,so please suggest some method or atleast a reference any article that shows any info. Thanks in advance
If you are trying to insert the csv as is into the database (i.e. without doing any processing in pandas), you could use sqlalchemy in python to execute a "BULK INSERT [params, file, etc.]". Alternatively, I've found that reading the csvs, processing, writing to csv, and then bulk inserting can be an option.
Otherwise, feel free to specify a bit more what you want to accomplish, how you need to process the data before inserting to the db, etc.

Parsing a CSV into a database for an API using Python?

I'm gonna use data from a .csv to train a model to predict user activity on google ads (impressions, clicks) in relation to the weather for a given day. And I have a .csv that contains 6000+ recordings of this info and want to parse it into a database using Python.
I tried making a df in pandas but for some reason the whole table isn't shown. The middle columns (there's about 7 columns I think) and rows (numbered over 6000 as I mentioned) are replaced with '...' when I print the table so I'm not sure if the entirety of the information is being stored and if this will be usable.
My next attempt will possible be SQLite but since it's local memory, will this interfere with someone else making requests to my API endpoint if I don't have the db actively open at all times?
Thanks in advance.
If you used pd.read_csv() i can assure you all of the info is there, it's just not displaying it.
You can check by doing something like print(df['Column_name_you_are_interested_in'].tolist()) just to make sure though. You can also use the various count type methods in pandas to make sure all of your lines are there.
Panadas is pretty versatile so it shouldn't have trouble with 6000 lines

Python: IDE support to output database query like pandas data frame

(This may be stupid question due to my ignorance.)
Is it possible in Visual Studio Code or PyCharm (perhaps with a plugin) to automatically output a database query, say from an Sqlite source, be nicely formatted like a Pandas DataFrame? (So when I run the code it will be displayed in a nicely formatted table.)
You can use .format(), there are a few different ways you could do this - I'd normally do something like this:
print('{:>len(longestResult)}'.format(i))
If you iterate through all your results to find the longest one and use the length of it as above and iterate through your results again it'll give you a nicely padded table.

XML to Postgres via python/psycopg2

I have an existing python script that loops through a directory of XML files parsing each file using etree, and inserting data at different points into a Postgres database schema using psycopg2 module. This hacked together script worked just fine but now the amount of data (number and size of XML files) is growing rapidly, and the number of INSERT statements is just not scaling. The largest table in my final database has grown to about ~50 million records from about 200,000 XML files. So my question is, what is the most efficient way to:
Parse data out of XMLs
Assemble row(s)
Insert row(s) to Postgres
Would it be faster to write all the data to a CSV in the correct format and then bulk load the final CSV tables to Postgres using COPY_FROM command?
Otherwise I was thinking about populating some sort of temporary data structure in memory that I could insert into the DB once it reaches a certain size? I am just having trouble arriving at the specifics of how this would work.
Thanks for any insight on this topic, and please let me know if more information is needed to answer my question.
copy_from is the fastest way I found to do bulk inserts. You might be able to get away with streaming the data through a generator to stay away from writing temporary files while keeping memory usage low.
A generator function could assemble rows out of the XML data, then consume that generator with copy_from. You may even want multiple levels of generators such that you can have one which yields records from a single file and another which composes those from all 200,000 files. You'd end up with a single query which will be much faster than 50,000,000.
I wrote an answer here with links to example and benchmark code for setting something similar up.

Categories

Resources