Pandas dataframe multiple database tables

Pandas dataframe multiple database tables - python

I have a API wrapper that pulls data from a specific product. I am facing this problem of how can I map the json data to the database (postgresql). I have read up on Pandas dataframe but I am unsure if it is the right way to go. I have a few questions that I need help with.
1) Is it possible to choose which rows get into the dataframe?
2) Every row inside the dataframe needs to be inserted into two different database tables. I would need to insert ten columns into TableA get the id of the newly inserted row and insert five columns including the returned id into TableB. How would I go about this?
3) Is it possible to specify the data types for each column in the dataframe?
4) Is it possible to rename the column names to the database field names?
5) Is it possible to iterate through specific columns and replace certain data?
Is there a specific term for what I am trying to accomplish which I can search for?
Many thanks!

1) Yes, you can. You can follow this tutorial
2) You can achieve this following the same tutorial as before.
3) Theres 3 main options to convert data types in pandas:
3.1) to_numeric() - provides functionality to safely convert non-numeric types (e.g. strings) to a suitable numeric type. (See also to_datetime() and to_timedelta().)
3.2) astype() - convert (almost) any type to (almost) any other type (even if it's not necessarily sensible to do so). Also allows you to convert to categorial types (very useful).
3.3) infer_objects() - a utility method to convert object columns holding Python objects to a pandas type if possible.
4) You can simply call the .rename function as explained here
5) Theres at least 5 ways to iterate over data in pandas. Some are faster than others, but the ideal way depends for each case. There's a very good post on GeeksForGeeks about it.
I Hope I could help you somehow =)

Related

Improve Pandas performance for very large dataframes?

I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100+ fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.
I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:
It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.

You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.
The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.
I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts). Alternatively, you should strive to flatten the JSONs.
Convert to the data format that gives you the best performance. JSON stores all the data in the textual format. Numbers are best used in their binary format. You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) (works on the whole dataframe too).
Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.
Recombine the JSONs for storage (I assume you need them in this format). At this stage the implicit conversion from numbers to strings will occur.
Given the level of details you provided in the question, the suggestions are necessarily brief. Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE).

Is there a way to get the source columns from a column object in PySpark?

I'm writing a function that (hopefully) simplifies a complex operation for other users. As part of this, the user passes in some dataframes and an arbitrary boolean Column expression computing something from those dataframes, e.g.
(F.col("first")*F.col("second").getItem(2) < F.col("third")) & (F.col("fourth").startswith("a")).
The dataframes may have dozens of columns each, but I only need the result of this expression, so it should be more efficient to select only the relevant columns before the tables are joined. Is there a way, given an arbitrary Column, to extract the names of the source columns that Column is being computed from, i.e.
["first", "second", "third", "fourth"]?
I'm using PySpark, so an ideal solution would be contained only in Python, but some sort of hack that requires Scala would also be interesting.
Alternatives I've considered would be to require the users to pass the names of the source columns separately, or to simply join the entire tables instead of selecting the relevant columns first. (I don't have a good understanding of Spark internals, so maybe the efficiency loss isn't as much I think.) I might also be able to do something by cross-referencing the string representation of the column with the list of column names in each dataframe, but I suspect that approach would be unreliable.

Why Does Pandas Convert One Row (or Column) of a DataFrame to a Series?

Context: I was passing what I thought was a DataFrame, df.iloc[n], into a function. Thanks to the dialogue here I figured out this was causing errors because Pandas automatically converts a single row or column from a Dataframe into a series, and is easily solved by using df.iloc[[n]] instead of df.iloc[n].
Question: My question is why does Pandas do this? Is there some performance benefit in using Series instead of DataFrames? What is the reasoning behind this automatic conversion to a Series?

As per Pandas documentation Why more than one data structure?
The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.
So, no conversion happening here, rather objects with different properties/methods being retrieved.

Python: (findall, wide to long, pandas): Data manipulation

So, I am trying to pivot this data (link here) to where all the metrics/numbers are in a column with another column being the ID column. Obviously having a ton of data/metrics in a bunch of columns is much harder to compare and do calculations off of than having it all in one column.
So, I know what tools I need for this; Pandas, Findall, Wide_to_long (or melt) and maybe stack. However, I am having a bit of difficulty putting them all in the right place.
I easily can import the data into the df and view it, but when it comes to using findall with wide_to_long to pivot the data I get pretty confused. I am basing my idea off of this example (about half way down when they use the findall / regex to define new column names). I am looking to create a new column for each category of the metrics (ie. population estimate is one column and then % change is another, they should not be all one column)
Can someone help me set up the syntax correctly for this part? I am not good at the expressions dealing with pattern recognition.
Thank you.

How can I manipulate tabular data with python?

I have column-based data in a CSV file and I would like to manipulate it in several ways. People have pointed me to R because it gives you easy access to both rows and columns, but I am already familiar with python and rather use it.
For example, I want to be able to delete all the rows that have a certain value in one of the columns. Or I want to change all the values of one column (i.e., trim the string). I also want to be able to aggregate rows based on common values (like a SQL GROUP BY).
Is there a way to do this in python without having to write a loop to iterate over all of the rows each time?

Look at the pandas library. It provides a DataFrame type similar to R's dataframe that lets you do the kind of thing you're talking about.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.