Converting Complex SQL to Python Pandas Merge and/or Join

Converting Complex SQL to Python Pandas Merge and/or Join - python

I currently have a Python script which converts two pandas DataFrames to tables in a SQLite database in memory, before reading reading and running SQL code on the tables. I would like the script to be more "Pythonic", merging and/or joining the DataFrames, but am having a difficult time finding Python code examples for the equivalent of SELECTing specific, and not all, elements from both tables, along with FROM, WHERE and ORDER BY clauses. I am fairly new to Python, and being the Guinea Pig of my department, so if I can get this working, it will become a template for MANY more scripts from my partners in my work group. Actual element names have been changed do to proprietary information, but the structure is the same. Thanks in advance for the help!
SELECT
dfE.Element05 AS [Alt Element05],
dfE.Element03 AS [Alt Element03],
dfE.Element04 AS [Alt Element04],
dfN.Element03,
dfN.Element04,
dfN.Element08,
dfN.Element09,
dfN.Element10,
dfN.Element17,
dfN.Element18,
dfN.Element19,
dfN.Element20,
dfN.Element23,
dfN.Element26,
dfN.Element13
FROM dfE INNER JOIN dfN ON (dfE.Element17 = dfN.Element17) AND (dfE.Element20 = dfN.Element20)
WHERE (((dfN.Element03)<>dfE.Element03))
GROUP BY
dfE.Element05,
dfE.Element03,
dfE.Element04,
dfN.Element03,
dfN.Element04,
dfN.Element08,
dfN.Element09,
dfN.Element10,
dfN.Element17,
dfN.Element18,
dfN.Element19,
dfN.Element20,
dfN.Element23,
dfN.Element26,
dfN.Element13
ORDER BY
dfE.Element03,
dfN.Element03,
dfN.Element08

I would start by copying the DataFrames that you want to join and selecting the specific columns there. I have included "Element17" & "Element20" from dfE because you need to have it when joining.
Ex.
df1 = dfE['Element05,'Element03','Element04','Element17','Element20'].copy()
In order to rename the columns use the following:
df1.rename(columns={'Element05':'Alt Element05','Element03':'Alt Element03','Element04':'Alt Element04'},inplace=True)
Once you have the other df set up (lets name it df2) you would use pd.merge() to join them as you would in SQL. *When using pd.merge, the columns on which you are going to join have to have the same name or it won't work! (Lets say df1['A'] shares the same data as df2['B'] and you want to join the DataFrames. You would have to change the name on one of the DataFrames so that it is equal to the column name of the Df you are joining to or it won't work.)
Ex.
df3 = pd.merge(df1,df1,how='inner',on=['Element17','Element20'])
For the Where I would do the following.
df3= df3[df3['Alt Element03']!=df3['Element']]
For Order By you could use .sort() but I'm not comfortable with giving you advice on how to use it as I haven't used it much.
I hope this helps! Let me know if you have questions.
*** This might not be the best way to do things. I apologize beforehand if I'm leading you to develop bad habits. I'm new to python as well!

Related

How to parallelize a python code that has two different pandas dataframes?

I have two dataframes and have a code to extract some data from one of the dataframes and add to the other dataframe:
sales= pd.read_excel("data.xlsx", sheet_name = 'sales', header = 0)
born= pd.read_excel("data.xlsx", sheet_name = 'born', header = 0)
bornuni = born.number.unique()
for babies in bornuni:
datafame = born[born["id"]==number]
for i, r in sales.iterrows():
if r["number"] == babies:
sales.loc[i,'ini_weight'] = datafame["weight"].iloc[0]
sales.loc[i,'ini_date'] = datafame["date of birth"].iloc[0]
else:
pass
this is pretty inefficient with bigger data sets so I want to parallelize this code but I don´t have a clue how to do it. Any help would be great. Here is a link to a mock dataset.

So before worrying about parallelizing, I can't help but notice that you're using lots of for loops to deal with the dataframes. Dataframes are pretty fast when you use their vectorized capabilities.
I see a lot of inefficient use of pandas here, so maybe we first fix that and then worry about throwing more CPU cores at it.
It seems to me you want to accomplish the following:
For each unique baby id number in the born dataframe, you want to update the ini_weight and ini_date fields of the corresponding entry in the sales dataframe.
There's a good chance that you can use some dataframe merging / joining to help you with that, as well as using the pivot table functionality:
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
I strongly suggest you take a look at those, try using the ideas from these articles, and then reframe your question in terms of these operations, because as you correctly notice, looping over all the rows repeatedly to find the row with some matching index is very inefficient.

Having Difficulty Merging Dataframes on Pandas

Trying to merge two dataframes of hockey data, both have player names (what I am trying to merge on) mind you the one with salary data only has 500 rows or so and the primary dataframe has 2000+ (if that makes a difference. Trying to merge them on name when applicable and the new df created has no rows of data in it.
Wanted to merge wherever it made sense to (ie. where both had salary data for a given player)
Let me know if something is not clear or how to upload more info as needed as I'm not seeing an option to make uploading the tables possible or if I can otherwise include more insight/info that may make my situation clearer to you when trying to help.
Thanks for what input you can kindly provide, enjoy your weekend.
Dataframes I am looking to merge on player names
When trying to merge the dataframes, I am simply trying to do so as follows
df = pd.merge(hdf, sdf, on='Player')

First reset sdf index cause now player name is index not a column:
df = pd.merge(hdf, sdf.reset_indx(), on='Player')

Is it efficient to join multiple dataframes using chained "join", rather than merge or concat?

Not being an expert in code efficiency (yet) & best pythonic code writing (yet), I would like to ask the experts here if the following code is the best to join dataframes that have a common Date Index, or if merge or concat may be better:
data = df1.join(df2).join(df3).join(df4).join(df5).dropna()
I used the .dropna() suffix at the end to cancel out rows where a single NaN occurs.
NB: the reason why NaN occurs in this dataset is because I have created dataframes that are in fact shifted versions of other dataframes (using .shift(n) ), which means that NaNs creep in at the head of the shifted dataframes.
I intend to use this code in many other applications, so wanted to use the best possible methodology (i.e. not make unnecessary use of memory, take too much time to process, use the correct join/merg/concat constructs).

It should be more efficient to do:
data = df1.join([df2, df3, df4, df5], how='inner')
This will merge all the dataframes in one go. It will also exclude any row that does not have values across all dataframes (so no need for dropna()). The default for how is 'left', which produces a row for every row in the calling dataframe, filling in any missing values with NaN. However, if any of the dataframes had NaN values in them before the join then you will still need to use dropna().
You can also use on=... to choose which column(s) to join the dataframes on if you don't want to use the dataframes indexes.

pyspark: join using schema? Or converting the schema to a list?

I am using the following code to join two data frames:
new_df = df_1.join(df_2, on=['field_A', 'field_B', 'field_C'], how='left_outer')
The above code works fine, but sometimes df_1 and df_2 have hundreds of columns. Is it possible to join using the schema instead of manually adding all the columns? Or is there a way that I can transform the schema into a list? Thanks a lot!

You can't join on schema, if what you meant was somehow having join incorporate the column dtypes. What you can do is extract the column names out first, then pass them through as the list argument for on=, like this:
join_cols = df_1.columns
df_1.join(df_2, on=join_cols, how='left_outer')
Now obviously you will have to edit the contents of join_cols to make sure it only has the names you actually want to join df_1 and df_2 on. But if there are hundreds of valid columns that is probably much faster than adding them one by one. You could also make join_cols an intersection of df_1 and df_2 columns, then edit from there if that's more suitable.
Edit: Although I should add that Spark 2.0 release is literally any day now, and I haven't versed myself on all the changes yet. So that might be worth looking into also, or provide a future solution.

Python Pandas - Main DataFrame, want to drop all columns in smaller DataFrame

I have a DataFrame ('main') that has about 300 columns. I created a smaller DataFrame ('public') and have been working on this.
I now want to delete the columns contained within 'public' from the larger DataFrame ('main').
I've tried the following instructions:
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.drop.html
Python Pandas - Deleting multiple series from a data frame in one command
without any success, along with various other statements that have been unsuccessful.
The columns that make up 'public' are not consecutive - i.e. they are taken from various points in the larger DataFrame 'main'. All of the columns have the same Index. [Not sure if this is important, but 'public' was created using the 'join' function].
Yes, I'm being lazy - I don't want to have to type out the names of every column! I'm hoping there's a way to use the DataFrame 'public' in a statement that will allow deletion of these columns en masse. If anyone has any suggestions and/or guidance I'd be most grateful.
(Have Python 2.7 and am using Pandas, numpy, math, pylab etc.)
Thanks in advance.

Ignore my question - Murphy's Law prevails and I've just solved it.
I was using the statement from the stackoverflow question mentioned below:
df.drop(df.columns[1:], axis=1)
and this was not working. I have instead used
df = df.drop(df2, axis=1)
and this worked (df = main, df2 = public). Simple really once you don't overthink it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting Complex SQL to Python Pandas Merge and/or Join - python

Related

How to parallelize a python code that has two different pandas dataframes?

Having Difficulty Merging Dataframes on Pandas

Is it efficient to join multiple dataframes using chained "join", rather than merge or concat?

pyspark: join using schema? Or converting the schema to a list?

Python Pandas - Main DataFrame, want to drop all columns in smaller DataFrame

Categories

Resources