Having Difficulty Merging Dataframes on Pandas

Having Difficulty Merging Dataframes on Pandas - python

Trying to merge two dataframes of hockey data, both have player names (what I am trying to merge on) mind you the one with salary data only has 500 rows or so and the primary dataframe has 2000+ (if that makes a difference. Trying to merge them on name when applicable and the new df created has no rows of data in it.
Wanted to merge wherever it made sense to (ie. where both had salary data for a given player)
Let me know if something is not clear or how to upload more info as needed as I'm not seeing an option to make uploading the tables possible or if I can otherwise include more insight/info that may make my situation clearer to you when trying to help.
Thanks for what input you can kindly provide, enjoy your weekend.
Dataframes I am looking to merge on player names
When trying to merge the dataframes, I am simply trying to do so as follows
df = pd.merge(hdf, sdf, on='Player')

First reset sdf index cause now player name is index not a column:
df = pd.merge(hdf, sdf.reset_indx(), on='Player')

Related

How to calculate the sum of conditional cells in excel, populate another column with results

EDIT: Using advanced search in Excel (under data tab) I have been able to create a list of unique company names, and am now able to SUMIF based on the cell containing the companies name!
Disclaimer: Any python solutions would be greatly appreciated as well, pandas specifically!
I have 60,000 rows of data, containing information about grants awarded to companies.
I am planning on creating a python dictionary to store each unique company name, with their total grant $ given (agreemen_2), and location coordinates. Then, I want to display this using Dash (Plotly) on a live MapBox map of Canada.
First thing first, how do I calculate and store the total value that was awarded to each company?
I have seen SUMIF in other solutions, but am unsure how to output this to a new column, if that makes sense.
One potential solution I thought was to create a new column of unique company names, and next to it SUMIF all the appropriate cells in col D.
PYTHON STUFF SO FAR
So with the below code, I take a much messier looking spreadsheet, drop duplicates, sort based on company name, and create a new pandas database with the relevant data columns:
corp_df is the cleaned up new dataframe that I want to work with.
and recipien_4 is the companies unique ID number, as you can see it repeats with each grant awarded. Folia Biotech in the screenshot shows a duplicate grant, as proven with a column i did not include in the screenshot. There are quite a few duplicates, as seen in the screenshot.
import pandas as pd
in_file = '2019-20 Grants and Contributions.csv'
# create dataframe
df = pd.read_csv(in_file)
# sort in order of agreemen_1
df.sort_values("recipien_2", inplace = True)
# remove duplicates
df.drop_duplicates(subset='agreemen_1', keep='first', inplace=True)
corp_dict = { }
# creates empty dict with only 1 copy of all corporation names, all values of 0
for name in corp_df_2['recipien_2']:
if name not in corp_dict:
corp_dict[name] = 0
# full name, id, grant $, longitude, latitude
corp_df = df[['recipien_2', 'recipien_4', 'agreemen_2','longitude','latitude']]
any tips or tricks would be greatly appreciated, .ittertuples() didn't seem like a good solution as I am unsure how to filter and compare data, or if datatypes are preserved. But feel free to prove me wrong haha.
I thought perhaps there was a better way to tackle this problem, straight in Excel vs. iterating through rows of a pandas dataframe. This is a pretty open question so thank you for any help or direction you think is best!

I can see that you are using pandas to read de the file csv, so you can use the method:
Group by
So you can create a new dataframe making groupings for the name of the company like this:
dfnew = dp.groupby(['recipien_2','agreemen_2']).sum()
Then dfnew have the values.
Documentation Pandas Group by:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The use of group_by followed by a sum may be the best for you:
corp_df= df.group_by(by=['recipien_2', 'longitude','latitude']).apply(sum, axis=1)
#if you want to transform the index into columns you can add this after as well:
corp_df=corp_df.reset_index()

Headers are missed when appending multiple dataframes into a list to make one concatenated dataframe

Below is the code where 5 dataframes are being generated and I want to combine all the dataframes into one, but since they have different headers of the columns, i think appending it to the list are not retaining the header names instead it is providing numbers.
Is there any other solution to combine the dataframes keeping the header names as it is?
Thanks in advance!!
list=[]
i=0
while i<5:
df = pytrend.interest_over_time()
list.append(df)
i=i+1
df_concat=pd.concat(list,axis=1)

Do you have a common column in the dataframes that you can merge on? In that case - use the data frame merge function.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html
I've had to do this recently with two dataframes I had, and I merged on the date column.
Are you trying to add additional columns, or append each dataframe on top of each other?
https://www.datacamp.com/community/tutorials/joining-dataframes-pandas
This link will give you an overview of the different functions you might need to use.
You can also rename the columns, if they do contain the same sort of data. Without an example of the dataframe it's tricky to know.

Converting Complex SQL to Python Pandas Merge and/or Join

I currently have a Python script which converts two pandas DataFrames to tables in a SQLite database in memory, before reading reading and running SQL code on the tables. I would like the script to be more "Pythonic", merging and/or joining the DataFrames, but am having a difficult time finding Python code examples for the equivalent of SELECTing specific, and not all, elements from both tables, along with FROM, WHERE and ORDER BY clauses. I am fairly new to Python, and being the Guinea Pig of my department, so if I can get this working, it will become a template for MANY more scripts from my partners in my work group. Actual element names have been changed do to proprietary information, but the structure is the same. Thanks in advance for the help!
SELECT
dfE.Element05 AS [Alt Element05],
dfE.Element03 AS [Alt Element03],
dfE.Element04 AS [Alt Element04],
dfN.Element03,
dfN.Element04,
dfN.Element08,
dfN.Element09,
dfN.Element10,
dfN.Element17,
dfN.Element18,
dfN.Element19,
dfN.Element20,
dfN.Element23,
dfN.Element26,
dfN.Element13
FROM dfE INNER JOIN dfN ON (dfE.Element17 = dfN.Element17) AND (dfE.Element20 = dfN.Element20)
WHERE (((dfN.Element03)<>dfE.Element03))
GROUP BY
dfE.Element05,
dfE.Element03,
dfE.Element04,
dfN.Element03,
dfN.Element04,
dfN.Element08,
dfN.Element09,
dfN.Element10,
dfN.Element17,
dfN.Element18,
dfN.Element19,
dfN.Element20,
dfN.Element23,
dfN.Element26,
dfN.Element13
ORDER BY
dfE.Element03,
dfN.Element03,
dfN.Element08

I would start by copying the DataFrames that you want to join and selecting the specific columns there. I have included "Element17" & "Element20" from dfE because you need to have it when joining.
Ex.
df1 = dfE['Element05,'Element03','Element04','Element17','Element20'].copy()
In order to rename the columns use the following:
df1.rename(columns={'Element05':'Alt Element05','Element03':'Alt Element03','Element04':'Alt Element04'},inplace=True)
Once you have the other df set up (lets name it df2) you would use pd.merge() to join them as you would in SQL. *When using pd.merge, the columns on which you are going to join have to have the same name or it won't work! (Lets say df1['A'] shares the same data as df2['B'] and you want to join the DataFrames. You would have to change the name on one of the DataFrames so that it is equal to the column name of the Df you are joining to or it won't work.)
Ex.
df3 = pd.merge(df1,df1,how='inner',on=['Element17','Element20'])
For the Where I would do the following.
df3= df3[df3['Alt Element03']!=df3['Element']]
For Order By you could use .sort() but I'm not comfortable with giving you advice on how to use it as I haven't used it much.
I hope this helps! Let me know if you have questions.
*** This might not be the best way to do things. I apologize beforehand if I'm leading you to develop bad habits. I'm new to python as well!

fastest way to copy values from one cell of a dataframe to another data frame if a third cell matches

I have a master dataframe with anywhere between 750 to 3000 rows of data.
I have a daily order dataframe with anywhere from 3000 to 5000 rows of data.
If the product code of the daily order dataframe is found in the master dataframe, I get the item cost. Otherwise, it is marked as invalid and deleted.
I currently do this via 2 for loops. But I will have to do many more such comparisons and data updating (other fields to compare, other values to copy)
What is the most efficient way to do this?
I cannot make the column I am comparing the index column of the master dataframe.
In this case, the product code may be unique in the master and I could do a merge, but there are other cases where I may have to compare other values like supplier city which may not be unique.
I seem to be doing this repeatedly in all my Python codes and I want to learn the most efficient way to do this.
Order DF:
[![Order csv from which the Order DF is created][1]][1]
Master DF
[![Master csv from which Master DF is created][1]][1]
def fillVol(orderDF,mstrDF,paramC,paramF,notFound):
orderDF['ttlVol']=0
for i in range(len(orderDF)):
found=False
for row in mstrDF.itertuples():
if (orderDF.loc[i,paramC]==getattr(row,paramC)):
orderDF.loc[i,paramF[0]]=getattr(row,paramF[0])#mtrl cbf
found=True
break
if (found==False):
notFound.append(inv.loc[i,paramC])
inv['ttlVol']=inv[paramF[0]]*inv[paramF[2]]
return notFound
I am passing along the column names I am comparing and the column names I am filling with data because there are minor variations in naming the csv. In the data I have shared, the material volume is CBF, in come cases it is CBM
The data columns cannot be index because there are no unique data in any of the columns, it is always a combination of values that makes them unique.
The data, in this case, is a float and numpy could be used, but in other cases like copying city names from a master, the data is a string. numpy was the suggestion to other people with a similar issue

I dont know if this is the most efficient way of doing it - as someone who started programming with Fortran and then C, I am always for basic datatypes and this solution is not utilising basic datatype. This is definitely a highly Pythonic solution.
orderDF=orderDF[orderDF[ParamF].isin(mstrDF[ParamF])]
orderDF=orderDF.reset_index(drop=True)
I use a left merge on the orderDF and msterDF data frames to copy all relevant values
orderDF=orderDF.merge(mstrDF.drop_duplicates(paramC,keep='last')[[paramF[0]]]', how='left',validate = 'm:1')

pyspark: join using schema? Or converting the schema to a list?

I am using the following code to join two data frames:
new_df = df_1.join(df_2, on=['field_A', 'field_B', 'field_C'], how='left_outer')
The above code works fine, but sometimes df_1 and df_2 have hundreds of columns. Is it possible to join using the schema instead of manually adding all the columns? Or is there a way that I can transform the schema into a list? Thanks a lot!

You can't join on schema, if what you meant was somehow having join incorporate the column dtypes. What you can do is extract the column names out first, then pass them through as the list argument for on=, like this:
join_cols = df_1.columns
df_1.join(df_2, on=join_cols, how='left_outer')
Now obviously you will have to edit the contents of join_cols to make sure it only has the names you actually want to join df_1 and df_2 on. But if there are hundreds of valid columns that is probably much faster than adding them one by one. You could also make join_cols an intersection of df_1 and df_2 columns, then edit from there if that's more suitable.
Edit: Although I should add that Spark 2.0 release is literally any day now, and I haven't versed myself on all the changes yet. So that might be worth looking into also, or provide a future solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.