Self Join in Pandas: Merge / Join within the same table

Self Join in Pandas: Merge / Join within the same table - python

I have attached the data here.
Excel Data
I need to return a DataFrame containing list of all employees (EmployeeID, first name, middle name, last name), and their manager's first and last name. The columns in the output DataFrame should be: EmployeeID, FirstName, MiddleName, LastName, ManagerFirstName, ManagerLastName.
Hint: Consider joining the table by itself as managers are employees themselves.
This is code I have so far, which is giving me duplicate records:
# Creating data frame from Excel File. Enter the appropriate file path
df = pd.read_excel(Employees)
df_new = df[['EmployeeID', 'ManagerID', 'FirstName', 'MiddleName', 'LastName']].copy()
df_new['ManagerID'] = pd.to_numeric(df_new['ManagerID'], errors='coerce').fillna(0)
# convert object to int64
df_new['ManagerID'] = df_new['ManagerID'].astype(np.int64)
result = df_new.merge(df_new, left_on='EmployeeID', right_on='ManagerID')
print(result.head())
Any help on this would greatly be appreciated.

I think this will work
df = pd.DataFrame({"EmployeeID":[259,278,204,78,255],
"ManagerID":[278,204,78,255,259],
"FirstName":["ben","garret","gabe","reuben","gordon"],
"MiddleName":["T","R","B","H","L"],
"LastName":["miller","vargas","mares","dsa","hee"]})
df['ManagerID'] = pd.to_numeric(df['ManagerID'], errors='coerce').fillna(0)
df_ = df[["EmployeeID","FirstName","LastName"]]
df_ = df_.rename(columns={"EmployeeID":"ManagerID","FirstName":"ManagerFirstName","LastName":"ManagerLastName"})
out = pd.merge(df,df_,on=["ManagerID"],how="left")
out = out.drop(["ManagerID"],axis=1)

Related

Dynamic creation of pandas DataFrames

The end goal is to read multiple .cvs files into multiple DataFrames with certain names.
I want to be able to refer to my DataFrame by the name of the city for further analysis and manipulate them separately. So it is important to achieve that and not keep them in a dictionary. But what ends up happening is that the last item in the dict gets assigned to every variable, so I get differently names dfs created but they all have the same data.
lst0 = ['/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv',
'/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv',
'/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv']
names_3 = ['annapolis','apalachicola','atlantic_city']
d = {}
for fname in lst0:
d[fname] = pd.read_csv(fname)
for nm in names_3:
for fname in lst0:
globals()[nm] = d[fname]
What am I doing wrong?
Thank you!

Your variable naming makes no sense to me. Please name them something relevant to the values they hold.
As to your problem:
paths = [
"/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv",
]
cities = ["annapolis", "apalachicola", "atlantic_city"]
# Create one dataframe per CSV file
d = {
city: pd.read_csv(path) for path, city in zip(paths, cities)
}
# Join the frames together, adding the new `city` column
df = (
pd.concat(d.values(), keys=d.keys(), names=["city", None])
.reset_index(level=0)
.reset_index(drop=True)
)

Ok. I figured it out.
Combining what Code Different suggested below but skipping the concatenation part.
I finally get the variables(that are dataframes) with the names of the cities created.
paths = [
"/User1/Research/comp_dataset/yutas_tg/Annapolis__MD_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Apalachicola__FL_ALL.csv",
"/User1/Research/comp_dataset/yutas_tg/Atlantic_City__NJ_ALL.csv",
]
cities = ["annapolis", "apalachicola", "atlantic_city"]
# Create one dataframe per CSV file
d = {
city: pd.read_csv(path) for path, city in zip(paths, cities)
}
for k in d.keys():
exec(f"{k} = d['{k}']")

Reformatting a dataframe to access it for sort after concatenating two series

I've joined or concatenated two series into a dataframe. However one of the issues I'm not facing is that I have no column headings on the actual data that would help me do a sort
hist_a = pd.crosstab(category_a, category, normalize=True)
hist_b = pd.crosstab(category_b, category, normalize=True)
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index])
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index])
df_plots = pd.concat([counts_a, counts_b], axis=1).fillna(0)
The data looks like the following:
0 1
category
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
and I'd like to do a sort, but there are no proper column headings
df_plots = df_plots.sort_values(by=['0?'])
But the dataframe seems to be in two parts. How could I better structure the dataframe to have 'proper' columns such as '0' or 'plot a' rather than being indexable by an integer, which seems to be hard to work with.
category plot a plot b
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04

Just rename the columns of the dataframe, for example:
df = pd.DataFrame({0:[1,23]})
df = df.rename(columns={0:'new name'})
If you have a lot of columns you rename all of them at once like:
df = pd.DataFrame({0:[1,23]})
rename_dict = {key: f'Col {key}' for key in df.keys() }
df = df.rename(columns=rename_dict)
You can also define the series with the name, so you avoid changing the name afterwards:
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index], name = 'counts_a')
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index], name = 'counts_b')

Panda Dataframe adding more columns to .csv when editing certain values

I am using Panda Dataframe to store some information for my code. In my code,
Initial State of csv:
...............
ID,Name
...............
Adding Data into dataframe:
name_desc = {"ID": 23523223, "Name": BlahBlah}
df = df.append(name_desc, ignore_index=True)
This was my panda dataframe upon creating the database:
....................
,ID,Name
0,23523223,BlahBlah
....................
Below is my code that searches through the ID column to locate the row with the stated ID (name_desc["ID"]).
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
The problem I encountered was after I have edited the name, I get a resultant db that looks like:
................................
Unnamed: 0 ID Name
0 0 23523223 BlahBlah
................................
If I continously execute:
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
I get this db:
..................................
,Unnamed: 0,Unnamed: 0.1,ID,Name
0,0,0,235283335,Dinese
..................................
I can't figure out why I have extra columns being added in the front of my database as I make edits.

I think you have a problem that is related to the df creation. The example you provided here does not return what you are showing:
BlahBlah = 'foo'
name_desc = {"ID": 23523223, "Name": BlahBlah}
df = pd.DataFrame(data=name_desc, index=[0])
print(df.columns) # it returns an Index(['ID', 'Name'], dtype='object')
print(len(df.columns)) # it returns 2, the number of your df columns
If you can, try to find what instruction adds the extra column to your code. Otherwise you can remove the column using drop and remove the column with '' as name. inplace is used to actually modify the dataframe. if inplace is not added, you just create a view of the dataframe without actually modifying it:
df.drop(columns = [''], inplace = True)
Finally, I post in the following the full example. My assumption is that your df is somehow created with the empty column at the beginnig, so I also add it in the dictionary:
BlahBlah = 'foo'
name_desc = {'':'',"ID": 23523223, "Name": BlahBlah} # I added an empty column
df = pd.DataFrame(data=name_desc, index = [0])
print(df.columns) # Index(['', 'ID', 'Name'], dtype='object')
df.drop(columns = [''],inplace = True)
df.loc[df["ID"] == name_desc["ID"], "Name"] = name_desc["Name"]
print(df.columns) # Index(['ID', 'Name'], dtype='object')

How to fetch preceding ids on fly using pandas

I have a data frame like as shown below
df = pd.DataFrame({'subject_id':[11,11,11,12,12,12],
'test_date':['02/03/2012 10:24:21','05/01/2019 10:41:21','12/13/2011 11:14:21','10/11/1992 11:14:21','02/23/2002 10:24:21','07/19/2005 10:24:21'],
'original_enc':['A742','B963','C354','D563','J323','G578']})
hash_file = pd.DataFrame({'source_enc':['A742','B963','C354','D563','J323','G578'],
'hash_id':[1,2,3,4,5,6]})
cols = ["subject_id","test_date","enc_id","previous_enc_id"]
test_df = pd.DataFrame(columns=cols)
test_df.head()
I would like to do two things here
Map original_enc to their corresponding hash_id and store it in enc_id
Find the previous hash_id for each subject based on their current hash_id and store it in previous_enc_id
I tried the below
test_df['subject_id'] = df['subject_id']
test_df['test_date'] = df['test_date']
test_df['enc_id'] = df['original_enc'].map(hash_file)
test_df = test_df.sort_values(['subject_id','test_date'],ascending=True)
test_df['previous_enc_id'] = test_df.groupby(['subject_id','test_date'])['enc_id'].shift(1)
However, I don't get the expected output for the previous_enc_id column as it is all NA.
I expect my output to be like as shown below. You see NA in the expected row for the 1st row of every subject because that's their 1st encounter. There is no info to look back to.

Use only one column for groupby:
test_df['previous_enc_id'] = test_df.groupby('subject_id')['enc_id'].shift()

Memory problem using python pandas to join stock DataFrames in loop

I am trying to join a lot of dataframes in order to do the correlation matrix in pandas.
So, it seems that I have to keep on adding columns on the right hand, with the "Date" as the index.
But, when I try to do this function with just 50 dataframes, it ends with the memory error.
Is there anyone knows what is happening?
def taking_and_combining_data_from_mysql_to_excel(root):
saved_path = root + "\main_df.xlsx"
main_df = pd.DataFrame()
mycursor = mydb.cursor(buffered=True)
for key, value in stock_dic.items():
mycursor.execute("""SELECT date, Adj_close
FROM hk_stock
Where date >= '2020-03-13 00:00:00' and stock_number = '{}'""".format(key))
row_result = mycursor.fetchall()
df = pd.DataFrame(row_result)
df.columns = ['Date', value]
df.set_index('Date',inplace=True)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df,how="outer")
with pd.ExcelWriter(saved_path) as writer:
main_df.to_excel(writer,sheet_name="raw_data")
main_df.corr().to_excel(writer,sheet_name="correlation")
return main_df

Pandas is not designed for such dynamic concatenations. You could just append things into a list, and convert that list into a DataFrame. Like so:
join=[]
for key, value in stock_dic.items():
join.append({'Date':value} )
df_join=pd.DataFrame(join)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Self Join in Pandas: Merge / Join within the same table - python

Related

Dynamic creation of pandas DataFrames

Reformatting a dataframe to access it for sort after concatenating two series

Panda Dataframe adding more columns to .csv when editing certain values

How to fetch preceding ids on fly using pandas

Memory problem using python pandas to join stock DataFrames in loop

Categories

Resources