Merge 2 relational dataframes to nested JSON / dataframe

Merge 2 relational dataframes to nested JSON / dataframe - python

I have another problem with joining to dataframes using pandas. I want to merge a complete dataframe into a column/field of another dataframe where the foreign key field of DF2 matches the unique key of DF1.
The input data are 2 CSV files roughly looking likes this:
CSV 1 / DF 1:
cid;name;surname;address
1;Mueller;Hans;42553
2;Meier;Peter;42873
3;Schmidt;Micha;42567
4;Pauli;Ulli;98790
5;Dick;Franz;45632
CSV 2 / DF 1:
OID;ticketid;XID;message
1;9;1;fgsgfs
2;8;2;gdfg
3;7;3;gfsfgfg
4;6;4;fgsfdgfd
5;5;5;dgsgd
6;4;5;dfgsgdf
7;3;1;dfgdhfd
8;2;2;dfdghgdh
I want each row of DF2, which XID matches with a cid of DF1, as a single field in DF1. my final goal is to convert the above input files into a nested JSON format.
Edit 1:
Something like this:
[
{
"cid": 1,
"name": "Mueller",
"surname": "Hans",
"address": 42553,
"ticket" :[{
"OID": 1,
"ticketid": 9,
"XID": 1,
"message": "fgsgfs"
}]
},
...]
Edit 2:
Some further thoughts: Would it be possible to create a dictionary of each row in dataframe 2 and then append this dictionary to a new column in dataframe 1 where some value (xid) of the dictionary matches with a unique id in a row (cid) ?
Some pseudo code I have in my mind:
Add new column "ticket" in DF1
Iterate over rows in DF2:
row to dictionary
iterate over DF1
find row where cid = dict.XID
append dictionary to field in "ticket"
convert DF1 to JSON
Non Python solution are also acceptable.

Not sure what you expect as output but check merge
df1.merge(df2, left_on="cid", right_on="XID", how="left")
[EDIT based on the expected output]
Maybe something like this:
(
df1.merge(
df2.groupby("XID").apply(lambda g: g.to_dict(orient="records")).reset_index(name="ticket"),
how="left", left_on="cid", right_on="XID")
.drop(["XID"], axis=1)
.to_json(orient="records")
)

Related

Error when manipulating dataframe with columns of type string with Pandas Pivot Table

I have the dataframe:
And I would like to obtain using Pivot Table or an alternative function this result:
I am trying to transform the rows of the Custom Field column into Columns, with the Pivot Table function of Pandas, and I get an error:
import pandas as pd
data = {
"Custom Field": ["CF1", "CF2", "CF3"],
"id": ["RSA", "RSB", "RSC"],
"Name": ["Wilson", "Junior", "Otavio"]
}
### create the dataframe ###
df = pd.DataFrame(data)
print(df)
df2 = df.pivot_table(columns=['Custom Field'], index=['Name'])
print(df2)
I suspect it is because I am working with Strings.
Any suggestions?
Thanks in advance.

You need pivot, not pivot_table. The latter does aggregation on possibly repeating values whereas the former is just a rearrangement of the values and fails for duplicate values.
df.pivot(columns=['Custom Field'], index=['Name'])
Update as per comment: if there are multiple values per cell, you need to use privot_table and specify an appropriate aggregate function, e.g. concatenate the string values. You can also specify a fill value for empty cells (instead of NaN):
df = pd.DataFrame({"Custom Field": ["CF1", "CF2", "CF3", "CF1"],
"id": ["RSA", "RSB", "RSC", "RSD"],
"Name": ["Wilson", "Junior", "Otavio", "Wilson"]})
df.pivot_table(columns=['Custom Field'], index=['Name'], aggfunc=','.join, fill_value='-')
id
Custom Field CF1 CF2 CF3
Name
Junior - RSB -
Otavio - - RSC
Wilson RSA,RSD - -

Find values in a Pandas dataframe and insert the data in a column of another Pandas dataframe

I have a dataframe that I need to convert the Custom Field column rows to columns in a second dataframe. This part I have managed to do and it works fine.
The problem is that I need to add the corresponding values from the id column to the respective columns of the second dataframe.
Here is an example:
This is first dataframe:
This is the second dataframe, with the columns already converted.
But I would like to add the values corresponding to the id column of the first dataframe to the second dataframe:
Attached is the code:
import pandas as pd
Data = {
"Custom Field": ["CF1", "CF2", "CF3"],
"id": [50, 40, 45],
"Name": ["Wilson", "Junior", "Otavio"]
}
### create the dataframe ###
df = pd.DataFrame(data)
print(df)
### add new columns from a list ###
columns_list = []
for x in df['Custom Field']:
### create multiple columns with x ##
columns_list.append(x)
### convert list to new columns ###
df2 = pd.DataFrame(df,columns=columns_list)
df2["Name"] = df["Name"]
print(df2)
### If Name df3 is equal to Name df and equal to Custom Field of df, then get the id of df and insert the value into the corresponding column in df3. ###
#### First unsuccessful attempt ###
df2_columns_names = list(df2.columns.values)
for df2_name in df2['Name']:
for df2_cf in df2_columns_names:
for df_name in df['Name']:
for df_cf in df['Custom Field']:
for df_id in df['id']:
if df2_name == df_name and df2_cf == df_cf:
df2.loc[df2_name, df2_cf] = df_id
print(df2)
Any suggestions?
Thanks in advance.

Use pivot_table
df.pivot_table(index=['Name'], columns=['Custom Field'])
As a general rule of thumb, if you are doing for loops and changing cells manually, you're using pandas wrong. Explore the methods of the framework in the docs, it can be very powerful :)

pandas merge dataframes: keep all rows when one of the columns matches another dataframe

I would want to find a way in python to merge the files on 'seq' but return all the ones with the same id, in this example only the lines with id 2 would be removed.
File one:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSDLILYYEQYF,2
CASSDLILYYTQYF,2
CASSGSYEQYF,3
CASSGSYEQYY,3
File two:
seq
CSVGPPNNEQFF
CASRGEAAGFYEQYF
CASSGSYEQYY
Output:
seq,id
CSVGPPNNEQFF,0
CTVGPPNNEQFF,0
CTVGPPNNERFF,0
CASRGEAAGFYEQYF,1
RASRGEAAGFYEQYF,1
CASRGGAAGFYEQYF,1
CASSGSYEQYF,3
CASSGSYEQYY,3
I have tried:
df3 = df1.merge(df2.groupby('seq',as_index=False)[['seq']].agg(','.join),how='right')
output:
seq,id
CASRGEAAGFYEQYF,1
CASSGSYEQYY,3
CSVGPPNNEQFF,0
Does anyone have any advice how to solve this?

Do you want to merge two dataframes, or just take subset of the first dataframe according to which id is included in the second dataframe (by seq)? Anyway, this gives the required result.
df1 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CTVGPPNNEQFF',
'CTVGPPNNERFF',
'CASRGEAAGFYEQYF',
'RASRGEAAGFYEQYF',
'CASRGGAAGFYEQYF',
'CASSDLILYYEQYF',
'CASSDLILYYTQYF',
'CASSGSYEQYF',
'CASSGSYEQYY'
],
'id': [0, 0, 0, 1, 1, 1, 2, 2, 3, 3]
})
df2 = pd.DataFrame({
'seq': [
'CSVGPPNNEQFF',
'CASRGEAAGFYEQYF',
'CASSGSYEQYY'
]
})
df3 = df1.loc[df1['id'].isin(df1['id'][df1['seq'].isin(df2['seq'])])]
Explanation: df1['id'][df1['seq'].isin(df2['seq'])] takes those values of id from df1 that contain at least one seq that is included in df2. Then all rows with those values of id are taken from df1.

You can use the isin() pandas method, code shall looks as follow :
df1.loc[df1['seq'].isin(df2['seq'])]
Assuming, both objects are pandas dataframe and 'seq' is a column.

How to Dynamically generate for loop based on columns in dataframe?

I am trying to generate a for loop dynamically based on the number of columns in a dataframe.
For e.g if my columns in dataframe is 5, then I generate the for loop and assign variables accordingly.
if
df_cols = ['USER_ID', 'BLID', 'PACKAGE_NAME', 'PACKAGE_PRICE', 'ENDED_DATE']
and brics is my dataframe
Then
for index, row in brics.iterrows():
analytics.track(row['USER_ID'], 'Cancelled Subscription', {
df_cols[1]: row['BLID']
df_cols[2]: row['PACKAGE_NAME'],
df_cols[3]: row['PACKAGE_PRICE'],
df_cols[4]: row['ENDED_DATE'],
})
The df_cols and the row[value] should be generated based on the number of columns in dataframe.
For e.g, if there are only 2 columns in data frame the below is how the code should look like.
if
df_cols = ['USER_ID', 'BLID']
Then
for index, row in brics.iterrows():
analytics.track(row['USER_ID'], 'Cancelled Subscription', {
df_cols[1]: row['BLID']
})
I searched SO for this solution but couldnt find the one related to dataframe's (Though R is available). Any pointers will be helpful.THank you.

df_cols = ['USER_ID', 'BLID', 'PACKAGE_NAME', 'PACKAGE_PRICE', 'ENDED_DATE']
for index, row in brics.iterrows():
analytics.track(row['USER_ID'], 'Cancelled Subscription', {
df_cols[i+1]: row[df_cols[i]] for i in range(len(df_cols)-1)
})

Adding Multiple Values into a Column - Pandas

I have two pandas dataframes that Im trying to merge together on their ID number. However in df1 the ID is being used multiple times and in df2 it is only being used once. Therefore I want the final dataframe to include all the results seperated by commas and having a index value in front of it. I made a simple example that will help me explain what I'm asking.
df1:
df2:
Merged Goal:
Ive tried merging them how I usually do:
MergedGoal= pd.merge(df1, df2, on='ID', how='left')
But I get a key error for ID, probably because there are duplicates. How can I add them together? and if anyone could also give me some insight as how to add an index for each value added that would be amazing. But if its not possible to add the index numbers thats totally fine, I just need all of the values in the same entry seperated by commas.

I created df1 the following way:
df1 = pd.DataFrame(data=[
[ 1, 'Manchester', 'NH', 3108 ],
[ 1, 'Bedford', 'NH', 3188 ],
[ 6, 'Boston', 'MA', 23718 ],
[ 1, 'Austin', 'TX', 20034 ]],
columns=['ID', 'City', 'State', 'Zip'])
df1.Zip = df1.Zip.astype(str).str.zfill(5)
Note that I changed source Zips (as I see, they are "plain"
integers) to a string, because you want to have leading zeroes.
To create df2 I used:
df2 = pd.DataFrame(data=[[ 1, 'Best Cities', 'xxx' ], [ 6, 'Worst Cities', 'yyy' ]],
columns=['ID', 'Title', 'Description'])
As a preparation step, let's define a function, which will be used
to aggregate columns from df1:
def fn(src):
lst = [ f'{idx}) {val}' for idx, val in enumerate(src, start=1) ]
return ', '.join(lst)
The first step of this function is a list comprehension, where
enumerate iterates over src (the content of the current column
in the current group) and substitutes:
idx - the current element index, but starting from 1,
val - the current element itself.
Formatting of result items performs f-string.
The result is a list of e.g. city names with numbers before them.
return statement joins this list into a string, inserting ", "
between them.
So e.g. for group for ID == 1 and City column, the source values are:
[ 'Manchester', 'Bedford', 'Austin' ] and result is:
1) Manchester, 2) Bedford, 3)Austin.
And the actual processing can be performed with a single instruction:
pd.merge(df2, df1.groupby('ID').agg(fn), how='left',
left_on='ID', right_index=True).fillna('')
As you can see:
I reverted the order of merged DataFrames. This way the result
contains first columns from df2, then from df1.
City, State and Zip columns from df1 are first
grouped by ID and aggregated, using fn function.
Then they are merged with df2.
I added fillna('') to replace NaN values with an empty string,
which would occur in case of IDs present only in df2.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge 2 relational dataframes to nested JSON / dataframe - python

Related

Error when manipulating dataframe with columns of type string with Pandas Pivot Table

Find values in a Pandas dataframe and insert the data in a column of another Pandas dataframe

pandas merge dataframes: keep all rows when one of the columns matches another dataframe

How to Dynamically generate for loop based on columns in dataframe?

Adding Multiple Values into a Column - Pandas

Categories

Resources