Merging on non-unique column - pandas python - python

I have been trying to merge two DataFrames together (df and df_details) in a similar fashion to an Excel "vlookup" but am getting strange results. Below I show the structure of the two DataFrames without populating real data for simplicity
df_details:
Abstract_Title | Abstract_URL | Session_No_v2 | Session_URL | Session_ID
-------------------------------------------------------------------------
Abstract_Title1 Abstract_URL1 1 Session_URL1 12345
Abstract_Title2 Abstract_URL2 1 Session_URL1 12345
Abstract_Title3 Abstract_URL3 1 Session_URL1 12345
Abstract_Title4 Abstract_URL4 2 Session_URL2 22222
Abstract_Title5 Abstract_URL5 2 Session_URL2 22222
Abstract_Title6 Abstract_URL6 3 Session_URL3 98765
Abstract_Title7 Abstract_URL7 3 Session_URL3 98765
df:
Session_Title | Session_URL | Sponsors | Type | Session_ID
-------------------------------------------------------------------------------
Session_Title1 Session_URL1 x, y z Paper 12345
Session_Title2 Session_URL2 x, y Presentation 22222
Session_Title3 Session_URL3 a, b ,c Presentation 98765
Session_Title4 Session_URL4 c Talk 12121
Session_Title5 Session_URL5 a, x Paper 33333
I want to merge along Session_ID and I want the final DataFrame to look like:
I've tried the following script which yields a DataFrame that duplicates (several times) certain rows and does strange things. For example, df_details has 7,046 rows and df has 1,856 rows - when I run the following merge code, my final_df results in 21,148 rows:
final_df = pd.merge(df_details, df, how = 'outer', on = 'Session_ID')
Please help!

To generate your final output table I used the following code:
final_df = pd.merge(df_details, df[['Session_ID',
'Session_Title',
'Sponsors',
'Type']], left_on = ['Session_ID'], right_on = ['Session_ID'], how = 'outer')

Use 'left' instead of 'outer'.
final_df = pd.merge(df_details, df[['Session_ID','Session_Title','Sponsors','Type']], left_on = ['Session_ID'], right_on =['Session_ID'], how = 'left')

Related

How to combine two pandas dataset based on multiple conditions?

I want to combine two datasets in Python based on multiple conditions using pandas.
The two datasets are different numbers of rows.
The first one contains almost 300k entries, while the second one contains almost 1000 entries.
More specifically, The first dataset: "A" has the following information:
Path | Line | Severity | Vulnerability | Name | Text | Title
An instance of the content of "A" is this:
src.bla.bla.class.java| 24; medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress
While the second dataset: "B" contains the following information:
Class | Path | DTWC | DR | DW | IDFP
An instance of the content in "B" is this:
y.x.bla.MainActivity | com.lucao.limpazap_11| 0 | 0 | 0 | 0
I want to combine these two dataset as follow:
If A['Name'] is equal to B['Path'] AND B['Class'] is in A['Class']
Than
Merge the two lines into another data frame "C"
An output example is the following:
Suppose that A contains:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress|
and B contains:
com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
the output should be the following:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress| com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
I'm not sure if this the best and the most efficient way but i have test it and it worked. So my answer is pretty straight forward, we will loop over two dataframes and apply the desired conditions.
Suppose the dataset A is df_a and dataset B is df_b.
First we have to add a suffix on every columns on df_a and df_b so both rows can be appended later.
df_a.columns= [i+'_A' for i in df_a.columns]
df_b.columns= [i+'_B' for i in df_b.columns]
And then we can apply this for loop
df_c= pd.DataFrame()
# Iterate through df_a
for (idx_A, v_A) in df_a.iterrows():
# Iterate through df_b
for (idx_B, v_B) in df_b.iterrows():
# Apply the condition
if v_A['Name_A']==v_B['Path_B'] and v_B['Class_B'] in v_A['Path_A']:
# Cast both series to dictionary and then append them to a new dict
c_dict= {**v_A.to_dict(), **v_B.to_dict()}
# Append the df_c with c_dict
df_c= df_c.append(c_dict, ignore_index=True)

Compare a row substring in one dataframe column with row substring of another dataframe column, and remove non-matching ones

I have two dataframes with different row counts.
df1 has the problems and count
problems | count
broken, torn | 10
torn, faded | 15
worn-out, broken | 25
faded | 5
df2 has the order_id and problems
order_id | problems
123 | broken
594 | torn
811 | worn-out, broken
I need to remove all rows from df1 that do not match the individual problems in the list in df2. And I want to maintain the count of df1.
The final df1 data frame would look like this:
problems | count
broken | 10
torn | 15
worn-out, broken | 25
Can someone please help?
IIUC, try this if your problems are strings:
df1 = pd.DataFrame({'problems':['broken, torn', 'torn, faded', 'worn-out, broken', 'faded'],
'count':[10,15,25,5]})
df2 = pd.DataFrame({'order_id':[123,594,811],
'problems':['broken', 'torn', 'worn-out, broken']})
prob_df2 = df2['problems'].str.split(',\s?').explode()
df1_prob = df1.assign(prob_exp=(df1['problems'].str.split(',\s?')))
df1_exp = df1_prob.explode('prob_exp')
df1_out = (df1_exp[df1_exp['prob_exp'].isin(prob_df2)]
.groupby(level=0)
.agg({'count':'first', 'prob_exp': ', '.join}))
df1_out
Output:
count prob_exp
0 10 broken, torn
1 15 torn
2 25 worn-out, broken
Details:
Create a list of problems from df2 using explode and .str.split with
a regex that captures the space after comma if present
Create a new column in df1 with exploded problems
Check to see if problems in df1 are in the df2 problem list
Use groupby to combine problems together for each index

Pandas - pivoting multiple columns into fewer columns with some level of detail kept

Say I have the following code that generates a dataframe:
df = pd.DataFrame({"customer_code": ['1234','3411','9303'],
"main_purchases": [3,10,5],
"main_revenue": [103.5,401.5,99.0],
"secondary_purchases": [1,2,4],
"secondary_revenue": [43.1,77.5,104.6]
})
df.head()
There's the customer_code column that's the unique ID for each client.
And then there are 2 columns to indicate the purchases that took place and revenue generated from main branches by those clients.
And another 2 columns to indicate the purchases/revenue from secondary branches by those clients.
I want to get the data into a format like this, where a pivot is done where there's a new column to differentiate between main vs secondary, but the revenue numbers and purchase columns are not mixed up:
The obvious solution is just to split this into 2 dataframes, and then simply do a concatenate, but I'm wondering whether there's a built-in way to do this in a line or two - this strikes me as the kind of thing someone might have thought to bake in a solution for.
With a little column renaming to get the "revenue" and "purchases" in the column names first using a regular expression and str.replace we can use pd.wide_to_long to convert these now stubnames from columns to rows:
# Reorder column names so stubnames are first
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
# Convert wide_to_long
df = (
pd.wide_to_long(
df,
i='customer_code',
stubnames=['purchases', 'revenue'],
j='type',
sep='_',
suffix='.*'
)
.sort_index() # Optional sort to match expected output
.reset_index() # retrieve customer_code from the index
)
df:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
1
1234
secondary
1
43.1
2
3411
main
10
401.5
3
3411
secondary
2
77.5
4
9303
main
5
99
5
9303
secondary
4
104.6
What does reordering the column headers do?
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
Produces:
Index(['customer_code', 'purchases_main', 'revenue_main',
'purchases_secondary', 'revenue_secondary'],
dtype='object')
The "type" column is now the suffix of the column header which allows wide_to_long to process the table as expected.
You can abstract the reshaping process with pivot_longer from pyjanitor; they are just a bunch of wrapper functions in Pandas:
#pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'customer_code',
names_to=('type', '.value'),
names_sep='_',
sort_by_appearance=True)
customer_code type purchases revenue
0 1234 main 3 103.5
1 1234 secondary 1 43.1
2 3411 main 10 401.5
3 3411 secondary 2 77.5
4 9303 main 5 99.0
5 9303 secondary 4 104.6
The .value in names_to signifies to the function that you want that part of the column to remain as a header; the other part goes under the type column. The split is determined in this case by names_sep (there is a names_pattern option, that allows regular expression split); if you do not care about the order of appearance, you can set sort_by_appearance as False.
You can also use melt() and concat() function to solve this problem.
import pandas as pd
df1 = df.melt(
id_vars='customer_code',
value_vars=['main_purchases', 'secondary_purchases'],
var_name='type',
value_name='purchases',
ignore_index=True)
df2 = df.melt(
id_vars='customer_code',
value_vars=['main_revenue', 'secondary_revenue'],
var_name='type',
value_name='revenue',
ignore_index=True)
Then we use concat() with the parameter axis=1 to join side by side and use sort_values(by='customer_code') to sort data by customer.
result= pd.concat([df1,df2['revenue']],
axis=1,
ignore_index=False).sort_values(by='customer_code')
Using replace() with regex to align type names:
result.type.replace(r'_.*$','', regex=True, inplace=True)
The above code will output the below dataframe:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
3
1234
secondary
1
43.1
1
3411
main
10
401.5
4
3411
secondary
2
77.5
2
9303
main
5
99
5
9303
secondary
4
104.6

Merge two dataframes in PySpark

I have two dataframes, DF1 and DF2, DF1 is the master which stores any additional information from DF2.
Lets say the DF1 is of the following format,
Item Id | item | count
---------------------------
1 | item 1 | 2
2 | item 2 | 3
1 | item 3 | 2
3 | item 4 | 5
DF2 contains the 2 items which were already present in DF1 and two new entries. (itemId and item are considered as a single group, can be treated as the key for join)
Item Id | item | count
---------------------------
1 | item 1 | 2
3 | item 4 | 2
4 | item 4 | 4
5 | item 5 | 2
I need to combine the two dataframes such that the existing items count are incremented and new items are inserted.
The result should be like:
Item Id | item | count
---------------------------
1 | item 1 | 4
2 | item 2 | 3
1 | item 3 | 2
3 | item 4 | 7
4 | item 4 | 4
5 | item 5 | 2
I have one way do achieve this not sure if its efficient or the right way to do
temp1 = df1.join(temp,['item_id','item'],'full_outer') \
.na.fill(0)
temp1\
.groupby("item_id", "item")\
.agg(F.sum(temp1["count"] + temp1["newcount"]))\
.show()
Since, the schema for the two dataframes is the same you can perform a union and then do a groupby id and aggregate the counts.
step1: df3 = df1.union(df2);
step2: df3.groupBy("Item Id", "item").agg(sum("count").as("count"));
There are several ways how to do it.
Based on what you describe the most straightforward solution would be to use RDD - SparkContext.union:
rdd1 = sc.parallelize(DF1)
rdd2 = sc.parallelize(DF2)
union_rdd = sc.union([rdd1, rdd2])
the alternative solution would be to use DataFrame.union from pyspark.sql
Note: I have suggested unionAll previously but it is deprecated in Spark 2.0
#wandermonk's solution is recommended as it does not use join. Avoid joins as much as possible as this triggers shuffling (also known as wide transformation and leads to data transfer over the network and that is expensive and slow)
You also have to look into your data size (both tables are big or one small one big etc) and accordingly you can tune the performance side of it.
I tried showing the group by a solution using SparkSQL as they do the same thing but easier to understand and manipulate.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
list_1 = [[1,"item 1" , 2],[2 ,"item 2", 3],[1 ,"item 3" ,2],[3 ,"item 4" , 5]]
list_2 = [[1,"item 1",2],[3 ,"item 4",2],[4 ,"item 4",4],[5 ,"item 5",2]]
my_schema = StructType([StructField("Item_ID",IntegerType(), True),StructField("Item_Name",StringType(), True ),StructField("Quantity",IntegerType(), True)])
df1 = spark.createDataFrame(list_1, my_schema)
df2 = spark.createDataFrame(list_2, my_schema)
df1.createOrReplaceTempView("df1")
df1.createOrReplaceTempView("df2")
df3 = df2.union(df1)
df3.createOrReplaceTempView("df3")
df4 = spark.sql("select Item_ID, Item_Name, sum(Quantity) as Quantity from df3 group by Item_ID, Item_Name")
df4.show(10)
now if you look into the SparkUI, you can see for such a small data set, the shuffle operation, and # of stages.
Number of stages for such a small job
Number the shuffle operation for this group by command
I also recommend to see the SQL plan and understand the cost. Exchange represents the shuffle here.
== Physical Plan ==
*(2) HashAggregate(keys=[Item_ID#6, Item_Name#7], functions=[sum(cast(Quantity#8 as bigint))], output=[Item_ID#6, Item_Name#7, Quantity#32L])
+- Exchange hashpartitioning(Item_ID#6, Item_Name#7, 200)
+- *(1) HashAggregate(keys=[Item_ID#6, Item_Name#7], functions=[partial_sum(cast(Quantity#8 as bigint))], output=[Item_ID#6, Item_Name#7, sum#38L])
+- Union
:- Scan ExistingRDD[Item_ID#6,Item_Name#7,Quantity#8]
+- Scan ExistingRDD[Item_ID#0,Item_Name#1,Quantity#2]

Pandas: Storing Dataframe in Dataframe

I am rather new to Pandas and am currently running into a problem when trying to insert a Dataframe inside a Dataframe.
What I want to do:
I have multiple simulations and corresponding signal files and I want all of them in one big DataFrame. So I want a DataFrame which has all my simulation parameters and also my signals as an nested DataFrame. It should look something like this:
SimName | Date | Parameter 1 | Parameter 2 | Signal 1 | Signal 2 |
Name 1 | 123 | XYZ | XYZ | DataFrame | DataFrame |
Name 2 | 456 | XYZ | XYZ | DataFrame | DataFrame |
Where SimName is my Index for the big DataFrame and every entry in Signal 1 and Signal 2 is an individuall DataFrame.
My idea was to implement this like this:
big_DataFrame['Signal 1'].loc['Name 1']
But this results in an ValueError:
Incompatible indexer with DataFrame
Is it possible to have this nested DataFrames in Pandas?
Nico
The 'pointers' referred to at the end of ns63sr's answer could be implemented as a class, e.g...
Definition:
class df_holder:
def __init__(self, df):
self.df = df
Set:
df.loc[0,'df_holder'] = df_holder(df)
Get:
df.loc[0].df_holder.df
the docs say that only Series can be within a DataFrame. However, passing DataFrames seems to work as well. Here is an exaple assuming that none of the columns is in MultiIndex:
import pandas as pd
signal_df = pd.DataFrame({'X': [1,2,3],
'Y': [10,20,30]} )
big_df = pd.DataFrame({'SimName': ['Name 1','Name 2'],
'Date ':[123 , 456 ],
'Parameter 1':['XYZ', 'XYZ'],
'Parameter 2':['XYZ', 'XYZ'],
'Signal 1':[signal_df, signal_df],
'Signal 2':[signal_df, signal_df]} )
big_df.loc[0,'Signal 1']
big_df.loc[0,'Signal 1'][X]
This results in:
out1: X Y
0 1 10
1 2 20
2 3 30
out2: 0 1
1 2
2 3
Name: X, dtype: int64
In case nested dataframes are not properly working, you may implement some sort of pointers that you store in big_df that allow you to access the signal dataframes stored elsewhere.
Instead of big_DataFrame['Signal 1'].loc['Name 1'] you should use
big_DataFrame.loc['Name 1','Signal 1']

Categories

Resources