How to combine (merge) two datatable Frame in python - python

Given two datatable Frame. How to combine (merge) them in one frame?
dt_f_A =
+--------+--------+--------+-----+--------+
| A_at_1 | A_at_2 | A_at_3 | ... | A_at_m |
+--------+--------+--------+-----+--------+
| v_1 | | | | |
+--------+--------+--------+-----+--------+
| ... | | | | |
+--------+--------+--------+-----+--------+
| v_N | | | | |
+--------+--------+--------+-----+--------+
dt_f_B =
+--------+--------+--------+-----+--------+
| B_at_1 | B_at_2 | B_at_3 | ... | B_at_k |
+--------+--------+--------+-----+--------+
| w_1 | | | | |
+--------+--------+--------+-----+--------+
| ... | | | | |
+--------+--------+--------+-----+--------+
| w_N | | | | |
+--------+--------+--------+-----+--------+
The expected result (dt_f_A concat(combine or merge) dt_f_B)
+--------+--------+--------+-----+--------+--------+--------+--------+-----+--------+
| A_at_1 | A_at_2 | A_at_3 | ... | A_at_m | B_at_1 | B_at_2 | B_at_3 | ... | B_at_k |
+--------+--------+--------+-----+--------+--------+--------+--------+-----+--------+
| v_1 | | | | | w_1 | | | | |
+--------+--------+--------+-----+--------+--------+--------+--------+-----+--------+
| ... | | | | | ... | | | | |
+--------+--------+--------+-----+--------+--------+--------+--------+-----+--------+
| v_N | | | | | w_N | | | | |
+--------+--------+--------+-----+--------+--------+--------+--------+-----+--------+
We consider three cases:
Case 1: a) The two frames have exactly the same numbers of rows, and b) unique attributes in the columns.
Case 2: The number of rows is different
Case 3: the attributes are not unique (there is a duplication)
#sammywemmy Thank you for the valuable comment.

Case 1: a) The two frames have exactly the same numbers of rows, and b) unique attributes in the columns
1- use cbind : dt_f_A.cbind(dt_f_B)
or
2- use : dt_f_A[:,dt_f_B.names] = dt_f_B
Example :
import datatable as dt
dt_f_A = dt.Frame({"a":[1,2,3,4],"b":['a','b','c','d']})
dt_f_B = dt.Frame({"c":[1.1, 2.2, 3.3, 4.4], "d":['aa', 'bb', 'cc', 'dd']})
dt_f_A.cbind(dt_f_B)
#dt_f_A[:, dt_f_B.names] = dt_f_B # it's work fine also
print(dt_f_A)
Case 2: The number of rows is different
dt_f_A.cbind(dt_f_B) gives InvalidOperationError: Cannot cbind frame with X rows to a frame with Y rows. (X ≠ Y)
dt_f_A[:, dt_f_B.names] gives ValueError: Frame has X rows, and cannot be used in an expression where Y are expected. (X ≠ Y)
The solution : use dt_f_A.cbind(dt_f_B,force=True)
Example:
import datatable as dt
dt_f_A = dt.Frame({"a":[1, 2, 3, 4, 5,6], "b":['a', 'b', 'c', 'd', 'e','f']})
dt_f_B = dt.Frame({"c":[1.1, 2.2, 3.3, 4.4], "d":['aa', 'bb', 'cc', 'dd']})
dt_f_A.cbind(dt_f_B,force=True)
print(dt_f_A)
The missing value, then will be filled with NA
Case 3: the attributes are not unique (there is a duplication)
dt_f_A.cbind(dt_f_B) : It works and gives a warning. It changes the duplicated attribute to a unique attribute: atatableWarning: Duplicate column name found, and was assigned a unique name: 'a' -> 'a.0'
dt_f_A[:, dt_f_B.names] = dt_f_B : IT doesn't give any error. It eliminate the duplicated column in dt_f_A and keep the column in dt_f_B.
Example:
import datatable as dt
dt_f_A = dt.Frame({"a":[1,2,3,4],"b":['a','b','c','d']})
dt_f_B = dt.Frame({"a":[1.1, 2.2, 3.3, 4.4], "d":['aa', 'bb', 'cc', 'dd']})
dt_f_A.cbind(dt_f_B) # rename the duplicated columns
#dt_f_A[:, dt_f_B.names] = dt_f_B # keep only the duplicated columns in dt_f_B
print(dt_f_A)
#sammywemmy Thank you for your valuable comment :)

Related

pandas: aggregate rows by creating dictionary key-value pairs based on a column

Let's say I have a pandas dataframe:
| id1 | id2 | attr1 | combo_id | perm_id |
| --- | --- | --- | --- | --- |
| 1 | 2 | [9606] | [1,2] | AB |
| 2 | 1 | [9606] | [1,2] | BA |
| 3 | 4 | [9606] | [3,4] | AB |
| 4 | 3 | [9606] | [3,4] | BA |
I'd like to aggregate rows with the same combo_id together, and store information from both rows using the perm_id of that row. So the resulting dataframe would look like:
| attr1 | combo_id |
| --- | --- |
| {'AB':[9606], 'BA': [9606]} | [1,2] |
| {'AB':[9606], 'BA': [9606]} | [3,4] |
How would I use groupby and aggregate functions for these operations?
I tried converting attribute1 to a dict using perm_id.
df['attr1'] = df.apply(lambda x: {x['perm_id']: x['attr1']})
Then I planned to use something to combine dictionaries in the same group.
df.groupby(['combo_id']).agg({ 'attr1': lambda x: {x**})
But this resulted in KeyError: perm_id
Any suggestions?
Try:
from ast import literal_eval
x = (
df.groupby(df["combo_id"].astype(str))
.apply(lambda x: dict(zip(x["perm_id"], x["attr1"])))
.reset_index(name="attr1")
)
# convert combo_id back to list (if needed)
x["combo_id"] = x["combo_id"].apply(literal_eval)
print(x)
Prints:
combo_id attr1
0 [1, 2] {'AB': [9606], 'BA': [9606]}
1 [3, 4] {'AB': [9606], 'BA': [9606]}

print multi-index dataframe with tabulate

How can one print a multi-index Dataframe such as the one below:
import numpy as np
import tabulate
import pandas as pd
df = pd.DataFrame(np.random.randn(4, 3),
index=pd.MultiIndex.from_product([["foo", "bar"],
["one", "two"]]),
columns=list("ABC"))
so that the two levels of the Multindex show as separate columns, much the same way pandas itself prints it:
In [16]: df
Out[16]:
A B C
foo one -0.040337 0.653915 -0.359834
two 0.271542 1.328517 1.704389
bar one -1.246009 0.087229 0.039282
two -1.217514 0.721025 -0.017185
However, tabulate prints like this:
In [28]: print(tabulate.tabulate(df, tablefmt="github", headers="keys", showindex="always"))
| | A | B | C |
|----------------|------------|-----------|------------|
| ('foo', 'one') | -0.0403371 | 0.653915 | -0.359834 |
| ('foo', 'two') | 0.271542 | 1.32852 | 1.70439 |
| ('bar', 'one') | -1.24601 | 0.0872285 | 0.039282 |
| ('bar', 'two') | -1.21751 | 0.721025 | -0.0171852 |
MultiIndexes are represented by tuples internally, so tabulate is showing you the right thing.
If you want column-like display, the easiest is to reset_index first:
print(tabulate.tabulate(df.reset_index().rename(columns={'level_0':'', 'level_1': ''}), tablefmt="github", headers="keys", showindex=False))
Output:
| | | A | B | C |
|-----|-----|-----------|-----------|-----------|
| foo | one | -0.108977 | 2.03593 | 1.11258 |
| foo | two | 0.65117 | -1.48314 | 0.391379 |
| bar | one | -0.660148 | 1.34875 | -1.10848 |
| bar | two | 0.561418 | 0.762137 | 0.723432 |
Alternatively, you can rework the MultiIndex to a single index:
df2 = df.copy()
df2.index = df.index.map(lambda x: '|'.join(f'{e:>5} ' for e in x))
print(tabulate.tabulate(df2.rename_axis('index'), tablefmt="github", headers="keys", showindex="always"))
Output:
| index | A | B | C |
|------------|-----------|-----------|-----------|
| foo | one | -0.108977 | 2.03593 | 1.11258 |
| foo | two | 0.65117 | -1.48314 | 0.391379 |
| bar | one | -0.660148 | 1.34875 | -1.10848 |
| bar | two | 0.561418 | 0.762137 | 0.723432 |

Create a dataframe by iterating over column of list in another dataframe

In pyspark, I have a DataFrame with a column that contains a list of ordered nodes to go through:
osmDF.schema
Out[1]:
StructType(List(StructField(id,LongType,true),
StructField(nodes,ArrayType(LongType,true),true),
StructField(tags,MapType(StringType,StringType,true),true)))
osmDF.head(3)
Out[2]:
| id | nodes | tags |
|-----------|-----------------------------------------------------|---------------------|
| 62960871 | [783186590,783198852] | "{""foo"":""bar""}" |
| 211528816 | [2215187080,2215187140,2215187205,2215187256] | "{""foo"":""boo""}" |
| 62960872 | [783198772,783183397,783167527,783169067,783198772] | "{""foo"":""buh""}" |
I need to create a dataframe with a row for each consecutive combination of 2 nodes the list of nodes, then save it as parquet.
The expected result will have a length of n-1, with n len(nodes) for each rows. It would look like this (with other columns that I'll add):
| id | from | to | tags |
|-----------------------|------------|------------|---------------------|
| 783186590_783198852 | 783186590 | 783198852 | "{""foo"":""bar""}" |
| 2215187080_2215187140 | 2215187080 | 2215187140 | "{""foo"":""boo""}" |
| 2215187140_2215187205 | 2215187140 | 2215187205 | "{""foo"":""boo""}" |
| 2215187205_2215187256 | 2215187205 | 2215187256 | "{""foo"":""boo""}" |
| 783198772_783183397 | 783198772 | 783183397 | "{""foo"":""buh""}" |
| 783183397_783167527 | 783183397 | 783167527 | "{""foo"":""buh""}" |
| 783167527_783169067 | 783167527 | 783169067 | "{""foo"":""buh""}" |
| 783169067_783198772 | 783169067 | 783198772 | "{""foo"":""buh""}" |
I tried to initiate with the following
from pyspark.sql.functions import udf
def split_ways_into_arcs(row):
arcs = []
for node in range(len(row['nodes']) - 1):
arc = dict()
arc['id'] = str(row['nodes'][node]) + "_" + str(row['nodes'][node + 1])
arc['from'] = row['nodes'][node]
arc['to'] = row['nodes'][node + 1]
arc['tags'] = row['tags']
arcs.append(arc)
return arcs
# Declare function as udf
split = udf(lambda row: split_ways_into_arcs(row.asDict()))
The issue I'm having is I don't know how many nodes there are in each row of the original DataFrame.
I know how to apply a udf to add a column to an existing DataFrame, but not to create a new one from lists of dicts.
Iterate over the nodes array using transform and explode the array afterwards:
from pyspark.sql import functions as F
df = ...
df.withColumn("nodes", F.expr("transform(nodes, (n,i) -> named_struct('from', nodes[i], 'to', nodes[i+1]))")) \
.withColumn("nodes", F.explode("nodes")) \
.filter("not nodes.to is null") \
.selectExpr("concat_ws('_', nodes.to, nodes.from) as id", "nodes.*", "tags") \
.show(truncate=False)
Output:
+---------------------+----------+----------+-----------------+
|id |from |to |tags |
+---------------------+----------+----------+-----------------+
|783198852_783186590 |783186590 |783198852 |{""foo"":""bar""}|
|2215187140_2215187080|2215187080|2215187140|{""foo"":""boo""}|
|2215187205_2215187140|2215187140|2215187205|{""foo"":""boo""}|
|2215187256_2215187205|2215187205|2215187256|{""foo"":""boo""}|
|783183397_783198772 |783198772 |783183397 |{""foo"":""buh""}|
|783167527_783183397 |783183397 |783167527 |{""foo"":""buh""}|
|783169067_783167527 |783167527 |783169067 |{""foo"":""buh""}|
|783198772_783169067 |783169067 |783198772 |{""foo"":""buh""}|
+---------------------+----------+----------+-----------------+

vlookup on text field using pandas

I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Categories

Resources