Drop duplicates based on first level column in MultiIndex DataFrame - python

I have a MultiIndex Pandas DataFrame like so:
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
| | VECTOR | SEGMENTS | OVERALL | INDIVIDUAL |
| | | | TIP X | TIP Y | CURVATURE | TIP X | TIP Y | CURVATURE |
| 0 | (TOP, TOP) | 2 | 3.24 | 1.309 | 44 | 1.62 | 0.6545 | 22 |
| 1 | (TOP, BOTTOM) | 2 | 3.495 | 0.679 | 22 | 1.7475 | 0.3395 | 11 |
| 2 | (BOTTOM, TOP) | 2 | 3.495 | -0.679 | -22 | 1.7475 | -0.3395 | -11 |
| 3 | (BOTTOM, BOTTOM) | 2 | 3.24 | -1.309 | -44 | 1.62 | -0.6545 | -22 |
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
How can I drop duplicates based on all columns contained under 'OVERALL' or 'INDIVIDUAL'? So if I choose 'INDIVIDUAL' to drop duplicates from the values of TIP X, TIP Y, and CURVATURE under INDIVIDUAL must all match for it to be a duplicate?
And further, as you can see from the table 1 and 2 are duplicates that are simply mirrored about the x-axis. These must also be dropped.
Also, can I center the OVERALL and INDIVIDUAL headings?
EDIT: frame.drop_duplicates(subset=['INDIVIDUAL'], inplace=True) produces KeyError: Index(['INDIVIDUAL'], dtype='object')

You can pass pandas .drop_duplicates a subset of tuples for multi-indexed columns:
df.drop_duplicates(subset=[
('INDIVIDUAL', 'TIP X'),
('INDIVIDUAL', 'TIP Y'),
('INDIVIDUAL', 'CURVATURE')
])
Or, if your row indices are unique, you could use the following approach that saves some typing:
df.loc[df['INDIVIDUAL'].drop_duplicates().index]
Update:
As you suggested in the comments, if you want to do operations on the dataframe you can do that in-line:
df.loc[df['INDIVIDUAL'].abs().drop_duplicates().index]
Or for non-pandas functions, you can use .transform:
df.loc[df['INDIVIDUAL'].transform(np.abs).drop_duplicates().index]

Related

how to find the sum of a dataframe?

while finding sum as follows
g.loc[g.index[0], 'sum'] = g[RDM].sum()
where RDM is
RDM = [f"R_Dist_meas_{i}" for i in range(48)]
the error was as follows:
KeyError: "None of [Index(['R_Dist_meas_0', 'R_Dist_meas_1', 'R_Dist_meas_2',\n .........................'R_Dist_meas_45', 'R_Dist_meas_46', 'R_Dist_meas_47'],\n dtype='object')] are in the [columns]"
the sample dataframe is as follows,it have many other column other than distance(angle,velocity etc..)
The format of dataframe is A0B0C0 A1B1C1 A2B2C2 A3B3C3 ....... A47B47C47
| R_Dist_meas_0 |R_vel_meas_0 | R_Dist_meas_1 |R_vel_meas_1 | R_Dist_meas_2 |R_vel_meas_2 |--------| R_Dist_meas_47 |R_vel_meas_47 |
|---------------|-------------|---------------|-------------|---------------|-------------|
| 5 | | | | | |
| | | | |10 | |
| | | | | 8 | |
| 2 | | 8 | | | |
the sum = 33
How to solve it?
Your list comprehension will go out of bounds if you try to search the dataframe since you only have columns up to R_Dist_meas_2. If you try to use the RDM as header keys you will be looking for columns not rows.
sum(g.iloc[:,:2].sum())
Excluding the sum outside, this allows you to sum up the rows of each column seperately and then add their totals for the final summation. This should give you the sum you are looking for.

Is there a way to set the indices of a multi index DataFrame?

I am attempting to create a multi index dataframe which contains every possible index even ones where it does not currently contain values. I wish to set these non-existent values to 0. To achieve this, I used the following:
index_levels = ['Channel', 'Duration', 'Designation', 'Manufacturing Class']
grouped_df = df.groupby(by = index_levels)[['Total Purchases', 'Sales', 'Cost']].agg('sum')
grouped_df = grouped_df.reindex(pd.MultiIndex.from_product(grouped_df.index.levels), fill_value = 0)
The expected result:
___________________________________________________________________________________________
|Chan. | Duration | Designation| Manufact. |Total Purchases| Sales | Cost |
|______|____________|____________|______________|_______________|_____________|_____________|
| | Month | Special | Brand | 0 | 0.00 | 0.00 |
| | | |______________|_______________|_____________|_____________|
| | | | Generic | 1567 | 16546.07 | 16000.00 |
|Retail|____________|____________|______________|_______________|_____________|_____________|
| | Season |Not Special | Brand | 351 | 13246.00 | 15086.26 |
| | | |______________|_______________|_____________|_____________|
| | | | Generic | 0 | 0.00 | 0.00 |
|______|____________|____________|______________|_______________|_____________|_____________|
This result is produced when at least one of the index levels contains a value. However, if the index level does not contain any value, then the following result is produced below.
___________________________________________________________________________________________
|Chan. | Duration | Designation| Manufact. |Total Purchases| Sales | Cost |
|______|____________|____________|______________|_______________|_____________|_____________|
| | Monthly | Special | Generic | 1567 | 16546.07 | 16000.00 |
|Retail|____________|____________|______________|_______________|_____________|_____________|
| | Season |Not Special | Brand | 351 | 13246.00 | 15086.26 |
|______|____________|____________|______________|_______________|_____________|_____________|
For some reason, the values continue to be autotruncated. How can I fix indices so that the desired result is always produced and I can always reliably use these indices for calculations, even when said indices have no values in them?
You should change the reindex part, as pd.MultiIndex.from_product() should take as input the original dataframe indexes (by giving grouped_df.index.levels as input you pass only the indexes resulting after the groupby).
This is a solution that wuold work:
full_idx = [df[col].dropna().unique() for col in index_levels]
grouped_df = grouped_df.reindex(pd.MultiIndex.from_product(full_idx), fill_value = 0)
If you are also interested in NaN categories, you should remove dropna() when defining the full indexes.

Iterate pyspark dataframe rows and apply UDF

I have a dataframe that looks like this:
partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
I want to group by the partitionCol, then within each partition to iterate over the rows, ordered by orderCol and apply some function to calculate a new column based on the valueCol and a cached value.
e.g.
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
I understand I need to groupby the partitionCol and apply a UDF that will operate on each chink separately, but struggling to find a good way to iterate the rows and applying the logic I described, to get a desired output of:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+
I think the best way for you to do that is to apply an UDF on the whole set of data :
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
But without knowing exactly what you want to do, that is all I can offer.

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

How to overwrite identical column names when performing an "outer" join in Pandas?

I am trying to merge/join two csv's, based on a unique city/country/state column combination using Pandas. However, when I try to do this using an outer join, I am getting extra columns when instead I would prefer to have the "right" side of my join overwrite the columns in the "left" side of the join. Any suggestions?
Here is my attempt, with an example:
These are my csv's:
My "left" csv file:
| city | country | state | pop | lat | long |
|--------------+---------+-------+----------+---------+---------|
| beijing | cn | 22 | 456 | 456 | 456 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 123 | 123 | 123 |
My "right" csv file:
| city | country | state | pop | lat | long |
|-------------+---------+-------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 |
| beijing | cn | 22 | 11716620 | 39.907 | 116.397 |
| mexico city | mx | 9 | 12294193 | 19.428 | -99.128 |
and I want this result:
| city | country | state | pop | lat | long |
|--------------+---------+-------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 |
| beijing | cn | 22 | 11716620 | 39.907 | 116.397 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 12294193 | 19.428 | -99.128 |
Note that mexico city and beijing are considered matches, based on their city, country, and state columns. Also note that on these matching rows, each column from my "left" csv is overwritten by the matching column from my "right" csv.
So here is my attempt using Pandas and dataframes:
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
result = pd.merge(left, right, on=['city', 'country', 'state'], how='outer')
Unfortunately, here is my result:
| city | country | state | pop_x | lat_x | long_x | pop_y | lat_y | long_y |
|--------------+---------+-------+----------+-----------+------------+----------+-----------+------------|
| adamsville | us | al | 4400 | 33.60575 | -86.97465 | 4400 | 33.60575 | -86.97465 |
| alabaster | us | al | 32707 | 33.219442 | -86.823907 | 32707 | 33.219442 | -86.823907 |
| albertville | us | al | | 34.26313 | -86.21066 | | 34.26313 | -86.21066 |
| beijing | cn | 22 | 456 | 456 | 456 | 11716620 | 39.907 | 116.397 |
| buenos aires | ar | 7 | 13076300 | -34.613 | -58.377 | 13076300 | -34.613 | -58.377 |
| mexico city | mx | 9 | 123 | 123 | 123 | 12294193 | 19.428 | -99.128 |
| mumbai | in | 16 | 12691836 | 19.073 | 72.883 | 12691836 | 19.073 | 72.883 |
| shanghai | cn | 23 | 22315474 | 31.222 | 121.458 | 22315474 | 31.222 | 121.458 |
As shown above, the columns that are not being used for the join, and which have the same name, are renamed with a _x suffix for the "left" dataframe and a _y suffix for the "right" dataframe.
Is there a simple way to make the columns from the "right" dataframe to overwrite the columns from the "left" dataframe when matched?
Although there seem to be similar questions already out there, I still can't seem to find an answer. For example, I tried implementing the solution based on this question:
left = pd.read_csv('left.csv')
right = pd.read_csv('right.csv')
left = left.set_index(['city','country','state'])
right = right.set_index(['city','country','state'])
left.update(right)
But update only performs left joins, so the resulting dataframe only has the same rows from the left dataframe, so it is missing cities like adamsville and alabaster above.
Since the column names for both dataframes are the same you could stack them and then do a drop_duplicates or groupby
For example:
result = pd.concat([left, right]).reset_index()
result.drop_duplicates(['city','country','state'], keep='first', inplace=True)
or:
df_stacked = pd.concat([left, right]).reset_index()
result = df_stacked.groupby(['city','country','state']).first()
Calling first will take the values from the "left" df over the "right" df because we're stacking the "left" df on top of the "right" df and resetting the index
Using groupby will allow you to perform more complex selects on the aggregated records if you don't want to just take the first or last record.
EDIT:
Just realized you want the "right" df to overwrite the "left" df, in that case...
df_stacked = pd.concat([right, left]).reset_index()
result = df_stacked.groupby(['city','country','state']).first()
This methodology only works if the "left" and "right" dataframes don't contain duplicate records to start.
And for the record, to get to the csv solution in the example above, we can perform the following:
result = result.reset_index()
# sort our descending population, and if populations are equal (or NaN), sort by ascending city name
result = result.sort_values(['pop', 'city'], ascending=[False, True])
result.drop('index', axis=1, inplace=True)
result.to_csv('result.csv', index=False)
Try:
res = pd.concat([left, right], ignore_index=True)
res = res.drop(res[res['city'].duplicated(keep='last')].index, axis=0)
Try this:
result = left.append(right).drop_duplicates(['city'], keep='last')

Categories

Resources