Append duplicate rows as single record - pandas - python

I am trying to append the duplicated rows in one column. Duplication is based on some columns that need to be an exact match for a row to be declared as a match. the problem is we can't say how many repeated rows will be there so the solution needs to be generic.
I have tried some other solution but they are based on some user_id or key column. In my case, all columns except one have to be compared exactly. and appended columns should have the same column name.
Concat, merge, join tried everything nothing works.
I used the following code to find duplicated rows:
col = ['TITLE', 'ISSN', 'e-ISSN', 'ISBN', 'e-ISBN']
duplicated_data = data[data.duplicated(col, keep=False)]
here is Input :
------------------------------
DealName | Target | Category |
-----------------------------
ABC-XYZ | ABC | A |
------------------------------
ABC-XYZ | ABC | B |
------------------------------
ABC-XYZ | None | C |
------------------------------
ABC-XYZ | ABC | C |
------------------------------
ABC-XYZ | None | D |
------------------------------
ABC-XYZ | Zab | C |
and want to convert it to something like this:
------------------------------------------------------------------------------------------
DealName | Target | Category |DealName | Target | Category |DealName | Target | Category |
---------------------------------------------------------------------------------------
ABC-XYZ | ABC | A |ABC-XYZ | ABC | B |ABC-XYZ | ABC | C |
------------------------------------------------------------------------------------------
ABC-XYZ | None | C |ABC-XYZ | None | D |
Suppose I have set the condition that the first two columns need to be matched in order to select the row,
first, second and 4th row got appended as a single record
and third and fifth row got appended as another record
any idea how to do this?

Related

Is there a way to set the indices of a multi index DataFrame?

I am attempting to create a multi index dataframe which contains every possible index even ones where it does not currently contain values. I wish to set these non-existent values to 0. To achieve this, I used the following:
index_levels = ['Channel', 'Duration', 'Designation', 'Manufacturing Class']
grouped_df = df.groupby(by = index_levels)[['Total Purchases', 'Sales', 'Cost']].agg('sum')
grouped_df = grouped_df.reindex(pd.MultiIndex.from_product(grouped_df.index.levels), fill_value = 0)
The expected result:
___________________________________________________________________________________________
|Chan. | Duration | Designation| Manufact. |Total Purchases| Sales | Cost |
|______|____________|____________|______________|_______________|_____________|_____________|
| | Month | Special | Brand | 0 | 0.00 | 0.00 |
| | | |______________|_______________|_____________|_____________|
| | | | Generic | 1567 | 16546.07 | 16000.00 |
|Retail|____________|____________|______________|_______________|_____________|_____________|
| | Season |Not Special | Brand | 351 | 13246.00 | 15086.26 |
| | | |______________|_______________|_____________|_____________|
| | | | Generic | 0 | 0.00 | 0.00 |
|______|____________|____________|______________|_______________|_____________|_____________|
This result is produced when at least one of the index levels contains a value. However, if the index level does not contain any value, then the following result is produced below.
___________________________________________________________________________________________
|Chan. | Duration | Designation| Manufact. |Total Purchases| Sales | Cost |
|______|____________|____________|______________|_______________|_____________|_____________|
| | Monthly | Special | Generic | 1567 | 16546.07 | 16000.00 |
|Retail|____________|____________|______________|_______________|_____________|_____________|
| | Season |Not Special | Brand | 351 | 13246.00 | 15086.26 |
|______|____________|____________|______________|_______________|_____________|_____________|
For some reason, the values continue to be autotruncated. How can I fix indices so that the desired result is always produced and I can always reliably use these indices for calculations, even when said indices have no values in them?
You should change the reindex part, as pd.MultiIndex.from_product() should take as input the original dataframe indexes (by giving grouped_df.index.levels as input you pass only the indexes resulting after the groupby).
This is a solution that wuold work:
full_idx = [df[col].dropna().unique() for col in index_levels]
grouped_df = grouped_df.reindex(pd.MultiIndex.from_product(full_idx), fill_value = 0)
If you are also interested in NaN categories, you should remove dropna() when defining the full indexes.

Python Pandas datasets - Having integers values in new column by making a dictionary

I am trying to output in a new column integers values (labels/classes) based on labels of another column in my dataset. Actually I did it by creating new columns (numerical column heading) for each class with boolean values in them, so then I can use these to create the new class column with numerical values. But I was trying to do it with a dictionary, which I think it is a good and faster way.
If I run a code like this:
x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
item_type_mapping[item_list[i]]=i
It generates the dictionary, but then if I run:
df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x])
or
df['New_column']=[item_type_mapping[item] for item in data.Item_Type]
It displays KeyError=None
Anybody know why this occurs? I think that's strange since the dictionary has been created and I can see it through my variables
Thanks
Edit 1
#Fourier
simply I have this column:
| Item_type|
| -------- |
| Nino |
| Nino |
| Nino |
| Pasquale |
| Franco |
| Franco |
and then I need the same column or a new one to display:
| Item_type| New_column |
| -------- | ---------- |
| Nino | 1 |
| Nino | 1 |
| Nino | 1 |
| Pasquale | 2 |
| Franco | 3 |
| Franco | 3 |
Your code works for me, but what you're trying to do is already provided by pandas as Categorical data.
df = pd.DataFrame({'Item_Type': list('abca')})
df['New_column'] = df.Item_Type.astype('category').cat.codes
Result:
Item_Type New_column
0 a 0
1 b 1
2 c 2
3 a 0

How to replace values in DataFrame with values from second DataFrame with condition that it selects different column?

I have a two DataFrames.
df1:
A | B | C
-----|---------|---------|
25zx | b(50gh) | |
50tr | a(70lc) | c(50gh) |
df2:
A | B
-----|-----
25zx | T
50gh | K
50tr | K
70lc | T
I want to replace values in df1. The row that I'm comparing is df2['A'], but the value that I want to put in to df1 is value from the row df['B'].
So the final table would look like:
df3:
A | B | C
-----|---------|---------|
T | b(K) | |
K | a(T) | c(K) |
Cast df2 to dict and use replace:
print (df.replace(df2.set_index("A")["B"].to_dict(), regex=True))
A B C
0 T b(K) None
1 K a(T) c(K)

PySpark search inside very large dataframe

I have a very large dataframe in pyspark. It has over 10 million rows and over 30 columns.
What is the best and efficient method to search the entire dataframe for a given list of values and remove the row which contains that value?
The given list of values:
list=['1097192','10727550','1098754']
The dataframe(df) is :
+---------+--------------+---------------+---------+------------+
| id | first_name | last_name | Salary | Verifycode |
+---------+--------------+---------------+---------+------------+
| 1986 | Rollie | Lewin | 1097192 | 42254172 | -Remove Row
| 289743 | Karil | Sudron | 2785190 | 3703538 |
| 3864 | Massimiliano | Dallicott | 1194553 | 23292573 |
| 49074 | Gerry | Grinnov | 1506584 | 62291161 |
| 5087654 | Nat | Leatherborrow | 1781870 | 55183252 |
| 689 | Thaine | Tipple | 2150105 | 40583249 |
| 7907 | Myrlene | Croley | 2883250 | 70380540 |
| 887 | Nada | Redier | 2676139 | 10727550 | -Remove Row
| 96533 | Sonny | Bosden | 1050067 | 13110714 |
| 1098754 | Dennie | McGahy | 1804487 | 927935 | -Remove Row
+---------+--------------+---------------+---------+------------+
If it was a smaller dataframe I could use collect() or toLocalIterator() functions and then iterate over the rows and remove it based on list values.
Since it is a very large dataframe what is the best way to solve this?
I have come up with this solution now but is there a better way:
column_names = df.schema.names
for name in column_names:
df = df.filter(~col(name).isin(list))
You got the correct approach of filtering the Dataframe using filter and isin function. You can use isin function if the list is small (in few thousands not millions). Also make sure that your dataframe is partitioned to at least 3*number of CPUs on the executors. It is a must to have lot of partitions without that parallelism will suffer.
I am comfortable with Scala so please take the concept from the below code. You need to build a Column object by combining all columns to be filtered on. Then provide that final column object on the dataframe.filter
column_names = df.schema.names
colFinal // initialize with 1 column name as col("colName").isin(list)
for name in column_names:
colFinal = colFinal.or(col(name).isin(list))
df = df.filter(!colFinal) // apply negation of final column object

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Categories

Resources