Remove duplicate values of a pandas dataframe particular columns - python

I have a data frame with multiple columns and I want to select the subset of columns and remove the duplicate values from it.
I do not want to remove rows, Only want to remove particular column duplicate values.
My data frame looks like:
I want to remove duplicates from these columns ["PLACEMENT # NAME", "IMPRESSIONS","ENGAGEMENTS","DPEENEGAGEMENTS"], so my out will look like.

Here's some of your data
import pandas as pd
df = pd.DataFrame({'PLACEMENT # NAME': ['Blend of Vdx Display', 'Blend of Vdx Display',
'Blend of Vdx Display', 'Blend of Vdx Display'],
'PRODUCT': ['Display', 'Display', 'Mobile', 'Mobile'],
'VIDEONAME': ['Features', 'TVC', 'video1', 'video2'],
'COST_TYPE': ['CPE', 'CPE', 'CPE', 'CPE'],
'Views': [1255, 10479, 156, 20],
'50_pc_video': [388, 2402, 38, 10],
'75_pc_cideo_10': ['', '', '', ''],
'IMPRESSIONS': [778732,778732,778732,778732],
'ENGAGEMENTS': [13373, 13373, 13373, 13373],
'DPEENGAGEMENTS': [7142, 7142, 7142, 7142]})
You can accomplish what you want with .loc + .duplicated()
dup_cols = ['PLACEMENT # NAME', 'IMPRESSIONS', 'ENGAGEMENTS', 'DPEENGAGEMENTS']
df.loc[df.duplicated(dup_cols), dup_cols] = ''

Related

Convert a muti-valued dict into a pandas dataframe

I want to convert this dict into a pandas dataframe where each key becomes a column and values in the list become the rows:
my_dict:
{'Last updated': ['2021-05-18T15:24:19.000Z', '2021-05-18T15:24:19.000Z'],
'Symbol': ['BTC', 'BNB', 'XRP', 'ADA', 'BUSD'],
'Name': ['Bitcoin', 'Binance Coin', 'XRP', 'Cardano', 'Binance USD'],
'Rank': [1, 3, 7, 4, 25],
}
The lists in my_dict can also have some missing values, which should appear as NaNs in dataframe.
This is how I'm currently trying to append it into my dataframe:
df = pd.DataFrame(columns = ['Last updated',
'Symbol',
'Name',
'Rank',]
df = df.append(my_dict, ignore_index=True)
#print(df)
df.to_excel(r'\walletframe.xlsx', index = False, header = True)
But my output only has a single row containing all the values.
The answer was pretty simple, instead of using
df = df.append(my_dict)
I used
df = pd.DataFrame.from_dict(my_dict).T
Which transposes the dataframe so it doesn't has any missing values for columns.
Credits to #Ank who helped me find the solution!

Create a dictionary of unique values of a column in a dataframe in pandas

I have a dataframe:
import pandas as pd
df = pd.DataFrame({
'ID': ['ABC', 'ABC', 'ABC', 'XYZ', 'XYZ', 'XYZ'],
'value': [100, 120, 130, 200, 190, 210],
'value2': [2100, 2120, 2130, 2200, 2190, 2210],
'state': ['init','mid', 'final', 'init', 'mid', 'final'],
})
I want to create dictionary of unique values of the Column 'ID'. I can extract the unique values by:
df.ID.unique()
But that gives me a list. I want the output to be a dictionary, which looks like this:
dict = {0:'ABC', 1: 'XYZ'}
If the number of unique entries in the column is n, then the keys should start at 0 and go till n-1. The values should be the names of unique entries in the column
The actual dataframe has 1000s of rows and is often updated. So I cannot maintain the dict manually.
Try this. -
dict(enumerate(df.ID.unique()))
{0: 'ABC', 1: 'XYZ'}
If you want to get unique values for a particular column in dict, try:
val_dict = {idx:value for idx , value in enumerate(df["ID"].unique())}
Output while printing val_dict
{0: 'ABC', 1: 'XYZ'}

Pandas - Extracting data from Series

I am trying to extract a seat of data from a column that is of type pandas.core.series.Series.
I tried
df['col1'] = df['details'].astype(str).str.findall(r'name\=(.*?),')
but the above returns null
Given below is how the data looks like in column df['details']
[{'id': 101, 'name': 'Name1', 'state': 'active', 'boardId': 101, 'goal': '', 'startDate': '2019-01-01T12:16:20.296Z', 'endDate': '2019-02-01T11:16:00.000Z'}]
Trying to extract value corresponding to name field
Expected output : Name1
try this: simple, change according to your need.
import pandas as pd
df = pd.DataFrame([{'id': 101, 'name': 'Name1', 'state': 'active', 'boardId': 101, 'goal': '', 'startDate': '2019-01-01T12:16:20.296Z', 'endDate': '2019-02-01T11:16:00.000Z'}])
print(df['name'][0])
#or if DataFrame inside a column itself
df['details'][0]['name']
NOTE: as you mentioned details is one of the dataset that you have in the existing dataset
import pandas as pd
df = pd.DataFrame([{'id': 101, 'name': 'Name1', 'state': 'active', 'boardId': 101, 'goal': '', 'startDate': '2019-01-01T12:16:20.296Z', 'endDate': '2019-02-01T11:16:00.000Z'}])
#Name column
print(df.name)
#Find specific values in Series
indeces = df.name.str.find("Name") #Returns indeces of such values
df.iloc[index] # Returns all columns that fields name contain "Name"
df.name.iloc[index] # Returns all values from column name, which contain "Name"
Hope, this example will help you.
EDIT:
Your data frame has column 'details', which contain a dict {'id':101, ...}
>>> df['details']
0 {'id': 101, 'name': 'Name1', 'state': 'active'...
And you want to get value from field 'name', so just try:
>>> df['details'][0]['name']
'Name1'
The structure in your series is a dictionary.
[{'id': 101, 'name': 'Name1', 'state': 'active', 'boardId': 101, 'goal': '', 'startDate': '2019-01-01T12:16:20.296Z', 'endDate': '2019-02-01T11:16:00.000Z'}]
You can just point to the element 'name' from that dict with the following command
df['details'][0]['name']
If the name could be different you can get the list of the keys in the dictionary and apply your regex on that list to get your field's name.
Hope that it can help you.

Separate pd DataFrame Rows that are dictionaries into columns

I am extracting some data from an API and having challenges transforming it into a proper dataframe.
The resulting DataFrame df is arranged as such:
Index Column
0 {'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
1 {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}
I am trying to split the emails into one column and the list into a separate column:
Index Column1 Column2
0 email#email.com [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
Ideally, each 'action'/'date' would have it's own separate row, however I believe I can do the further unpacking myself.
After looking around I tried/failed lots of solutions such as:
df.apply(pd.Series) # does nothing
pd.DataFrame(df['column'].values.tolist()) # makes each dictionary key as a separate colum
where most of the rows are NaN except one which has the pair value
Edit:
As many of the questions asked the initial format of the data in the API, it's a list of dictionaries:
[{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]},{'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
Thanks
One naive way of doing this is as below:
inp = [{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
, {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
index = 0
df = pd.DataFrame()
for each in inp: # iterate through the list of dicts
for k, v in each.items(): #take each key value pairs
for eachv in v: #the values being a list, iterate through each
print (str(eachv))
df.set_value(index,'Column1',k)
df.set_value(index,'Column2',str(eachv))
index += 1
I am sure there might be a better way of writing this. Hope this helps :)
Assuming you have already read it as dataframe, you can use following -
import ast
df['Column'] = df['Column'].apply(lambda x: ast.literal_eval(x))
df['email'] = df['Column'].apply(lambda x: x.keys()[0])
df['value'] = df['Column'].apply(lambda x: x.values()[0])

Pandas indexed dataframe display: use top left empty box

Is there a way to put text in the top left box of a dataframe display? Does that field have a name? See below:
import pandas as pd
raw_data = {'Regiment': ['Nighthawks', 'Raptors'],
'Company': ['1st', '2nd'],
'preTestScore': [4, 24],
'postTestScore': [25, 94]}
pd.DataFrame(raw_data, columns = ['Regiment', 'Company', 'preTestScore', 'postTestScore']).set_index('Regiment')
Yes. That space is used for the name of the columns. It can be filled in by doing
df.columns.name = 'your name'

Categories

Resources