Removing index which is not common between 2 dataframes - python

I have the following code:
import pandas.util.testing as testing
df = testing.makeDataFrame()
df
This this I have created 2 dataframes with one dataframe have 2 less lines than the original one.
This is df - Original
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
71ZFgWaeDE 1.887420 0.337702 -0.176539 0.149089
alWOjkQ2eZ 1.997701 -0.354276 1.997802 -0.086803
This is df1 - with 2 less lines
A B C D
OdhGFPa5Kw -0.686378 -1.210838 1.160708 0.903309
gelZFj4BG5 1.603112 1.852592 -0.065482 0.684566
mp3Aq5ueGD 0.254211 -0.788877 -0.626789 0.109116
pBtz9DHxUZ -0.970632 0.982661 -0.463984 -0.123727
K28pzbdYcX -1.311220 -2.121306 1.209484 -1.695901
What I am trying to do is to remove all the rows which are not common between the two dataframes. To do this, we find the duplicate index in the two columns.
duplicates = set(df.index).intersection(df1.index)
Could you please advise how can I remove rows where index is not in the duplicates ?

If you want to remove the indices in place:
idx = df.index.difference(df1.index)
df.drop(idx, inplace=True)
If you want to create a new object:
idx = df.index.intersection(df1.index)
new_df = df.loc[idx]

Related

Pandas create two new columns based on 2 existing columns

I have a dataframe like the below:
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
Email Ticket_Category Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com Tier1 5 1345.45
1 joblogs#gmail.com Tier2 2 10295.88
What I'm trying to do is to create 2 new columns "Tier1_Quantity_Purchased" and "Tier2_Quantity_Purchased" based on the existing dataframe, and sum the total of "Total_Price_Paid" as below:
dummy_dict_desired = {'Email':['joblogs#gmail.com'],
'Tier1_Quantity_Purchased': [5],
'Tier2_Quantity_Purchased':[2],
'Total_Price_Paid':[11641.33]}
Email Tier1_Quantity_Purchased Tier2_Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com 5 2 11641.33
Any help would be greatly appreciated. I know there is an easy way to do this, just can't figure out how without writing some silly for loop!
What you want to do is to pivot your table, and then add a column with aggregated data from the original table.
df = pd.DataFrame(dummy_dict_existing)
pivot_df = df.pivot(index='Email', columns='Ticket_Category', values='Quantity_Purchased')
pivot_df['total'] = df.groupby('Email')['Total_Price_Paid'].sum()
Email
Tier1
Tier2
total
joblogs#gmail.com
5
2
11641.33
For more details on pivoting, take a look at How can I pivot a dataframe?
import pandas as pd
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
df = pd.DataFrame(dummy_dict_existing)
df2 = df[['Ticket_Category', 'Quantity_Purchased']]
df_transposed = df2.T
df_transposed.columns = ['Tier1_purchased', 'Tier2_purchased']
df_transposed = df_transposed.iloc[1:]
df_transposed = df_transposed.reset_index()
df_transposed = df_transposed[['Tier1_purchased', 'Tier2_purchased']]
df = df.groupby('Email')[['Total_Price_Paid']].sum()
df = df.reset_index()
df.join(df_transposed)
output

How to use wide_to_long (Pandas)

I have this code which I thought would reformat the dataframe so that the columns with the same column name would be replaced by their duplicates.
# Function that splits dataframe into two separate dataframes, one with all unique
# columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
columns = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = columns[columns == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from
dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate
columns:\n\n{}".format(unq_df.columns.tolist(), dup_df.columns.tolist()))
Output:
Unique columns:
['total_tracks', 'popularity']
Duplicate columns:
['t_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 't_energy1', 't_energy2',
't_key0', 't_key1', 't_key2', 't_speech0', 't_speech1', 't_speech2', 't_acous0', 't_acous1', 't_acous2',
't_ins0', 't_ins1', 't_ins2', 't_live0', 't_live1', 't_live2', 't_val0', 't_val1', 't_val2', 't_tempo0',
't_tempo1', 't_tempo2']
Then I tried to use wide_to_long to combine columns with the same name:
cols = unq_df.columns.tolist()
temp = pd.wide_to_long(dataset.reset_index(), stubnames=['t_dur','t_dance', 't_energy', 't_key', 't_mode',
't_speech', 't_acous', 't_ins', 't_live', 't_val',
't_tempo'], i=['index'] + cols, j='temp', sep='t_')
.reset_index().groupby(cols, as_index=False).mean()
temp
Which gave me this output:
I tried to look at this question, but the dataframe that's returned has "Nothing to show". What am I doing wrong here? How do I fix this?
EDIT
Here is an example of how I've done it "by-hand", but I am trying to do it more efficiently using the already defined built-in functions.
The desired output is the dataframe that is shown last.

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Split dataframe by certain condition but keep the original dataframe

I have a dataframe "bb" like this:
Response Unique Count
I love it so much! 246_0 1
This is not bad, but can be better. 246_1 2
Well done, let's do it. 247_0 1
If count is lager than 1, I would like to split the string and make the dataframe "bb" become this: (result I expected)
Response Unique
I love it so much! 246_0
This is not bad 246_1_0
but can be better. 246_1_1
Well done, let's do it. 247_0
My code:
bb = DataFrame(bb[bb['Count'] > 1].Response.str.split(',').tolist(), index=bb[bb['Count'] > 1].Unique).stack()
bb = bb.reset_index()[[0, 'Unique']]
bb.columns = ['Response','Unique']
bb=bb.replace('', np.nan)
bb=bb.dropna()
print(bb)
But the result is like this:
Response Unique
0 This is not bad 246_1
1 but can be better. 246_1
How can I keep the original dataframe in this case?
First split only values per condition with to new helper Series and then add counter values by GroupBy.cumcount only per duplicated index values by Index.duplicated:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
mask = df1.index.duplicated(keep=False)
df1.loc[mask, 'Unique'] += df1[mask].groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
EDIT: If need _0 for all another values remove mask:
s = df.loc[df.pop('Count') > 1, 'Response'].str.split(',', expand=True).stack()
df1 = df.join(s.reset_index(drop=True, level=1).rename('Response1'))
df1['Response'] = df1.pop('Response1').fillna(df1['Response'])
df1['Unique'] += df1.groupby(level=0).cumcount().astype(str).radd('_')
df1 = df1.reset_index(drop=True)
print (df1)
Response Unique
0 I love it so much! 246_0_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0_0
Step wise we can solve this problem the following:
Split your dataframes by count
Use this function to explode the string to rows
We groupby on index and use cumcount to get the correct unique column values.
Finally we concat the dataframes together again.
df1 = df[df['Count'].ge(2)] # all rows which have a count 2 or higher
df2 = df[df['Count'].eq(1)] # all rows which have count 1
df1 = explode_str(df1, 'Response', ',') # explode the string to rows on comma delimiter
# Create the correct unique column
df1['Unique'] = df1['Unique'] + '_' + df1.groupby(df1.index).cumcount().astype(str)
df = pd.concat([df1, df2]).sort_index().drop('Count', axis=1).reset_index(drop=True)
Response Unique
0 I love it so much! 246_0
1 This is not bad 246_1_0
2 but can be better. 246_1_1
3 Well done! 247_0
Function used from linked answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

Parsing JSON in Pandas

I need to extract the following json:
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Status":"SMART Passed","Name":"/dev/sda"},{"Status":"SMART Passed","Name":"/dev/sdb"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"},{"Name":"disk1","Status":"Passed"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Failed"},{"Name":"disk1","Status":"not supported"}]}
{"PhysicalDisks":[{"Name":"disk0","Status":"Passed"}]}
Name: raw_results, dtype: object
Into separate columns. I don't know how many disks per result there might be in future. What would be the best way here?
I tried the following:
d = raw_res['raw_results'].map(json.loads).apply(pd.Series).add_prefix('raw_results.')
Gives me:
Example output might be something like
Better way would be to add each disk check as an additional row into dataframe with the same checkid as the row it was extracted from. So for 3 disks in results it will generate 3 rows 1 per disk
UPDATE
This code
# This works
dfs = []
def json_to_df(row, json_col):
json_df = pd.read_json(row[json_col])
dfs.append(json_df.assign(**row.drop(json_col)))
df['raw_results'].replace("{}", pd.np.nan, inplace=True)
df = df.dropna()
df.apply(json_to_df, axis=1, json_col='raw_results')
df = pd.concat(dfs)
df.head()
Adds an extra row for each disk (sda, sdb etc.)
So now I would need to split this column into 2: Status and Name.
df1 = df["PhysicalDisks"].apply(pd.Series)
df_final = pd.concat([df, df1], axis = 1).drop('PhysicalDisks', axis = 1)
df_final.head()

Categories

Resources