How to remove duplicates based on partial match

How to remove duplicates based on partial match - python

I don't even know how to approach it as it feels too complex for my level.
Imagine courier tracking numbers and I am receiving some duplicated updates from upstream system in following format:
see attached image or a small piece of code that creates such table:
import pandas as pd
incoming_df = pd.DataFrame({
'Tracking ID' : ['4845','24345', '8436474', '457453', '24345-S2'],
'Previous' : ['Paris', 'Lille', 'Paris', 'Marseille', 'Dijon'],
'Current' : ['Nantes', 'Dijon', 'Dijon', 'Marseille', 'Lyon'],
'Next' : ['Lyone', 'Lyon', 'Lyon', 'Rennes', 'NICE']
})
incoming_df
Obviously, tracking ID 24345-S2 (green arrow) is a duplication of 24345 (red arrow), however, it is not fully duplicated but a newer, updated location information (with history) for the parcel. How do I delete old line 24345 and keep new line 24345-S2 in the data set?
The length of tracking ID can be from 4 to 20 chars but '-S2' is always helpfully appended.
Thank you!

Edit: New solution:
# extract duplicates
duplicates = df['Tracking ID'].str.extract('(.+)-S2').dropna()
# remove older entry if necessary
df = df[~df['Tracking ID'].isin(duplicates[0].unique())]
If the 1234-S2 entry is always lower in the DataFrame than the 1234 entry , you could do something like:
# remove the suffix from all entries
incoming_df['Tracking ID'] = incoming_df['Tracking ID'].apply(lambda x: x.split('-')[0])
# keep only the last entry of the duplicates
incoming_df = incoming_df.drop_duplicates(subset='Tracking ID', keep='last')

Related

Google Sheets API and Pandas. Inconsistent data length from the API

I'm using the google sheets API to get data which I then pass to Pandas so I can easily work with the data.
Let's say I want to get a sheet with the following data (depicted as a JSON object as tables weren't presented here well)
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '12345', '8 Leafy Street']
}
The sheets API will return something along the lines of this:
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35', '12345', '8 Leafy Street']
]
}
This is great and allows me to easily pass the column headings and data to Pandas without much fuss. I do this in the following manner:
values = sheets_api_result["values"]
df = pd.DataFrame(values[1:], columns=values[0])
My Problem
If I have a Gsuite Sheet that looks like the below table, depicted as a key:value data type
{
columns: ['Name', 'Age', 'Tlf.' 'Address'],
data: ['Julie', '35', '', '']
}
I will receive the following response
{
'range': 'Cases!A1:AE999',
'majorDimension': 'ROWS',
'values':
[
['Name', 'Age', 'Tlf.', 'Address'],
['Julie', '35']
]
}
Note that the length of the two arrays are not unequal, and that instead of None or null values being returned, the data is simply not present in the response.
When working with this data in my code, I end up with an error that looks like this
ValueError: 4 columns passed, passed data had 2 columns
So as far as I can tell I have two options:
Come up with a clever way to pad my response where necessary with None
If possible, instruct the API to return a null value in the JSON where null values exist, especially when the last column(s) have no data at all.
With regards to point 1. I think I can append x None values to the list where x is equal to length_of_column_heading_array - length_of_data_array. This does however seem ugly and perhaps there is a more elegant way of doing it.
And with regards to point 2, I haven't managed to find an answer that helps me.
If anyone has any ideas on how I can solve this, I'd be very grateful.
Cheers!

If anyone is interested, here is how I solved the issue.
First, we need to get all the data from the Sheets API.
# define the names of the tabs I want to get
ranges = ['tab1', 'tab2']
# Call the Sheets API
request = service.spreadsheets().values().batchGet(spreadsheetId=document, ranges=ranges,)
response = request.execute()
Now I want to go through every column and ensure that each row's list contains the same number of elements as the first row which contains the column headings.
# response is the response from google sheets API,
# and from the code above. It contains column headings
# and data from every row.
# valueRanges is the key to access the data.
def extract_case_data(response, keyword):
for obj in response["valueRanges"]:
if keyword in obj["range"]:
values = pad_data(obj["values"])
df = pd.DataFrame(values[1:], columns=values[0])
return df
return None
And finally, the method to pad the data
def pad_data(data: list):
# build a new array with the column heading data
# this is the list which we will return
return_data = [data[0]]
for row in data[1:]:
difference = len(data[0]) - len(row)
new_row = row
# append None to the lists which have a shorter
# length than the column heading list
for count in range(1, difference + 1):
new_row.append(None)
return_data.append(new_row)
return return_data
I'm certainly not saying that this is the best or most elegant solution, but it has done the trick for me.
Hope this helps someone.

Same idea, maybe simpler look:
Get raw values
result = service.spreadsheets().values().get(spreadsheetId=spreadsheet_id, range=data_range).execute()
raw_values = result.get('values', [])
Then complete while iterating
for row in raw_values:
row = row + [''] * (expected_length - len(row))

Add keys from dicts (in column) to new column

I have a DataFrame with a 'budgetYearMap' column, which has 1-3 key-value pairs for each record. I'm a bit stuck as to how I'm supposed to make a new column containing only the keys of the "budgetYearMap" column.
Sample data below:
df_sample = pd.DataFrame({'identifier': ['BBI-2016-D02', 'BBI-2016-D03', 'BBI-2016-D04', 'BBI-2016-D05', 'BBI-2016-D06'],
'callIdentifier': ['H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016'],
'budgetYearMap': [{'0': 188650000}, {'2017': 188650000}, {'2015': 188650000}, {'2014': 188650000}, {'2020': 188650000, '2014': 188650000, '2012': 188650000}]
})
First I tried to extract the keys by position, then make a list out of them and add the list to the dataframe. As some records contained multiple keys (I then found out), this approach failed.
all_keys = [i for s in [list(d.keys()) for d in df_sample.budgetYearMap] for i in s]
df_TD_selected['budgetYear'] = all_keys
My problem is that extracting the keys by "name" wouldn't work either, given that the names of the keys are variable, and I do not know the set of years in advance. The data set will keep growing. It can be either 0 or a year within the 2000 range now, but in the future more years will be added.
My desired output would be:
df_output = pd.DataFrame({'identifier': ['BBI-2016-D02', 'BBI-2016-D03', 'BBI-2016-D04', 'BBI-2016-D05', 'BBI-2016-D06'],
'callIdentifier': ['H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016', 'H2020-BBI-JTI-2016'],
'Year': ['0', '2017', '2015', '2014', '2020, 2014, 2012']
})
Any idea how I should approach this?

Perfect pipeline use-case.
df = (
df_sample
.assign(Year = df_sample['budgetYearMap'].apply(lambda s: list(s.keys())))
.drop(columns = ['budgetYearMap'])
)
.assign creates a new column which takes the 'budgetYearMap' Series and applies the lambda function to it. This returns the dictionary's keys in a list. If you prefer a string (as in your desired output), simply replace the lambda function with
lambda s: ', '.join(list(s.keys()))

Create different files based off value in dataframe column A and save to different existing folders based off value in dataframe column A

First, I would like to create different files based off the value in dataframe column A FTP_FOLDER_PATH
Second, I would like to save these files to different folders depending on the value in dataframe column A 'FTP_FOLDER_PATH'. These folders already exist and do not need to be created.
I am struggling with how to do this through looping. I have done something similar in the past for the first part, where I just create different files, but I could only figure out how to save them to one folder. I am stuck on trying to save them to multiple folders. In the code, I have included:
the dataframe
what I have attempted which only solves the first part of the problem and
the desired output which all needs to go to the correct FTP folders.
import pandas as pd
import os
FTP_Master_Folder = 'C:/FTP'
df = pd.DataFrame({'FTP_FOLDER_PATH' : ['C:\FTP1', 'C:\FTP2', 'C:\FTP2', 'C:\FTP2', 'C:\FTP3', 'C:\FTP3'],
'NAME' : ['Jon', 'Kat', 'Kat', 'Kat', 'Joe', 'Joe'],
'CARS' : ['Honda', 'Lexus', 'Porsche', 'Saleen s7', 'Tesla', 'Tesla']})
df
for i, x in df.groupby('FTP_FOLDER_PATH'):
#How do I change the below line to loop through and change the directory based on the value of the 'FTP_FOLDER_PATH'
os.chdir(f'{FTP_Master_Folder}')
p = os.path.join(os.getcwd(), i + '.csv')
x.to_csv(p, index=False)
#Desired Ouput to specific FTP folder based on row of dataframe
df_FTP1 = pd.DataFrame({'FTP_FOLDER_PATH' : ['C:\FTP1'],
'NAME' : ['Jon'],
'CARS' : ['Honda']})
df_FTP1
df_FTP2 = pd.DataFrame({'FTP_FOLDER_PATH' : ['C:\FTP2', 'C:\FTP2', 'C:\FTP2'],
'NAME' : ['Kat', 'Kat', 'Kat'],
'CARS' : ['Lexus', 'Porsche', 'Saleen s7']})
df_FTP2
df_FTP3 = pd.DataFrame({'FTP_FOLDER_PATH' : ['C:\FTP3', 'C:\FTP3'],
'NAME' : ['Joe', 'Joe'],
'CARS' : ['Tesla', 'Tesla']})
df_FTP3

I discovered a minor basic error. I should have included /{i} in line 2. i would be the subfolder of the masterfolder in this case, so adding this in allows the files to go to their destinations, so that solves part two of my problem quite easily.
for i, x in df_joined.groupby('FTP_FOLDER_PATH'):
os.chdir(f'{FTP_Master_Folder}/{i}')
p = os.path.join(os.getcwd(), i + '.csv')
x.to_csv(p, index=False)

Run function on cell in column, based on another column

I have a dataframe full of scientific paper information.
My Dataframe:
database authors title
0 sciencedirect [{'surname': 'Sharafaldin', 'first_name': 'Iman'}, An eval...
{'surname': 'Lashkari', 'first_name': 'Arash Habibi'}]
1 sciencedirect [{'surname': 'Srinivas', 'first_name': 'Jangirala'}, Governmen...
{'surname': 'Das', 'first_name': 'Ashok Kumar'}]
2 sciencedirect [{'surname': 'Bongiovanni', 'first_name': 'Ivano'}] The last...
3 ieeexplore [Igor Kotenko, Andrey Chechulin] Cyber Attac...
As you can see, the authors column contains a list of dictionarys, but only where the database is sciencedirect. In order to perform some analysis, I need to clean my data. Therefore, my goal is to put the names just into lists like in row 4.
What i want:
# From:
[{'surname': 'Sharafaldin', 'first_name': 'Iman'}, {'surname': 'Lashkari', 'first_name': 'Arash Habibi'}]
# To:
[Iman Sharafaldin, Arash Habibi Lashkari]
My appraoch is to createa two masks, one for the database column, extracting only sciencedirect papers and the other mask is the whole authors column. From these mask, a new dataframe is created, on which column "authors" i run the code shown below. It extracts the author names of each cell and stores them in a list, just as i want it:
scidir_mask = df["database"] == 'sciencedirect'
authors_col = df["authors"] is not None
only_scidir = df[authors_col & scidir_mask]
for cell in only_scidir["authors"]:
# get each list from cell
cell_list = []
for dictionary in cell:
# get the values from dict and reverse into list
name_as_list = [*dictionary.values()][::-1]
# make list from first and surname a string
author = ' '.join(name_as_list)
cell_list.append(author)
So at the end of the above code, the cell_list contains the authors names in the way I want.
But I can't get my head around, on how to store these cell_lists back into the original dataframe.
So, how do I get the authors cell, where the database is sciencedirect,perform my little function and store the output of my function back into the cell?

Idea is create custom function with f-strings and apply only to filtered rows:
scidir_mask = df["database"] == 'sciencedirect'
f = lambda x: [f"{y['first_name']} {y['surname']}" for y in x]
df.loc[scidir_mask, 'authors'] = df.loc[scidir_mask, 'authors'].apply(f)
print (df)
database authors title
0 sciencedirect [Iman Sharafaldin, Arash Habibi Lashkari] An eval
1 sciencedirect [Jangirala Srinivas, Ashok Kumar Das] Governmen
2 sciencedirect [Ivano Bongiovanni] The last
3 ieeexplore [Igor Kotenko, Andrey Chechulin] Cyber Attac

Python/Pandas: creating new dataframe, gets error "unalignable boolean Series provided as indexer"

I am trying to compare two dataframes and return different result sets based on whether a value from one dataframe is present in the other.
Here is my sample code:
pmdf = pd.DataFrame(
{
'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
}
)
pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]
jcrdf = pd.DataFrame(
{
'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
'Total Cites': ['223','444','324','166'],
'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
})
jcrdf = jcrdf.set_index('Full Journal Title')
pmdf_issn = pmdf['ISSN'].values.tolist()
This line gets me the rows from dataframe jcrdf that contain the issn from dataframe pmdf
pmjcrmatch = jcrdf[jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]
I wanted the following line to create a new dataframe of values from pmdf where the ISSN is not in jcfdf so I negated the previous statement and chose the first dataframe.
pmjcrnomatch = pmdf[~jcrdf['All_ISSNs'].str.contains('|'.join(pmdf_issn))]
I get an error: "Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match"
I don't find a lot about this specific error, at least nothing that is helping me toward a solution.
Is "str.contains" not the best way of sorting items that are and aren't in the second dataframe?

You are trying to apply the boolean index of one dataframe to another. This is only possible if the length of both dataframes match. In your case you should use isin.
# get all rows from jcrdf where `ALL_ISSNs` contains any of the `ISSN` in `pmdf`.
pmjcrmatch = jcrdf[jcrdf.All_ISSNs.str.contains('|'.join(pmdf.ISSN))]
# assign all remaining rows from `jcrdf` to a new dataframe.
pmjcrnomatch = jcrdf[~jcrdf.ISSN.isin(pmjcrmatch.ISSN)]
EDIT
Let's try another approach:
First i'd create a lookup for all you ISSNs and then create the diff by isolating the matches:
import pandas as pd
pmdf = pd.DataFrame(
{
'Journal' : ['US Drug standards.','Acta veterinariae.','Bulletin of big toe science.','The UK journal of dermatology.','Journal of Hypothetical Journals'],
'ISSN': ['0096-0225', '0567-8315','0007-4977','0007-0963','8675-309J'],
}
)
pmdf = pmdf[['Journal'] + pmdf.columns[:-1].tolist()]
jcrdf = pd.DataFrame(
{
'Full Journal Title': ['Drug standards.','Acta veterinaria.','Bulletin of marine science.','The British journal of dermatology.'],
'Abbreviated Title': ['DStan','Avet','Marsci','BritSkin'],
'Total Cites': ['223','444','324','166'],
'ISSN': ['0096-0225','0567-8315','0007-4977','0007-0963'],
'All_ISSNs': ['0096-0225,0096-0225','0567-8315,1820-7448,0567-8315','0007-4977,0007-4977','0007-0963,0007-0963,0366-077X,1365-2133']
})
jcrdf = jcrdf.set_index('Full Journal Title')
# create lookup from all issns to avoid expansice string matching
jcrdf_lookup = pd.DataFrame(jcrdf['All_ISSNs'].str.split(',').tolist(),
index=jcrdf.ISSN).stack(level=0).reset_index(level=0)
# compare extracted ISSNs from ALL_ISSNs with pmdf.ISSN
matches = jcrdf_lookup[jcrdf_lookup[0].isin(pmdf.ISSN)]
jcrdfmatch = jcrdf[jcrdf.ISSN.isin(matches.ISSN)]
jcrdfnomatch = pmdf[~pmdf.ISSN.isin(matches[0])]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to remove duplicates based on partial match - python

Related

Google Sheets API and Pandas. Inconsistent data length from the API

Add keys from dicts (in column) to new column

Create different files based off value in dataframe column A and save to different existing folders based off value in dataframe column A

Run function on cell in column, based on another column

Python/Pandas: creating new dataframe, gets error "unalignable boolean Series provided as indexer"

Categories

Resources