Generate ID based on string in excel - python

So, I have data consists of names of the persons and I want to assign a unique numeric ID to each of them based on their first names. But the thing is I want to give same unique numeric ID to same first name person. For example, if say, there are two person with same name e.g. John, they will have same unique numeric ID value. Note that I want to assign this ID dynamically because the people data will get added constantly so every time the new people data added I need to check whether the I already have a ID for that person or do I have to generate a new one. I want do this excel with some formula or macros.
Also if anyone knows how to do this python like generating an same unique numeric ID for same string. I also try to find answer using UUID module of python, but didn't find any proper solution.
ID Name
1 John
2 Michelle
1 John
3 Hasan
2 Michelle
As you can see I John value has same numeric ID which is '1' so as 'Michelle'

This UDF is a bit shaky but will work depending on how many names you have and the spread of the names ...
Public Function GenerateId(ByVal strText As String) As Long
Dim i As Long
For i = 1 To Len(strText)
strChar = UCase(Mid(strText, i, 1))
GenerateId = GenerateId + Asc(strChar)
Next
End Function
... there is a chance it will double up but it's not easy to predict. You'd have to run all names through and check all outcomes.
Also, I know it's not a sequential ID approach starting from 1 but you didn't specify that so I used some creative licence. :-)
Also, this will ensure that the name retains it's ID if the data is sorted differently, not sure if that's a requirement or not but it's a consideration.
Worth a potential shot anyway.

Related

How could I find specific texts in one column of another dataset? Python

I have 2 datasets. One contains a column of companies name, and another contains a column of headlines of news. So the aim I want to achieve is to find all the news whose headline contains one company in the other datasets.Basically the two datasets are like this, and I wanna select the news with specific company names
I have tried to use for loop to achieve my goals, but I think it takes too much time and I think pandas or some other libraries can do this in an easier way.
I am a starter in python.
If I understand correctly you should have 2 data sets with different columns, first, you need to loop through the dataset that contains the company name to search in the headline, then you could use obj. find(“search”) to find matches in both datasets.
Also if every query is stored in a CSV format you could use the split() function to get the only column you wanna use
Supposing that you have saved your company names in a pd.Series called company and headlines and texts in a pd.DataFrame called df, this will be what you are looking for:
# it will add a column called "company" to your initial df
for org, headline in zip(company, df['headline']):
if org in headline:
df.loc[df['headline'] == headline, 'company'] = org
You should pay attention to lower and upper case letters, as this will only find the corresponding company if the exact same word appears in the headline.

Tools or python libraries to detect records duplicate

I’m trying to find duplicates in a single csv file by python so through my search I found dedupe.io which is a platform using python and machine learning algorithms to detect records duplicate but it’s not a free tool. However, I don’t want to use the traditional method which the compared columns should specified. I would like to find a way to detect duplicate with a high accuracy. Therefore, is there any tool or python library to find duplicates for text datasets?
Here is an example which could clarify that:
Title, Authors, Venue, Year
1- Clustering validity checking methods: part II, Maria Halkidi, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002
2- Cluster validity methods: part I, Yannis Batistakis, Michalis Vazirgiannis, ACM SIGMOD Record, 2002
3- Book reviews, Karl Aberer, ACM SIGMOD Record, 2003
4- Book review column, Karl Aberer, ACM SIGMOD Record, 2003
5- Book reviews, Leonid Libkin, ACM SIGMOD Record, 2003
So, we can decide that records 1 and 2 are not duplicate even though they are contain almost similar data but slightly different in the Title column. Records 3 and 4 are duplicate but record 5 is not referring to the same entity.
Pandas provides provides a very straightforward way to achieve this pandas.DataFrame.drop_duplicates.
Given the following file (data.csv) stored in the current working directory.
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
John Doe,25,50000
Louise Jones,25,50000
The following script can be used to remove duplicate records, writing the processed data to a csv file in the current working directory (processed_data.csv).
import pandas as pd
df = pd.read_csv("data.csv")
df = df.drop_duplicates()
df.to_csv("processed_data.csv", index=False)
The resultant output in this example looks like:
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
Louise Jones,25,50000
pandas.DataFrame.drop_duplicates also allows dropping of duplicate attributes from a specific column (instead of just duplicates of entire rows), column names are specified using the subset argument.
e.g
import pandas as pd
df = pd.read_csv("data.csv")
df = df.drop_duplicates(subset=["age"])
df.to_csv("processed_data.csv", index=False)
Will remove all duplicate values from the age column, maintaining only the first record containing a value duplicated in the age field of later records.
In this example case the output would be:
name,age,salary
John Doe,25,50000
Jayne Doe,20,80000
Tim Smith,40,100000
Thanks #JPI93 for your answer but some duplicate still exist and didn't removed. I think this method works for the exact duplicate; if this is the case, it's not what i'm looking for. I want to apply record linkage which identify the records that refer to the same entity and then can be removed.

How to do a nested loop type operation in Python Pandas?

I need to find a subset of a subset, and I need to do it iteratively, then at each instance of this sub-subset calculate a value and then saving it in a new outputs table.
To explain better I have a data frame similar to the one shown in the pic below;
I need to iterate through the dataset and sum the costs for all person 1 (of Group 1) for Team A (of Group 2).
Then move to person 1 in Team B and do the same, and so on until Person 1 is done.
Then move to Person 2 and do the same for all the teams again.
EXAMPLE BELOW:
My understanding was to use a nested loop something like:
for Person in Group1:
for Team in Group 2:
Newcost=sum(cost)
output.append(Person, Team, Newcost)
However, I am new to Python and pandas in particular and I am finding it difficult to use the same method I would usually due to having a data frame setting and a different syntax.
I have read about using .groupby and .loc to make my data frame smaller and group by my conditions, but I would need to do it iteratively and with two conditions at the same time, and then finally calculate my value and I am not sure how that would work.
Any suggestion would be much-appreciated Thanks!
I think, it will be easy for you to create a new dataframe instead of operating on the same dataframe. You can use dataframe.loc with the query inside to get the value of Person 1 and Person 2 with respect to the group you want. The way to use .loc can be introduced in here

How to calculate the sum of conditional cells in excel, populate another column with results

EDIT: Using advanced search in Excel (under data tab) I have been able to create a list of unique company names, and am now able to SUMIF based on the cell containing the companies name!
Disclaimer: Any python solutions would be greatly appreciated as well, pandas specifically!
I have 60,000 rows of data, containing information about grants awarded to companies.
I am planning on creating a python dictionary to store each unique company name, with their total grant $ given (agreemen_2), and location coordinates. Then, I want to display this using Dash (Plotly) on a live MapBox map of Canada.
First thing first, how do I calculate and store the total value that was awarded to each company?
I have seen SUMIF in other solutions, but am unsure how to output this to a new column, if that makes sense.
One potential solution I thought was to create a new column of unique company names, and next to it SUMIF all the appropriate cells in col D.
PYTHON STUFF SO FAR
So with the below code, I take a much messier looking spreadsheet, drop duplicates, sort based on company name, and create a new pandas database with the relevant data columns:
corp_df is the cleaned up new dataframe that I want to work with.
and recipien_4 is the companies unique ID number, as you can see it repeats with each grant awarded. Folia Biotech in the screenshot shows a duplicate grant, as proven with a column i did not include in the screenshot. There are quite a few duplicates, as seen in the screenshot.
import pandas as pd
in_file = '2019-20 Grants and Contributions.csv'
# create dataframe
df = pd.read_csv(in_file)
# sort in order of agreemen_1
df.sort_values("recipien_2", inplace = True)
# remove duplicates
df.drop_duplicates(subset='agreemen_1', keep='first', inplace=True)
corp_dict = { }
# creates empty dict with only 1 copy of all corporation names, all values of 0
for name in corp_df_2['recipien_2']:
if name not in corp_dict:
corp_dict[name] = 0
# full name, id, grant $, longitude, latitude
corp_df = df[['recipien_2', 'recipien_4', 'agreemen_2','longitude','latitude']]
any tips or tricks would be greatly appreciated, .ittertuples() didn't seem like a good solution as I am unsure how to filter and compare data, or if datatypes are preserved. But feel free to prove me wrong haha.
I thought perhaps there was a better way to tackle this problem, straight in Excel vs. iterating through rows of a pandas dataframe. This is a pretty open question so thank you for any help or direction you think is best!
I can see that you are using pandas to read de the file csv, so you can use the method:
Group by
So you can create a new dataframe making groupings for the name of the company like this:
dfnew = dp.groupby(['recipien_2','agreemen_2']).sum()
Then dfnew have the values.
Documentation Pandas Group by:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
The use of group_by followed by a sum may be the best for you:
corp_df= df.group_by(by=['recipien_2', 'longitude','latitude']).apply(sum, axis=1)
#if you want to transform the index into columns you can add this after as well:
corp_df=corp_df.reset_index()

Insert blank column in Excel with values based on other column data

I have a spreadsheet that comes to me with a column that contains FQDN's of computers. However, filtering this is difficult because of the unique names and I ended up putting in a new column next the FQDN column and then entering a less unique value based on that name. An example of this would be:
dc01spmkt.domain.com
new column value = "MARKETING"
All of the hosts will have a 3 letter designation so people can filter on the new column with the more generic titles.
My question is: Is there a way that I can script this so that when the raw sheet comes I can run the script and it will look for values in the old column to populate the new one? So if it finds 'mkt' together in the hostname field it writes MARKETING, or if it finds 'sls' it writes SALES?
If I understand you correctly, you should be able to do this with an if, isnumber, search formula as follows:
=IF(ISNUMBER(SEARCH("mkt",A1))=TRUE,"Marketing",IF(ISNUMBER(SEARCH("sls",A1))=TRUE,"Sales",""))
which would yield you the following:
asdfamkt Marketing
sls Sales
aj;sldkjfa
a;sldkfja
mkt Marketing
sls Sales
What this is doing is using Search, which returns the numbered place where your text you are searching begins in the field. Then you use ISNumber to return a true or false as to whether your Search returned a number, meaning it found your three letters in question. Then you are using the IF to say that if ISNumber is True, then you want to call it "Marketing" or whatever.
You can draw out the IF arguments for as many three letter variables as you would need to.
Hope this helped!

Categories

Resources