Duplicate Information - python

I have a df that contains the columns, [CPF, name, age].
I need to find the CPF that is repeated on the base and return the person's name together with the CPF.
So far I've done that.
TrueDuplicat = base.groupby(['CPF']).size().reset_index(name='count')
TrueDuplicat = TrueDuplicat[TrueDuplicat['count']>1]
When I put:
TrueDuplicat = TrueDuplicat[['name','CPF']]
I get the error "['name'] not in index".
How do I get the duplicate CPF with the person's name?
Exemplo do DF
CPF name age
38445675455 Alex 15
54785698574 Ana 25
38445675455 Bento 22
65878584558 Caio 33

After your groupby, you do not have a name column in TrueDuplicat. For the example you have posted, TrueDuplicat is:
CPF count
0 38445675455 2
If you're looking for the names corresponding to the CPF values in TrueDuplicat, you can do something like
df[df['CPF'].isin(TrueDuplicat['CPF'].tolist())]
which, for your example, will yield
CPF name age
0 38445675455 Alex 15
2 38445675455 Bento 22

Related

How to group data by count of columns in Pandas?

I have a CSV file with a lot of rows and different number of columns.
How to group data by count of columns and show it in different frames?
File CSV has the following data:
1 OLEG US FRANCE BIG
1 OLEG FR 18
1 NATA 18
Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then:
ID NAME STATE COUNTRY HOBBY
FR1: 1 OLEG US FRANCE BIG
ID NAME COUNTRY AGE
FR2: 1 OLEG FR 18
FR3:
ID NAME AGE
1 NATA 18
Any words, I need to group rows by count of columns and show them in different dataframes.
since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.
One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
print(df1, df2, df3, sep='\n\n')
ID NAME AGE
0 1 NATA 18
ID NAME COUNTRY AGE
0 1 OLEG FR 18
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.
EDIT
Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.
col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
dict_of_dfs = {}
for cols in col_list:
dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
for key,val in dict_of_dfs.items():
print(f'{key=}: \n {val} \n')
key='df_3':
ID NAME AGE
0 1 NATA 18
key='df_4':
ID NAME COUNTRY AGE
0 1 OLEG FR 18
key='df_5':
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.
If you need to import the data with pandas, you could have a look at this post.

Best practice to store the order of table rows?

I have table which has columns like these
class AgeAndName(models.Model):
name = m.CharField(max_length=20)
age = m.IntegerField
name age
---- --
John 22
Hitch 38
Heiku 25
Taro 36
Cho 40
Now I want to allow the user to sort as he like, and keep.
then I think of two ways.
1.make new column and keep order's
class AgeAndName(models.Model):
name = m.CharField(max_length=20)
age = m.IntegerField
order = m.IntegerField
name age order
---- -- -----
John 22 1
Hitch 38 5
Heiku 25 3
Taro 36 4
Cho 40 2
2.make one property for model and keep them.
class AgeAndName(models.Model):
#classmember??? ( I am not sure I have this kind of thing)
order = (0,4,2,3,1)
name = m.CharField(max_length=20)
age = m.IntegerField
Which one is the best practice for django??
Or is there any other good way ?

Getting an error when checking if values in a list match a column PANDAS

I'm just wondering how one might overcome the below error.
AttributeError: 'list' object has no attribute 'str'
What I am trying to do is create a new column "PrivilegedAccess" and in this column I want to write "True" if any of the names in the first_names column match the ones outlined in the "Search_for_These_values" list and "False" if they don't
Code
## Create list of Privileged accounts
Search_for_These_values = ['Privileged','Diagnostics','SYS','service account'] #creating list
pattern = '|'.join(Search_for_These_values) # joining list for comparision
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF.columns=[['first_name']].str.contains(pattern)
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['PrivilegedAccess'].map({True: 'True', False: 'False'})
SAMPLE DATA:
uid last_name first_name language role email_address department
0 121 Chad Diagnostics English Team Lead Michael.chad#gmail.com Data Scientist
1 253 Montegu Paulo Spanish CIO Paulo.Montegu#gmail.com Marketing
2 545 Mitchel Susan English Team Lead Susan.Mitchel#gmail.com Data Scientist
3 555 Vuvko Matia Polish Marketing Lead Matia.Vuvko#gmail.com Marketing
4 568 Sisk Ivan English Supply Chain Lead Ivan.Sisk#gmail.com Supply Chain
5 475 Andrea Patrice Spanish Sales Graduate Patrice.Andrea#gmail.com Sales
6 365 Akkinapalli Cherifa French Supply Chain Assistance Cherifa.Akkinapalli#gmail.com Supply Chain
Note that the dtype of the first_name column is "object" and the dataframe is multi index (not sure how to change from multi index)
Many thanks
It seems you need select one column for str.contains and then use map or convert boolean to strings:
Search_for_These_values = ['Privileged','Diagnostics','SYS','service account'] #creating list
pattern = '|'.join(Search_for_These_values)
PrivilegedAccounts_DF = pd.DataFrame({'first_name':['Privileged 111',
'aaa SYS',
'sss']})
print (PrivilegedAccounts_DF.columns)
Index(['first_name'], dtype='object')
print (PrivilegedAccounts_DF.loc[0, 'first_name'])
Privileged 111
print (type(PrivilegedAccounts_DF.loc[0, 'first_name']))
<class 'str'>
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['first_name'].str.contains(pattern).astype(str)
print (PrivilegedAccounts_DF)
first_name PrivilegedAccess
0 Privileged 111 True
1 aaa SYS True
2 sss False
EDIT:
There is problem one level MultiIndex, need:
PrivilegedAccounts_DF = pd.DataFrame({'first_name':['Privileged 111',
'aaa SYS',
'sss']})
#simulate problem
PrivilegedAccounts_DF.columns = [PrivilegedAccounts_DF.columns.tolist()]
print (PrivilegedAccounts_DF)
first_name
0 Privileged 111
1 aaa SYS
2 sss
#check columns
print (PrivilegedAccounts_DF.columns)
MultiIndex([('first_name',)],
)
Solution is join values, e.g. by empty string:
PrivilegedAccounts_DF.columns = PrivilegedAccounts_DF.columns.map(''.join)
So now columns names are correct:
print (PrivilegedAccounts_DF.columns)
Index(['first_name'], dtype='object')
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['first_name'].str.contains(pattern).astype(str)
print (PrivilegedAccounts_DF)
There might be a more elegant solution, but this should work (without using patterns):
PrivilegedAccounts_DF.loc[PrivilegedAccounts_DF['first_name'].isin(Search_for_These_values), "PrivilegedAccess"]=True
PrivilegedAccounts_DF.loc[~PrivilegedAccounts_DF['first_name'].isin(Search_for_These_values), "PrivilegedAccess"]=False

Finding value of another attribute given an attribute

I have a CSV that has multiple lines, and I am looking to find the JobTitle of a person, given their name. The CSV is now in a DataFrame sal as such:
id employee_name job_title
1 SOME NAME SOME TITLE
I'm trying to find the JobTitle of some given persons name, but am having a hard time doing this. I am currently trying to learn pandas by doing crash courses and I know I can get a list of job titles by using sal['job_title'], but that gives me an entire list of the job titles.
How can I find the value of a specific person?
You need boolean indexing:
sal[sal.employee_name == 'name']
If need select only some column, use ix with boolean indexing:
sal.ix[sal.employee_name == 'name', 'job_title']
Sample:
sal = pd.DataFrame({'id':[1,2,3],
'employee_name':['name','name1','name2'],
'job_title':['titleA','titleB','titleC']},
columns=['id','employee_name','job_title'])
print (sal)
id employee_name job_title
0 1 name titleA
1 2 name1 titleB
2 3 name2 titleC
print (sal[sal.employee_name == 'name'])
id employee_name job_title
0 1 name titleA
print (sal.ix[sal.employee_name == 'name', 'job_title'])
0 titleA
Name: job_title, dtype: object

comparing column values based on other column values in pandas

I have a dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([['M',2014,'Seth',5],
['M',2014,'Spencer',5],
['M',2014,'Tyce',5],
['F',2014,'Seth',25],
['F',2014,'Spencer',23]],columns =['sex','year','name','number'])
print df
I would like to find the most gender ambiguous name for 2014. I have tried many ways but haven't had any luck yet.
NOTE: I do write a function at the end of my answer, but I decided to run through the code part by part for better understanding.
Obtaining Gender Ambiguous Names
First, you would want to get the list of gender ambiguous names. I would suggest using set intersection:
>>> male_names = df[df.sex == "M"].name
>>> female_names = df[df.sex == "F"].name
>>> gender_ambiguous_names = list(set(male_names).intersection(set(female_names)))
Now, you want to actually subset the data to show only gender ambiguous names in 2014. You would want to use membership conditions and chain the boolean conditions as a one-liner:
>>> gender_ambiguous_data_2014 = df[(df.name.isin(gender_ambiguous_names)) & (df.year == 2014)]
Aggregating the Data
Now you have this as gender_ambiguous_data_2014:
>>> gender_ambiguous_data_2014
sex year name number
0 M 2014 Seth 5
1 M 2014 Spencer 5
3 F 2014 Seth 25
4 F 2014 Spencer 23
Then you just have to aggregate by number:
>>> gender_ambiguous_data_2014.groupby('name').number.sum()
name
Seth 30
Spencer 28
Name: number, dtype: int64
Extracting the Name(s)
Now, the last thing you want is to get the name with the highest numbers. But in reality you might have gender ambiguous names that have the same total numbers. We should apply the previous result to a new variable gender_ambiguous_numbers_2014 and play with it:
>>> gender_ambiguous_numbers_2014 = gender_ambiguous_data_2014.groupby('name').number.sum()
>>> # get the max and find the list of names:
>>> gender_ambiguous_max_2014 = gender_ambiguous_numbers_2014[gender_ambiguous_numbers_2014 == gender_ambiguous_numbers_2014.max()]
Now you get this:
>>> gender_ambiguous_max_2014
name
Seth 30
Name: number, dtype: int64
Cool, let's extract the index names then!
>>> gender_ambiguous_max_2014.index
Index([u'Seth'], dtype='object')
Wait, what the heck is this type? (HINT: it's pandas.core.index.Index)
No problem, just apply list coercion:
>>> list(gender_ambiguous_max_2014.index)
['Seth']
Let's Write This in a Function!
So, in this case, our list has only element. But maybe we want to write a function where it returns a string for the sole contender, or returns a list of strings if some gender ambiguous names have the same total number in that year.
In the wrapper function below, I abbreviated my variable names with ga to shorten the code. Of course, this is assuming the data set is in the same format you have shown and is named df. If it's named otherwise just change the df accordingly.
def get_most_popular_gender_ambiguous_name(year):
"""Get the gender ambiguous name with the most numbers in a certain year.
Returns:
a string, or a list of strings
Note:
'gender_ambiguous' will be abbreviated as 'ga'
"""
# get the gender ambiguous names
male_names = df[df.sex == "M"].name
female_names = df[df.sex == "F"].name
ga_names = list(set(male_names).intersection(set(female_names)))
# filter by year
ga_data = df[(df.name.isin(ga_names)) & (df.year == year)]
# aggregate to get total numbers
ga_total_numbers = ga_data.groupby('name').number.sum()
# find the max number
ga_max_number = ga_total_numbers.max()
# subset the Series to only those that have max numbers
ga_max_data = ga_total_numbers[
ga_total_numbers == ga_max_number
]
# get the index (the names) for those satisfying the conditions
most_popular_ga_names = list(ga_max_data.index) # list coercion
# if list only contains one element, return the only element
if len(most_popular_ga_names) == 1:
return most_popular_ga_names[0]
return most_popular_ga_names
Now, calling this function is as easy as it gets:
>>> get_most_popular_gender_ambiguous_name(2014) # assuming df is dataframe var name
'Seth'
Not sure what do you mean by 'most gender ambigious', but you can start from this
>>> dfy = (df.year == 2014)
>>> dfF = df[(df.sex == 'F') & dfy][['name', 'number']]
>>> dfM = df[(df.sex == 'M') & dfy][['name', 'number']]
>>> pd.merge(dfF, dfM, on=['name'])
name number_x number_y
0 Seth 25 5
1 Spencer 23 5
If you want just the name with highest total number then:
>>> dfT = pd.merge(dfF, dfM, on=['name'])
>>> dfT
name number_x number_y
0 Seth 25 5
1 Spencer 23 5
>>> dfT['total'] = dfT['number_x'] + dfT['number_y']
>>> dfT.sort_values('total', ascending=False).head(1)
name number_x number_y total
0 Seth 25 5 30

Categories

Resources