comparing column values based on other column values in pandas

comparing column values based on other column values in pandas - python

I have a dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame([['M',2014,'Seth',5],
['M',2014,'Spencer',5],
['M',2014,'Tyce',5],
['F',2014,'Seth',25],
['F',2014,'Spencer',23]],columns =['sex','year','name','number'])
print df
I would like to find the most gender ambiguous name for 2014. I have tried many ways but haven't had any luck yet.

NOTE: I do write a function at the end of my answer, but I decided to run through the code part by part for better understanding.
Obtaining Gender Ambiguous Names
First, you would want to get the list of gender ambiguous names. I would suggest using set intersection:
>>> male_names = df[df.sex == "M"].name
>>> female_names = df[df.sex == "F"].name
>>> gender_ambiguous_names = list(set(male_names).intersection(set(female_names)))
Now, you want to actually subset the data to show only gender ambiguous names in 2014. You would want to use membership conditions and chain the boolean conditions as a one-liner:
>>> gender_ambiguous_data_2014 = df[(df.name.isin(gender_ambiguous_names)) & (df.year == 2014)]
Aggregating the Data
Now you have this as gender_ambiguous_data_2014:
>>> gender_ambiguous_data_2014
sex year name number
0 M 2014 Seth 5
1 M 2014 Spencer 5
3 F 2014 Seth 25
4 F 2014 Spencer 23
Then you just have to aggregate by number:
>>> gender_ambiguous_data_2014.groupby('name').number.sum()
name
Seth 30
Spencer 28
Name: number, dtype: int64
Extracting the Name(s)
Now, the last thing you want is to get the name with the highest numbers. But in reality you might have gender ambiguous names that have the same total numbers. We should apply the previous result to a new variable gender_ambiguous_numbers_2014 and play with it:
>>> gender_ambiguous_numbers_2014 = gender_ambiguous_data_2014.groupby('name').number.sum()
>>> # get the max and find the list of names:
>>> gender_ambiguous_max_2014 = gender_ambiguous_numbers_2014[gender_ambiguous_numbers_2014 == gender_ambiguous_numbers_2014.max()]
Now you get this:
>>> gender_ambiguous_max_2014
name
Seth 30
Name: number, dtype: int64
Cool, let's extract the index names then!
>>> gender_ambiguous_max_2014.index
Index([u'Seth'], dtype='object')
Wait, what the heck is this type? (HINT: it's pandas.core.index.Index)
No problem, just apply list coercion:
>>> list(gender_ambiguous_max_2014.index)
['Seth']
Let's Write This in a Function!
So, in this case, our list has only element. But maybe we want to write a function where it returns a string for the sole contender, or returns a list of strings if some gender ambiguous names have the same total number in that year.
In the wrapper function below, I abbreviated my variable names with ga to shorten the code. Of course, this is assuming the data set is in the same format you have shown and is named df. If it's named otherwise just change the df accordingly.
def get_most_popular_gender_ambiguous_name(year):
"""Get the gender ambiguous name with the most numbers in a certain year.
Returns:
a string, or a list of strings
Note:
'gender_ambiguous' will be abbreviated as 'ga'
"""
# get the gender ambiguous names
male_names = df[df.sex == "M"].name
female_names = df[df.sex == "F"].name
ga_names = list(set(male_names).intersection(set(female_names)))
# filter by year
ga_data = df[(df.name.isin(ga_names)) & (df.year == year)]
# aggregate to get total numbers
ga_total_numbers = ga_data.groupby('name').number.sum()
# find the max number
ga_max_number = ga_total_numbers.max()
# subset the Series to only those that have max numbers
ga_max_data = ga_total_numbers[
ga_total_numbers == ga_max_number
]
# get the index (the names) for those satisfying the conditions
most_popular_ga_names = list(ga_max_data.index) # list coercion
# if list only contains one element, return the only element
if len(most_popular_ga_names) == 1:
return most_popular_ga_names[0]
return most_popular_ga_names
Now, calling this function is as easy as it gets:
>>> get_most_popular_gender_ambiguous_name(2014) # assuming df is dataframe var name
'Seth'

Not sure what do you mean by 'most gender ambigious', but you can start from this
>>> dfy = (df.year == 2014)
>>> dfF = df[(df.sex == 'F') & dfy][['name', 'number']]
>>> dfM = df[(df.sex == 'M') & dfy][['name', 'number']]
>>> pd.merge(dfF, dfM, on=['name'])
name number_x number_y
0 Seth 25 5
1 Spencer 23 5
If you want just the name with highest total number then:
>>> dfT = pd.merge(dfF, dfM, on=['name'])
>>> dfT
name number_x number_y
0 Seth 25 5
1 Spencer 23 5
>>> dfT['total'] = dfT['number_x'] + dfT['number_y']
>>> dfT.sort_values('total', ascending=False).head(1)
name number_x number_y total
0 Seth 25 5 30

Related

Command to call non numeric value from a dataset?

New to Python, as part of an assessment I need to collect data samples from a dataset, the information has been put through a labelencoder with:
le = LabelEncoder()
for i in columns:
#print(i)
data[i] = le.fit_transform(data[i])
data.head()
this shows the below table.
if i use the command:
data['native-country'].value_counts()
I will get numerical values when at this point I want to see the actual country rather than the numerical value assigned. how do I do this?
thanks.

The funcion value_counts returns a series with values as index entries. Since values in the dataframe are numbers, that's what you get.
You can use the library phone_iso3166 to lookup the numeric country codes (I assume they're telephone prefixes) and update the index
df
# Out:
# col_1 native_country
# 0 3 39
# 1 2 39
# 2 1 20
vc = df['native_country'].value_counts()
vc
# 39 2
# 20 1
# Name: native_country, dtype: int64
Import library and lookup country code
from phone_iso3166.country import *
vc.to_frame().set_index(vc.index.map(phone_country))
# Out:
# native_country
# IT 2
# EG 1
vc.to_frame().set_index(vc.index.map(phone_country)) \
.rename(columns={'native_country':'count'})
# Out:
# count
# IT 2
# EG 1
Or just use any other feasible dictionary/library for converting the codes to country names.

Replacing wrong country name with the correct one in dataset

Sample Dataset
i am facing an issue and don't know how to approach it.
i have a large dataset with two coulmn i.e country and cityname. There are multiple entries where country and City names are Spelled incorrect due to human error. eg England is written as Egnald
Can anybody please guide me how to check and correct them in python?
i was able to found the incorrect entries by using the code below, but i am not sure how to correct them with the proper one with automated process as i cannot do it manually
Thanks
Here is what i have done so far:
import pycountry as pc
#converting billing country to lower string
df['Billing Country'].str.lower()
input_country_list=list(df['Billing Country'])
input_country_list=[element.upper() for element in input_country_list];
def country_name_check():
pycntrylst = list(pc.countries)
alpha_2 = []
alpha_3 = []
name = []
common_name = []
official_name = []
invalid_countrynames =[]
tobe_deleted = ['IRAN','SOUTH KOREA','NORTH KOREA','SUDAN','MACAU','REPUBLIC
OF IRELAND']
for i in pycntrylst:
alpha_2.append(i.alpha_2)
alpha_3.append(i.alpha_3)
name.append(i.name)
if hasattr(i, "common_name"):
common_name.append(i.common_name)
else:
common_name.append("")
if hasattr(i, "official_name"):
official_name.append(i.official_name)
else:
official_name.append("")
for j in input_country_list:
if j not in map(str.upper,alpha_2) and j not in map(str.upper,alpha_3)
and j not in map(str.upper,name) and j not in map(str.upper,common_name) and
j not in map(str.upper,official_name):
invalid_countrynames.append(j)
invalid_countrynames = list(set(invalid_countrynames))
invalid_countrynames = [item for item in invalid_countrynames if item not in
tobe_deleted]
return print(invalid_countrynames)
By running the above code i was able to get the names of misspelled country name, can anyone please guide how to replace them with the correct one now?

You can use SequenceMatcher from difflib (see here). It has ratio() method, that allows you to compare similarity of two strings (higher number means higher similarity, 1.0 means same words):
>>> from difflib import SequenceMatcher
>>> SequenceMatcher(None,'Dog','Cat').ratio()
0.0
>>> SequenceMatcher(None,'Dog','Dogg').ratio()
0.8571428571428571
>>> SequenceMatcher(None,'Cat','Cta').ratio()
0.6666666666666666
My idea is to have list of correct names of countries, and compare each record in your dataframe to each item in this list, and select the most similar, thus you should get the correct name of country. Then you can put it into the function, and apply this function over all records in your Country column in dataframe:
>>> #let's say we have following dataframe
>>> df
number country
0 1 Austria
1 2 Autrisa
2 3 Egnald
3 4 Sweden
4 5 England
5 6 Swweden
>>>
>>> #let's specify correct names
>>> correct_names = {'Austria','England','Sweden'}
>>>
>>> #let's specify the function that select most similar word
>>> def get_most_similar(word,wordlist):
... top_similarity = 0.0
... most_similar_word = word
... for candidate in wordlist:
... similarity = SequenceMatcher(None,word,candidate).ratio()
... if similarity > top_similarity:
... top_similarity = similarity
... most_similar_word = candidate
... return most_similar_word
...
>>> #now apply this function over 'country' column in dataframe
>>> df['country'].apply(lambda x: get_most_similar(x,correct_names))
0 Austria
1 Austria
2 England
3 Sweden
4 England
5 Sweden
Name: country, dtype: object

df.replace(['Egnald', 'Cihna'], ['England', 'China'])
This will find and replace in the entire DF
Use df.replace(['Egnald', 'Cihna'], ['England', 'China'], inplace=True) if you want to do this inplace.

How to select the specific datas from the Dataframe after being used value_couts()?

I used python to read a file which contains the baby's names, genders and birth-years. Now I want to find out the names which are used both by boys and girls. I used value_counts()to get the appearance times of each name, but now I don't know how to extract the names from all the names.
Here is my codes:
def names_both(year):
names = []
path = 'babynames/yob%d.txt' % year
columns = ['name', 'sex', 'birth']
frame = pd.read_csv(path, names=columns)
frame = frame['name'].value_counts()
print(frame)
"""if len(names) != 0:
print(names)
else:
print('None')"""
The frame now is like this:
Lou 2
Willie 2
Erie 2
Cora 2
..
Perry 1
Coy 1
Adolphus 1
Ula 1
Emily 1
Name: name, Length: 1889, dtype: int64
Here is the csv:
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288
Annie,F,1258
Clara,F,1226
Ella,F,1156
Florence,F,1063
...
Thanks for helping!

Here we are for counting the number of names given both to girls and boys:
common_girl_and_boys_names = (
# work name by name
frame.groupby('name')
# count the number of sex given for the name and keep the one given to both sex, this boolean will be put in a column call 0
.apply(lambda x: len(x['sex'].unique()) == 2)
# the name are now in the index, reset it in order to get the names
.reset_index()
# keep only names with the column 0 with True value
.loc[lambda x: x[0], 'name']
)
final_df = (
# keep only the names common to boys and girls (the series build before)
frame.loc[frame['name'].isin(common_girl_and_boys_names), :]
# sex is now useless
.drop(['sex'], axis='columns')
# work name by name and sum the number of birth
.groupby('name')
.sum()
)
You can put those lines after the read_csv function. I hope it is want you want.

How to match input data and a data in df, for loop minus

I want the input str to match with str in file that have fix row and then I will minus the score column of that row
1!! == i think this is for loop to find match str line by line from first to last
2!! == this is for when input str have matched it will minus score of matched row by 1.
CSV file:
article = pd.read_csv('Customer_List.txt', delimiter = ',',names = ['ID','NAME','LASTNAME','SCORE','TEL','PASS'])
y = len(article.ID)
line=article.readlines()
for x in range (0,y): # 1!!
if word in line :
newarticle = int(article.SCORE[x]) - 1 #2!!
print(newarticle)
else:
x = x + 1
P.S. I have just study python for 5 days, please give me a suggestion.Thank you.

Since I see you using pandas, I will give a solution without any loops as it is much easier.
You have, for example:
df = pd.DataFrame()
df['ID'] = [216, 217]
df['NAME'] = ['Chatchai', 'Bigm']
df['LASTNAME'] = ['Karuna', 'Koratuboy']
df['SCORE'] = [25, 15]
You need to do:
lookfor = str(input("Enter the name: "))
df.loc[df.NAME == lookfor, 'SCORE']-= 1
What happens in the lines above is, you look for the name entered in the NAME column of your dataframe, and reduce the score by 1 if there is a match, which is what you want if I understand your question.
Example:
Now, let's say you are looking for a person called Alex with the name, since there is no such person, you must get the same dataframe back.
Enter the name: Alex
ID NAME LASTNAME SCORE
0 216 Chatchai Karuna 25
1 217 Bigm Koratuboy 15
Now, let's say you are looking for a person called Chatchai with the name, since there is a match and you want the score to be reduced, you will get:
Enter the name: Chatchai
ID NAME LASTNAME SCORE
0 216 Chatchai Karuna 24
1 217 Bigm Koratuboy 15

Get value from pandas series object

I have a bunch of data files, with columns 'Names', 'Gender', 'Count', one file per one year. I need to concatenate all the files for some period, sum all counts for all unique names and add a new column with amount of consonant. I can't extract string value from 'Names'. How can I implement that?
Here is my code:
import os
import re
import pandas as pd
PATH = ...
def consonants_dynamics (years):
names_by_year = {}
for year in years:
names_by_year[year] = pd.read_csv(PATH+"\\yob{}.txt".format(year), names =['Names', 'Gender', 'Count'])
names_all = pd.concat(names_by_year, names=['Year', 'Pos'])
dynamics = names_all.groupby('Names').sum().sort_values(by='Count', ascending=False).unstack('Names')
dynamics['Consonants'] = dynamics.apply(count_vowels(dynamics.Names), axis = 1)
return dynamics.head(10)
def count_vowels (name):
vowels = re.compile('A|E|I|O|U|a|e|i|o|u')
return len(name) - len (vowels.findall(name))
If I run something like
a = consonants_dynamics(i for i in range (1900, 2001, 10))
I get the following error message
<ipython-input-9-942fc155267e> in consonants_dynamcis(years)
...
---> 12 dynamics['Consonants'] = dynamics.apply(count_vowels(dynamics.Names), axis = 1)
AttributeError: 'Series' object has no attribute 'Names'
I tried various ways but all failed. How can it be done?

after doing unstack you converted dynamics to a series object where you no longer have Names column dynamics.Names. I think it should be fixed by removing .unstack('Names')
after that use dynamics.index:
dynamics['Consonants'] = dynamics.reset_index()['Names'].apply(count_vowels)

Convert index to_series and apply function:
print (dynamics)
Count
Names
James 2
John 3
Robert 10
def count_vowels (name):
vowels = re.compile('A|E|I|O|U|a|e|i|o|u')
return len(name) - len (vowels.findall(name))
dynamics['Consonants'] = dynamics.index.to_series().apply(count_vowels)
Solution without function with str.len and substract only wovels by str.count:
pat = 'A|E|I|O|U|a|e|i|o|u'
s = dynamics.index.to_series()
dynamics['Consonants_new'] = s.str.len() - s.str.count(pat)
print (dynamics)
Count Consonants_new Consonants
Names
James 2 3 3
John 3 3 3
Robert 10 4 4
EDIT:
Solutions without to_series is add as_index=False to groupby for return DataFrame:
names_all = pd.DataFrame({
'Names':['James','James','John','John', 'Robert', 'Robert'],
'Count':[10,20,10,30, 80,20]
})
dynamics = names_all.groupby('Names', as_index=False).sum()
.sort_values(by='Count', ascending=False)
pat = 'A|E|I|O|U|a|e|i|o|u'
s = dynamics.index.to_series()
dynamics['Consonants'] = dynamics['Names'].str.len() - dynamics['Names'].str.count(pat)
print (dynamics)
Names Count Consonants
2 Robert 100 4
1 John 40 3
0 James 30 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

comparing column values based on other column values in pandas - python

Related

Command to call non numeric value from a dataset?

Replacing wrong country name with the correct one in dataset

How to select the specific datas from the Dataframe after being used value_couts()?

How to match input data and a data in df, for loop minus

Get value from pandas series object

Categories

Resources