List unique identifier of missing rows in a series - python

Is it possible to return the row number of missing values within a given series?
Name Age
Fred 25
John 38
Chris
I want to return the row number or some unique identifier of any rows where 'Age' is missing. i.e Chris

Use:
df = pd.DataFrame({'Age': [25.0, 38.0, np.nan]}, index=['Fred', 'John', 'Chris'])
print (df)
Age
Fred 25.0
John 38.0
Chris NaN
m = df['Age'].isnull()
print (df.index[m].tolist())
[Chris]

You can do df.index[pd.isnull(df['Age'])]

Related

how to choose certain amount of character from a column in Python?

for example, there is a column in a dataframe, 'ID'.
One of the entries is for example, '13245993, 3004992'
I only want to get '13245993'.
That also applies for every row in column 'ID'.
How to change the data in each row in column 'ID'?
You can try like this, apply slicing on ID column to get the required result. I am using 3 chars as no:of chars here
import pandas as pd
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'ID':[90877, 10909, 12223, 12334]}
df=pd.DataFrame(data)
print('Before change')
print(df)
df["ID"]=df["ID"].apply(lambda x: (str(x)[:3]))
print('After change')
print(df)
output
Before change
Name ID
0 Tom 90877
1 nick 10909
2 krish 12223
3 jack 12334
After change
Name ID
0 Tom 908
1 nick 109
2 krish 122
3 jack 123
You could do something like
data[data['ID'] == '13245993']
this will give you the columns where ID is 13245993
More Indepth Code
I hope this answers your question if not please let me know.
With best regards

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.

Python: Sum values in DataFrame if other values match between DataFrames

I have two dataframes of different length like those:
DataFrame A:
FirstName LastName
Adam Smith
John Johnson
DataFrame B:
First Last Value
Adam Smith 1.2
Adam Smith 1.5
Adam Smith 3.0
John Johnson 2.5
Imagine that what I want to do is to create a new column in "DataFrame A" summing all the values with matching last names, so the output in "A" would be:
FirstName LastName Sums
Adam Smith 5.7
John Johnson 2.5
If I were in Excel, I'd use
=SUMIF(dfB!B:B, B2, dfB!C:C)
In Python I've been trying multiple solutions but using both np.where, df.sum(), dropping indexes etc., but I'm lost. Below code is returning "ValueError: Can only compare identically-labeled Series objects", but I don't think it's written correctly anyways.
df_a['Sums'] = df_a[df_a['LastName'] == df_b['Last']].sum()['Value']
Huge thanks in advance for any help.
Use boolean indexing with Series.isin for filtering and then aggregate sum:
df = (df_b[df_b['Last'].isin(df_a['LastName'])]
.groupby(['First','Last'], as_index=False)['Value']
.sum())
If want match both, first and last name:
df = (df_b.merge(df_a, left_on=['First','Last'], right_on=['FirstName','LastName'])
.groupby(['First','Last'], as_index=False)['Value']
.sum())
df_b_a = (pd.merge(df_b, df_a, left_on=['FirstName', 'LastName'], right_on=['First', 'Last'], how='left')
.groupby(by=['First', 'Last'], as_index=False)['Value'].sum())
print(df_b_a)
First Last Value
0 Adam Smith 5.7
1 John Johnson 2.5
Use DataFrame.merge + DataFrame.groupby:
new_df=( dfa.merge(dfb.groupby(['First','Last'],as_index=False).Value.sum() ,
left_on='LastName',right_on='Last',how='left')
.drop('Last',axis=1) )
print(new_df)
to join for both columns:
new_df=( dfa.merge(dfb.groupby(['First','Last'],as_index=False).Value.sum() ,
left_on=['FirstName','LastName'],right_on=['First','Last'],how='left')
.drop(['First','Last'],axis=1) )
print(new_df)
Output:
FirstName LastName Value
0 Adam Smith 5.7
1 John Johnson 2.5

Pandas - Replacing Nulls with the most frequent value from groups

I have a dataset containing the following columns:
['sex', 'age', 'relationship_status]
There are some NaN values in 'relationship_status' column and I want to replace them with the most common value in each group based on age and gender.
I know how to groupby and count the values:
df2.groupby(['age','sex'])['relationship_status'].value_counts()
and it returns:
age sex relationship_status
17.0 female Married with kids 1
18.0 female In relationship 5
Married 4
Single 4
Married with kids 2
male In relationship 9
Single 5
Married 4
Married with kids 4
Divorced 3
.
.
.
86.0 female In relationship 1
92.0 male Married 1
97.0 male In relationship 1
So again, what I need to achieve is that whenever "relationship_status" is empty I need the program to replace it with the most frequent value based on persons age and gender.
Can anyone suggest how can I do it?
Kind regards.
Something like this:
mode = df2.groupby(['age','sex'])['relationship_status'].agg(lambda x: pd.Series.mode(x)[0])
df2['relationship_status'].fillna(mode, inplace=True)
Check this, it returns 'ALL_NAN' when within (age,sex) subgroups are only nans:
import pandas as pd
df = pd.DataFrame(
{'age': [25, 25, 25, 25, 25, 25,],
'sex': ['F', 'F', 'F', 'M', 'M', 'M', ],
'status': ['married', np.nan, 'married', np.nan, np.nan, 'single']
})
df.loc[df['status'].isna(), 'status'] = df.groupby(['age','sex'])['status'].transform(lambda x: x.mode()[0] if any(x.mode()) else 'ALL_NAN')
Output:
age sex status
0 25 F married
1 25 F married
2 25 F married
3 25 M single
4 25 M single
5 25 M single

pandas: Group by splitting string value in all rows (a column) and aggregation function

If i have dataset like this:
id person_name salary
0 [alexander, william, smith] 45000
1 [smith, robert, gates] 65000
2 [bob, alexander] 56000
3 [robert, william] 80000
4 [alexander, gates] 70000
If we sum that salary column then we will get 316000
I really want to know how much person who named 'alexander, smith, etc' (in distinct) makes in salary if we sum all of the salaries from its splitting name in this dataset (that contains same string value).
output:
group sum_salary
alexander 171000 #sum from id 0 + 2 + 4 (which contain 'alexander')
william 125000 #sum from id 0 + 3
smith 110000 #sum from id 0 + 1
robert 145000 #sum from id 1 + 3
gates 135000 #sum from id 1 + 4
bob 56000 #sum from id 2
as we see the sum of sum_salary columns is not the same as the initial dataset. all because the function requires double counting.
I thought it seems familiar like string count, but what makes me confuse is the way we use aggregation function. I've tried creating a new list of distinct value in person_name columns, then stuck comes.
Any help is appreciated, Thank you very much
Solutions working with lists in column person_name:
#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')
print (type(df.loc[0, 'person_name']))
<class 'list'>
First idea is use defaultdict for store sumed values in loop:
from collections import defaultdict
d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
for x in p:
d[x] += int(s)
print (d)
defaultdict(<class 'int'>, {'alexander': 171000,
'william': 125000,
'smith': 110000,
'robert': 145000,
'gates': 135000,
'bob': 56000})
And then:
df1 = pd.DataFrame({'group':list(d.keys()),
'sum_salary':list(d.values())})
print (df1)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another solution with repeating values by length of lists and aggregate sum:
from itertools import chain
df1 = pd.DataFrame({
'group' : list(chain.from_iterable(df['person_name'].tolist())),
'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})
df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
group sum_salary
0 alexander 171000
1 william 125000
2 smith 110000
3 robert 145000
4 gates 135000
5 bob 56000
Another sol:
df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())
person_name salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000
Can be done concisely with dummies though performance will suffer due to all of the .str methods:
df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()
#alexander 171000
#bob 56000
#gates 135000
#robert 145000
#smith 110000
#william 125000
#dtype: int64
I parsed this as strings of lists, by copying OP's data and using pandas.read_clipboard(). In case this was indeed the case (a series of strings of lists), this solution would work:
df = df.merge(df.person_name.str.split(',', expand=True), left_index=True, right_index=True)
df = df[[0, 1, 2, 'salary']].melt(id_vars = 'salary').drop(columns='variable')
# Some cleaning up, then a simple groupby
df.value = df.value.str.replace('[', '')
df.value = df.value.str.replace(']', '')
df.value = df.value.str.replace(' ', '')
df.groupby('value')['salary'].sum()
Output:
value
alexander 171000
bob 56000
gates 135000
robert 145000
smith 110000
william 125000
Another way you can do this is with iterrows(). This will not be as fast jezraels solution. But it works:
ids = []
names = []
salarys = []
# Iterate over the rows and extract the names from the lists in person_name column
for ix, row in df.iterrows():
for name in row['person_name']:
ids.append(row['id'])
names.append(name)
salarys.append(row['salary'])
# Create a new 'unnested' dataframe
df_new = pd.DataFrame({'id':ids,
'names':names,
'salary':salarys})
# Groupby on person_name and get the sum
print(df_new.groupby('names').salary.sum().reset_index())
Output
names salary
0 alexander 171000
1 bob 56000
2 gates 135000
3 robert 145000
4 smith 110000
5 william 125000

Categories

Resources