I am cleaning up a dataframe about apples. I am supposed to put the values from the "Age" column into categorical bins. However when I get to the part of actually placing the values into bins, labeling them, etc. Either all of my values end up in the first category (code 1), or it seems to drop every value that doesn't fit the first bin (code 2).
Try 1
import pandas as pd
data = {'Fav': ['Gala', 'Fuji', 'GALA', 'GRANNY SMITH', 'Red Delicious',
'All of them!', 'Pink lady', 'IDK', 'granny smith', 'Honey Crisp',
'Fuji', 'Golden delish', 'McIntosh', 'Empire', 'Gala' ],
'Age': [10,'Old enough', '30+', 'No', 21, 19, 43,37,29,7,28,70,60,52,49]}
apples = pd.DataFrame(data)
# create True/False index
B = apples['AGE'].str.isnumeric()
# for the index, fill missing values with False
B = B.fillna(False)
# select Age column for only those False values from index and code as missing
apples.loc[~B 'AGE'] = np.nan
#change strings to floats (when I test calue counts up to this point everything is there and correct)
apples.loc[~B,'AGE'] = apples.loc[~B,'Q3: AGE'].str.replace('\%', '', regex=True).astype(float)
#binning (at this point it puts ALL values in the "unknown" bin.)
bins = [0,1, 18, 26, 36, 46, 56]
labels = ['17 and under', '18-25', '26-35', '36-45', '46-55', '56+']
apples['AGE'] = pd.cut(B, bins, labels)
#check my result
apples['AGE'].value_counts()
Try 2
# create True/False index
B = apples['AGE'].str.isnumeric()
# for the index, fill missing values with False
B = B.fillna(False)
# select Age column for only those False values from index and code as missing
apples.loc[~B 'AGE'] = np.nan
#change strings to floats (when I test value counts up to this point everything is there and correct)
apples.loc[~B,'AGE'] = apples.loc[~B,'Q3: AGE'].str.replace('\%', '', regex=True).astype(float)
#binning (at this point I lose all values except the 'unknown" bin.)
apples['AGE']= pd.cut(~B, bins=[0,1,18, 26, 36, 46, 56, 100] ,
labels=['unknown','17 and under', '18-25', '26-35', '36-45', '46-55', '56+'])
#check my result
apples['AGE'].value_counts()
Any other way that I attempt to format this code gives me the type error "'<' not supported between instances of 'int' and 'str'"
Although not explicitly listed in the "categories", Nan values do have their own category in a categorical, as seen by their distinct code -1:
apples.Age = pd.to_numeric(apples.Age, errors='coerce')
bins = [1, 18, 26, 36, 46, 56, 100]
labels = ['17 and under', '18-25', '26-35', '36-45', '46-55', '56+']
apples.Age = pd.cut(apples.Age, bins=bins, labels=labels)
print(apples)
print(apples.Age.cat.codes)
# Output:
Fav Age
0 Gala 17 and under
1 Fuji NaN
2 GALA NaN
3 GRANNY SMITH NaN
4 Red Delicious 18-25
5 All of them! 18-25
6 Pink lady 36-45
7 IDK 36-45
8 granny smith 26-35
9 Honey Crisp 17 and under
10 Fuji 26-35
11 Golden delish 56+
12 McIntosh 56+
13 Empire 46-55
14 Gala 46-55
0 0
1 -1
2 -1
3 -1
4 1
5 1
6 3
7 3
8 2
9 0
10 2
11 5
12 5
13 4
14 4
dtype: int8
Related
I have the below dataset. How can create a new column that shows the difference of money for each person, for each expiry?
The column is yellow is what I want. You can see that it is the difference in money for each expiry point for the person. I highlighted the other rows in colors so it is more clear.
Thanks a lot.
Example
[]
import pandas as pd
import numpy as np
example = pd.DataFrame( data = {'Day': ['2020-08-30', '2020-08-30','2020-08-30','2020-08-30',
'2020-08-29', '2020-08-29','2020-08-29','2020-08-29'],
'Name': ['John', 'Mike', 'John', 'Mike','John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': ['1Y', '1Y', '2Y','2Y','1Y','1Y','2Y','2Y']})
example_0830 = example[ example['Day']=='2020-08-30' ].reset_index()
example_0829 = example[ example['Day']=='2020-08-29' ].reset_index()
example_0830['key'] = example_0830['Name'] + example_0830['Expiry']
example_0829['key'] = example_0829['Name'] + example_0829['Expiry']
example_0829 = pd.DataFrame( example_0829, columns = ['key','Money'])
example_0830 = pd.merge(example_0830, example_0829, on = 'key')
example_0830['Difference'] = example_0830['Money_x'] - example_0830['Money_y']
example_0830 = example_0830.drop(columns=['key', 'Money_y','index'])
Result:
Day Name Money_x Expiry Difference
0 2020-08-30 John 100 1Y 50
1 2020-08-30 Mike 950 1Y 900
2 2020-08-30 John 200 2Y -50
3 2020-08-30 Mike 1000 2Y -200
If the difference is just derived from the previous date, you can just define a date variable in the beginning to find today(t) and previous day (t-1) to filter out original dataframe.
You can solve it with groupby.diff
Take the dataframe
df = pd.DataFrame({
'Day': [30, 30, 30, 30, 29, 29, 28, 28],
'Name': ['John', 'Mike', 'John', 'Mike', 'John', 'Mike', 'John', 'Mike'],
'Money': [100, 950, 200, 1000, 50, 50, 250, 1200],
'Expiry': [1, 1, 2, 2, 1, 1, 2, 2]
})
print(df)
Which looks like
Day Name Money Expiry
0 30 John 100 1
1 30 Mike 950 1
2 30 John 200 2
3 30 Mike 1000 2
4 29 John 50 1
5 29 Mike 50 1
6 28 John 250 2
7 28 Mike 1200 2
And the code
# make sure we have dates in the order we want
df.sort_values('Day', ascending=False)
# groubpy and get the difference from the next row in each group
# diff(1) calculates the difference from the previous row, so -1 will point to the next
df['Difference'] = df.groupby(['Name', 'Expiry']).Money.diff(-1)
Output
Day Name Money Expiry Difference
0 30 John 100 1 50.0
1 30 Mike 950 1 900.0
2 30 John 200 2 -50.0
3 30 Mike 1000 2 -200.0
4 29 John 50 1 NaN
5 29 Mike 50 1 NaN
6 28 John 250 2 NaN
7 28 Mike 1200 2 NaN
I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.
I have a dataset containing the following columns:
['sex', 'age', 'relationship_status]
There are some NaN values in 'relationship_status' column and I want to replace them with the most common value in each group based on age and gender.
I know how to groupby and count the values:
df2.groupby(['age','sex'])['relationship_status'].value_counts()
and it returns:
age sex relationship_status
17.0 female Married with kids 1
18.0 female In relationship 5
Married 4
Single 4
Married with kids 2
male In relationship 9
Single 5
Married 4
Married with kids 4
Divorced 3
.
.
.
86.0 female In relationship 1
92.0 male Married 1
97.0 male In relationship 1
So again, what I need to achieve is that whenever "relationship_status" is empty I need the program to replace it with the most frequent value based on persons age and gender.
Can anyone suggest how can I do it?
Kind regards.
Something like this:
mode = df2.groupby(['age','sex'])['relationship_status'].agg(lambda x: pd.Series.mode(x)[0])
df2['relationship_status'].fillna(mode, inplace=True)
Check this, it returns 'ALL_NAN' when within (age,sex) subgroups are only nans:
import pandas as pd
df = pd.DataFrame(
{'age': [25, 25, 25, 25, 25, 25,],
'sex': ['F', 'F', 'F', 'M', 'M', 'M', ],
'status': ['married', np.nan, 'married', np.nan, np.nan, 'single']
})
df.loc[df['status'].isna(), 'status'] = df.groupby(['age','sex'])['status'].transform(lambda x: x.mode()[0] if any(x.mode()) else 'ALL_NAN')
Output:
age sex status
0 25 F married
1 25 F married
2 25 F married
3 25 M single
4 25 M single
5 25 M single
I have a dataframe called passenger_details which is shown below
Passenger Age Gender Commute_to_work Commute_mode Commute_time ...
Passenger1 32 Male I drive to work car 1 hour
Passenger2 26 Female I take the metro train NaN ...
Passenger3 33 Female NaN NaN 30 mins ...
Passenger4 29 Female I take the metro train NaN ...
...
I want to apply an if function that will turn missing values(NaN values) to 0 and present values to 1, to column headings that have the string 'Commute' in them.
This is basically what I'm trying to achieve
Passenger Age Gender Commute_to_work Commute_mode Commute_time ...
Passenger1 32 Male 1 1 1
Passenger2 26 Female 1 1 0 ...
Passenger3 33 Female 0 0 1 ...
Passenger4 29 Female 1 1 0 ...
...
However, I'm struggling with how to phrase my code. This is what I have done
passenger_details = passenger_details.filter(regex = 'Location_', axis = 1).apply(lambda value: str(value).replace('value', '1', 'NaN','0'))
But I get a Type Error of
'replace() takes at most 3 arguments (4 given)'
Any help would be appreciated
Seelct columns by Index.contains and test not missing values by DataFrame.notna and last cast to integer for True/False to 1/0 map:
c = df.columns.str.contains('Commute')
df.loc[:, c] = df.loc[:, c].notna().astype(int)
print (df)
Passenger Age Gender Commute_to_work Commute_mode Commute_time
0 Passenger1 32 Male 1 1 1
1 Passenger2 26 Female 1 1 0
2 Passenger3 33 Female 0 0 1
3 Passenger4 29 Female 1 1 0
I have a list called 'gender', of which I counted all the occurrences of the values with Counter:
gender = ['2',
'Female,',
'All Female Group,',
'All Male Group,',
'Female,',
'Couple,',
'Mixed Group,'....]
gender_count = Counter(gender)
gender_count
Counter({'2': 1,
'All Female Group,': 222,
'All Male Group,': 119,
'Couple,': 256,
'Female,': 1738,
'Male,': 2077,
'Mixed Group,': 212,
'NA': 16})
I want to put this dict into a pandas Dataframe. I have used pd.series(Convert Python dict into a dataframe):
s = pd.Series(gender_count, name='gender count')
s.index.name = 'gender'
s.reset_index()
Which gives me the dataframe I want, but I don't know how to save these steps into a pandas DataFrame.
I also tried using DataFrame.from_dict()
s2 = pd.DataFrame.from_dict(gender_count, orient='index')
But this creates a dataframe with the categories of gender as the index.
I eventually want to use gender categories and the count for a piechart.
Skip the intermediate step
gender = ['2',
'Female',
'All Female Group',
'All Male Group',
'Female',
'Couple',
'Mixed Group']
pd.value_counts(gender)
Female 2
2 1
Couple 1
Mixed Group 1
All Female Group 1
All Male Group 1
dtype: int64
In [21]: df = pd.Series(gender_count).rename_axis('gender').reset_index(name='count')
In [22]: df
Out[22]:
gender count
0 2 1
1 All Female Group, 222
2 All Male Group, 119
3 Couple, 256
4 Female, 1738
5 Male, 2077
6 Mixed Group, 212
7 NA 16
what about just
s = pd.DataFrame(gender_count)