cleaning a column of strings in a pandas dataframe with str comprehension - python

I have a dataframe (df1) constructed from a survey in which participants entered their gender as a string and so there is a gender column that looks like:
id gender age
1 Male 19
2 F 22
3 male 20
4 Woman 32
5 female 26
6 Male 22
7 make 24
etc.
I've been using
df1.replace('male', 'Male')
for example, but this is really clunky and involves knowing the exact format of each response to fix it.
I've been trying to use various string comprehensions and string operations in Pandas, such as .split(), .replace(), and .capitalize(), with np.where() to try to get:
id gender age
1 Male 19
2 Female 22
3 Male 20
4 Female 32
5 Female 26
6 Male 22
7 Male 24
I'm sure there must be a way to use regex to do this but I can't seem to get the code right.
I know that it is probably a multi-step process of removing " ", then capitalising the entry, then replacing the capitalised values.
Any guidance would be much appreciated pythonistas!
Kev

Adapt the code in my comment to replace every record that starts with an f with the word Female:
df1["gender"] = df1.gender.apply(lambda s: re.sub(
"(^F)([A-Za-z]+)*", # pattern
"Female", # replace
s.strip().title()) # string
)
Similarly for F with M in the pattern and replace with Male for Male.
Relevant regex docs
Regex help

Related

How do I append a column in a dataframe and and give each unique string a number?

I'm looking to append a column in a pandas data frame that is similar to the following "Identifier" column:
Name. Age Identifier
Peter Pan 13 PanPe
James Jones 24 JonesJa
Peter Pan 22 PanPe
Chris Smith 19 SmithCh
I need the "Identifier" column to look like:
Identifier
PanPe01
JonesJa01
PanPe02
SmithCh01
How would I number each original string with 01? And if there are duplicates (for example Peter Pan), then the following duplicate strings (after the original 01) will have 02, 03, and so forth?
I've been referred to the following theory:
combo="PanPe"
Counts={}
if combo in counts:
count=counts[combo]
counts[combo]=count+1
else:
counts[combo]=1
However, getting a good example of code would be ideal, as I am relatively new to Python, and would love to know the syntax as how to implement an entire column iterated through this process, instead of just one string as shown above with "PanPe".
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Output:
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
Thank you #dm2 and #Bushmaster

Create sorting dictionary

Write the function dataframe that takes a
dictionary as input and creates a dataframe
from the dictionary, Sort the dictionary.
Instructions
1. Create a dataframe with the input dictionary
2. Columns should be Name Age
3. Print "Before Sorting"
4. Print a Newline
5. Print the dataframe before sorting.
Note: Printing the dataframe must not contain index.
6. Print a Newline
7. Sort the dataframe in ascending order based on Age column
8. Print "After Sorting"
9. Print a Newline
10. Print the dataframe after sorting. Note: Printing the dataframe must not contain index.
Sample Input (it may change according to use cases. So cannot insert below input on code)
['william':42, 'George' :10, 'Joseph
:22, 'Henry':15, 'Samuel':32, 'David':18]
Sample Output
Before Sorting
Name Age
William 42
George. 10
Joseph. 22
Henry. 15
Samuel. 32
David. 18
After Sorting
Name. Age
George. 10
Henry. 15
David. 18
Joseph. 22
Samuel. 32
William. 42
import pandas
import ast
#Enter your code here. Read input from STDIN. Print output from STDOUT
def dataframe(key, value):
. STDIN = {key:value}
def dataframe(data):
df = pd.DataFrame(data)
print("Before Sorting")
print(df)
df.sort_values(by=['Age'], inplace=True)
print("After Sorting")
print(df)
Output :
Before Sorting
Name Age
0 William 42
1 George 10
2 Joseph 22
3 Henry 15
4 Samuel 32
5 David 18
After Sorting
Name Age
1 George 10
3 Henry 15
5 David 18
2 Joseph 22
4 Samuel 32
0 William 42

How to leave certain values (which have a comma in them) intact when separating list-values in strings in pandas?

From the dataframe, I create a new dataframe, in which the values from the "Select activity" column contain lists, which I will split and transform into new rows. But there is a value: "Nothing, just walking", which I need to leave unchanged. Tell me, please, how can I do this?
The original dataframe looks like this:
Name Age Select activity Profession
0 Ann 25 Cycling, Running Saleswoman
1 Mark 30 Nothing, just walking Manager
2 John 41 Cycling, Running, Swimming Accountant
My code looks like this:
df_new = df.loc[:, ['Name', 'Age']]
df_new['Activity'] = df['Select activity'].str.split(', ')
df_new = df_new.explode('Activity').reset_index(drop=True)
I get this result:
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing
3 Mark 30 just walking
4 John 41 Cycling
5 John 41 Running
6 John 41 Swimming
In order for the value "Nothing, just walking" not to be divided by 2 values, I added the following line:
if df['Select activity'].isin(['Nothing, just walking']) is False:
But it throws an error.
then let's look ahead after comma to guarantee a Capital letter, and only then split. So instead of , we have , (?=[A-Z])
df_new = df.loc[:, ["Name", "Age"]]
df_new["Activity"] = df["Select activity"].str.split(", (?=[A-Z])")
df_new = df_new.explode("Activity", ignore_index=True)
i only changed the splitter, and ignore_index=True to explode instead of resetting afterwards (also the single quotes..)
to get
>>> df_new
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing, just walking
3 John 41 Cycling
4 John 41 Running
5 John 41 Swimming
one line as usual
df_new = (df.loc[:, ["Name", "Age"]]
.assign(Activity=df["Select activity"].str.split(", (?=[A-Z])"))
.explode("Activity", ignore_index=True))

Pandas return column data as list without duplicates

This is just an oversimplification but I have this large categorical data.
Name Age Gender
John 12 Male
Ana 24 Female
Dave 16 Female
Cynthia 17 Non-Binary
Wayne 26 Male
Hebrew 29 Non-Binary
Suppose that it is assigned as df and I want it to return as a list with non-duplicate values:
'Male','Female','Non-Binary'
I tried it with this code, but this returns the gender with duplicates
list(df['Gender'])
How can I code it in pandas so that it can return values without duplicates?
In these cases you have to remember that df["Gender"] is a Pandas Series so you could use .drop_duplicates() to retrieve another Pandas Series with the duplicated values removed or use .unique() to retrieve a Numpy Array containing the unique values.
>> df["Gender"].drop_duplicates()
0 Male
1 Female
3 Non-Binary
4 Male
Name: Gender, dtype: object
>> df["Gender"].unique()
['Male ' 'Female' 'Non-Binary' 'Male']

pandas - Can't merge df/series and groupby then count

TL;DR:
Have 2 dataframes with different sizes, but one 'id' column(in both df) that supposed to act as index. Need to merge them, group by 'sector' and 'gender' and count/sum entrys in each group.
Long version:
I have a dataframe with 'id', 'sector', among other columns, from company personnel. Another dataframe with 'id' and 'gender'. Examples bellow:
df1:
row* id sector other columns
1 0 Operational ...
2 0 Administrative ...
3 1 Sales ...
4 2 IT ...
5 3 Operational ...
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational ...
151 100 Sales ...
152 101 IT ...
*I don't really have a 'row' column, it's there just to make it easier to understand my problem.
df2:
row* id gender
1 0 Male
2 1 Female
3 2 Female
4 3 Male
5 4 Male
[...]
101 100 Male
102 101 Female
As you can see, one person can be in more then one sector (which seems to make my problem more complicated.)
I need to merge them together and then make a sum from how many male and female in each sector.
FIRST PROBLEM
Decided to make a new df to get only the columns 'id' and 'sector'.
df3 = df1[['id','sector']]
df3 = df3.merge(df2)
I get:
No common columns to perform merge on. Merge options: left_on=None,
right_on=None, left_index=False, right_index=False
Tried using .join() instead of .merge() and I get:
['id'] not in index"
Tried now with reset_index() - Found in some of the answers around here, but didn't really solved my issue.
df1 = df1.reset_index()
df3 = df1[['id','sector']]
df3 = df3.join(df2)
What I got was this:
row* id sector gender
1 0 Operational Male
2 0 Administrative Female
3 1 Sales Female
4 2 IT Male
5 3 Operational Male
6 3 IT ...
7 4 Sales ...
[...]
150 100 Operational NaN
151 100 Sales NaN
152 101 IT NaN
It didn't respected the 'id' and just concatenated the column to the side. Since df2 only had 102 rows, I got NaN in the other rows(103 to 152), aside from the fact that the 'gender' was no longer accurate.
SECOND PROBLEM
Decided to power through that in order to get the rest of the work done. I tried this:
df3 = df3.groupby('sector','gender').size()
It raises:
No axis named gender for object type < class 'pandas.core.frame.DataFrame'>
What doesn't really make sense to me, because I can call df3.gender and I get the (entire) expected series. If I remove 'gender' from the line above, it actually group but just that doesn't work for me. Also tried passing the columns names befor groupby, to no avail.
Expected result should be something like this:
sector gender sum
operational male 20
operational female 5
administrative male 10
administrative female 17
sales male 12
sales female 13
IT male 1
IT female 11
Not sure if I can answer to my own question but I think I should since the issue is resolved.
The solutions were very simple, even though I don't understand some of the issues I got.
First problem added on='id' in the merge
df3 = df1[['id','sector']].merge(df2, on='id')
Second problem just missing a list, as pointed by #DYZ
df3.groupby(['sector','gender']).size()
Feeling quite stupid right now... Must be tired. Thanks DYZ and sorry for the trouble.

Categories

Resources