How to convert to lowercase all columns except a few specific ones? - python

I would like to convert to lowercase all columns within a dataframe except two. To convert all the dataframe I usually do
df=df.apply(lambda x: x.astype(str).str.lower())
My dataset is
Time Name Surname Age Notes Comments
12 Mirabel Gutierrez 23 None Already Paid
09 Kim Stuart 45 In debt Should refund 100 EUR
and so on.
I would like to transform into lowercase all the columns except Notes and Comments.
Time Name Surname Age Notes Comments
12 mirabel gutierrez 23 None Already Paid
09 kim stuart 45 In debt Should refund 100 EUR
What can I try?

You probably simply want to create a list of the relevant columns:
lowerify_cols = [col for col in df if col not in ['Notes','Comments']]
Then you can use your code snippet:
df[lowerify_cols]= df[lowerify_cols].apply(lambda x: x.astype(str).str.lower(),axis=1)

Related

How to leave certain values (which have a comma in them) intact when separating list-values in strings in pandas?

From the dataframe, I create a new dataframe, in which the values from the "Select activity" column contain lists, which I will split and transform into new rows. But there is a value: "Nothing, just walking", which I need to leave unchanged. Tell me, please, how can I do this?
The original dataframe looks like this:
Name Age Select activity Profession
0 Ann 25 Cycling, Running Saleswoman
1 Mark 30 Nothing, just walking Manager
2 John 41 Cycling, Running, Swimming Accountant
My code looks like this:
df_new = df.loc[:, ['Name', 'Age']]
df_new['Activity'] = df['Select activity'].str.split(', ')
df_new = df_new.explode('Activity').reset_index(drop=True)
I get this result:
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing
3 Mark 30 just walking
4 John 41 Cycling
5 John 41 Running
6 John 41 Swimming
In order for the value "Nothing, just walking" not to be divided by 2 values, I added the following line:
if df['Select activity'].isin(['Nothing, just walking']) is False:
But it throws an error.
then let's look ahead after comma to guarantee a Capital letter, and only then split. So instead of , we have , (?=[A-Z])
df_new = df.loc[:, ["Name", "Age"]]
df_new["Activity"] = df["Select activity"].str.split(", (?=[A-Z])")
df_new = df_new.explode("Activity", ignore_index=True)
i only changed the splitter, and ignore_index=True to explode instead of resetting afterwards (also the single quotes..)
to get
>>> df_new
Name Age Activity
0 Ann 25 Cycling
1 Ann 25 Running
2 Mark 30 Nothing, just walking
3 John 41 Cycling
4 John 41 Running
5 John 41 Swimming
one line as usual
df_new = (df.loc[:, ["Name", "Age"]]
.assign(Activity=df["Select activity"].str.split(", (?=[A-Z])"))
.explode("Activity", ignore_index=True))

Pandas return column data as list without duplicates

This is just an oversimplification but I have this large categorical data.
Name Age Gender
John 12 Male
Ana 24 Female
Dave 16 Female
Cynthia 17 Non-Binary
Wayne 26 Male
Hebrew 29 Non-Binary
Suppose that it is assigned as df and I want it to return as a list with non-duplicate values:
'Male','Female','Non-Binary'
I tried it with this code, but this returns the gender with duplicates
list(df['Gender'])
How can I code it in pandas so that it can return values without duplicates?
In these cases you have to remember that df["Gender"] is a Pandas Series so you could use .drop_duplicates() to retrieve another Pandas Series with the duplicated values removed or use .unique() to retrieve a Numpy Array containing the unique values.
>> df["Gender"].drop_duplicates()
0 Male
1 Female
3 Non-Binary
4 Male
Name: Gender, dtype: object
>> df["Gender"].unique()
['Male ' 'Female' 'Non-Binary' 'Male']

How to drop rows in one DataFrame based on one similar column in another Dataframe that has a different number of rows

I have two DataFrames that are completely dissimilar except for certain values in one particular column:
df
First Last Email Age
0 Adam Smith email1#email.com 30
1 John Brown email2#email.com 35
2 Joe Max email3#email.com 40
3 Will Bill email4#email.com 25
4 Johnny Jacks email5#email.com 50
df2
ID Location Contact
0 5435 Austin email5#email.com
1 4234 Atlanta email1#email.com
2 7896 Toronto email3#email.com
How would I go about finding the matching values in the Email column of df and the Contact column of df2, and then dropping the whole row in df based on that match?
Output I'm looking for (index numbering doesn't matter):
df1
First Last Email Age
1 John Brown email2#email.com 35
3 Will Bill email4#email.com 25
I've been able to identify matches using a few different methods like:
Changing the column names to be identical
common = df.merge(df2,on=['Email'])
df3 = df[(~df['Email'].isin(common['Email']))]
But df3 still shows all the rows from df.
I've also tried:
common = df['Email'].isin(df2['Contact'])
df.drop(df[common].index, inplace = True)
And again, identifies the matches but df still contains all original rows.
So the main thing I'm having difficulty with is updating df with the matches dropped or creating a new DataFrame that contains only the rows with dissimilar values when comparing the Email column from df and the Contact column in df2. Appreciate any suggestions.
As mentioned in the comments(#Arkadiusz), it is enough to filter your data using the following
df3 = df[(~df['Email'].isin(df2.Contact))].copy()
print(df3)

How to Multiply a column value based on a list of ID numbers?

I have a dataset and want to times a value column by 2.5 based on the ID value of a list.
My data frame looks like this
Name ID Salary
James 21 25,000
Sam 12 15,000
My list is a series and let's call it s = ["21", "36"] this data is the ID numbers
How do I get it based on the ID number to times the salary by 2.5?
The goal is to have something like this
Name ID Salary
James 21 62,500
Sam 12 15,000
First convert Salary to numeric and then convert values of ID to strings, test by Series.isin and multiple by DataFrame.loc for select rows by mask and column by name Salary:
s = ["21", "36"]
#if values of Salary are strings
#df = pd.read_csv(file, thousands=',')
#or
#df['Salary'] = df['Salary'].str.replace(',','').astype(int)
#ID are converted to strings by `astype`, because valus in list s are strings
df.loc[df['ID'].astype(str).isin(s), 'Salary'] *= 2.5
#if s are numeric
#df.loc[df['ID'].isin(s), 'Salary'] *= 2.5
print (df)
Name ID Salary
0 James 21 62500.0
1 Sam 12 15000.0

Replacing values from one dataframe to another

I'm trying to fix discrepancies in a column from one df to a column in another.
The tables are not sorted as well.
How can i do this using python. Example:
df1
Age Name
40 Sid Jones
50 Alex, Bot
32 Tony Jar
65 Fred, Smith
24 Brad, Mans
df2
Age Name
24 Brad Mans
32 Tony Jar
40 Sid Jones
65 Fred Smith
50 Alex Bot
I need to replace the values in df2 to match those in df1 as you can see in my example the discrepancies are commas in the names.
Expected outcome for df2:
Age Name
24 Brad, Mans
32 Tony Jar
40 Sid Jones
65 Fred, Smith
50 Alex, Bot
The values in df2 should be changed to match the df1s values.
Create a column in df1 with commas removed from the Name column
df1['Name_nocomma'] = df1.Name.str.replace(',', '')
merge df1 to df2 using Name_nocomma & Name to get the corrected Name create a new version of df2
df2_out = df2.merge(df1, left_on='Name', right_on='Name_nocomma', how='left')[['Age_x', 'Name_x', 'Name_y']]
use combine_first to coalesce Name_y & Name_x into a new column Name
df2_out['Name'] = df2_out.Name_y.combine_first(df2_out.Name_x)
drop / rename the intermediate columns
del df1['Name_nocomma']
del df2_out['Name_x']
del df2_out['Namy_y']
df2_out.rename({'Age_x': 'Age'}, axis=1, inplace=True)
df2_out
#outputs:
Age Name
0 24 Brad Mans
1 32 Tony Jar
2 40 Sid Jones
3 65 Fred Smith
4 50 Alex Bot
you need sort and append
df1.sort(by=['Age'], inplace = True)
df2.sort(by=['Age'], inplace = True)
result_df = df1.append(df2)

Categories

Resources