Wonder if you have two Columns (A = 'Name', B = 'Name_Age'), is there a quick way to remove 'Name' from 'Name_Age' so that you can quickly get 'Age', like a reversed concatenation??
I've thought about 'string split', but in some cases (when there's no string split factor) I really need a method to remove strings of one column from strings of another.
#example data below:
import pandas as pd
data = {'Name':['Mark','Matt','Michael'], 'Name_Age':['Mark 14','Matt 29','Michael 18']}
df = pd.DataFrame(data)
You can try using pandas apply function, which lets you define your own function to be passed to every row of the dataframe:
def age_from_name_age(name, name_age):
return name_age.replace(name, '').strip()
df['Age'] = df.apply(lambda x: age_from_name_age(x['Name'], x['Name_Age']),
axis='columns')
age_from_name_age takes two strings (a name and a name_age), and returns just the age. Then, in the apply statement, I define an anonymous lambda function that just takes in a row and passes the correct fields to age_from_name_age.
Using string slicing:
df['Age'] = df.apply(lambda row: row['Name_Age'][len(row['Name']):], axis=1).astype(int)
You can use str.split() to separate the values from the column names with space separator and then rename the column's with new names.
1) Using str.split()
>>> df['Name_Age'].str.split(" ", expand=True).rename(columns={0:'Name', 1:'Age'})
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
OR
>>> df = df['Name_Age'].str.split(" ", expand=True).rename(columns={0:'Name', 1:'Age'})
>>> df
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
OR, by Converting the splitted list into new dataframe:
>>> pd.DataFrame(df.Name_Age.str.split().tolist(), columns="Name Age".split())
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
2) Another option using str.partition
>>> df['Name_Age'].str.partition(" ", True).rename(columns={0:'Name', 2:'Age'}).drop(1, axis=1)
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
3) another using df.assign with lambda
Use split() with default separator as follows and assigning the values back with new column Age.
>>> df.assign(Age = df.Name_Age.apply(lambda x: x.split()[1]))
Name Name_Age Age
0 Mark Mark 14 14
1 Matt Matt 29 29
2 Michael Michael 18 18
OR
>>> df.Name_Age.apply(lambda x: pd.Series(str(x).split())).rename({0:"Name",1:"Age"}, axis=1)
Name Age
0 Mark 14
1 Matt 29
2 Michael 18
Related
This question already has answers here:
Pandas make new column from string slice of another column
(3 answers)
Closed 4 months ago.
data = [['Tom', '5-123g'], ['Max', '6-745.0d'], ['Bob', '5-900.0e'], ['Ben', '2-345',], ['Eva', '9-712.x']]
df = pd.DataFrame(data, columns=['Person', 'Action'])
I want to shorten the "Action" column to a length of 5. My current df has two columns:
['Person'] and ['Action']
I need it to look like this:
person Action Action_short
0 Tom 5-123g 5-123
1 Max 6-745.0d 6-745
2 Bob 5-900.0e 5-900
3 Ben 2-345 2-345
4 Eva 9-712.x 9-712
What I´ve tried was:
Checking the type of the Column
df['Action'].dtypes
The output is:
dtype('0')
Then I tried:
df['Action'] = df['Action'].map(str)
df['Action_short'] = df.Action.str.slice(start=0, stop=5)
I also tried it with:
df['Action'] = df['Action'].astype(str)
df['Action'] = df['Action'].values.astype(str)
df['Action'] = df['Action'].map(str)
df['Action'] = df['Action'].apply(str)```
and with:
df['Action_short'] = df.Action.str.slice(0:5)
df['Action_short'] = df.Action.apply(lambda x: x[:5])
df['pos'] = df['Action'].str.find('.')
df['new_var'] = df.apply(lambda x: x['Action'][0:x['pos']],axis=1)
The output from all my versions was:
person Action Action_short
0 Tom 5-123g 5-12
1 Max 6-745.0d 6-745
2 Bob 5-900.0e 5-90
3 Ben 2-345 2-34
4 Eva 9-712.x 9-712
The lambda funktion is not working with 3-222 it sclices it to 3-22
I don't get it why it is working for some parts and for others not.
Try this:
df['Action_short'] = df['Action'].str.slice(0, 5)
By using .str on a DataFrame or a single column of a DataFrame (which is a pd.Series), you can access pandas string manipulation methods that are designed to look like the string operations on standard python strings.
# slice by specifying the length you need
df['Action_short']=df['Action'].str[:5]
df
Person Action Action_short
0 Tom 5-123g 5-123
1 Max 6-745.0d 6-745
2 Bob 5-900.0e 5-900
3 Ben 2-345 2-345
4 Eva 9-712.x 9-712
I have a dataframe where most of the values are wrongly mapped.
here is my dataframe.
df:
Name Age Cust_Id
alex 47 1923894833
I need to re-map every values to its correct column.
df_output:
Name Age Phn_No
alex 47 1923894833
For fun, here is a hack way to perform the task (I really wouldn't use this in real-life):
df.apply(lambda row:
pd.Series(sorted(row, key=lambda x: (not str(x).isalpha())*(1+(len(str(x))>2))),
index=row.index), axis=1)
How it works:
convert as string
if all letters -> 0
if length > 2 -> 2
else 1
use the above number to sort the values and generate a new Series
first field will be all letters, second 2 characters, third the longer string
output:
Name Age Phn_No
0 alex 47 1923894833
1 Ross 23 17293883222
2 mike 34 8738272882
3 stefy 39 19298388392
My pandas dataframe currently has a column titled BinLocation that contains the location of a material in a warehouse. For example:
If a part is located in column A02, row 33, and then level B21 then the BinLocation ID is A02033B21.
For some columns, the format maybe A0233B21. The naming convention is not consistent but that was not up to me, and now I have to clean the data up.
I want to split the string such that for any given input for the BinLocation, I can return the column, row and level. Ultimately, I want to create 3 new columns for the dataframe (column, row, level).
In case it is not clear, the general structure of the ID is ColumnChar_ColumnInt_RowInt_ColumnChar_LevelInt
Now,for some BinLocations, the ID is separated by a hyphen so I wrote this code for those:
def forHyphenRow(s):
return s.split('-')[1]
def forHyphenColumn(s):
return s.split('-')[0]
def forHyphenLevel(s):
return s.split('-')[2]
How do I do the same but for the other IDs?
Also, in the dataframe is there anyway to group the columns in the dataframe all together? (so A02 are all grouped together, CB-22 are all grouped together etc)
Here is an answer that:
uses Python regular expression syntax to parse your ID (handles cases with and without hyphens and can be tweaked to accommodate other quirks of historical IDs if needed)
puts the ID in a regularized format
adds columns for the ID components
sorts based on the ID components so rows are "grouped" together (though not in the "groupby" sense of pandas)
import pandas as pd
df = pd.DataFrame({'BinLocation':['A0233B21', 'A02033B21', 'A02-033-B21', 'A02-33-B21', 'A02-33-B15', 'A02-30-B21', 'A01-33-B21']})
print(df)
print()
df['RawBinLocation'] = df['BinLocation']
import re
def parse(s):
m = re.match('^([A-Z])([0-9]{2})-?([0-9]+)-?([A-Z])([0-9]{2})$', s)
if not m:
return None
tup = m.groups()
colChar, colInt, rowInt, levelChar, levelInt = tup[0], int(tup[1]), int(tup[2]), tup[3], int(tup[4])
tup = (colChar, colInt, rowInt, levelChar, levelInt)
return pd.Series(tup)
df[['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt']] = df['BinLocation'].apply(parse)
df['BinLocation'] = df.apply(lambda x: f"{x.ColChar}{x.ColInt:02}-{x.RowInt:03}-{x.LevChar}{x.LevInt:02}", axis=1)
df.sort_values(by=['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt'], inplace=True, ignore_index=True)
print(df)
Output:
BinLocation
0 A0233B21
1 A02033B21
2 A02-033-B21
3 A02-33-B21
4 A02-33-B15
5 A02-30-B21
6 A01-33-B21
BinLocation RawBinLocation ColChar ColInt RowInt LevChar LevInt
0 A01-033-B21 A01-33-B21 A 1 33 B 21
1 A02-030-B21 A02-30-B21 A 2 30 B 21
2 A02-033-B15 A02-33-B15 A 2 33 B 15
3 A02-033-B21 A0233B21 A 2 33 B 21
4 A02-033-B21 A02033B21 A 2 33 B 21
5 A02-033-B21 A02-033-B21 A 2 33 B 21
6 A02-033-B21 A02-33-B21 A 2 33 B 21
If there will always be the first three characters of a string as Column, and last three as Level (and therefore Row as everything in-between):
def forNotHyphenColumn(s):
return s[:3]
def forNotHyphenLevel(s):
return s[-3:]
def forNotHyphenRow(s):
return s[3:-3]
Then, you could sort your DataFrame by Column by creating separate DataFrame columns for the BinLocation items and using df.sort_values():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12
EDIT:
To also use the hyphen functions in the apply():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21", "A01-33-C13"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forHyphenColumn(x) if "-" in x else forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forHyphenRow(x) if "-" in x else forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forHyphenLevel(x) if "-" in x else forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#3 A01-33-C13 A01 33 C13
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12
I want to cut a certain part of a string (applied to multiple columns and differs on each column) when one column contains a particular substring
Example:
Assume the following dataframe
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
print(df)
Out:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike39 20 Leo AB Somerville c_d_e
2 Brenda4 25 Virgo B Hendersonville f_g
3 Holy5 18 Libra AA Gannon h_i_j
For example if one entry of column City ends with 'e', cut the last three letters of column 'City' and the last two letters of column 'name'.
What I tried so far is something like this:
df['City'] = df['City'].apply(lambda x: df['City'].str[:-3] if df.City.str.endswith('e'))
That doesn't work and I also don't really know how to cut letters on other columns while having the same if clause.
I'm happy for any help I get.
Thank you
You can record the rows with City ending with e then use loc update:
mask = df['City'].str[-1] == 'e'
df.loc[mask, 'City'] = df.loc[mask, 'City'].str[:-3]
df.loc[mask, 'name'] = df.loc[mask, 'name'].str[:-2]
Output:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike 20 Leo AB Somervi c_d_e
2 Brend 25 Virgo B Hendersonvi f_g
3 Holy5 18 Libra AA Gannon h_i_j
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
def func(row):
index = row.name
if row['City'][-1] == 'c': #check the last letter of column City for each row, implement your condition here.
df.at[index, 'City'] = df['City'][index][:-3]
df.at[index, 'name'] = df['name'][index][:-1]
df.apply(lambda x: func(x), axis =1 )
print (df)
I have a pandas dataframe and in one column I have a string where words are separated by '_', I would like to extract the last element of this string (which is a number) and make a new column with this.
I tried the following
df = pd.DataFrame({'strings':['some_string_25','a_different_one_13','and_a_last_one_40']})
df.assign(number = lambda x: x.strings.str.split('_')[0])
but it gives me this in my last column
number
some
string
25
but I would like to get this
number
25
13
40
How can I do this?
Use Series.str.split for split and select last value of list by indexing or use Series.str.extract by last integer of strings - (\d+) is for match int and $ for end of string:
df['last'] = df['strings'].str.split('_').str[-1]
df['last1'] = df['strings'].str.extract('(\d+)$')
print (df)
strings last last1
0 some_string_25 25 25
1 a_different_one_13 13 13
2 and_a_last_one_40 40 40
Difference is possible see in changed data:
df = pd.DataFrame({'strings':['some_string_25','a_different_one_13','and_a_last_one_40',
'aaaa', 'sss58']})
df['last'] = df['strings'].str.split('_').str[-1]
df['last1'] = df['strings'].str.extract('(\d+)$')
print (df)
strings last last1
0 some_string_25 25 25
1 a_different_one_13 13 13
2 and_a_last_one_40 40 40
3 aaaa aaaa NaN
4 sss58 sss58 58
Can do:
df['number']=df['strings'].apply(lambda row: row.split('_')[-1])
or:
df['number']=[row[-1] for row in df['strings'].str.split('_')]
Please try this
df = pd.DataFrame({'strings':['some_string_25','a_different_one_13','and_a_last_one_40']})
df['number'] = df.strings.apply(lambda x: x.split('_')[-1])
df