Cut substring on multiple columns when one column contains a particular substring - python

I want to cut a certain part of a string (applied to multiple columns and differs on each column) when one column contains a particular substring
Example:
Assume the following dataframe
import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
print(df)
Out:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike39 20 Leo AB Somerville c_d_e
2 Brenda4 25 Virgo B Hendersonville f_g
3 Holy5 18 Libra AA Gannon h_i_j
For example if one entry of column City ends with 'e', cut the last three letters of column 'City' and the last two letters of column 'name'.
What I tried so far is something like this:
df['City'] = df['City'].apply(lambda x: df['City'].str[:-3] if df.City.str.endswith('e'))
That doesn't work and I also don't really know how to cut letters on other columns while having the same if clause.
I'm happy for any help I get.
Thank you

You can record the rows with City ending with e then use loc update:
mask = df['City'].str[-1] == 'e'
df.loc[mask, 'City'] = df.loc[mask, 'City'].str[:-3]
df.loc[mask, 'name'] = df.loc[mask, 'name'].str[:-2]
Output:
name Age Zodiac Grade City pahun
0 Allan2 30 Aries A Aura a_b_c
1 Mike 20 Leo AB Somervi c_d_e
2 Brend 25 Virgo B Hendersonvi f_g
3 Holy5 18 Libra AA Gannon h_i_j

import pandas as pd
df = pd.DataFrame({'name':['Allan2','Mike39','Brenda4','Holy5'], 'Age': [30,20,25,18],'Zodiac':['Aries','Leo','Virgo','Libra'],'Grade':['A','AB','B','AA'],'City':['Aura','Somerville','Hendersonville','Gannon'], 'pahun':['a_b_c','c_d_e','f_g','h_i_j']})
def func(row):
index = row.name
if row['City'][-1] == 'c': #check the last letter of column City for each row, implement your condition here.
df.at[index, 'City'] = df['City'][index][:-3]
df.at[index, 'name'] = df['name'][index][:-1]
df.apply(lambda x: func(x), axis =1 )
print (df)

Related

Filter pd.DataFrame by string in header and string in column of the found header

I am trying to get my head around a way where I can select / filter columns that contain in the header a specific string and another string within the column.
I am a little bit confused with the way I could quickly select the columns and the rows concerning the selected columns.
Assume the following dataframe df:
Country/Region Record ID
0 France 118
1 France 110
2 United Kingdom 146
3 United Kingdom 836
4 France 944
and I am thinking something like:
condition_1 --> filter the columns that contain "Country" in the header
condition_2 --> filter the rows where the country is "France"
Is it possible to do it with one .loc[] and/or with a def or a lambda function? I need to use it multiply for several combinations and conditions within my process.
I have tried to combine the following somehow without success:
country_condition = lambda df, string: df.filter(regex=string)
df.loc[country_condition==True, :] or df[df.filter(regex='Country') == 'France']
so any help will be appreciated.
I want to be able to give the string that the header will need to include (here 'Country) and the string that the rows of this column will need to include (here 'France') so that I get:
Country/Region Record ID
0 France 118
1 France 110
4 France 944
A possible solution, which should work with multiple columns with Country in the header:
df.loc[df.filter(like='Country').eq('France').all(axis=1), :]
Output:
Country/Region Record ID
0 France 118
1 France 110
4 France 944
The simplest thing to do would be to simply keep your data as is and query in a standard way:
df[df['Country/Region'] == 'France']
What df.filter(regex=...) does is select data frame columns that match the regex. It is identical to running df[[i for i in df.columns if re.match(..., i)]]. But you say you have multiple columns that might not all start with the same name: df.filter may run into the issue where there are multiple matches.
>>> df0 = pd.DataFrame({'country/adjfkl': ['A', 'B', 'C']})
>>> df1 = pd.DataFrame({'country/a1395d': ['B', 'C', 'D']})
>>> pd.concat([d[lambda e: e.filter(regex='^country').eq('B').any(axis=1)] for d in [df0, df1]])
country/adjfkl country/a1395d
1 B NaN
0 NaN B
The filter result can be dimensionally reduced with any across (ie horizontally). I can't entirely say why you would do this, however, inasmuch as you'll be left with inconsistent columns.
It would be better to just rename them ab initio:
>>> pd.concat(
... [d.rename(
... columns=lambda s: re.sub('^country.*', 'country_name', s)).loc[
... lambda e: e['country_name'] == 'B', :
... ] for d in [df0, df1]]
... )
country_name
1 B
0 B
If you already know the column name and value in that column you want to look up, you can simply use df[df['Country/Region'] == 'France']
But I think your question is more complicated than that.
First condition, we need the list of column names.
cols = [x if 'Country' in x for x in df.cols]
Next let's check for 'France' on any of these columns. Since cols is a list let's check one by one and concatenate the results.
new_df = pd.DataFrame()
for col in cols:
new_df = pd.concat([new_df, df[df[col] == 'France']])
Full code example:
import pandas as pd
df = pd.DataFrame({'Country/Region': ['France', 'France', 'Spain'], 'Record ID': [120, 240, 360]})
cols = [x for x in df.columns if 'Country' in x]
new_df = pd.DataFrame()
for col in cols:
new_df = pd.concat([new_df, df[df[col] == 'France']])
print(new_df)
prints:
Country/Region Record ID
0 France 120
1 France 240

How can I split an ID that is of type string in python according to postion of the integers in the ID?

My pandas dataframe currently has a column titled BinLocation that contains the location of a material in a warehouse. For example:
If a part is located in column A02, row 33, and then level B21 then the BinLocation ID is A02033B21.
For some columns, the format maybe A0233B21. The naming convention is not consistent but that was not up to me, and now I have to clean the data up.
I want to split the string such that for any given input for the BinLocation, I can return the column, row and level. Ultimately, I want to create 3 new columns for the dataframe (column, row, level).
In case it is not clear, the general structure of the ID is ColumnChar_ColumnInt_RowInt_ColumnChar_LevelInt
Now,for some BinLocations, the ID is separated by a hyphen so I wrote this code for those:
def forHyphenRow(s):
return s.split('-')[1]
def forHyphenColumn(s):
return s.split('-')[0]
def forHyphenLevel(s):
return s.split('-')[2]
How do I do the same but for the other IDs?
Also, in the dataframe is there anyway to group the columns in the dataframe all together? (so A02 are all grouped together, CB-22 are all grouped together etc)
Here is an answer that:
uses Python regular expression syntax to parse your ID (handles cases with and without hyphens and can be tweaked to accommodate other quirks of historical IDs if needed)
puts the ID in a regularized format
adds columns for the ID components
sorts based on the ID components so rows are "grouped" together (though not in the "groupby" sense of pandas)
import pandas as pd
df = pd.DataFrame({'BinLocation':['A0233B21', 'A02033B21', 'A02-033-B21', 'A02-33-B21', 'A02-33-B15', 'A02-30-B21', 'A01-33-B21']})
print(df)
print()
df['RawBinLocation'] = df['BinLocation']
import re
def parse(s):
m = re.match('^([A-Z])([0-9]{2})-?([0-9]+)-?([A-Z])([0-9]{2})$', s)
if not m:
return None
tup = m.groups()
colChar, colInt, rowInt, levelChar, levelInt = tup[0], int(tup[1]), int(tup[2]), tup[3], int(tup[4])
tup = (colChar, colInt, rowInt, levelChar, levelInt)
return pd.Series(tup)
df[['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt']] = df['BinLocation'].apply(parse)
df['BinLocation'] = df.apply(lambda x: f"{x.ColChar}{x.ColInt:02}-{x.RowInt:03}-{x.LevChar}{x.LevInt:02}", axis=1)
df.sort_values(by=['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt'], inplace=True, ignore_index=True)
print(df)
Output:
BinLocation
0 A0233B21
1 A02033B21
2 A02-033-B21
3 A02-33-B21
4 A02-33-B15
5 A02-30-B21
6 A01-33-B21
BinLocation RawBinLocation ColChar ColInt RowInt LevChar LevInt
0 A01-033-B21 A01-33-B21 A 1 33 B 21
1 A02-030-B21 A02-30-B21 A 2 30 B 21
2 A02-033-B15 A02-33-B15 A 2 33 B 15
3 A02-033-B21 A0233B21 A 2 33 B 21
4 A02-033-B21 A02033B21 A 2 33 B 21
5 A02-033-B21 A02-033-B21 A 2 33 B 21
6 A02-033-B21 A02-33-B21 A 2 33 B 21
If there will always be the first three characters of a string as Column, and last three as Level (and therefore Row as everything in-between):
def forNotHyphenColumn(s):
return s[:3]
def forNotHyphenLevel(s):
return s[-3:]
def forNotHyphenRow(s):
return s[3:-3]
Then, you could sort your DataFrame by Column by creating separate DataFrame columns for the BinLocation items and using df.sort_values():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12
EDIT:
To also use the hyphen functions in the apply():
df = pd.DataFrame(data={"BinLocation": ["A02033B21", "C02044C12", "A0233B21", "A01-33-C13"]})
# Create dataframe columns for BinLocation items
df["Column"] = df["BinLocation"].apply(lambda x: forHyphenColumn(x) if "-" in x else forNotHyphenColumn(x))
df["Row"] = df["BinLocation"].apply(lambda x: forHyphenRow(x) if "-" in x else forNotHyphenRow(x))
df["Level"] = df["BinLocation"].apply(lambda x: forHyphenLevel(x) if "-" in x else forNotHyphenLevel(x))
# Sort values
df.sort_values(by=["Column"], ascending=True, inplace=True)
df
#Out:
# BinLocation Column Row Level
#3 A01-33-C13 A01 33 C13
#0 A02033B21 A02 033 B21
#2 A0233B21 A02 33 B21
#1 C02044C12 C02 044 C12

I want to replace every value in the age column with its middle value

I have a column that looks like this:
Age
[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)
and want to remove the "[","-" and ")". Instead of showing the range such as 0-10, I would like to show the middle value instead for every row in the column
Yet another solution:
The dataframe:
df = pd.DataFrame({'Age':['[0-10)','[10-20)','[20-30)','[30-40)','[40-50)','[50-60)','[60-70)','[70-80)']})
df
Age
0 [0-10)
1 [10-20)
2 [20-30)
3 [30-40)
4 [40-50)
5 [50-60)
6 [60-70)
7 [70-80)
The code:
df['Age'] = df.Age.str.extract('(\d+)-(\d+)').astype('int').mean(axis=1).astype('int')
The result:
df
Age
0 5
1 15
2 25
3 35
4 45
5 55
6 65
7 75
If you want to explode a row into multiple rows where each row carries a value from the range, you can do this:
data = '''[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)'''
df = pd.DataFrame({'Age': data.splitlines()})
df['Age'] = df['Age'].str.extract(r'\[(\d+)-(\d+)\)').astype(int).apply(lambda r: list(range(r[0], r[1])), axis=1)
df.explode('Age')
Note that I assume your Age column is string typed, so I used extract to get the boundaries of the ranges, and convert them to a real list of integers. Finally explode your dataframe for the modified Age column will get you a new row for each integer in the list. Values in other columns will be copied accordingly.
I tried this:
import pandas as pd
import re
data = {
'age_range': [
'[0-10)',
'[10-20)',
'[20-30)',
'[30-40)',
'[40-50)',
'[50-60)',
'[60-70)',
'[70-80)',
]
}
df = pd.DataFrame(data=data)
def get_middle_age(age_range):
pattern = r'(\d+)'
ages = re.findall(pattern, age_range)
return int((int(ages[0])+int(ages[1]))/2)
df['age'] = df.apply(lambda row: get_middle_age(row['age_range']), axis=1)

Replace values in a dataframe with values from another dataframe - Regex

I have input data like as shown below. Here 'gender' and 'ethderived' are two columns. I would like to replace their values like 1,2,3 etc with categorical values. Ex - 1 with Male, 2 with Female
The mapping file looks like as shown below - sample 2 columns
Input data looks like as shown below
I expect my output dataframe to look like this
I have tried to do this using the below code. Though the code works fine, I don't see any replace happening. Can you please help me with this?
mapp = pd.read_csv('file2.csv')
data = pd.read_csv('file1.csv')
for col in mapp:
if col in data.columns:
print(col)
s = list(mapp.loc[(mapp[col].str.contains('^\d')==True)].index)
print("s is",s)
for i in s:
print("i is",i)
try:
value = mapp[col][i].split('. ')
print("value 0 is",value[0])
print("value 1 is",value[1])
if value[0] in data[col].values:
data.replace({col:{value[0]:value[1]}})
except:
print("column not present")
else:
print("No")
Please note that I have shown only two columns but in real time there might more than 600 columns. Any elegant approach/suggestions to make it simple is helpful. As I have two separate csv files, any suggestions on merge/join etc will also be helpful but please note that my mapping file contains values as "1. Male", "2. Female". hence I used regex
Also note that, several other column values can also have mapping values that start with 1. ex: 1. Single, 2. Married, 3. Divorced etc
Looking forward to your help
Use DataFrame.replace with nested dictionaries - first key define colum name for replace and another values for replace created by function Series.str.extract:
df = pd.DataFrame({'Gender':['1.Male','2.Female', np.nan],
'Ethnicity':['1.Chinese','2.Indian','3.Malay']})
print (df)
Gender Ethnicity
0 1.Male 1.Chinese
1 2.Female 2.Indian
2 NaN 3.Malay
d={x:df[x].str.extract(r'(\d+)\.(.+)').dropna().set_index(0)[1].to_dict() for x in df.columns}
print (d)
{'Gender': {'1': 'Male', '2': 'Female'},
'Ethnicity': {'1': 'Chinese', '2': 'Indian', '3': 'Malay'}}
df1 = pd.DataFrame({'Gender':[2,1,2,1],
'Ethnicity':[1,2,3,1]})
print (df1)
Gender Ethnicity
0 2 1
1 1 2
2 2 3
3 1 1
#convert to strings before replace
df2 = df1.astype(str).replace(d)
print (df2)
Gender Ethnicity
0 Female Chinese
1 Male Indian
2 Female Malay
3 Male Chinese
If the entries are always in order(1.XXX,2.XXX...), use:
m=df1.apply(lambda x: x.str[2:])
n=df2.sub(1).replace(m)
print(n)
gender ethderived
0 Female Chinese
1 Male Indian
2 Male Malay
3 Female Chinese
4 Male Chinese
5 Female Indian

Arrange DataFrame columns by column header

I have two pandas data-frame and each of them are of different sizes each over 1 million records.
I am looking to compare these two data-frames and identify the differences.
DataFrameA
ID Name Age Sex
1A1 Cling 21 M
1B2 Roger 22 M
1C3 Stew 23 M
DataFrameB
ID FullName Gender Age
1B2 Roger M 21
1C3 Rick M 23
1D4 Ash F 21
DataFrameB will always have more records than DataFrameA but the records found in DataFrameA may not still be in DataFrameB.
The column names in the DataFrameA and DataFrameB are different. I have the mapping stored in a different dataframe.
MappingDataFrame
DataFrameACol DataFrameBCol
ID ID
Name FullName
Age Age
Sex Gender
I am looking to compare these two and add a column next to it with the result.
Col Name Adder for DataFrame A = "_A_Txt"
Col Name Adder for DataFrame B = "_B_Txt"
ExpectedOutput
ID Name_A_Txt FullName_B_Text Result_Name Age_A_Txt Age_B_Txt Result_Age
1B2 Roger Roger Match ... ...
1C3 Stew Rick No Match ... ...
The column names should have the text added before this.
I am using a For loop at the moment to build this logic. But 1 million record is taking ages to complete. I left the program running for more than 50 minutes and it wasn't completed as in real-time, I am building it for more than 100 columns.
I will open bounty for this question and award the bounty, even if the question was answered before opening it as a reward. As, I have been struggling really for performance using For loop iteration.
To start with DataFrameA and DataFrameB, use the below code,
import pandas as pd
d = {
'ID':['1A1', '1B2', '1C3'],
'Name':['Cling', 'Roger', 'Stew'],
'Age':[21, 22, 23],
'Sex':['M', 'M', 'M']
}
DataFrameA = pd.DataFrame(d)
d = {
'ID':['1B2', '1C3', '1D4'],
'FullName':['Roger', 'Rick', 'Ash'],
'Gender':['M', 'M', 'F'],
'Age':[21, 23, 21]
}
DataFrameB = pd.DataFrame(d)
I believe, this question is a bit different from the suggestion (definition on joins) provided by Coldspeed as this also involves looking up at different column names and adding a new result column along. Also, the column names need to be transformed on the result side.
The OutputDataFrame Looks as below,
For better understanding of the readers, I am putting the column
names in the Row in order
Col 1 - ID (Coming from DataFrameA)
Col 2 - Name_X (Coming from DataFrameA)
Col 3 - FullName_Y (Coming from DataFrameB)
Col 4 - Result_Name (Name is what is there in DataFrameA and this is a comparison between Name_X and FullName_Y)
Col 5 - Age_X (Coming from DataFrameA)
Col 6 - Age_Y (Coming From DataFrameB)
Col 7 - Result_Age (Age is what is there in DataFrameA and this is a result between Age_X and Age_Y)
Col 8 - Sex_X (Coming from DataFrameA)
Col 9 - Gender_Y (Coming from DataFrameB)
Col 10 - Result_Sex (Sex is what is there in DataFrameA and this is a result between Sex_X and Gender_Y)
m = list(mapping_df.set_index('DataFrameACol')['DataFrameBCol']
.drop('ID')
.iteritems())
m[m.index(('Age', 'Age'))] = ('Age_x', 'Age_y')
m
# [('Name', 'FullName'), ('Age_x', 'Age_y'), ('Sex', 'Gender')]
Start with an inner merge:
df3 = (df1.merge(df2, how='inner', on=['ID'])
.reindex(columns=['ID', *(v for V in m for v in V)]))
df3
ID Name FullName Age_x Age_y Sex Gender
0 1B2 Roger Roger 22 21 M M
1 1C3 Stew Rick 23 23 M M
Now, compare the columns and set values with np.where:
l, r = map(list, zip(*m))
matches = (df3[l].eq(df3[r].rename(dict(zip(r, l)), axis=1))
.add_prefix('Result_')
.replace({True: 'Match', False: 'No Match'}))
for k, v in m:
name = f'Result_{k}'
df3.insert(df3.columns.get_loc(v)+1, name, matches[name])
df3.columns
# Index(['ID', 'Name', 'FullName', 'Result_Name', 'Age_x', 'Age_y',
# 'Result_Age_x', 'Sex', 'Gender', 'Result_Sex'],
# dtype='object')
df3.filter(like='Result_')
Result_Name Result_Age_x Result_Sex
0 Match No Match Match
1 No Match Match Match

Categories

Resources