String manipulation within a column (pandas): split, replace, join - python

I'd like to create a new column based on the following conditions:
if the row contains dogs/dog/chien/chiens, then add -00
if the row contains cats/cat/chat/chats, then add 00-
A sample of data is as follows:
Animal
22 dogs
1 dog
1 cat
3 dogs
32 chats
and so far.
I'd like as output a column with only numbers (numerical):
Animal New column
22 dogs 22-00
1 dog 1-00
1 cat 00-1
3 dogs 3-00
32 chats 00-32
I think I should use an if condition to check the words, then .split and .join . It's about string manipulation but I'm having trouble breaking down this problem.

PRES = set(("cats", "cat", "chat", "chats"))
POSTS = set(("dogs", "dog", "chien", "chiens"))
def fun(words):
# words will come as e.g. "22 dogs"
num, ani = words.split()
if ani in PRES:
return "00-" + num
elif ani in POSTS:
return num + "-00"
else:
# you might want to handle this..
return "unexpected"
df["New Column"] = df["Animal"].apply(fun)
where df is your dataframe. For a fast lookup, we turn the condition lists into sets. Then we apply a function to values of the Animal column of df and act accordingly.

You could do this, first extract the number, then use np.where to conditionally add characters to the string:
df['New Col'] = df['Animal'].str.extract(r'([0-9]*)')
df['New Col'] = np.where(df['Animal'].str.contains('dogs|dog|chiens|chien'), df['New Col']+'-00', df['New Col'])
df['New Col'] = np.where(df['Animal'].str.contains('cats|cat|chat|chats'), '00-'+df['New Col'], df['New Col'])
print(df)
Animal New Col
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32

Since your data is well-formatted, you can use a basic substitution and apply it to the row:
import pandas as pd
import re
def replacer(s):
return re.sub(r" (chiens?|dogs?)", "-00",
re.sub(r"(\d+) ch?ats?", r"00-\1", s))
df = pd.DataFrame({"Animal": ["22 dogs", "1 dog", "1 cat", "3 dogs", "32 chats"]})
df["New Column"] = df["Animal"].apply(replacer)
Output:
Animal New Column
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32

Using re:
import re
list1 = ['dogs', 'dog', 'chien', 'chiens']
list2 = ['cats', 'cat', 'chat', 'chats']
df['New_col'] = [(re.search(r'(\w+)', val).group(1).strip()+"-00") if re.search(r'([a-zA-Z]+)', val).group(1).strip() in list1 else ("00-" + re.search(r'(\w+)', val).group(1).strip()) if re.search(r'([a-zA-Z]+)', val).group(1).strip() in list2 else val for val in list(df['Animal'])]
print(df)
Output:
Animal New_col
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32

Create tuple of search words
dog = ('dogs', 'dog', 'chien', 'chiens')
cat = ('cats', 'cat', 'chat', 'chats')
Create conditions for each tuple created with corresponding replacements and apply the conditions to the column, using numpy select :
num = df.Animal.str.split().str[0] #the numbers
#conditions
cond1 = df.Animal.str.endswith(dog)
cond2 = df.Animal.str.endswith(cat)
condlist = [cond1,cond2]
#what should be returned for each successful condition
choicelist = [num+"-00","00-"+num]
df['New Column'] = np.select(condlist,choicelist)
df
Animal New Column
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32

Related

How do I create a list of items (attributes) based on the dummy variable value?

Let's say I have a dataframe in python with a range of animals, and a range of attributes, with dummy variables for whether the animal has that attribute. I'm interested in creating lists, both vertically and horizontally based on dummy variable value. e.g. I'd like to:
a) create a list of animals that have hair
b) create a list of all the attributes that a dog has.
Could anyone please assist with how I would do this in Python? Thanks very much!
Name
Hair
Eyes
Dog
1
1
Fish
0
1
You could use a dictionary to store values regarding the animals. And the first value of the values list can hold the 0 or 1 denoting hair on the animal.
animals = { "Dog": [ 1, 1 ], "Fish": [ 0, 1 ] }
(a)
df[ df['Hair'] == 1 ]['Name'].to_list()
df.loc[ df['Hair'] == 1, 'Name'].to_list()
(b)
It may need to transpose dataframe (to convert rows into columns) and set column's names.
And later you can use similar code
df[ df['Dog'] == 1 ].index.to_list()
Minimal working code
text = '''Name,Hair,Eyes
Dog,1,1
Fish,0,1'''
import pandas as pd
import io
df = pd.read_csv(io.StringIO(text))
print(df)
print('---')
print('Hair 1:', df[ df['Hair'] == 1 ]['Name'].to_list())
print('hair 2:', df.loc[ df['Hair'] == 1, 'Name'].to_list())
print('---')
# transpose
#new_df = df.transpose() #
new_df = df.T # shorter name - without `()`
# convert first row into column's names
new_df.columns = new_df.loc['Name']
new_df = new_df[1:]
print(new_df)
print('---')
print('Dog :', new_df[ new_df['Dog'] == 1 ].index.to_list())
print('Fish:', new_df[ new_df['Fish'] == 1 ].index.to_list())
Result:
Name Hair Eyes
0 Dog 1 1
1 Fish 0 1
---
Hair 1: ['Dog']
hair 2: ['Dog']
---
Name Dog Fish
Hair 1 0
Eyes 1 1
---
Dog : ['Hair', 'Eyes']
Fish: ['Eyes']

Loop all columns for value in any column

I'm trying to loop through all columns in a dataframe to find where a "Feature" condition is met in order to alter the FeatureValue. So if my dataframe(df) looks like below:
Feature FeatureValue Feature2 Feature2Value
Cat 1 Dog 3
Fish 2 Cat 1
I want to find where Feature=Cat or Feature2=Cat and change FeatureValue and Feature2Value to 20. I tried the below to get started, but am struggling.
for column in df:
if df.loc[df[column] == "Cat"]:
print(column)
The solution would look like:
Feature FeatureValue Feature2 Feature2Value
Cat 20 Dog 3
Fish 2 Cat 20
Here is a way to do it :
# First we construct a dictionary linking each feature to its value column
feature_value = {'Feature' : 'FeatureValue', 'Feature2' : 'Feature2Value'}
# We iterate over each feature column
for feature in feature_value:
df.loc[df[feature]=='Cat', feature_value[feature]] = 20
You currently have a wide data structure. To solve your problem in an elegant way, you should convert to a long data structure. I don't know what you are doing with your data, but the long form is often much easier to deal with.
You can do it like this
import pandas as np
from itertools import chain
# set up your sample data
dta = {'Feature': ['Cat', 'Fish'], 'FeatureValue': [1, 2], 'Feature2': ['Dog', 'Cat'], 'Feature2Value': [3, 1]}
df = pd.DataFrame(data=dta)
# relabel your columns to be able to apply method `wide_to_long`
# this is a little ugly here only because your column labels are not wisely chosen
# if you had [Feature1,FeatureValue1,Feature2,FeatureValue2] as column labels,
# you could get rid of this part
columns = ['Feature', 'FeatureValue'] * int(len(df.columns)/2)
identifier = zip(range(int(len(df.columns)/2)), range(int(len(df.columns)/2)))
identifier = list(chain(*identifier))
columns = ['{}{}'.format(i,j) for i, j in zip(columns, identifier)]
df.columns = columns
# generate result
df['feature'] = df.index
df_long = pd.wide_to_long(df, stubnames=['Feature', 'FeatureValue'], i='feature', j='id')
Now, you converted your data from
Feature FeatureValue Feature2 Feature2Value
0 Cat 1 Dog 3
1 Fish 2 Cat 1
to this
Feature FeatureValue
feature id
0 0 Cat 1
1 0 Fish 2
0 1 Dog 3
1 1 Cat 1
This allows you to answer your problem in a single line, no loops:
df_long.loc[df_long['Feature'] == 'Cat', 'FeatureValue'] = 20
This yields
Feature FeatureValue
feature id
0 0 Cat 20
1 0 Fish 2
0 1 Dog 3
1 1 Cat 20
You can easily go back to your wide format using the same method.

String Matching and get more than 1 column in Pandas

I need to match Name from df1 to Item_Name from df2. Wherever the name matches I also need Item_Id and Material_Name from df2.
I have two data frames:
Df1:
Original df has 1000+ Name
Id Name
1 Paper
2 Paper Bag
3 Scissors
4 Mat
5 Cat
6 Good Cat
2nd Df:
Original df has 1000+ Item_Name
Item_ID Item_Name Material_Name
1 Paper Bag Office
2 wallpaper Decor
3 paper Office
4 cat cage Animal Misc
5 good cat Animal
Expected Output:
Id Name Item_ID Material_Name
1 Paper 1,2,3 Office,Decor,Office
2 Paper Bag 1,2,3 Office,Decor,Office
3 Scissors NA NA
4 Mat NA NA
5 Cat 4,5 Animal Misc, Animal
6 Good Cat 4,5 Animal Misc,Animal
Code:
def matcher(query):
matches = [i['Item_ID'] for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in query.lower().split())]
if matches:
return ','.join(map(str, matches))
else:
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)
This worked properly when I need to have one column and currently I am running this code twice to get Item_ID and Material_Name.
ASK:
Need help if there is another way to not to run the function twice but I can get even 2 or 3 columns in one go
Here's one way using pd.DataFrame.loc and reusing Boolean masks:
def matcher(x):
# construct 2-way mask
m1 = df2['Item_Name'].str.contains(x, regex=False, case=False)
m2 = [any(w in i.lower() for w in x.lower().split()) for i in df2['Item_Name']]
# apply 2-way mask
res_id = df2.loc[m1 | m2, 'Item_ID']
res_mat = df2.loc[m1 | m2, 'Material_Name']
return ','.join(res_id.astype(str)), ','.join(res_mat.astype(str))
df1[['Item_ID', 'Material_Name']] = pd.DataFrame(df1['Name'].apply(matcher).tolist())
print(df1)
Id Name Item_ID Material_Name
0 1 Paper 1,2,3 Office,Decor,Office
1 2 Paper Bag 1,2,3 Office,Decor,Office
2 3 Scissors
3 4 Mat
4 5 Cat 4,5 Animal Misc,Animal
5 6 Good Cat 4,5 Animal Misc,Animal
You can try getting both Item_ID and Material_Name as a tuple from your query, then apply the appropriate column with [i[0] for i in matches] or [i[1] for i in matches].
def matcher(query):
matches = [(i['Item_ID'], i['Material_Name']) for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in df1['Name'].lower().split())]
if matches:
df1['Material_Name'].apply(','.join(map(str, [i[1] for i in matches])))
return ','.join(map(str, [i[0] for i in matches]))
else:
df1['Material_Name'].apply("NA")
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)

GroupBy Value in DataFrame and getting a list of words seperated by comma

I have a pandas dataframe as shown here. There are many more columns in that frame that are not important concerning the task.
id pos value sente
1 a I 21
2 b have 21
3 b a 21
4 a cat 21
5 d ! 21
1 a My 22
2 a cat 22
3 b is 22
4 a cute 22
5 d . 22
I now want to group all rows where sente=sente and join the words in value to form a sentence in a list. So the output should look something like this (a list full of strings seperated by comma) :
["I have a cat!", "My cat is cute."]
I suppose the first step is to use groupby("sente")
fill = (df.groupby("sente").apply(lambda df: df["value"].values)).reset_index().rename(columns={0: "content"})
fill = [word for word in fill["content"]
However doing so I get this output:
print(fill):
[array(['I','have','a','cat','!'],dtype=object), array(['My','cat','is','cute','.'],dtype=object)]
Is there any way to join all words in a sentence without labeling them as a seperate string and to remove the array and dtype part?
You need join all values without last by space and then append it:
L = (df.groupby("sente")['value']
.apply(lambda x: ' '.join(x.iloc[:-1]) + x.iloc[-1])
.tolist())
print (L)
['I have a cat!', 'My cat is cute.']
because else unnecessary space before ! and .:
print (df.groupby("sente")['value'].apply(' '.join).tolist())
['I have a cat !', 'My cat is cute .']

Pandas dataframe column value case insensitive replace where <condition>

Is there a case insensitive version for pandas.DataFrame.replace? https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.replace.html
I need to replace string values in a column subject to a case-insensitive condition of the form "where label == a or label == b or label == c".
The issue with some of the other answers is that they don't work with all Dataframes, only with Series, or Dataframes that can be implicitly converted to a Series. I understand this is because the .str construct exists in the Series class, but not in the Dataframe class.
To work with Dataframes, you can make your regular expression case insensitive with the (?i) extension. I don't believe this is available in all flavors of RegEx but it works with Pandas.
d = {'a':['test', 'Test', 'cat'], 'b':['CAT', 'dog', 'Cat']}
df = pd.DataFrame(data=d)
a b
0 test CAT
1 Test dog
2 cat Cat
Then use replace as you normally would but with the (?i) extension:
df.replace('(?i)cat', 'MONKEY', regex=True)
a b
0 test MONKEY
1 Test dog
2 MONKEY MONKEY
I think need convert to lower and then replace by condition with isin:
d = {'a':['test', 'Test', 'cat', 'CAT', 'dog', 'Cat']}
df = pd.DataFrame(data=d)
m = df['a'].str.lower().isin(['cat','test'])
df.loc[m, 'a'] = 'baby'
print (df)
a
0 baby
1 baby
2 baby
3 baby
4 dog
5 baby
Another solution:
df['b'] = df['a'].str.replace('test', 'baby', flags=re.I)
print (df)
a b
0 test baby
1 Test baby
2 cat cat
3 CAT CAT
4 dog dog
5 Cat Cat

Categories

Resources