I have a dataframe "data" which as a column called "Description" which as a text " the IN678I78 is delivered" every row as some code starts with 'IN'
now i need to pull that IN------ separately into new column
please do help
thanks
When asking a question, always put a sample of your dataframe for us to vizualize your problem and try some solutions.
IIUC you can use an apply on your Description column and regular expressions manipulation to extract your desired feature. You can try the following:
def extr(x):
lis = x.split(' ')
for string in lis:
if string[:2] == 'IN':
return string
data['New col'] = data.Description.apply(extr)
Related
I have a dataset. In the column 'Tags' I want to extract from each row all the content that has the word player. I could repeat or be alone in the same cell. Something like this:
'view_snapshot_hi:hab,like_hi:hab,view_snapshot_foinbra,completed_profile,view_page_investors_landing,view_foinbra_inv_step1,view_foinbra_inv_step2,view_foinbra_inv_step3,view_snapshot_acium,player,view_acium_inv_step1,view_acium_inv_step2,view_acium_inv_step3,player_acium-ronda-2_r1,view_foinbra_rinv_step1,view_page_makers_landing'
expected output:
'player,player_acium-ronda-2_r1'
And I need both.
df["Tags"] = df["Tags"].str.ectract(r'*player'*,?\s*')
I tried this but it´s not working.
You need to use Series.str.extract keeping in mind that the pattern should contain a capturing group embracing the part you need to extract.
The pattern you need is player[^,]*:
df["Tags"] = df["Tags"].str.extract(r'(player[^,]*)', expand=False)
The expand=False returns a Series/Index rather than a dataframe.
Note that Series.str.extract finds and fetches the first match only. To get all matches use either of the two solutions below with Series.str.findall:
df["Tags"] = df["Tags"].str.findall(r'player[^,]*', expand=False)
df["Tags"] = df["Tags"].str.findall(r'player[^,]*', expand=False).str.join(", ")
This simple list also gives what you want
words_with_players = [item for item in your_str.split(',') if 'player' in item]
players = ','.join(words_with_players)
have been trying for a good while now and cannot find an answer online, so... I'm sure someone can help.
I have a dataframe with a column that contains descriptive text, e.g.
"BALANCE SHRINKER - CORE"
Each row has a different text value.
I need to check for the existence of any of multiple words:
['LOB','LIFE','SHRINKER'] say.
And from the result (True/False), create a new column set to 999 if any phrase is found in the text column being searched, or set to 0 otherwise.
I have tried this kind of approach but nothing works for me:
df['rule1'] = 999 if any(x in df['textcolumn'].str for x in ['LOB','LIFE','SHRINKER']) else 0
I've tried .find() and .contains() but to no avail.
So, I'm sure someone can advise!
Thanks for looking.
DT
Use Series.str.contains to check if each row of 'textcolumn' contains any of the words, producing a boolean Series. Then use Series.map to map the True values to 900, and the False values to 0.
# list of words to find in 'textcolumn'
words = ['LOB','LIFE','SHRINKER']
# regex pattern to search in 'textcolumn'
# '|' stands for OR. Read pat as "match 'LOB' OR 'LIFE' OR 'SHRINKER'"
pat = "|".join(words)
df['rule1'] = df['textcolumn'].str.contains(pat).map({True: 999, False: 0})
Another option is to use numpy.where
import numpy as np
words = ['LOB','LIFE','SHRINKER']
pat = "|".join(words)
df['rule1'] = np.where(df['textcolumn'].str.contains(pat), 999, 0)
I have a column containing strings that are comprised of different words but always have a similar structure structure. E.g.:
2cm off ORDER AGAIN (191 1141)
I want to extract the sub-string that starts after the second space and ends at the space before the opening bracket/parenthesis. So in this example I want to extract ORDER AGAIN.
Is this possible?
You could use str.extract here:
df["out"] = df["col"].str.extract(r'^\w+ \w+ (.*?)(?: \(|$)')
Note that this answer is robust even if the string doesn't have a (...) term at the end.
Here is a demo showing that the regex logic is working.
You can try the following:
r"2cm off ORDER AGAIN (191 1141)".split(r"(")[0].split(" ", maxsplit=2)[-1].strip()
#Out[3]: 'ORDER AGAIN'
If the pattern of data is similar to what you have posted then I think the below code snippet should work for you:
import re
data = "2cm off ORDER AGAIN (191 1141)"
extr = re.match(r".*?\s.*?\s(.*)\s\(.*", data)
if extr:
print (extr.group(1))
You can try the following code
s = '2cm off ORDER AGAIN (191 1141)'
second_space = s.find(' ', s.find(' ') + 1)
openparenthesis = s.find('(')
substring = s[second_space : openparenthesis]
print(substring) #ORDER AGAIN
So, I have a simple doubt but I am new to regex. I am working with a Pandas DataFrame. One of the columns contains the names. However, some names are written like "John Doe" but some are written like "John.Doe" and I need to write all of them like "John Doe". I need to run this on the whole dataframe. What is the regex query to fix this and in an efficient manner. Col Name = 'Customer_Name'. Let me know if more details are needed.
Try running this to replace all . with space, if that is your only condition:
df['Customer_Name'] = df['Customer_Name'].str.replace('.', ' ')
All you need is to use apply function from pandas that applies a function to all the values on column. You do not need regex for this but below is an example that has both
import pandas as pd
import re
# Read CSV File
df = pd.read_csv(<PATH TO CSV FILE>)
# Apply Function to Column
df['NewCustomerName'] = df['Customer_Name'].apply(format_name)
# Function that does replacement
def format_name(val):
return val.replace('.', ' ')
# return re.sub('\.', ' ', val) # If you would like to use regex
I have a list of words negative that has 4783 elements. I want to use the following code
tweets3 = tweets2[tweets2['full_text'].str.contains('|'.join(negative))]
But, it gives ane error like this error: multiple repeat at position 4193.
I do not understand this error. Apparently, if I use a single word in str.contains such as str.contains("deal") I am able to get results.
All I need is a new dataframe that carries only those rows which carry any of the words occuring in the dataframe tweets2 column full_text.
As a matter of choice I would also like to see if I can have a boolean column for present and absent values as 0 or 1.
I arrived at using the following code with the help of #wp78de:
tweets2['negative'] = tweets2.loc[tweets2['full_text'].str.contains(r'(?:{})'.format('|'.join(negative)), regex=True, na=False)].copy()
For arbitrary literal strings that may have regular expression metacharacters in it you can use the re.escape() function. Something along this line should be sufficient:
.str.contains(r'(?:{})'.format(re.escape('|'.join(words)), regex=True, na=False)]