I am working on a thesis project on smartworking. I downloaded some tweets using Python and I wanted to get rid of users / mentions before implementing wordclouds. However, I can't delete the users, but with the commands shown I delete only the "#".
df['token']=df['token'].apply(lambda x:re.sub(r"#mention","", x))
df['token']=df['token'].apply(lambda x:re.sub(r"#[A-Za-z0-9]+","", x))
Your second code should work, however for efficiency use str.replace:
df['token2'] = df['token'].str.replace('#[A-Za-z0-9]+\s?', '', regex=True)
# or for [a-zA-Z0-9_] use \w
# df['token2'] = df['token'].str.replace('#\w+\s?', '', regex=True)
example:
token token2
0 this is a #test case this is a case
Looked for awhile on here, but couldn't find the answer.
df['Products'] = ['CC: buns', 'people', 'CC: help me']
Trying to get only text after colon or keep text if no colon is in the string.
Tried a lot of things, but this was my final attempt.
x['Product'] = x['Product'].apply(lambda i: i.extract(r'(?i):(.+)') if ':' in i else i)
I get this error:
Might take two steps, I assume.
I tried this:
x['Product'] = x['Product'].str.extract(r'(?i):(.+)')
Got me everything after the colon and a bunch of NaN, so my regex is working. I am assuming my lambda sucks.
Use str.split and get last item
df['Products'] = df['Products'].str.split(': ').str[-1]
Out[9]:
Products
0 buns
1 people
2 help me
Try this
df['Products'] = df.Products.apply(lambda x: x.split(': ')[-1] if ':' in x else x)
print(df)
Output:
Products
0 buns
1 people
2 help me
I would like how to convert the first letter of each word in this column:
Test
There is a cat UNDER the table
The pen is working WELL.
Into lower case, in order to have
Test
there is a cat uNDER the table
the pen is working wELL.
I know there is capitalize() but I would need a function which does the opposite.
Many thanks
Please note that the strings are within a column.
I don't believe there is a builtin for this, but I could be mistaken. This is however quite easy to do with string comprehension!.
" ".join(i[0].lower()+i[1:] for i in line.split(" "))
Where line is each individual line.
According to this solution you can do :
>>> func = lambda s: s[:1].lower() + s[1:] if s else ''
>>> sent = "There is a cat UNDER the table "
>>> res = " ".join(list(map(func , sent.split())))
>>> res
'there is a cat uNDER the table'
You can use .str.lower, .str.split and ' '.join:
s=df.Test.str.split()
df.Test=s.str[0].str.lower()+' '+s.str[1:].agg(' '.join)
Same as spliting the words with .str.split and then modifying with apply:
df.Test=df.Test.str.split().apply(lambda x: [x[0].lower()]+x[1:] ).agg(' '.join)
Both outputs:
df
Test
0 there is a cat UNDER the table
1 the pen is working WELL.
I have a field that looks like
field1
231-206-2222
231-206-2344
231-206-1111
231-206-1111
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
I tried regexing it but to no avail. I am new to this so any ideas would help. Any suggestions?ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss
It seems like a dataframe to me, if so try this:
df['field1'].apply(lambda x: x.replace("-",""))
There are many ways of doing it.
Demo:
1) # where sub will replace hyphen with empty space
df = pd.DataFrame({'field1': ['123-456-999', '333-222-111']})
df['field1'] = df['field1'].apply(lambda x: re.sub(r'-', '', x))
2) # where \D+ will match one or more non-digits and remove it
df['field1'] = df['field1'].str.replace(r'\D+', '')
3) # to replace - with empty space
df['field1'] = df['field1'].str.replace('-', '')
Result:
field1
0 123456999
1 333222111
All,
To replace one string in one particular column I have done this and it worked fine:
dataUS['sec_type'].str.strip().str.replace("LOCAL","CORP")
I would like now to replace multiple strings with one string say replace ["LOCAL", "FOREIGN", "HELLO"] with "CORP"
How can make it work? the code below didn't work
dataUS['sec_type'].str.strip().str.replace(["LOCAL", "FOREIGN", "HELLO"], "CORP")
You can perform this task by forming a |-separated string. This works because pd.Series.str.replace accepts regex:
Replace occurrences of pattern/regex in the Series/Index with some
other string. Equivalent to str.replace() or re.sub().
This avoids the need to create a dictionary.
import pandas as pd
df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})
pattern = '|'.join(['LOCAL', 'FOREIGN', 'HELLO'])
df['A'] = df['A'].str.replace(pattern, 'CORP')
# A
# 0 CORP TEST
# 1 TEST CORP
# 2 ANOTHER CORP
# 3 NOTHING
replace can accept dict , os we just create a dict for those values need to be replaced
dataUS['sec_type'].str.strip().replace(dict(zip(["LOCAL", "FOREIGN", "HELLO"], ["CORP"]*3)),regex=True)
Info of the dict
dict(zip(["LOCAL", "FOREIGN", "HELLO"], ["CORP"]*3))
Out[585]: {'FOREIGN': 'CORP', 'HELLO': 'CORP', 'LOCAL': 'CORP'}
The reason why you receive the error ,
str.replace is different from replace
The answer of #Rakesh is very neat but does not allow for substrings. With a small change however, it does.
Use a replacement dictionary because it makes it much more generic
Add the keyword argument regex=True to Series.replace() (not Series.str.replace) This does two things actually: It changes your replacement to regex replacement, which is much more powerful but you will have to escape special characters. Beware for that. Secondly it will make the replace work on substrings instead of the entire string. Which is really cool!
replacement = {
"LOCAL": "CORP",
"FOREIGN": "CORP",
"HELLO": "CORP"
}
dataUS['sec_type'].replace(replacement, regex=True)
Full code example
dataUS = pd.DataFrame({'sec_type': ['LOCAL', 'Sample text LOCAL', 'Sample text LOCAL sample FOREIGN']})
replacement = {
"LOCAL": "CORP",
"FOREIGN": "CORP",
"HELLO": "CORP"
}
dataUS['sec_type'].replace(replacement, regex=True)
Output
0 CORP
1 CORP
2 Sample text CORP
3 Sample text CORP sample CORP
Name: sec_type, dtype: object
#JJP answer is a good one if you have a long list. But if you have just two or three then you can simply use the '|' within the pattern. Make sure to add regex=True parameter.
Clearly .str.strip() is not a requirement but is good practise.
import pandas as pd
df = pd.DataFrame({'A': ['LOCAL TEST', 'TEST FOREIGN', 'ANOTHER HELLO', 'NOTHING']})
df['A'] = df['A'].str.strip().str.replace("LOCAL|FOREIGN|HELLO", "CORP", regex=True)
output
A
0 CORP TEST
1 TEST CORP
2 ANOTHER CORP
3 NOTHING
Function to replace multiple values in pandas Series:
def replace_values(series, to_replace, value):
for i in to_replace:
series = series.str.replace(i, value)
return series
Hope this helps someone
Try:
dataUS.replace({"sec_type": { 'LOCAL' : "CORP", 'FOREIGN' : "CORP"}})