I have a dataframe and alot of the values in one of the columns have python-unfriendly characters, like &.
I wanted to make a dictionary and then loop through with find and replacements
sort of like this:
replacements = {
" ": ""
,"&": "and"
,"/":""
,"+":"plus"
,"(":""
,")":""
}
df['VariableName']=df['VariableName'].replace(replacements,regex=True)
however this brings up the following error code:
error: nothing to repeat at position 0
I think you need escape special regex characters in dictionary comprehension:
import re
df = pd.DataFrame({'VariableName':['ss dd +','(aa)']})
replacements = {re.escape(k):v for k, v in replacements.items()}
df['VariableName']=df['VariableName'].replace(replacements,regex=True)
print (df)
VariableName
0 ssddplus
1 aa
Related
I am trying to strip all the special characters from a pandas dataframe column of words with the split() and replace() functions.
Howerver, it does not work. The special characters are not stripped from the words.
Can somebody enlight me please ?
import pandas as pd
import datetime
df = pd.read_csv("2022-12-08_word_selection.csv")
for n in df.index:
i = str(df.loc[n, "words"])
if len(i) > 12:
df.loc[n, "words"] = ""
df["words"] = df["words"].str.replace("$", "s")
df["words"] = df["words"].str.strip('[,:."*+-#/\^`#}{~&%’àáâæ¢ß¥£™©®ª×÷±²³¼½¾µ¿¶·¸º°¯§…¤¦≠¬ˆ¨‰øœšÞùúûý€')
df["words"] = df["words"].str.strip("\n")
df = df.groupby(["words"]).mean()
print(df)
Firstly, the program replaces all words in the "words" column longer than 12 characters. Then , I was hoping it would strip all the special characters from the "words" column.
First, avoid using a loop and instead use transform() to replace words longer than 12 characters with an empty string. Second, the Series.str conversion is not necessary prior to calling replace(). Third, split() only removes leading and trailing characters so it is not what you want. Use a regular expression with replace() instead. Finally, to remove special characters, it is cleaner to use a regex negative set to match and remove only the characters that are not letters or numbers. This looks like: "[^A-Za-z0-9]".
Here is some example data and code that works:
import pandas as pd
import re
df = pd.DataFrame(
{
"words": [
123,
"abcd",
"efgh",
"abcdefghijklmn",
"lol%",
"Hornbæk",
"10:03",
"$999¼",
]
}
)
# Faster and more concise than a loop
df["words"] = df["words"].transform(lambda x: "" if len(x) > 12 else x)
# Not sure why you do this but okay
df["words"] = df["words"].replace("$", "s")
# Use a regex negative set to keep only letters and numbers
df["words"] = df["words"].replace(re.compile("[^A-Za-z0-9]"), "")
display(df)
outputs:
words
0 123
1 abcd
2 efgh
3 abcdefghijklmn
4 lol
5 Hornbk
6 1003
7 999
I am new to pandas, I have an issue with strings. So I have a string s = "'hi'+'bikes'-'cars'>=20+'rangers'" I want only the words from the string, not the symbols or the integers. How can I do it?
My input:
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
Excepted Output:
s = "'hi','bikes','cars','rangers'"
try this using regex
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
samp= re.compile('[a-zA-z]+')
word= samp.findall(s)
not sure about pandas, but you can also do it with Regex as well, and here is the solution
import re
s = "'hi'+'bikes'-'cars'>=20+'rangers'"
words = re.findall("(\'.+?\')", s)
output = ','.join(words)
print(output)
For pandas I would convert the column in the dataframe to string first:
df
a b
0 'hi'+'bikes'-'cars'>=20+'rangers' 1
1 random_string 'with'+random,# 4
2 more,weird/stuff=wrong 6
df["a"] = df["a"].astype("string")
df["a"]
0 'hi'+'bikes'-'cars'>=20+'rangers'
1 random_string 'with'+random,#
2 more,weird/stuff=wrong
Name: a, dtype: string
Now you can see that dtype is string, which means you can do string operations on it,
including translate and split (pandas strings). But first you have to make a translate table with punctuation and digits imported from string module string docs
from string import digits, punctuation
Then make a dictionary mapping each of the digits and punctuation to whitespace
from itertools import chain
t = {k: " " for k in chain(punctuation, digits)}
create the translation table using str.maketrans (no import necessary with python 3.8 but may be a bit different with other versions) and apply the translate and split (with "str" in between) to the column)
t = str.maketrans(t)
df["a"] = df["a"].str.translate(t).str.split()
df
a b
0 [hi, bikes, cars, rangers] 1
1 [random, string, with, random] 4
2 [more, weird, stuff, wrong] 6
As you can see you only have the words now.
I have a string which includes ":", it shows like this:
: SHOES
and I want to split the colon and SHOES, then make a variable that contains only "SHOES"
I have split them used df.split(':') but then how should I create a variable with "SHOES" only?
You can use the list slicing function. and then use lstrip and rstrip to remove excess spaces before and after the word.
df=": shoes"
d=df.split(":")[-1].lstrip().rstrip()
print(d)
You can use 'apply' method to execute a loop over all dataset and split the column with 'split()'.
This is an example:
import pandas as pd
df=pd.DataFrame({'A':[':abd', ':cda', ':vfe', ':brg']})
# First we create a new column just named a new column -> df['new_column']
# Second, we loop dataset with apply
# Third, we execute a lambda with split function, getting only text after ':'
df['new_column']=df['A'].apply(lambda x: x.split(':')[1] )
df
A new_column
0 :abd abd
1 :cda cda
2 :vfe vfe
3 :brg brg
If your original strings always start with ": " then you could just remove the first two characters using:
myString[2:]
Here is a small working sample. Both stripValue and newString return the same value. It is matter cleaner code vs verbose code:
# set initial string
myString = "string : value"
# split it which will return an array [0,1,2,3...]
stripValue = myString.split(":")
# you can create a new var with the value you want/need from the array
newString = (stripValue[1])
# or you can short hand it
print(stripValue[1])
# calling the new var
print(newString)
I have a dictionary with keys that look like this:
['NAME', 'ID', 'COURSE', 'DUE', 'SUBMITTED', 'MINUTESLATE', 'LATEDEDUCTION', 'P1', 'P1COMMENTS', 'P2', 'P2COMMENTS', 'SUBTOTAL', 'TOTAL']
My goal is to go through a file and replace occurrences of these keys with values that I've read in from another file. For instance:
Problem 1: <<P1>>/35 <<P1COMMENTS>>
would be replaced with something like:
Problem 1: 30/35 comment
However, I'm having issues with doing this, as the keys can be overlapping. I wrote this method using some code that I looked up previously for another assignment:
def replace_all(text, dic):
for i, j in dic.items():
text = text.replace(i, j)
return text
However, this is the first time I've had overlapping keys in my dictionary, so I'm having a tough time tweaking this method to work properly. Currently, this is what my output looks like:
Problem 1: 30/35 30COMMENTS
Any ideas on a better way to approach this problem?
You could use re.sub() to find each << key >> and then replace it with the corresponding value from the dictionary.
import re
dct = {
'P1': 30,
'P1COMMENTS': 'comment'
}
print(dct)
s = 'Problem 1: <<P1>>/35 <<P1COMMENTS>>'
s = re.sub(r'<<(.*?)>>', lambda x: str(dct[x.group(1)]), s)
print(s)
Output:
Problem 1: 30/35 comment
Explanation:
<<(.*?)>>:
<< // matches <<
( // start of group 1
.*? // matches any number (0 or more) of characters (lazy)
) // end of group 1
>> // matches >>
re.sub() will take a pattern, a replacement value, and a string, and will replace any pattern matches in the string, with the replacement value. The function:
lambda x: str(dct[x.group(1)])
Will lookup the match in the dictionary, and return the value for the key.
What you need is a customizable template engine. Luckily, Python is shipped with string.Template.
import re
import string
class CustomTemplate(string.Template):
pattern = r'<<(?P<named>[^>]+)>>'
template = '<<FOO>> 123456 <<FOOBAR>>tail'
print(CustomTemplate(template).substitute(
FOO='foo_content',
FOOBAR='foobar_stuff',
))
Output: foo_content 123456 foobar_stufftail
Say I've got a column in my Pandas Dataframe that looks like this:
s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
I would like to use this column for fuzzy matching and therefore I want to remove characters ('.' , '/' , '-') but only at the end of each string so it looks like this:
s = pd.Series(["ab-cd", "abc", "abc-def", "ab.cde", "abcd"])
So far I started out easy so instead of generating a list with characters I want removed I just repeated commands for different characters like:
if s.str[-1] == '.':
s.str[-1].replace('.', '')
But this simply produces an error. How do I get the result I want, that is strings without characters at the end (characters in the rest of the string need to be preserved)?
Replace with regex will help you get the output
s.replace(r'[./-]$','',regex=True)
or with the help of apply incase looking for an alternative
s.apply(lambda x :x[:-1] if x[-1] is '.' or '-' or '/' else x)
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
You can use str.replace with a regex:
>>> s = pd.Series(["ab-cd.", "abc", "abc-def/", "ab.cde", "abcd-"])
>>> s.str.replace("\.$|/$|\-$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
which can be reduced to this:
>>> s.str.replace("[./-]$","")
0 ab-cd
1 abc
2 abc-def
3 ab.cde
4 abcd
dtype: object
>>>
You can use str.replace with a regular expression
s.str.replace(r'[./-]$','')
Substitute inside [./-] any characters you want to replace. $ means the match should be at the end of the string.
To replace "in-place" use Series.replace
s.replace(r'[./-]$','', inplace=True, regex=True)
I was able to remove characters from the end of strings in a column in a pandas DataFrame with the following line of code:
s.replace(r'[./-]$','',regex=True)
Where all entries in between brackets ( [./-] ) indicate characters to be removed and $ indicate they should be removed from the end