I have a Pandas DataFrame (df) where some of the words contain encoding replacement characters. I want to replace these words with replacement words from a dictionary (translations).
translations = {'gr�nn': 'gronn', 'm�nst': 'menst'}
df = pd.DataFrame(["gr�nn Y", "One gr�nn", "Y m�nst/line X"])
df.replace(translations, regex=True, inplace=True)
However, it doesn't seem to capture all the instances.
Current output:
0
0 gronn Y
1 One gr�nn
2 Y m�nst/line X
Do I need to specify any regex patterns to enable the replacement to also capture partial words within a string?
Expected output:
0
0 gronn Y
1 One gronn
2 Y menst/line X
Turn your translations into regex find/replace strings:
translations = {r'(.*)gr�nn(.*)': r'\1gronn\2', r'(.*)m�nst(.*)': r'\1menst\2'}
df = pd.DataFrame(["gr�nn Y", "One gr�nn", "Y m�nst/line X"])
df.replace(translations, regex=True)
Returns:
0
0 gronn Y
1 One gronn
2 Y menst/line X
Related
I want to improve the loop performance where it counts word occurrences in text, but it runs around 5 minutes for 5 records now
DataFrame
No Text
1 I love you forever...*500 other words
2 No , i know that you know xxx *100 words
My word list
wordlist =['i','love','David','Mary',......]
My code to count word
for i in wordlist :
df[i] = df['Text'].str.count(i)
Result :
No Text I love other_words
1 I love you ... 1 1 4
2 No, i know ... 1 0 5
You can do this by making a Counter from the words in each Text value, then converting that into columns (using pd.Series), summing the columns that don't exist in wordlist into other_words and then dropping those columns:
import re
import pandas as pd
from collections import Counter
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z]+\b', t.lower())))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
other_words = list(set(df.columns) - set(wordlist) - { 'No', 'Text' })
df['other_words'] = df[other_words].sum(axis=1)
df = df.drop(other_words, axis=1)
Output (for the sample data in your question):
No Text i love other_words
0 1 I love you forever... other words 1 1 4
1 2 No , i know that you know xxx words 1 0 7
Note:
I've converted all the words to lower-case so you're not counting I and i separately.
I've used re.findall rather than the more obvious split() so that forever... gets counted as the word forever rather than forever...
If you only want to count the words in wordlist (and don't want an other_words count), you can simplify this to:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'\b[a-z]+\b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
Output:
No Text i love
0 1 I love you forever... other words 1 1
1 2 No , i know that you know xxx words 1 0
Another way of also generating the other_words value is to generate 2 sets of counters, one of all the words, and one only of the words in wordlist. These can then be subtracted from each other to find the count of words in the text which are not in the wordlist:
wordlist = list(map(str.lower, wordlist))
counters = df['Text'].apply(lambda t:Counter(w for w in re.findall(r'\b[a-z]+\b', t.lower()) if w in wordlist))
df = pd.concat([df, counters.apply(pd.Series).fillna(0).astype(int)], axis=1)
c2 = df['Text'].apply(lambda t:Counter(re.findall(r'\b[a-z]+\b', t.lower())))
df['other_words'] = (c2 - counters).apply(lambda d:sum(d.values()))
Output of this is the same as for the first code sample. Note that in Python 3.10 and later, you should be able to use the new total function:
(c2 - counters).apply(Counter.total)
as an alternative you could try this:
counts = (df['Text'].str.lower().str.findall(r'\b[a-z]+\b')
.apply(lambda x: pd.Series(x).value_counts())
.filter(map(str.lower, wordlist)).fillna(0))
df[counts.columns] = counts
print(df)
'''
№ Text i love
0 1 I love you forever... other words 1.0 1.0
1 2 No , i know that you know xxx words 1.0 0.0
in Pandas (df), a column with following strings. looking to pad 0 when number within string are <100
Freq
XXX100KHz
XYC200KHz
YYY80KHz
YYY50KHz
to:
Freq
XXX100KHz
XYC200KHz
YYY080KHz
YYY050KHz
following function doesn't work, as \1 then 0 won't work as \10 doesn't exist.
df.replace({'Freq':'^([A-Za-z]+)(\d\d[A-Za-z]*)$'},{'Freq':r'\1**0**\2'},regex=True, inplace=True)
Try:
df["Freq"] = df["Freq"].str.replace(
r"(?<=\D)\d{1,2}(?=KHz)",
lambda g: "{:0>3}".format(g.group()),
regex=True,
)
print(df)
Prints:
Freq
0 XXX100KHz
1 XYC200KHz
2 YYY080KHz
3 YYY050KHz
dfcolumn = [PUEF2CarmenXFc034DpEd, PUEF2BalulanFc034CamH, CARF1BalulanFc013Baca, ...]
My output should be:
dfnewcolumn1 = [PUEF2, PUEF2 , CARF1]
dfnewcolumn2 = [CarmenXFc034DpEd, BalulanFc034CamH, BalulanFc013Baca]
Assuming your split criteria is by fixed number of characters (e.g. 5 here), you can use:
df['dfnewcolumn1'] = df['dfcolumn'].str[:5]
df['dfnewcolumn2'] = df['dfcolumn'].str[5:]
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
If your split criteria is by the first digit in the string, you can use:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Using the following modified original data with more test cases:
dfcolumn
0 PUEF2CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH
2 CARF1BalulanFc013Baca
3 CAF1BalulanFc013Baca
4 PUEFA2BalulanFc034CamH
Run code:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
3 CAF1BalulanFc013Baca CAF1 BalulanFc013Baca
4 PUEFA2BalulanFc034CamH PUEFA2 BalulanFc034CamH
Assuming your prefix consists of a sequence of alphabets followed by a sequence of digits, which both have variable length. Then a regex split function can be constructed and applied on each cell.
Solution
import pandas as pd
import re
# data
df = pd.DataFrame()
df["dfcolumn"] = ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]
def f_split(s: str):
"""Split two part by regex"""
# alphabet(s) followed by digit(s)
o = re.match(r"^([A-Za-z]+\d+)(.*)$", s)
# may add exception handling here if there is no match
return o.group(1), o.group(2)
df[["dfnewcolumn1", "dfnewcolumn2"]] = df["dfcolumn"].apply(f_split).to_list()
Note the .to_list() to convert tuples into lists, which is required for the new column assignment to work.
Result
print(df)
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
Hoe about this compact solution:
import pandas as pd
df = pd.DataFrame({"original": ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]})
df2 = pd.DataFrame(df.original.str.split(r"(\d)", n=1).to_list(), columns=["part1", "separator", "part2"])
df2.part1 = df2.part1 + df2.separator.astype(str)
df2
part1 separator part2
0 PUEF2 2 CarmenXFc034DpEd
1 PUEF2 2 BalulanFc034CamH
2 CARF1 1 BalulanFc013Baca
I use:
Series.str.split with a regex pattern and a kwarg to specify that it should only split on the first match.
in th regex pattern, I use a group (the round braces in (\d)) to capture the separating character
to_list() to output the split as a list of lists
DataFrame constructor to build a new DataFrame from that list
string concat of two columns
I have used re.search to get strings of uniqueID from larger strings.
ex:
import re
string= 'example string with this uniqueID: 300-350'
combination = '(\d+)[-](\d+)'
m = re.search(combination, string)
print (m.group(0))
Out: '300-350'
I have created a dataframe with the UniqueID and the Combination as columns.
uniqueID combinations
0 300-350 (\d+)[-](\d+)
1 off-250 (\w+)[-](\d+)
2 on-stab (\w+)[-](\w+)
And a dictionary meaning_combination relating the combination with the variable meaning it represents:
meaning_combination={'(\\d+)[-](\\d+)': 'A-B',
'(\\w+)[-](\\d+)': 'C-A',
'(\\w+)[-](\\w+)': 'C-D'}
I want to create new columns for each variable (A, B, C, D) and fill them with their corresponding values.
the final result should look like this:
uniqueID combinations A B C D
0 300-350 (\d+)[-](\d+) 300 350
1 off-250 (\w+)[-](\d+) 250 off
2 on-stab (\w+)[-](\w+) stab on
I would fix your regexes to:
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
To capture the full group instead of having three capturing groups.
I.e. (300-350, 300, 350) --> (300-350)
You don't need to have extra two capturing groups because if a specific pattern is satisfied then you know what the positions of the word or digit characters are (based on how you defined the pattern) and you can split by - to access them individually.
I.e.:
str = 'example string with this uniqueID: 300-350'
values = re.findall('(\d+-\d+)', str)
>>>['300-350']
#first digit char:
values[0].split('-')[0]
>>>'300'
If you use this way, you can loop over dictionary keys and list of strings and test if the pattern is satisfied in the string. If it's satisfied (len(re.findall(pattern, string)) != 0), then grab the corresponding dictionary value for the key and split it and split the match and assign dictionary_value.split('-')[0] : match[0].split('-')[0] and dictionary_value.split('-')[1] : match[0].split('-')[1] in a new dictionary that your creating in the loop - also assign unique id to the full match value and combination to the matched pattern. Then use pandas to make a Dataframe.
Altogether:
import re
import pandas as pd
stri= ['example string with this uniqueID: 300-350', 'example string with this uniqueID: off-250', 'example string with this uniqueID: on-stab']
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
values = [{'Unique ID': re.findall(x, st)[0], 'Combination': x, y.split('-')[0] : re.findall(x, st)[0].split('-')[0], y.split('-')[1] : re.findall(x, st)[0].split('-')[1]} for st in stri for x, y in meaning_combination.items() if len(re.findall(x, st)) != 0]
df = pd.DataFrame.from_dict(values)
#just to sort it in order since default is alphabetical
col_val = ['Unique ID', 'Combination', 'A', 'B', 'C', 'D']
df = df.reindex(sorted(df.columns, key=lambda x: col_val.index(x) ), axis=1)
print(df)
output:
Unique ID Combination A B C D
0 300-350 (\d+-\d+) 300 350 NaN NaN
1 off-250 ([^0-9\W]+\-\d+) 250 NaN off NaN
2 on-stab ([^0-9\W]+\-[^0-9\W]+) NaN NaN on stab
Also, note, I think you have a typo in your expected output because you have:
'(\\w+)[-](\\d+)': 'C-A'
which would match off-250, but in your final result you have:
uniqueID combinations A B C D
1 off-250 (\w+)[-](\d+) 250 off
When based on your key this should be in C and A.
can anyone make me understand this piece of code.
def remove_digit(data):
newData = ''.join([i for i in data if not i.isdigit()])
i = newData.find('(')
if i>-1: newData = newData[:i]
return newData.strip()
Why don't you use regex. [0-9()] looks for matching characters between 0-9, ( and )
newData = re.sub('[0-9()]', '', data)
Give this df:
data
0 a43
1 b((
2 cr3r3
3 d
You can remove digits and parenthesis from the column in this way:
df['data'] = df['data'].str.replace('\d|\(|\)','')
Output:
data
0 a
1 b
2 crr
3 d