I have a dataframe with a column called "Utterances", which contains strings (e.g.: "I wanna have a beer" is its first row).
What I need is to create a new data frame that will contain the number of every letter of every row of "Utterances" in the alphabet.
This means that for example in the case of "I wanna have a beer", I need to get the following row: 9 23114141 81225 1 25518, since "I" is the 9th letter of the alphabet, "w" the 23rd and so on. Notice that I want the spaces " " to be maintained.
What I have done so far is the following:
for word in df2[['Utterances']]:
for character in word:
new.append(ord(character.lower())-96)
str1 = ''.join(str(e) for e in new)
The above returns the concatenated string. However, the above loop only iterates once and second the string returned by str1 does not have the required spaces (" "). And of course, I can not find a way to append these lines into a new dataframe.
Any help would be greatly appreciated.
Thanks.
You can do
In [5572]: df
Out[5572]:
Utterances
0 I wanna have a beer
In [5573]: df['Utterances'].apply(lambda x: ' '.join([''.join(str(ord(c)-96) for c in w)
for w in x.lower().split()]))
Out[5573]:
0 9 23114141 81225 1 25518
Name: Utterances, dtype: object
for word in ['I ab c def']:
for character in word:
if character == ' ':
new.append(' ')
else:
new.append(ord(character.lower())-96)
str1 = ''.join(str(e) for e in new)
Output
9 12 3 456
Lets use dictionary and get with strings if you have only alphabets i.e
import string
dic = {j:i+1 for i,j in enumerate(string.ascii_lowercase[:26])}
dic[' ']= ' '
df['Ut'].apply(lambda x : ''.join([str(dic.get(i)) for i in str(x).lower()]))
Output :
Ut new
0 I wanna have a beer 9 23114141 81225 1 25518
Related
Hi guys I have a problem. I did a twitter scraper work for my thesis inorder to obtain some texts and hashtags to process. So the problem is the seguent: in the hashtag column, I have all rows such as:
['covid19', 'croazia', 'slovenia']
Now in order to cluster this text data, I wanto to join all rows into one, in order to have something like this:
covid19 croazia slovenia
So because of these hashtags are in a pandas column called "Hashtag", to do what I want I used this line of code:
df["Hashtag_united"] = df["Hashtag"].apply(lambda x: " ".join(x))
But in this way I hadn't the rows as I expected as I wrote, but I had:
[ ' c o v i d 1 9 ' , ' c r o a z i a ' , ' s l o v e n i a ' ]
What I have to do in order to obtain what I want? Thank you for the time spent for me.
I apologize for the stupid question. Have a good day!
Since you have "['covid19', 'croazia', 'slovenia']" in your Hashtag column, you can use:
import ast
df["Hashtag_united"] = df["Hashtag"].apply(lambda x: " ".join(ast.literal_eval(x)))
The ast.literal_eval(x) will cast the stringified string list into a string list, and " ".join(...) will make a string out of it.
I have two things that I would like to replace in my text files.
Add " " between String end with '#' (eg. ABC#) into (eg. A B C)
Ignore certain Strings end with 'H' or 'xx:xx:xx' (eg. 1111H - ignore), (eg. if is 1111, process into 'one one one one')
so far this is my code..
import re
dest1 = r"C:\\Users\CL\\Desktop\\Folder"
files = os.listdir(dest1)
#dictionary to process Int to Str
numbers = {"0":"ZERO ", "1":"ONE ", "2":"TWO ", "3":"THREE ", "4":"FOUR ", "5":"FIVE ", "6":"SIX ", "7":"SEVEN ", "8":"EIGHT ", "9":"NINE "}
for f in files:
text= open(dest1+ "\\" + f,"r")
text_read = text.read()
#num sub pattern
text = re.sub('[%s]\s?' % ''.join(numbers), lambda x: numbers[x.group().strip()]+' ', text)
#write result to file
data = f.write(text)
f.close()
sample .txt
1111H I have 11 ABC# apples
11:12:00 I went to my# room
output required
1111H I have ONE ONE A B C apples
11:12:00 I went to M Y room
also.. i realized when I write the new result, the format gets 'messy' without the breaks. not sure why.
#current output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES
ONE ONE ONE TWO H - I WENT TO MY# ROOM
#overwritten output
ONE ONE ONE ONE H - I HAVE ONE ONE ABC# APPLES ONE ONE ONE TWO H - I WENT TO MY# ROOM
You can use
def process_match(x):
if x.group(1):
return " ".join(x.group(1).upper())
elif x.group(2):
return f'{numbers[x.group(2)] }'
else:
return x.group()
print(re.sub(r'\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b|\b([A-Za-z]+)#|([0-9])', process_match, text_read))
# => 1111H I have ONE ONE A B C apples
# 11:12:00 I went to M Y room
See the regex demo. The main idea behind this approach is to parse the string only once capturing or not parts of it, and process each match on the go, either returning it as is (if it was not captured) or converted chunks of text (if the text was captured).
Regex details:
\b(?:\d+[A-Z]+|\d{2}:\d{2}:\d{2})\b - a word boundary, and then either one or more digits and one or more uppercase letters, or three occurrences of colon-separated double digits, and then a word boundary
| - or
\b([A-Za-z]+)# - Group 1: words with # at the end: a word boundary, then oneor more letters and a #
| - or
([0-9]) - Group 2: an ASCII digit.
I have a bunch of strings in a pandas dataframe that contain numbers in them. I could the riun the below code and replace them all
df.feature_col = df.feature_col.str.replace('\d+', ' NUM ')
But what I need to do is replace any 10 digit number with a string like masked_id, any 16 digit numbers with account_number, or any three-digit numbers with yet another string, and so on.
How do I go about doing this?
PS: since my data size is less, a less optimal way is also good enough for me.
Another way is replace with option regex=True with a dictionary. You can also use somewhat more relaxed match
patterns (in order) than Tim's:
# test data
df = pd.DataFrame({'feature_col':['this has 1234567',
'this has 1234',
'this has 123',
'this has none']})
# pattern in decreasing length order
# these of course would replace '12345' with 'ID45' :-)
df['feature_col'] = df.feature_col.replace({'\d{7}': 'ID7',
'\d{4}': 'ID4',
'\d{3}': 'ID3'},
regex=True)
Output:
feature_col
0 this has ID7
1 this has ID4
2 this has ID3
3 this has none
You could do a series of replacements, one for each length of number:
df.feature_col = df.feature_col.str.replace(r'\b\d{3}\b', ' 3mask ')
df.feature_col = df.feature_col.str.replace(r'\b\d{10}\b', masked_id)
df.feature_col = df.feature_col.str.replace(r'\b\d{16}\b', account_number)
I'm trying to figure out and put together a somewhat complicated syntax (for me) with .join function for hours already but just can't get it to work.
The task is to remove all duplicate words from a string obtained through scraping process but leave all duplicate numbers and digits intact.
Example Code:
from collections import OrderedDict
examplestring = 'Your Brand22 For Awesome Product 1 Year 1 User Subscription Brand22'
print(' '.join(OrderedDict((w,w) for w in examplestring.split()).keys()))
>>> Your Brand22 For Awesome Product 1 Year User Subscription
Note that the above code works but removes the duplicated 1 (1 Year 1 User) too, which I need. I'm trying to leave the numbers intact by comparing it to isdigit() function as .split() goes through the string word by word but cannot figure it out what is the proper syntax for it.
result = ' '.join(OrderedDict((w,w) for w in examplestring.split()).keys() if w not isdigit())
result = ([' '.join(OrderedDict((w,w) for w in examplestring.split()).keys())] if w not isdigit())
result = ' '.join([(OrderedDict((w,w) for w in examplestring.split()).keys()] if w not isdigit()))
I tried many more different variations of the above one-liner code and might be even missing the if statement, but these brackets everywhere confuse me so I'm grateful if anyone can help me out.
Goal: Remove duplicate words but keep repeated digits/numbers inside the string
You can Solve the problem by modifying the keys if the key is a number. Here I'm using enumerate to modify the key if key is numeric.
examplestring = 'Your Brand22 For Awesome Product 1 Year 1 User Subscription Brand22'
res = ' '.join(OrderedDict(((word + str(idx) if word.isnumeric() else word), word) for idx, word in enumerate(examplestring.split())).values())
print(res)
Output:
Your Brand22 For Awesome Product 1 Year 1 User Subscription
Does this work for you?
example_str = '''Your Brand22 For Awesome Product 1 Year 1 User Subscription Brand22'''
words_list = example_str.split()
numeric_flags_list = [all([char.isnumeric() for char in word]) for word in words_list]
unique_words = []
for word, numeric_flag in zip(words_list, numeric_flags_list):
if numeric_flag:
unique_words.append(word)
else:
if word not in unique_words:
unique_words.append(word)
else:
continue
I write a function which I want to apply to a dataframe later.
def get_word_count(text,df):
#text is a lowercase list of words
#df is a dataframe with 2 columns: word and count
#this function updates the word counts
#f=open('stopwords.txt','r')
#stopwords=f.read()
stopwords='in the and an - '
for word in text:
if word not in stopwords:
if df['word'].str.contains(word).any():
df.loc[df['word']==word, 'count']=df['count']+1
else:
df.loc[0]=[word,1]
df.index=df.index+1
return df
Then I check it:
word_df=pd.DataFrame(columns=['word','count'])
sentence1='[first] - missing "" in the text [first] word'.split()
y=get_word_count(sentence1, word_df)
sentence2="error: wrong word in the [second] text".split()
y=get_word_count(sentence2, word_df)
y
I get the following results:
Word Count
[first] 2
missing 1
"" 1
text 2
word 2
error: 1
wrong 1
So where is [second] from the sentence2?
If I omit one of square brackets I get an error message. How do I handle words with special characters? Note that I don't want to get rid of them because if I do, I will miss "" in the sentence1.
The problem comes from the line:
if df['word'].str.contains(word).any():
This reports if any of the words in the word column contains the given word. The DataFrame from df['word'].str.contains(word) reports True when [second] is given and compared to specifically [first].
For a quick fix, I changed the line to:
if word in df['word'].tolist():
Creating a DataFrame in a loop like that is not recommended, you should do something like this:
stopwords='in the and an - '
sentence = sentence1+sentence2
df = pd.DataFrame([sentence.split()]).T
df.rename(columns={0: 'Words'}, inplace=True)
df = df.groupby(by=['Words'])['Words'].size().reset_index(name='counts')
df = df[~df['Words'].isin(stopwords.split())]
print(df)
Words counts
0 "" 1
2 [first] 2
3 [second] 1
4 error: 1
6 missing 1
7 text 2
9 word 2
10 wrong 1
I have rebuild it in a way you can add sentences and see the frequency growing
from collections import Counter
from collections import defaultdict
import pandas as pd
def terms_frequency(corpus, stop_words=None):
'''
Takes in texts and returns a pandas DataFrame of words frequency
'''
corpus_ = corpus.split()
# remove stop wors
terms = [word for word in corpus_ if word not in stop_words]
terms_freq = pd.DataFrame.from_dict(Counter(terms), orient='index').reset_index()
terms_freq = terms_freq.rename(columns={'index':'word', 0:'count'}).sort_values('count',ascending=False)
terms_freq.reset_index(inplace=True)
terms_freq.drop('index',axis=1,inplace=True)
return terms_freq
def get_sentence(sentence, storage, stop_words=None):
storage['sentences'].append(sentence)
corpus = ' '.join(s for s in storage['sentences'])
return terms_frequency(corpus,stop_words)
# tests
STOP_WORDS = 'in the and an - '
storage = defaultdict(list)
S1 = '[first] - missing "" in the text [first] word'
print(get_sentence(S1,storage,STOP_WORDS))
print('\nNext S2')
S2 = 'error: wrong word in the [second] text'
print(get_sentence(S2,storage,STOP_WORDS))