I am trying to replace certain words that occur at the very first of the statement in each row in the dataframe. However, passing in '1' position is replacing everything. Why is passing '1' in replace not working? Is there are different way to this?
Thanks!
Initial:
df_test = pd.read_excel('sample.xlsx')
print('Initial: \n',df_test)
Initial:
some_text
0 ur goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 he way is clear … he is going to library
Tried:
df_test['some_text'] = df_test['some_text'] \
.str.replace('ur ','Our ',1) \
.str.replace('he ','The ',1)
print('Tried:\n',df_test)
Tried: (Incorrect Results)
some_text
0 Our goal is to finish shopping for books today
1 OOur goal is to finish shopping for books today
2 TThe help is on the way
3 The way is clear … he is going to library
Final output needed:
some_text
0 Our goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 The way is clear … he is going to library
Not sure why the other answer got deleted, it was much more concise and did the job. (Sorry, I don't remember who posted it. I tried the answer and it worked but had certain limitations)
df.some_text.str.replace('^ur','Our ').str.replace('^he','The ')
However, as pointed out in the comments, this would replace all the starting characters starting with 'ur' ('ursula') or 'he' ('helen').
The corrected code is:
df.some_text.str.replace('^ur\s','Our ').str.replace('^he\s','The ')
the '^' indicates start of line & should only replace the incomplete words at the beginning of line. The '\s' indicates a space after the first word so it only matches the correct word.
Programming languages, including Python, don't read like human beings. You need to tell Python to split by whitespace. For example, via str.split:
df = pd.DataFrame({'some_text': ['ur goal is to finish shopping for books today',
'Our goal is to finish shopping for books today',
'The help is on the way',
'he way is clear … he is going to library']})
d = {'ur': 'Our', 'he': 'The'}
df['result'] = [' '.join((d.get(i, i), j)) for i, j in df['some_text'].str.split(n=1)]
print(df)
some_text \
0 ur goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 he way is clear … he is going to library
result
0 Our goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 The way is clear … he is going to library
Related
I've been working on a job description parser and I have been trying to extract the entire sentence which consists of the number of years of experience required.
I have tried to use regex which provides me the number of years but not the entire sentence.
def extract_years(self,resume_text):
resume_text = str(resume_text.split('.'))
exp=[]
rx = re.compile(r"(\d+(?:-\d+)?\+?)\s*(years?)",re.I)
for word in resume_text:
exp_temp = rx.search(resume_text)
if exp_temp:
exp.append(exp_temp[0])
exp = list(set(exp))
return exp
Output:
['5-7 years']
Desired Output:
['5-7 years of experience in journalism, communications, or content creation preferred']
Try: (\d+(?:-\d+)?+?)\s*(years?).*
Though I'm somewhat new to Regex, I believe you can get what you desire using a combination of ".*" to end of your match terms and possibly the beginning if "5-7 years" comes after some characters like "needs 5-7 years of experience".
just adding the group ".*" at the end would mean to add any combination of characters, 0 or more after your initial match stopping at a line break, to match the entire sentence.
Hope this helps.
So I have a review dataset having reviews like
Simply the best. I bought this last year. Still using. No problems
faced till date.Amazing battery life. Works fine in darkness or broad
daylight. Best gift for any book lover.
(This is from the original dataset, I have removed all punctuation and have all lower case in my processed dataset)
What I want to do is replace some words by 1(as per my dictionary) and others by 0.
My dictionary is
dict = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
I want my output like:
0010000000000001000000000100000
I have used this code:
df['newreviews'] = df['reviews'].map(dict).fillna("0")
This always returns 0 as output. I did not want this so I took 1s and 0s as strings, but despite that I'm getting the same result.
Any suggestions how to solve this?
First dont use dict as variable name, because builtins (python reserved word), then use list comprehension with get for replace not matched values to 0.
Notice:
If data are like date.Amazing - no space after punctuation is necessary replace by whitespace.
df = pd.DataFrame({'reviews':['Simply the best. I bought this last year. Still using. No problems faced till date.Amazing battery life. Works fine in darkness or broad daylight. Best gift for any book lover.']})
d = {"amazing":"1","super":"1","good":"1","useful":"1","nice":"1","awesome":"1","quality":"1","resolution":"1","perfect":"1","revolutionary":"1","and":"1","good":"1","purchase":"1","product":"1","impression":"1","watch":"1","quality":"1","weight":"1","stopped":"1","i":"1","easy":"1","read":"1","best":"1","better":"1","bad":"1"}
df['reviews'] = df['reviews'].str.replace(r'[^\w\s]+', ' ').str.lower()
df['newreviews'] = [''.join(d.get(y, '0') for y in x.split()) for x in df['reviews']]
Alternative:
df['newreviews'] = df['reviews'].apply(lambda x: ''.join(d.get(y, '0') for y in x.split()))
print (df)
reviews \
0 simply the best i bought this last year stil...
newreviews
0 0011000000000001000000000100000
You can do:
# clean the sentence
import re
sent = re.sub(r'\.','',sent)
# convert to list
sent = sent.lower().split()
# get values from dict using comprehension
new_sent = ''.join([str(1) if x in mydict else str(0) for x in sent])
print(new_sent)
'001100000000000000000000100000'
You can do it by
df.replace(repl, regex=True, inplace=True)
where df is your dataframe and repl is your dictionary.
I want to find frequencies for the certain words in wanted, and while it finds me the frequecies, the displayed result contains lots of unnecessary data.
Code:
from collections import Counter
import re
wanted = "whereby also thus"
cnt = Counter()
words = re.findall('\w+', open('C:/Users/user/desktop/text.txt').read().lower())
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
Results:
Counter({'e': 131, 'a': 119, 'by': 38, 'where': 16, 's': 14, 'also': 13, 'he': 4, 'whereby': 2, 'al': 2, 'b': 2, 'o': 1, 't': 1})
Questions:
How do i omit all those 'e', 'a' 'by', 'where', etc.?
If I then wanted to sum up the frequencies of words (also, thus, whereby) and divide them by total number of words in text, would that be possible?
disclaimer: this is not school assignment. i jut got lots of free time at work now and since i spend a lot of time with reading texts i decided to do this little project of mine to remind myself a bit of what i've been taught couple years ago.
Thanks in advance for any help.
As others have pointed out, you need to change your string wanted to a list. I just hardcoded a list, but you could do use str.split(" ") if you were passed a string in a function. I also implemented you the frequency counter. Just as a note, make sure you close your files; it's also easier (and recommended) that you use the open directive.
from collections import Counter
import re
wanted = ["whereby", "also", "thus"]
cnt = Counter()
with open('C:/Users/user/desktop/text.txt', 'r') as fp:
fp_contents = fp.read().lower()
words = re.findall('\w+', fp_contents)
for word in words:
if word in wanted:
cnt[word] += 1
print (cnt)
total_cnt = sum(cnt.values())
print(float(total_cnt)/len(cnt))
Reading from the web
I made this little mod of the code of Axel to read from a txt on the web, Alice in wonderland, to apply the code (as I don't have your txt file and I wanted to try it). So, I publish it here in case someone should need something like this.
from collections import Counter
import re
from urllib.request import urlopen
testo = str(urlopen("https://www.gutenberg.org/files/11/11.txt").read())
wanted = ["whereby", "also", "thus", "Alice", "down", "up", "cup"]
cnt = Counter()
words = re.findall('\w+', testo)
for word in words:
if word in wanted:
cnt[word] += 1
print(cnt)
total_cnt = sum(cnt.values())
print(float(total_cnt) / len(cnt))
output
Counter({'Alice': 334, 'up': 97, 'down': 90, 'also': 4, 'cup': 2})
105.4
>>>
How many times the same word is found in adjacent sentences
This answer to the request (from the author of the question) of looking for how many times a word is found in adjacent sentences. If in a sentence there are more same words (ex.: 'had') and in the next there is another equal, I counted that for 1 ripetition. That is why I used the wordfound list.
from collections import Counter
import re
testo = """There was nothing so VERY remarkable in that; nor did Alice think it so? Thanks VERY much. Out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed. Quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS? WAISTCOAT-POCKET, and looked at it, and then hurried on.
Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit. with either a waistcoat-pocket, or a watch to take out of it! and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop? Down a large rabbit-hole under the hedge.
Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw. How she longed to get out of that dark hall, and wander about among those beds of bright flowers and those cool fountains, but she could not even get her head through the doorway; 'and even if my head would go through,' thought poor Alice, 'it would be of very little use without my shoulders. Oh, how I wish I could shut up like a telescope! I think I could, if I only knew how to begin.'For, you see, so many out-of-the-way things had happened lately, that Alice had begun to think that very few things indeed were really impossible. There seemed to be no use in waiting by the little door, so she went back to the table, half hoping she might find another key on it, or at any rate a book of rules for shutting people up like telescopes: this time she found a little bottle on it, ('which certainly was not here before,' said Alice,) and round the neck of the bottle was a paper label, with the words 'DRINK ME' beautifully printed on it in large letters. It was all very well to say 'Drink me,' but the wise little Alice was not going to do THAT in a hurry. 'No, I'll look first,' she said, 'and see whether it's marked "poison" or not'; for she had read several nice little histories about children who had got burnt, and eaten up by wild beasts and other unpleasant things, all because they WOULD not remember the simple rules their friends had taught them: such as, that a red-hot poker will burn you if you hold it too long; and that if you cut your finger VERY deeply with a knife, it usually bleeds; and she had never forgotten that, if you drink much from a bottle marked 'poison,' it is almost certain to disagree with you, sooner or later. However, this bottle was NOT marked 'poison,' so Alice ventured to taste it, and finding it very nice, (it had, in fact, a sort of mixed flavour of cherry-tart, custard, pine-apple, roast turkey, toffee, and hot buttered toast,) she very soon finished it off. """
frasi = re.findall("[A-Z].*?[\.!?]", testo, re.MULTILINE | re.DOTALL)
print("How many times this words are repeated in adjacent sentences:")
cnt2 = Counter()
for n, s in enumerate(frasi):
words = re.findall("\w+", s)
wordfound = []
for word in words:
try:
if word in frasi[n + 1]:
wordfound.append(word)
if wordfound.count(word) < 2:
cnt2[word] += 1
except IndexError:
pass
for k, v in cnt2.items():
print(k, v)
output
How many times this words are repeated in adjacent sentences:
had 1
hole 1
or 1
as 1
little 2
that 1
hot 1
large 1
it 5
to 5
a 6
not 3
and 2
s 1
me 1
bottle 1
is 1
no 1
the 6
how 1
Oh 1
she 2
at 1
marked 1
think 1
VERY 1
I 2
door 1
red 1
of 1
dear 1
see 1
could 2
in 2
so 1
was 1
poison 1
A 1
Alice 3
all 1
nice 1
rabbit 1
with this:
dataset = pd.read_csv('lyrics.csv', delimiter = '\t', quoting = 3)
I print my dataset in this fashion:
lyrics,classification
0 "I should have known better with a girl like you
1 That I would love everything that you do
2 And I do, hey hey hey, and I do
3 Whoa, whoa, I
4 Never realized what I kiss could be
5 This could only happen to me
6 Can't you see, can't you see
7 That when I tell you that I love you, oh
8 You're gonna say you love me too, hoo, hoo, ho...
9 And when I ask you to be mine
10 You're gonna say you love me too
11 So, oh I never realized what I kiss could be
12 Whoa whoa I never realized what I kiss could be
13 You love me too
14 You love me too",0
but what I really need is to have all thats between "" per row. how do I make this conversion in pandas?
Solution that worked for OP (from comments):
Fixing the problem at its source (in read_csv):
#nbeuchat is probably right, just try
dataset = pd.read_csv('lyrics.csv', quoting = 2)
That should give you a dataframe with one row and two columns: lyrics (with embedded line returns in the string) and classification (0).
General solution for collapsing series of strings:
You want to use pd.Series.str.cat:
import pandas as pd
dataset = pd.DataFrame({'lyrics':pd.Series(['happy birthday to you',
'happy birthday to you',
'happy birthday dear outkast',
'happy birthday to you'])})
dataset['lyrics'].str.cat(sep=' / ')
# 'happy birthday to you / happy birthday to you / happy birthday dear outkast / happy birthday to you'
The default sep is None, which would give you 'happy birthday to youhappy birthday to youhappy ...' so pick the sep value that works for you. Above I used slashes (padded with spaces) since that's what you typically see in quotations of songs and poems.
You can also try print(dataset['lyrics'].str.cat(sep='\n')) which maintains the line breaks but stores them all in one string instead of one string per line.
I have a text that goes like this:
text = "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."
How do I write a function hedging(text) that processes my text and produces a new version that inserts the word "like" in the every third word of the text?
The outcome should be like that:
text2 = "All human beings like are born free like and equal in like..."
Thank you!
Instead of giving you something like
solution=' like '.join(map(' '.join, zip(*[iter(text.split())]*3)))
I'm posting a general advice on how to approach the problem. The "algorithm" is not particularly "pythonic", but hopefully easy to understand:
words = split text into words
number of words processed = 0
for each word in words
output word
number of words processed += 1
if number of words processed is divisible by 3 then
output like
Let us know if you have questions.
You could go with something like that:
' '.join([n + ' like' if i % 3 == 2 else n for i, n in enumerate(text.split())])