I would like to find a selected word and take everything from the first period(.) before it and up until the first period(.) after it.
example:
inside a file call 'text.php'
'The price of blueberries has gone way up. In the year 2038 blueberries have
almost tripled in price from what they were ten years ago. Economists have
said that berries may going up 300% what they are worth today.'
Code example: (I know that if i use a code like this i can find +5 before the word ['that'] and +5 after the word, but i would like to find everything between the period before and after a word.)
import re
text = 'The price of blueberries has gone way up, that might cause trouble for farmers.
In the year 2038 blueberries have almost tripled in price from what they were ten years
ago. Economists have said that berries may going up 300% what they are worth today.'
find =
re.search(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,5}that(?:[^a-zA-Z'-]+[a-zA-Z'-]+){0,5}", text)
done = find.group()
print(done)
return:
'blueberries has gone way up, that might cause trouble for farmers'
I would like it to return every sentence with ['that'] in it.
Example return(what i'm looking to get):
'The price of blueberries has gone way up, that might cause trouble for farmers',
'Economists have said that berries may going up 300% what they are worth today'
I would do it like this:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
for sentence in text.split('.'):
if 'that' in sentence:
print(sentence.strip())
.strip() is there simply to trim extra spaces because I'm splitting on ..
If you do want to use the re module, I would be using something like this:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"[^.]+that[^.]+", text)
results = map(lambda x: x.strip(), results)
print(results)
To get the same results.
Things to keep in mind:
If you have words like thatcher in the sentence, the sentence will be printed too. In the first solution, you could use if 'that' in sentence.split(): instead so as to split the string into words, and in the second solution, you could use re.findall(r"[^.]+\bthat\b[^.]+", text) (note the \b tokens; these represent word boundaries).
The script relies on period (.) to limit the sentences. If the sentence themselves contain words that use periods, then the results might not be the expected results (e.g. for the sentence Dr. Tom is sick yet again today, so I'm substituting for him., the script will find Dr as one sentence and Tom is sick yet again today, so I'm substituting for him. as another sentence)
EDIT: To answer your question in the comments, I would make the following changes:
Solution 1:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
sentences = text.split('.')
for i, sentence in enumerate(sentences):
if 'almost' in sentence:
before = '' if i == 0 else sentences[i-1].strip()
middle = sentence.strip()
after = '' if i == len(sentences)-1 else sentences[i+1].strip()
print(". ".join([before, middle, after]))
Solution 2:
text = 'The price of blueberries has gone way up, that might cause trouble for farmers. In the year 2038 blueberries have almost tripled in price from what they were ten years ago. Economists have said that berries may going up 300% what they are worth today.'
results = re.findall(r"(?:[^.]+\. )?[^.]+almost[^.]+(?:[^.]+\. )?", text)
results = map(lambda x: x.strip(), results)
print(results)
Note that these can potentially give overlapping results. E.g. if the text is a. b. b. c., and you are trying to find sentences containing b, you will get a. b. b and b. b. c.
This function should do the job:
old_text = 'test 1: test friendly, test 2: not friendly, test 3: test friendly, test 4: not friendly, test 5: not friendly'
replace_dict={'test 1':'tested 1','not':'very'}
The function:
def replace_me(text,replace_dict):
for key in replace_dict.keys():
text=text.replace(str(key),str(replace_dict[key]))
return text
result:
print(replace_me(old_text,replace_dict))
Out: 'tested 1: test friendly, test 2: very friendly, test 3: test friendly, test 4: very friendly, test 5: very friendly'
Related
Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool
I am relatively new to Python and very new to nltk and regex. I have searched for guidance but not figuring it out. I am simply trying to remove any x or X that falls after an integer (should always be an integer) in text to ultimately get just the number. I have code that does what I need it to do once the X or x is removed so now I am trying to add to the code to remove that x or X from the numbers but NOT the normal text (words like exited and matrix below).
For example, if I have a text string of: 'It was a beautiful day and 710x birds exited their habitats and flew overhead. 130X of them dove down and landed on the grass while 21X of them were shot by 7 hunters. 9x birds vanished into the matrix. The remaining 550x birds kept flying away.'
I would like this:
'It was a beautiful day and 710 birds exited their habitats and flew overhead. 130 of them dove down and landed on the grass while 21 of them were shot by 7 hunters. 9 birds vanished into the matrix. The remaining 550 birds kept flying away.'
So I dont know if this is best handled by regex (Regular Expression) or nltk (Natural Language Toolkit) or simply some if statement somehow. I tokenize all the text which can be upwards of 20,000 to 30,000 tokens/words from the pdf files I extract the text from, but I would be happy to remove those x's while still a huge string or after they have been made into tokens. No matter to me. Thank you very much for any assistance ...
This matches x with a look behind assertion that the prior character is a digit and replaces the x with nothing.
re.sub('(?<=\d)[xX]', '', s)
Try this.
import re
text = 'It was a beautiful day and 710x birds exited their habitats and flew overhead. 130X of them dove down and landed on the grass while 21X of them were shot by 7 hunters. 9x birds vanished into the matrix. The remaining 550x birds kept flying away.'
re.sub(r'(\d+)[xX]', r'\1', text)
# >>> 'It was a beautiful day and 710 birds exited their habitats and flew overhead. 130 of them dove down and landed on the grass while 21 of them were shot by 7 hunters. 9 birds vanished into the matrix. The remaining 550 birds kept flying away.'
What's this?
re.sub is substitution by regular expression. First parameter is regex to find, and second is regex to replace.
r'(\d+)[xX]' is made of
\d+ <= 1 or more integer sequence
[xX] <= 1 x or X
() <= keep it to use afterwards
r'\1' means first kept strings.
def parseNumeric(data):
for each in data:
noX =''
for i in each:
if i.isdigit():
noX+=i
if noX != '':
data[data.index(each)]=noX
return " ".join(str(x) for x in data)
theData = "It was a beautiful day and 710x birds exited their habitats and flew overhead. 130X of them dove down and landed on the grass while 21X of them were shot by 7 hunters. 9x birds vanished into the matrix. The remaining 550x birds kept flying away."
print("\n BEFORE \n")
print(theData)
print("\n AFTER \n")
print(parseNumeric(theData.split()))
Check the DEMO, I know that not the best solution but hope it helps.
I am trying to replace certain words that occur at the very first of the statement in each row in the dataframe. However, passing in '1' position is replacing everything. Why is passing '1' in replace not working? Is there are different way to this?
Thanks!
Initial:
df_test = pd.read_excel('sample.xlsx')
print('Initial: \n',df_test)
Initial:
some_text
0 ur goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 he way is clear … he is going to library
Tried:
df_test['some_text'] = df_test['some_text'] \
.str.replace('ur ','Our ',1) \
.str.replace('he ','The ',1)
print('Tried:\n',df_test)
Tried: (Incorrect Results)
some_text
0 Our goal is to finish shopping for books today
1 OOur goal is to finish shopping for books today
2 TThe help is on the way
3 The way is clear … he is going to library
Final output needed:
some_text
0 Our goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 The way is clear … he is going to library
Not sure why the other answer got deleted, it was much more concise and did the job. (Sorry, I don't remember who posted it. I tried the answer and it worked but had certain limitations)
df.some_text.str.replace('^ur','Our ').str.replace('^he','The ')
However, as pointed out in the comments, this would replace all the starting characters starting with 'ur' ('ursula') or 'he' ('helen').
The corrected code is:
df.some_text.str.replace('^ur\s','Our ').str.replace('^he\s','The ')
the '^' indicates start of line & should only replace the incomplete words at the beginning of line. The '\s' indicates a space after the first word so it only matches the correct word.
Programming languages, including Python, don't read like human beings. You need to tell Python to split by whitespace. For example, via str.split:
df = pd.DataFrame({'some_text': ['ur goal is to finish shopping for books today',
'Our goal is to finish shopping for books today',
'The help is on the way',
'he way is clear … he is going to library']})
d = {'ur': 'Our', 'he': 'The'}
df['result'] = [' '.join((d.get(i, i), j)) for i, j in df['some_text'].str.split(n=1)]
print(df)
some_text \
0 ur goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 he way is clear … he is going to library
result
0 Our goal is to finish shopping for books today
1 Our goal is to finish shopping for books today
2 The help is on the way
3 The way is clear … he is going to library
I have some text which is not clear and have so many tags and ascii as follow,
val =
"\nRated\xa0\n I have been to this place for dinner tonight.
\nWell I didn't found anything extraordinary there but indeed a meal worth
the price. The number of barbeque item and other both were good.\n\nFood: 3.5/5\"
So for making clear this tag I am using
val.text.replace('\t', '').replace('\n', '').encode('ascii','ignore').
decode("utf-8").replace('Rated','').replace(' ','')
and using multiple times replace I got my o/p as -
I have been to this place for dinner tonight. Well I didn't found anything extraordinary there but indeed a meal worth the price. The number of barbeque item and other both were good. Food: 3.5/5
I want to know that is there any way so I can use replace at once only for similar kind of replacement. like in this case -
replace('\t', '').replace('\n', '').replace(' ','')
You can use .translate to delete \n\t and then use your replacement for the runs of spaces:
>>> val.translate(None,'\n\t').replace(' ','')
"Rated I have been to this place for dinner tonight.Well I didn't found anything extraordinary there but indeed a meal worth the price. The number of barbeque item and other both were good.Food: 3.5/5"
The replace(' ','') will be problematic with runs of even spaces (they will just be deleted). You might consider a regex:
>>> re.sub(r'(\b *\b)',' ',val.translate(None,'\n\t'))
"Rated I have been to this place for dinner tonight.Well I didn't found anything extraordinary there but indeed a meal worth the price. The number of barbeque item and other both were good.Food: 3.5/5"
Well even do i am not using replace, but i still think this is the best way:
import string
val = """\nRated\xa0\n I have been to this place for dinner tonight.
\nWell I didn't found anything extraordinary there but indeed a meal worth
the price. The number of barbeque item and other both were good.\n\nFood: 3.5/5\"""
"""
print(''.join([i for i in ' '.join(val.split()) if i in string.ascii_letters+' ']))
Output:
Rated I have been to this place for dinner tonight Well I didnt found anything extraordinary there but indeed a meal worth the price The number of barbeque item and other both were good Food
with this:
dataset = pd.read_csv('lyrics.csv', delimiter = '\t', quoting = 3)
I print my dataset in this fashion:
lyrics,classification
0 "I should have known better with a girl like you
1 That I would love everything that you do
2 And I do, hey hey hey, and I do
3 Whoa, whoa, I
4 Never realized what I kiss could be
5 This could only happen to me
6 Can't you see, can't you see
7 That when I tell you that I love you, oh
8 You're gonna say you love me too, hoo, hoo, ho...
9 And when I ask you to be mine
10 You're gonna say you love me too
11 So, oh I never realized what I kiss could be
12 Whoa whoa I never realized what I kiss could be
13 You love me too
14 You love me too",0
but what I really need is to have all thats between "" per row. how do I make this conversion in pandas?
Solution that worked for OP (from comments):
Fixing the problem at its source (in read_csv):
#nbeuchat is probably right, just try
dataset = pd.read_csv('lyrics.csv', quoting = 2)
That should give you a dataframe with one row and two columns: lyrics (with embedded line returns in the string) and classification (0).
General solution for collapsing series of strings:
You want to use pd.Series.str.cat:
import pandas as pd
dataset = pd.DataFrame({'lyrics':pd.Series(['happy birthday to you',
'happy birthday to you',
'happy birthday dear outkast',
'happy birthday to you'])})
dataset['lyrics'].str.cat(sep=' / ')
# 'happy birthday to you / happy birthday to you / happy birthday dear outkast / happy birthday to you'
The default sep is None, which would give you 'happy birthday to youhappy birthday to youhappy ...' so pick the sep value that works for you. Above I used slashes (padded with spaces) since that's what you typically see in quotations of songs and poems.
You can also try print(dataset['lyrics'].str.cat(sep='\n')) which maintains the line breaks but stores them all in one string instead of one string per line.