Say I have df as follows:
MyCol
Red Motor
Blue Taxi
Green Taxi-1
Light blue small Taxi-1
Light blue big Taxi-2
I would like to split the color and the vehicle into two columns. I used this command to split the last word (could be any character).
The last word (could be any character, like taxi or taxi-1) refers to the vehicle. Sometimes, there is a 'big' or 'small' associated with the car name. The first few words (can be one or more than one words) refers to the color.
This is what I have tried. It only works when the last word is a word without special characters. How can I include the case when special characters in the last word too?
df['MyCol'].str.extract('^(.*?)\s((?:small|big)?\s?\w+).*$')
df['MyCol'].str.extract('^(.*?)\s((?:small|big|)\s?\S+)$')
resulting in:
0
1
0
Red
Motor
1
Blue
Taxi
2
Green
Taxi-1
3
Light blue
small Taxi-1
4
Light blue
big Taxi
Related
I want to look for various types of matches on the word "car" but not if its preceded by "Jane, Jane's, Janes, and Jane(s).
the following 2 regex partially work for exclusion and inclusion, but I can't get the other variants to work
(?<!\bJane) car
Jane car
for example
the car is red - Match
here is Jane car is red -> None
here is Janes car is red -> None
here is Jane's car is red -> None
I also want to find the cases Jane is in the phrase
the car is red - None
here is Jane car is red - Match
here is Janes car is red - Match
here is Jane's car is red - Match
and where car is not preceding by Jane(s)
here Jane(s) car is red - None
and of course the opposite
here is Jane(s) car is red - Match
Edit
If I have a document with "red car\n and Janes car" this should be a Match as there is a reference to "car" without the word Jane/Janes/Jane's/etc. in front of it.
In fact, for additional clarity. I will be doing a re.Findall for all the occurrences of "car" without the word Jane in front of them.
If you want to match it where the different forms of Jane should not occur, you can exclude the match with a negative lookahead, and then still match car
^(?!.*\bJane(?:'?s|\(s\))? car\b).*\bcar\b.*
^ Start of string
(?! Negative lookahead
.*\bJane(?:'?s|\(s\))? Match Jane Janes Jane's Jane(s)
car\b Match a space and the word car
) Close the lookahead
.*\bcar\b.* Match the whole line with the word car between word boundaries
Regex demo
If the different forms of Jane followed by car should be there, you can match it:
^.*\bJane(?:'?s|\(s\))? car\b.*
Regex demo
To matching all occurrences of car without the ones that have Jane in front of it, you can match what you don't want to keep and capture what you do want to keep.
Then in Python you can use re.findall which will return the capture group values and remove the empty entries from the result.
\bJane(?:'?s|\(s\))? car\b|\b(car)\b
Regex demo | Python demo
Hello I have a dataset where I want to match my keyword with the location. The problem I am having is the location "Afghanistan" or "Kabul" or "Helmund" I have in my dataset appears in over 150 combinations including spelling mistakes, capitalization and having the city or town attached to its name. What I want to do is create a separate column that returns the value 1 if any of these characters "afg" or "Afg" or "kab" or "helm" or "are contained in the location. I am not sure if upper or lower case makes a difference.
For instance there are hundreds of location combinations like so: Jegdalak, Afghanistan, Afghanistan,Ghazni♥, Kabul/Afghanistan,
I have tried this code and it is good if it matches the phrase exactly but there is too much variation to write every exception down
keywords= ['Afghanistan','Kabul','Herat','Jalalabad','Kandahar','Mazar-i-Sharif', 'Kunduz', 'Lashkargah', 'mazar', 'afghanistan','kabul','herat','jalalabad','kandahar']
#how to make a column that shows rows with a certain keyword..
def keyword_solution(value):
strings = value.split()
if any(word in strings for word in keywords):
return 1
else:
return 0
taleban_2['keyword_solution'] = taleban_2['location'].apply(keyword_solution)
# below will return the 1 values
taleban_2[taleban_2['keyword_solution'].isin(['1'])].head(5)
Just need to replace this logic where all results will be put into column "keyword_solution" that matches either "Afg" or "afg" or "kab" or "Kab" or "kund" or "Kund"
Given the following:
Sentences from the New York Times
Remove all non-alphanumeric characters
Change everything to lowercase, thereby removing the need for different word variations
Split the sentence into a list or set. I used set because of the long sentences.
Add to the keywords list as needed
Matching words from two lists
'afgh' in ['afghanistan']: False
'afgh' in 'afghanistan': True
Therefore, the list comprehension searches for each keyword, in each word of word_list.
[True if word in y else False for y in x for word in keywords]
This allows the list of keywords to be shorter (i.e. given afgh, afghanistan is not required)
import re
import pandas as pd
keywords= ['jalalabad',
'kunduz',
'lashkargah',
'mazar',
'herat',
'mazar',
'afgh',
'kab',
'kand']
df = pd.DataFrame({'sentences': ['The Taliban have wanted the United States to pull troops out of Afghanistan Turkey has wanted the Americans out of northern Syria and North Korea has wanted them to at least stop military exercises with South Korea.',
'President Trump has now to some extent at least obliged all three — but without getting much of anything in return. The self-styled dealmaker has given up the leverage of the United States’ military presence in multiple places around the world without negotiating concessions from those cheering for American forces to leave.',
'For a president who has repeatedly promised to get America out of foreign wars, the decisions reflect a broader conviction that bringing troops home — or at least moving them out of hot spots — is more important than haggling for advantage. In his view, decades of overseas military adventurism has only cost the country enormous blood and treasure, and waiting for deals would prolong a national disaster.',
'The top American commander in Afghanistan, Gen. Austin S. Miller, said Monday that the size of the force in the country had dropped by 2,000 over the last year, down to somewhere between 13,000 and 12,000.',
'“The U.S. follows its interests everywhere, and once it doesn’t reach those interests, it leaves the area,” Khairullah Khairkhwa, a senior Taliban negotiator, said in an interview posted on the group’s website recently. “The best example of that is the abandoning of the Kurds in Syria. It’s clear the Kabul administration will face the same fate.”',
'afghan']})
# substitute non-alphanumeric characters
df['sentences'] = df['sentences'].apply(lambda x: re.sub('[\W_]+', ' ', x))
# create a new column with a list of all the words
df['word_list'] = df['sentences'].apply(lambda x: set(x.lower().split()))
# check the list against the keywords
df['location'] = df.word_list.apply(lambda x: any([True if word in y else False for y in x for word in keywords]))
# final
print(df.location)
0 True
1 False
2 False
3 True
4 True
5 True
Name: location, dtype: bool
I have a dataframe and one of the columns contains a bunch of random text. Within the random text is one name per row. I would like to create a new column within the dataframe that is only the name. All of these name start with capital letters and are preceded by phrases like, "Meet" "name is" "hello to". I believe I should use regex but not sure beyond that.
Example texts from a dataframe cells:
"This is John. He is a rock star on tour in Australia." (desired name is John)
"Meet Randy. He probably has the best hairdo on planet Earth." (desired name is Randy)
"Say hello to Mike! His moustache won first prize at the county fair." (desired name is Mike)
I think the code should be something like:
df['name'][df['text'].str.extract('r'____________')
First get the regex patterns. My logic seeing your pattern is that:
every name starts with a capital letter,
has a space before the name
starts has a character after the name (exclamation mark or full stop),
after the name has a space else even Earth will be counted, which we do not want
The regex for the following is:
re1='(\\s+)' # White Space 1
re2='((?:[A-ZÀ-ÿ][a-zÀ-ÿ]+))' # Word 1
re3='([.!,?\\-])' # Any Single Character 1
re4='(\\s+)' # White Space 2
I use this website to get my regex: https://txt2re.com/
Now do:
df['name'] = df['text'].str.extract(re1+re2+re3+re4, expand=True)[1]
Output:
0 John
1 Randy
2 Mike
3 Amélie
Name: name, dtype: object
Suppose I have the string
'apples are red. this apple is green. pears are sometimes red, but not usually. pears are green. apples are yummy. lizards are green.'
and I want to use regular expressions to pull the sentences in that string that mention either apple or pear first and then its color, red or green. So I basically want a list returned that has:
["apples are red.", "this apple is green.", "pears are sometimes red, but not usually.", pears are green."]
I can pull a regular expression for just apples and pears or green and red with something like
re.findall(r'([^.]*?apple[^.]*|[^.]*?pear[^.]*)', string)
and
re.findall(r'([^.]*?red[^.]*|[^.]*?green[^.]*)', string)
but how do I put these two together when I want the fruit (apple/pear) to come first in the string followed by the color and some later point in the sentence?
You can use parentheses to group sub-expressions:
re.findall(r"[^.]*\b(?:apple|pear)[^.]*\b(?:red|green)\b[^.]*\.", string)
For example:
>>> import re
>>> a = 'apples are red. this apple is green. pears are sometimes red, but not usually. pears are green. apples are yummy. lizards are green.'
>>> re.findall(r"[^.]*\b(?:apple|pear)[^.]*\b(?:red|green)\b[^.]*\.", a)
['apples are red.', ' this apple is green.',
' pears are sometimes red, but not usually.', ' pears are green.']
use this pattern (?:^|\b)(?=[^.]*(?:apple|pear)[^.]*(?:red|green))([^.]+\.) Demo
I recommend you read about NLTK (Natural Language Tool Kit). It's a python package for text processing
Tried using this function on a paragraph consisting of 3 strings and abbreviations.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
def splitParagraphIntoSentences(paragraph):
''' break a paragraph into sentences
and return a list '''
import re
# to split by multile characters
# regular expressions are easiest (and fastest)
sentenceEnders = re.compile('[.!?][\s]{1,2}[A-Z]')
sentenceList = sentenceEnders.split(paragraph)
return sentenceList
if __name__ == '__main__':
p = "While other species (e.g. horse mango, M. foetida) are also grown ,Mangifera indica – the common mango or Indian mango – is the only mango tree. Commonly cultivated in many tropical and subtropical regions, and its fruit is distributed essentially worldwide.In several cultures, its fruit and leaves are ritually used as floral decorations at weddings, public celebrations and religious "
sentences = splitParagraphIntoSentences(p)
for s in sentences:
print s.strip()
The first character of the next beggining sentence is eliminated,
O/p Recieved:
While other Mangifera species (e.g. horse mango, M. foetida) are also grown on a
more localized basis, Mangifera indica ΓÇô the common mango or Indian mango ΓÇô
is the only mango tree
ommonly cultivated in many tropical and subtropical regions, and its fruit is di
stributed essentially worldwide.In several cultures, its fruit and leaves are ri
tually used as floral decorations at weddings, public celebrations and religious.
Thus the string got spliited into only 2 strings and the first character of the next sentence got eliminated.Also some strange charactes can be seen, I guess python wasn`t able to convert the hypen.
Incase I alter the regex to [.!?][\s]{1,2}
While other species (e.g
horse mango, M
foetida) are also grown ,Mangifera indica ΓÇô the common mango or Indian mango Γ
Çô is the only mango tree
Commonly cultivated in many tropical and subtropical regions, and its fruit is d
istributed essentially worldwide.In several cultures, its fruit and leaves are r
itually used as floral decorations at weddings, public celebrations and religiou
s
Thus even the abbreviations get splitted.
The regex you want is:
[.!?][\s]{1,2}(?=[A-Z])
You want a positive lookahead assertion, which means you want to match the pattern if it's followed by a capital letter, but not match the capital letter.
The reason only the first one got matched is you don't have a space after the 2nd period.