python find specific pattern of text from string - python

I would like to extract specific type of text from string.
Luxyry 2 bedroom apartment
Deluxe apartment 2 bedroom
Super luxyry 3 bedroom apartment
1 Bedroom studio apartment
This is the text I have and I want to extract 1 Bedroom or 2 bedroom or 3 bedroom from the text.
The pattern will be the same like {no_of_bedroom} bedroom.
How to extract this in python ?

You can use regex like the below:
import re
text = """
Luxyry 2 bedroom apartment
Deluxe apartment 2 bedroom
Super luxyry 3 bedroom apartment
1 Bedroom studio apartment
"""
res = re.findall(r'\d+ [Bb]edroom', text)
print(res)
# Use 'set()' if you want unique values
# print(set(res))
# {'3 bedroom', '1 Bedroom', '2 bedroom'}
Output:
['2 bedroom', '2 bedroom', '3 bedroom', '1 Bedroom']
Explanation:
\d+:
\d : Matches a digit (equivalent to [0-9])
+ : Matches the previous token between one and unlimited times
[Bb] : Match a single character present in the list below [Bb]

You can make use of re module
#pip install re
Import re
text = 'Luxyry 2 bedroom apartment
Deluxe apartment 2 bedroom
Super luxyry 3 bedroom
apartment 1 Bedroom studio apartment'
Result = re.findall(r"\d+\s\[Bb]bedroom", text)
Print(f"Result :{Result}")
\d+ will match 1 or more digits.

Related

How to split text based on numbers with dots in Python?

I have following simple text:
2 of 5 deliveries some text some text... 1. 3 of 5 items some text some text... 2. 1 of 5 items found in box some text...
Now I want that on the basis of numbers [0.-9.] the text should be splitted as following: (each row represents on entry in a list).
2 of 5 deliveries some text some text...,
3 of 5 items some text some text...,
1 of 5 items found in box some text...
This is the desired output. However, it does not really work with regex with re.split('([0\.-9\.]+)', text). It always separates by numbers only. What would be the most clever way to convert this with Python?
You can use the following pattern:
>>> re.split(r'\s+\d+\.\s+', text)
['2 of 5 deliveries some text some text...',
'3 of 5 items some text some text...',
'1 of 5 items found in box some text...']
EXPLANATION:
>>> re.split(r'''
\s+ # Matches leading spaces to the separator
\d+ # Matches digit character
\. # Matches '.' character
\s+ # Matches trailing spaces after the separator
''', text, flags=re.VERBOSE)
['2 of 5 deliveries some text some text...',
'3 of 5 items some text some text...',
'1 of 5 items found in box some text...']
try:
import re
text = '2 of 5 deliveries some text some text... 1. 3 of 5 items some text some text... 2. 1 of 5 items found in box some text...'
print(re.split('[0-9]\.', text))
Output:
['2 of 5 deliveries some text some text... ', ' 3 of 5 items some text some text... ', ' 1 of 5 items found in box some text...']

Python regex, negate a set of characters in between a string

I have several set of strings with numbers followed words and jumbled numbers and words etc.
For example,
"Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"
I am trying to separate the street names from the apartment numbers.
Here is my current code:
import re
pattern_street = re.compile(r'[A-Za-z]+\s?\w+\s?[A-Za-z]+\s?[A-Za-z]+',re.X)
pattern_apartmentnumber = re.compile(r'(^\d+\s? | [A-Za-z]+[\s?]+[0-9]+$)',re.X)
for i in ["Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"]:
match_street = pattern_street.search(i)
match_apartmentnumber = pattern_apartmentnumber.search(i)
fin_street = match_street[0]
fin_apartmentnumber = match_apartmentnumber[0]
print("street--",fin_street)
print("apartmentnumber--",fin_apartmentnumber)
which prints:
street-- Street 50 No
apartmentnumber-- No 40
street-- saint bakers holy street
apartmentnumber-- 5
street-- Syndicate street
apartmentnumber-- 32
I want to remove the "No" from the first street name. i.e. if there is any street with No followed by a number at the end, that needs to be taken as the apartment number,
and not as the street.
How can I do this for my above example strings?
First try the case where there is a No 123 at the end, use a positive lookahead.
If not found try a street without this.
pattern_street = re.compile(r'[A-Za-z]+[\s\w]+(?=\s[Nn]o\s\d+$)|[A-Za-z]+[\s\w]+',re.X)
You can find the street name by the following regex pattern to eliminate No [0-9] from the statement.
pattern_street = re.compile(r'[A-Za-z]+((?!No).)+',re.X)

Remove letters in word within pandas column if the word follows specific pattern

I have used an API to download info related to companies and topics. Unfortunately, some of the topic/company names were downloaded with the letter b at the start and at the end. I do not want to replace them one by one and I am looking for a regular expression that can help me identify all the substrings that start and end with 'b' and remove the 'b'.
news = {'Text':['bNikeb invests in shoes', 'bAdidasb invests in t-shirts', 'dog drank water'], 'Source':['NYT', 'WP', 'Guardian']}
news_df = pd.DataFrame(news)
outcome = {'Text':['Nike invests in shoes', 'Adidas invests in t-shirts', 'dog drank water'], 'Source':['NYT', 'WP', 'Guardian']}
outcome_df = pd.DataFrame(news)
Thanks!
How about try this pattern
news_df.Text.str.replace(r'\bb(\w+)b\b', r'\1')
Out[1054]:
0 Nike invests in shoes
1 Adidas invests in t-shirts
2 dog drank water
Name: Text, dtype: object

Capturing only specific sections/patterns of string with Regex

I have the following strings, which always follow a standard format:
'On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'
I want to extract certain data fields into a series of lists:
['10/31/2018','Sally Brown','25','apples']
['11/01/2018','John Smith','12','peaches']
['09/15/2018','Jim Roe','10','pears']
As you can see, I need some of the sentence structure to be recognized, but not captured, so the program has context for where the data is located. The Regex that I thought would work is:
(?<=On\s)\d{2}\/\d{2}\/\d{4},\s(?=[A-Z][a-z]+\s[A-Z][a-z]+)\s.+?(?=\d+)\s(?=[a-z]+)\sat\sthe\sorchard\.
But of course, that is incorrect somehow.
This may be a simple question for someone, but I'm having trouble finding the answer. Thanks in advance, and someday when I'm more skilled I'll pay it forward on here.
use \w+ to match any word or [a-zA-Z0-9_]
import re
str = ''''On 10/31/2018, Sally Brown picked 25 apples at the orchard.'
'On 11/01/2018, John Smith picked 12 peaches at the orchard.'
'On 09/15/2018, Jim Roe picked 10 pears at the orchard.'''
arr = re.findall('On\s(.*?),\s(\w+\s\w+)\s\w+\s(\d+)\s(\w+)', str)
print arr
# [('10/31/2018', 'Sally Brown', '25', 'apples'),
# ('11/01/2018', 'John Smith', '12', 'peaches'),
# ('09/15/2018', 'Jim Roe', '10', 'pears')]

Extracting required names from a given format

I have a text file containing data as shown below. I have to extract some required names from it. I am trying the below code but not getting the required results.
The file contains data as below:
Leader : Tim Lee ; 34567
Head\Organiser: Sam Mathews; 11:53 am
Head: Alica Mills; 45612
Head\Secretary: Maya Hill; #53190
Captain- Jocey David # 45123
Vice Captain:- Jacob Green; -65432
The code which I am trying:
import re
pattern = re.compile(r'(Leader|Head\\Organiser|Captain|Vice Captain).*(\w+)',re.I)
matches=pattern.findall(line)
for match in matches:
print(match)
Expected Output:
Tim Lee
Sam Mathews
Jocey David
Jacob Green
import re
line = '''
Leader : Tim Lee ; 34567
Head\Organiser: Sam Mathews; 11:53 am
Head: Alica Mills; 45612
Head\Secretary: Maya Hill; #53190
Captain- Jocey David # 45123
Vice Captain:- Jacob Green; -65432'''
pattern = re.compile(r'(?:Leader|Head(?:\\Organiser|\\Secretary)?|Captain|Vice Captain)\W+(\w+(?:\s+\w+)?)',re.I)
matches=pattern.findall(line)
for match in matches:
print(match)
Explanation:
(?: : start non capture group
Leader : literally
| : OR
Head : literally
(?: : start non capture group
\\Organiser : literally
| : OR
\\Secretary : literally
)? ! end group, optional
| : OR
Captain : literally
| : OR
Vice Captain : literally
) : end group
\W+ : 1 or more non word character
( : start group 1
\w+ : 1 or more word char
(?: : non capture group
\s+ : 1 or more spaces
\w+ : 1 or more word char
)? : end group, optional
) : end group 1
Result for given example:
Tim Lee
Sam Mathews
Alica Mills
Maya Hill
Jocey David
Jacob Green
Given:
s='''\
Leader : Tim Lee ; 34567
Head\Organiser: Sam Mathews; 11:53 am
Head: Alica Mills; 45612
Head\Secretary: Maya Hill; #53190
Captain- Jocey David # 45123
Vice Captain:- Jacob Green; -65432'''
You can get the names like so:
>>> [e.rstrip() for e in re.findall(r'[:-]+[ \t]+(.*?)[;#]',s)]
['Tim Lee', 'Sam Mathews', 'Alica Mills', 'Maya Hill', 'Jocey David', 'Jacob Green']
Or, create a dict of the titles and associated names:
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Head|Head\\Secretary|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}
{'Leader': 'Tim Lee', 'Head\\Organiser': 'Sam Mathews', 'Head': 'Alica Mills', 'Head\\Secretary': 'Maya Hill', 'Captain': 'Jocey David', 'Vice Captain': 'Jacob Green'}
Which then can be restricted to the titles desired:
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}
{'Leader': 'Tim Lee', 'Head\\Organiser': 'Sam Mathews', 'Captain': 'Jocey David', 'Vice Captain': 'Jacob Green'}
And if you just want the names (Python 3.6+ maintains the order, so they will be in string order):
>>> {k:v.rstrip() for k,v in re.findall(r'^\s*(Leader|Head\\Organiser|Captain|Vice Captain)\s*[:-]+[ \t]+(.*?)[;#]',s, re.M)}.values()
dict_values(['Tim Lee', 'Sam Mathews', 'Jocey David', 'Jacob Green'])
Demo and explanation of regex

Categories

Resources