Cleaning column using regex remove character based on conditions - python

I am trying to figure out how I would deal with the following situation:
I have raw data that has been manual input and several unnecessary characters and i need to clean the column.
Anything after a symbol such as (-,/,!,#) should be removed if less than 5 letters.
Raw data
NYC USA - LND UK
GBKTG-U
DUB AE- EUUSA
USA -TY
SG !S
CNZOS !C SEA
GAGAX"T
AEU DGR# UK,GBR
Desired Output
LND UK
GBKTG
EUUSA
USA
SG
CNZOS
GAGAZ
UK GBR

Split each line between origin and destination using the regex groups adjusting the separator ([^\w\s]) as needed. Next, count the number of letter on the right side of the separator symbols cheking for stated number of letters.
Details:
(.*?) : capture group - zero or more characters (except line ending) non-greddy
[^\w\s] : follow by any character that is not a letter, digit, underline ([a-z-A-Z0-9_]) or space
(.*) : capture group - zero or more characters (except line ending)
File sample.txt used as input
NYC USA - LND UK
GBKTG-U
DUB AE- EUUSA
USA -TY
SG !S
CNZOS !C SEA
GAGAX"T
AEU DGR# UK,GBR
import re
f = open("sample.txt", "r")
txt = f.read()
dest = []
r = re.findall(r"(.*?)[^\w\s](.*)", txt)
for f in r:
if sum([i.isalpha() for i in f[1]]) >= 5:
dest.append(f[1].strip())
else:
dest.append(f[0].strip())
print(dest)
['LND UK', 'GBKTG', 'EUUSA', 'USA', 'SG', 'CNZOS', 'GAGAX', 'UK,GBR']

Related

How to extract all comma delimited numbers inside () bracked and ignore any text

I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}

Extract textual data in between two strings in a text file using Python

Lets say I have a Text file with the below content:
Quetiapine fumarate Drug substance This document
Povidone Binder USP
This line doesn't contain any medicine name.
This line contains Quetiapine fumarate which shouldn't be extracted as it not present at the
beginning of the line.
Dibasic calcium phosphate dihydrate Diluent USP is not present in the csv
Lactose monohydrate Diluent USNF
Magnesium stearate Lubricant USNF
Lactose monohydrate, CI 77491
0.6
Colourant
E 172
Some lines to break the group.
Silicon dioxide colloidal anhydrous
(0.004
Gliding agent
Ph Eur
Adding some random lines.
Povidone
(0.2
Lubricant
Ph Eur
I have a csv containing a list of medicine name which I want to match inside the .txt file and extract all the data that is present between 2 unique medicines(when the medicine name is at the beginning of the line).(Example of medicines from the csv file are 'Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate' etc etc.)
I want to iterate each line of my text file and create groups from one medicine to another.
This should only happen if the medicine name is present at the start of the newline and is not present in between a line.
Expected output:
['Quetiapine fumarate Drug substance This document'],
['Povidone Binder USP'],
['Lactose monohydrate Diluent USNF'],
['Magnesium stearate Lubricant USNF'],
[Lactose monohydrate, CI 77491
0.6
Colourant
E 172],
[Povidone
(0.2
Lubricant
Ph Eur]
Can someone please help me with the same to do this in Python?
Attempt till now:
medicines = ('Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate')
result = []
with open('C:/Users/test1.txt', 'r', encoding='utf8') as f:
for line in f:
if any(line.startswith(med) for med in medicines):
result.append(line.strip())
which captures output till here but I need the remaining part as well:
['Quetiapine fumarate Drug substance This document'],
['Povidone Binder USP'],
['Lactose monohydrate Diluent USNF'],
['Magnesium stearate Lubricant USNF']
I need to capture all the text from one medicine to another as shown in Expected Output. If there is only one medicine name present in a line, I need to capture data from the next four lines and form a group where a number will come in the next line after medicine as shown in the output.
You may use this regex with the re.M option:
^\s*(?:Quetiapine fumarate|Povidone|Magnesium stearate|Lactose monohydrate).*(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?
See the regex demo
Details
^ - start of a line
\s* - 0 or more whitespaces
(?:Quetiapine fumarate|Povidone|Magnesium stearate|Lactose monohydrate) - your list of medicines
.* - rest of the line
(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})? - an optional string of
\n - newline
[^\w\n]* - 0+ chars other than word and newline chars
\d*\.?\d+ - a number
[^\w\n]* - 0+ chars other than word and newline chars
(?:\n.*){2} - two occurrences of a newline and the rest of the line
Python (see Python demo online):
import re
medicines = ['Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate']
result = []
med = r"(?:{})".format("|".join(map(re.escape, medicines)))
pattern = re.compile(r"^\s*" + med + r".*(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?", re.M)
with open('C:/Users/test1.txt', 'r', encoding='utf8') as f:
result = pattern.findall(f.read())

Match words only if preceded by specific pattern

I have a string from a NWS bulletin:
LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley
My aim is to extract a couple fields with regular expressions. In the first string I want "AAD" and from the second string I want "RECHNX". I have tried:
( )\w{3} #for the first string
and
\w{6} #for the 2nd string
But these find all 3 and 6 character strings leading up to the string I want.
Assuming the fields you want to extract are always in capital letters and preceded by 6 digits and a space, this regular expression would do the trick:
(?<=\d{6}\s)[A-Z]+
Demo: https://regex101.com/r/dsDHTs/1
Edit: if you want to match up to two alpha-numeric uppercase words preceded by 6 digits, you can use:
(?<=\d{6}\s)([A-Z0-9]+\b)\s(?:([A-Z0-9]+\b))*
Demo: https://regex101.com/r/dsDHTs/5
If you have a specific list of valid fields, you could also simply use:
(AAD|TMLB|RECHNX|RR4HNX)
https://regex101.com/r/dsDHTs/3
Since the substring you want to extract is a word that follows a number, separated by a space, you can use re.search with the following regex (given your input stored in s):
re.search(r'\b\d+ (\w+)', s).group(1)
To read first groups of word chars from each line, you can use a pattern like
(\w+) (\w+) (\w+) (\w+).
Then, from the first line read group No 4 and from the second line read group No 3.
Look at the following program. It prints four groups from each source line:
import re
txt = """LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
KHNX 141001 RECHNX Weather Service San Joaquin Valley"""
n = 0
pat = re.compile(r'(\w+) (\w+) (\w+) (\w+)')
for line in txt.splitlines():
n += 1
print(f'{n:2}: {line}')
mtch = pat.search(line)
if mtch:
gr = [ mtch.group(i) for i in range(1, 5) ]
print(f' {gr}')
The result is:
1: LTUS41 KCAR 141558 AAD TMLB Forecast for the National Parks
['LTUS41', 'KCAR', '141558', 'AAD']
2: KHNX 141001 RECHNX Weather Service San Joaquin Valley
['KHNX', '141001', 'RECHNX', 'Weather']

Insert space after the second or third capital letter python

I have a pandas dataframe containing addresses. Some are formatted correctly like 481 Rogers Rd York ON. Others have a space missing between the city quandrant and the city name, for example: 101 9 Ave SWCalgary AB or even possibly: 101 9 Ave SCalgary AB, where SW refers to south west and S to south.
I'm trying to find a regex that will add a space between second and third capital letters if they are followed by lowercase letters, or if there are only 2 capitals followed by lower case, add a space between the first and second.
So far, I've found that ([A-Z]{2,3}[a-z]) will match the situation correctly, but I can't figure out how to look back into it and sub at position 2 or 3. Ideally, I'd like to use an index to split the match at [-2:] but I can't figure out how to do this.
I found that re.findall('(?<=[A-Z][A-Z])[A-Z][a-z].+', '101 9 Ave SWCalgary AB')
will return the last part of the string and I could use a look forward regex to find the start and then join them but this seems very inefficient.
Thanks
You may use
df['Test'] = df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
See this regex demo
Details
\b - a word boundary
([A-Z]{1,2}) - Capturing group 1 (later referred with \1 from the replacement pattern): one or two uppercase letters
([A-Z][a-z]) - Capturing group 2 (later referred with \2 from the replacement pattern): an uppercase letter + a lowercase one.
If you want to specifically match city quadrants, you may use a bit more specific regex:
df['Test'] = df['Test'].str.replace(r'\b([NS][EW]|[NESW])([A-Z][a-z])', r'\1 \2')
See this regex demo. Here, [NS][EW]|[NESW] matches N or S that are followed with E or W, or a single N, E, S or W.
Pandas demo:
import pandas as pd
df = pd.DataFrame({'Test':['481 Rogers Rd York ON',
'101 9 Ave SWCalgary AB',
'101 9 Ave SCalgary AB']})
>>> df['Test'].str.replace(r'\b([A-Z]{1,2})([A-Z][a-z])', r'\1 \2')
0 481 Rogers Rd York ON
1 101 9 Ave SW Calgary AB
2 101 9 Ave S Calgary AB
Name: Test, dtype: object
You can use
([A-Z]{1,2})(?=[A-Z][a-z])
to capture the first (or first and second) capital letters, and then use lookahead for a capital letter followed by a lowercase letter. Then, replace with the first group and a space:
re.sub(r'([A-Z]{1,2})(?=[A-Z][a-z])', r'\1 ', str)
https://regex101.com/r/TcB4Ph/1

Delete whitespace characters in quoted columns in tab-separated file?

I had a similar text file and got great help to solve it, but I have to realize that I'm too new to programming in general and regex in particular to modify the great Python script below written by steveha for a Similar file.
EDIT: I want to get rid of tabs, newlines and other characters than "normal" words, numbers, exclamation marks, question marks, dots - in order to get a clean CSV and from there do text analysis.
import re
import sys
_, infile, outfile = sys.argv
s_pat_row = r'''
"([^"]+)" # match column; this is group 1
\s*,\s* # match separating comma and any optional white space
(\S+) # match column; this is group 2
\s*,\s* # match separating comma and any optional white space
"((?:\\"|[^"])*)" # match string data that can include escaped quotes
'''
pat_row = re.compile(s_pat_row, re.MULTILINE|re.VERBOSE)
s_pat_clean = r'''[\x01-\x1f\x7f]'''
pat_clean = re.compile(s_pat_clean)
row_template = '"{}",{},"{}"\n'
with open(infile, "rt") as inf, open(outfile, "wt") as outf:
data = inf.read()
for m in re.finditer(pat_row, data):
row = m.groups()
cleaned = re.sub(pat_clean, ' ', row[2])
words = cleaned.split()
cleaned = ' '.join(words)
outrow = row_template.format(row[0], row[1], cleaned)
outf.write(outrow)
I can't figure out how to modify it to match this file, where there is \t separating the columns and text instead of a number in the second column. My objective is to have the cleaned text ready for content analysis, but I seem to have years of learning before I get to that point where I'm familiar... ;-)
Could anyone help me modify it so it works on the data file below?
"from_user" "to_user" "full_text"
"_________erik_" "systersandra gigantarmadillo kuttersmycket NULL NULL" "\"men du...? är du bi?\". \"näeh. Tyvärr\" #fikarum,Alla vi barn i bullerbyn goes #swecrime. #fjällbackamorden,Ny mobil och en väckare som ringer 0540. #fail,När jag måste välja, \"äta kakan eller ha den kvar\", så carpe diar jag kakan på sekunden. #mums,Låter RT #bobhansson: Om pessimisterna lever 7 år kortare är det ju inte alls konstigt att dom är det.
http://t.co/a1t5ht4l2h,Finskjortan på tork: Check! Dags att leta fram gå-bort skorna..."
If your CSV file uses tabs for delimiters rather than commas, then in s_pat_row you should replace the , characters with \t. Also, the second field in your sample text file includes spaces, so the (\S+) pattern in s_pat_row will not match it. You could try this instead:
s_pat_row = r'''
"([^"]+)" # match column; this is group 1
\s*\t\s* # match separating tab and any optional white space
([^\t]+) # match a string of non-tab chars; this is group 2
\s*\t\s* # match separating tab and any optional white space
"((?:\\"|[^"])*)" # match string data that can include escaped quotes
'''
That may be sufficient to solve your immediate problem.

Categories

Resources