I had a similar text file and got great help to solve it, but I have to realize that I'm too new to programming in general and regex in particular to modify the great Python script below written by steveha for a Similar file.
EDIT: I want to get rid of tabs, newlines and other characters than "normal" words, numbers, exclamation marks, question marks, dots - in order to get a clean CSV and from there do text analysis.
import re
import sys
_, infile, outfile = sys.argv
s_pat_row = r'''
"([^"]+)" # match column; this is group 1
\s*,\s* # match separating comma and any optional white space
(\S+) # match column; this is group 2
\s*,\s* # match separating comma and any optional white space
"((?:\\"|[^"])*)" # match string data that can include escaped quotes
'''
pat_row = re.compile(s_pat_row, re.MULTILINE|re.VERBOSE)
s_pat_clean = r'''[\x01-\x1f\x7f]'''
pat_clean = re.compile(s_pat_clean)
row_template = '"{}",{},"{}"\n'
with open(infile, "rt") as inf, open(outfile, "wt") as outf:
data = inf.read()
for m in re.finditer(pat_row, data):
row = m.groups()
cleaned = re.sub(pat_clean, ' ', row[2])
words = cleaned.split()
cleaned = ' '.join(words)
outrow = row_template.format(row[0], row[1], cleaned)
outf.write(outrow)
I can't figure out how to modify it to match this file, where there is \t separating the columns and text instead of a number in the second column. My objective is to have the cleaned text ready for content analysis, but I seem to have years of learning before I get to that point where I'm familiar... ;-)
Could anyone help me modify it so it works on the data file below?
"from_user" "to_user" "full_text"
"_________erik_" "systersandra gigantarmadillo kuttersmycket NULL NULL" "\"men du...? är du bi?\". \"näeh. Tyvärr\" #fikarum,Alla vi barn i bullerbyn goes #swecrime. #fjällbackamorden,Ny mobil och en väckare som ringer 0540. #fail,När jag måste välja, \"äta kakan eller ha den kvar\", så carpe diar jag kakan på sekunden. #mums,Låter RT #bobhansson: Om pessimisterna lever 7 år kortare är det ju inte alls konstigt att dom är det.
http://t.co/a1t5ht4l2h,Finskjortan på tork: Check! Dags att leta fram gå-bort skorna..."
If your CSV file uses tabs for delimiters rather than commas, then in s_pat_row you should replace the , characters with \t. Also, the second field in your sample text file includes spaces, so the (\S+) pattern in s_pat_row will not match it. You could try this instead:
s_pat_row = r'''
"([^"]+)" # match column; this is group 1
\s*\t\s* # match separating tab and any optional white space
([^\t]+) # match a string of non-tab chars; this is group 2
\s*\t\s* # match separating tab and any optional white space
"((?:\\"|[^"])*)" # match string data that can include escaped quotes
'''
That may be sufficient to solve your immediate problem.
Related
I am trying to extract the comma delimited numbers inside () brackets from a string. I can get the numbers if that are alone in a line. But i cant seem to find a solution to get the numbers when other surrounding text is involved. Any help will be appreciated. Below is the code that I current use in python.
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
line = each.strip()
regex_criteria = r'"^([1-9][0-9]*|\([1-9][0-9]*\}|\(([1-9][0-9]*,?)+[1-9][0-9]*\))$"gm'
if (line.__contains__('(') and line.__contains__(')') and not re.search('[a-zA-Z]', refline)):
refline = line[line.find('(')+1:line.find(')')]
if not re.search('[a-zA-Z]', refline):
Remove the ^, $ is whats preventing you from getting all the numbers. And gm flags wont work in python re.
You can change your regex to :([1-9][0-9]*|\([1-9][0-9]*\}|\(?:([1-9][0-9]*,?)+[1-9][0-9]*\)) if you want to get each number separately.
Or you can simplify your pattern to (?<=[(,])[1-9][0-9]+(?=[,)])
Test regex here: https://regex101.com/r/RlGwve/1
Python code:
import re
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
print(re.findall(r'(?<=[(,])[1-9][0-9]+(?=[,)])', line))
# ['101065', '101066', '101067', '101065']
(?<=[(,])[1-9][0-9]+(?=[,)])
The above pattern tells to match numbers which begin with 1-9 followed by one or more digits, only if the numbers begin with or end with either comma or brackets.
Here's another option:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)*(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?<=\(): Lookbehind for (
[1-9]+\d*: At least one number (would \d+ work too?)
(?:,[1-9]\d*)*: Zero or multiple numbers after a ,
(?=\)): Lookahead for )
Result for your line:
[['101065', '101066', '101067'], ['101065']]
If you only want the comma separated numbers:
pattern = re.compile(r"(?<=\()[1-9]+\d*(?:,[1-9]\d*)+(?=\))")
results = [match[0].split(",") for match in pattern.finditer(line)]
(?:,[1-9]\d*)+: One or more numbers after a ,
Result:
[['101065', '101066', '101067']]
Now, if your line could also look like
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines ( 101065,101066, 101067 )
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
then you have to sprinkle the pattern with \s* and remove the whitespace afterwards (here with str.translate and str.maketrans):
pattern = re.compile(r"(?<=\()\s*[1-9]+\d*(?:\s*,\s*[1-9]\d*\s*)*(?=\))")
table = str.maketrans("", "", " ")
results = [match[0].translate(table).split(",") for match in pattern.finditer(line)]
Result:
[['101065', '101066', '101067'], ['101065']]
Using the pypi regex module you could also use capture groups:
\((?P<num>\d+)(?:,(?P<num>\d+))*\)
The pattern matches:
\( Match (
(?P<num>\d+) Capture group, match 1+ digits
(?:,(?P<num>\d+))* Optionally repeat matching , and 1+ digits in a capture group
\) Match )
Regex demo | Python demo
Example code
import regex
pattern = r"\((?P<num>\d+)(?:,(?P<num>\d+))*\)"
line = """
Abuta has a history of use in the preparation of curares, an arrow poison to cause asphyxiation in hunting
It has also been used in traditional South American and Indian Ayurvedic medicines (101065,101066,101067)
The genus name Cissampelos is derived from the Greek words for ivy and vine (101065)
"""
matches = regex.finditer(pattern, line)
for _, m in enumerate(matches, start=1):
print(m.capturesdict())
Output
{'num': ['101065', '101066', '101067']}
{'num': ['101065']}
Having this dialogue between a sender and a receiver through Discord, I need to eliminate the tags and the names of the interlocutors, in this case it would help me to eliminate the previous to the colon (:), that way the name of the sender would not matter and I would always delete whoever sent the message.
This is the information what is inside the generic_discord_talk.txt file
Company: <#!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <#!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <#!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <#!808947310809317387> Yes, I have it in front of me
import collections
import pandas as pd
import matplotlib.pyplot as plt #to then graph the words that are repeated the most
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
with open('stopwords-es.txt') as f:
st = [word for line in f for word in line.split()]
print(st)
stops = set(st)
stopwords = stops.union(set(['you','for','the'])) #OPTIONAL
#print(stopwords)
I have created a regex to detect the tags
regex = re.compile("^(<#!.+>){,1}\s{,}(messegeA|messegeB|messegeC)(<#!.+>){,1}\s{,}$")
regex_tag = re.compile("^<#!.+>")
I need that the sentence print(st) give me return the words to me but without the emitters and without the tags
You could remove either parts using an alternation | matching either from the start of the string to the first occurrence of a comma, or match <#! till the first closing tag.
^[^:\n]+:\s*|\s*<#!\d+>
The pattern matches:
^ Start of string
[^:\n]+:\s* Match 1+ occurrences of any char except : or a newline, then match : and optional whitspace chars
| Or
\s*<#! Match literally, preceded by optional whitespace chars
[^<>]+ Negated character class, match 1+ occurrences of any char except < and >
> Match literally
Regex demo
If there can be only digits after <#!
^[^:\n]+:|<#!\d+>
For example
archivo = open('generic_discord_talk.txt', encoding="utf8")
a = archivo.read()
st = re.sub(r"^[^:\n]+:\s*|\s*<#![^<>]+>", "", a, 0, re.M)
If you also want to clear the leading and ending spaces, you can add this line
st = re.sub(r"^[^\S\n]*|[^\S\n]*$", "", st, 0, re.M)
I think this should work:
import re
data = """Company: <#!808947310809317387> Good morning, technical secretary of X-company, will Maria attend you, how can we help you?
Customer: Hi <#!808947310809317385>, I need you to help me with the order I have made
Company: Of course, she tells me that she has placed an order through the store's website and has had a problem. What exactly is Maria about?
Customer: I add the product to the shopping cart and nothing happens <#!808947310809317387>
Company: Does Maria have the website still open? So I can accompany you during the purchase process
Client: <#!808947310809317387> Yes, I have it in front of me"""
def run():
for line in data.split("\n"):
line = re.sub(r"^\w+: ", "", line) # remove the customer/company part
line = re.sub(r"<#!\d+>", "", line) # remove tags
print(line)
I am trying to figure out how I would deal with the following situation:
I have raw data that has been manual input and several unnecessary characters and i need to clean the column.
Anything after a symbol such as (-,/,!,#) should be removed if less than 5 letters.
Raw data
NYC USA - LND UK
GBKTG-U
DUB AE- EUUSA
USA -TY
SG !S
CNZOS !C SEA
GAGAX"T
AEU DGR# UK,GBR
Desired Output
LND UK
GBKTG
EUUSA
USA
SG
CNZOS
GAGAZ
UK GBR
Split each line between origin and destination using the regex groups adjusting the separator ([^\w\s]) as needed. Next, count the number of letter on the right side of the separator symbols cheking for stated number of letters.
Details:
(.*?) : capture group - zero or more characters (except line ending) non-greddy
[^\w\s] : follow by any character that is not a letter, digit, underline ([a-z-A-Z0-9_]) or space
(.*) : capture group - zero or more characters (except line ending)
File sample.txt used as input
NYC USA - LND UK
GBKTG-U
DUB AE- EUUSA
USA -TY
SG !S
CNZOS !C SEA
GAGAX"T
AEU DGR# UK,GBR
import re
f = open("sample.txt", "r")
txt = f.read()
dest = []
r = re.findall(r"(.*?)[^\w\s](.*)", txt)
for f in r:
if sum([i.isalpha() for i in f[1]]) >= 5:
dest.append(f[1].strip())
else:
dest.append(f[0].strip())
print(dest)
['LND UK', 'GBKTG', 'EUUSA', 'USA', 'SG', 'CNZOS', 'GAGAX', 'UK,GBR']
Lets say I have a Text file with the below content:
Quetiapine fumarate Drug substance This document
Povidone Binder USP
This line doesn't contain any medicine name.
This line contains Quetiapine fumarate which shouldn't be extracted as it not present at the
beginning of the line.
Dibasic calcium phosphate dihydrate Diluent USP is not present in the csv
Lactose monohydrate Diluent USNF
Magnesium stearate Lubricant USNF
Lactose monohydrate, CI 77491
0.6
Colourant
E 172
Some lines to break the group.
Silicon dioxide colloidal anhydrous
(0.004
Gliding agent
Ph Eur
Adding some random lines.
Povidone
(0.2
Lubricant
Ph Eur
I have a csv containing a list of medicine name which I want to match inside the .txt file and extract all the data that is present between 2 unique medicines(when the medicine name is at the beginning of the line).(Example of medicines from the csv file are 'Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate' etc etc.)
I want to iterate each line of my text file and create groups from one medicine to another.
This should only happen if the medicine name is present at the start of the newline and is not present in between a line.
Expected output:
['Quetiapine fumarate Drug substance This document'],
['Povidone Binder USP'],
['Lactose monohydrate Diluent USNF'],
['Magnesium stearate Lubricant USNF'],
[Lactose monohydrate, CI 77491
0.6
Colourant
E 172],
[Povidone
(0.2
Lubricant
Ph Eur]
Can someone please help me with the same to do this in Python?
Attempt till now:
medicines = ('Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate')
result = []
with open('C:/Users/test1.txt', 'r', encoding='utf8') as f:
for line in f:
if any(line.startswith(med) for med in medicines):
result.append(line.strip())
which captures output till here but I need the remaining part as well:
['Quetiapine fumarate Drug substance This document'],
['Povidone Binder USP'],
['Lactose monohydrate Diluent USNF'],
['Magnesium stearate Lubricant USNF']
I need to capture all the text from one medicine to another as shown in Expected Output. If there is only one medicine name present in a line, I need to capture data from the next four lines and form a group where a number will come in the next line after medicine as shown in the output.
You may use this regex with the re.M option:
^\s*(?:Quetiapine fumarate|Povidone|Magnesium stearate|Lactose monohydrate).*(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?
See the regex demo
Details
^ - start of a line
\s* - 0 or more whitespaces
(?:Quetiapine fumarate|Povidone|Magnesium stearate|Lactose monohydrate) - your list of medicines
.* - rest of the line
(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})? - an optional string of
\n - newline
[^\w\n]* - 0+ chars other than word and newline chars
\d*\.?\d+ - a number
[^\w\n]* - 0+ chars other than word and newline chars
(?:\n.*){2} - two occurrences of a newline and the rest of the line
Python (see Python demo online):
import re
medicines = ['Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate']
result = []
med = r"(?:{})".format("|".join(map(re.escape, medicines)))
pattern = re.compile(r"^\s*" + med + r".*(?:\n[^\w\n]*\d*\.?\d+[^\w\n]*(?:\n.*){2})?", re.M)
with open('C:/Users/test1.txt', 'r', encoding='utf8') as f:
result = pattern.findall(f.read())
I am working with a text file (620KB) that has a list of ID#s followed by full names separated by a comma.
The working regex I've used for this is
^([A-Z]{3}\d+)\s+([^,\s]+)
I want to also capture the first name and middle initial (space delimiter between first and MI).
I tried this by doing:
^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)
Which works, but I want to remove the new line break that is generated on the output file (I will be importing the two output files into a database (possibly Access) and I don't want to capture the new line breaks, also if there is a better way of writing the regex?
Full code:
import re
source = open('source.txt')
ticket_list = open('ticket_list.txt', 'w')
id_list = open('id_list.txt', 'w')
for lines in source:
m = re.search('^([A-Z]{3}\d+)\s+([^\s]+([\D+])+)', lines)
if m:
x = m.group()
print('Ticket: ' + x)
ticket_list.write(x + "\n")
ticket_list = open('First.txt', 'r')
for lines in ticket_list:
y = re.search('^(\d+)\s+([^\s]+([\D+])+)', lines)
if y:
z = y.group()
print ('ID: ' + z)
id_list.write(z + "\n")
source.close()
ticket_list.close()
id_list.close()
Sample Data:
Source:
ABC1000033830 SMITH, Z
100000012 Davis, Franl R
200000655 Gest, Baalio
DEF4528942681 PACO, BETH
300000233 Theo, David Alex
400000012 Torres, Francisco B.
ABC1200045682 Mo, AHMED
DEF1000006753 LUGO, G TO
ABC1200123123 de la Rosa, Maria E.
Depending on what kind of linebreak you're dealing with, a simple positive lookahead may remedy your pattern capturing the linebreak in the result. This was generated by RegexBuddy 4.2.0, and worked with all your test data.
if re.search(r"^([A-Z]{3}\d+)\s+([^,\s]+([\D])+)(?=$)", subject, re.IGNORECASE | re.MULTILINE):
# Successful match
else:
# Match attempt failed
Basically, the positive lookahead makes sure that there is a linebreak (in this case, end of line) character directly after the pattern ends. It will match, but not capture the actual end of line.