regex extract list of strings from file - python

I have an input file (input.txt) which contains some data that follows a standard format similar to the following lines:
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"#de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"#en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"#de .
I want to extract a list of English strings which lies between the " "#en in outputfile-en.txt, and German strings which lies between the " "#de in outputfile-de.txt
In this example outputfile-en.txt should contain:
Political inclusion
and outputfile-de.txt should contain:
Politische Inklusion
Radiologische Kampfmittel
Which regex is suitable here?

With such a simple pattern there's no need for regex at all, especially not to re-iterate over the same data to pick up different languages - you can stream parse and write your results on the fly:
with open("input.txt", "r") as f: # open the input file
file_handles = {} # a map of our individual output file handles
for line in f: # read it line by line
rindex = line.rfind("#") # find the last `#` character
language = line[rindex+1:rindex+3] # grab the following two characters as language
if rindex != -1: # char found, consider the line...
lindex = line.rfind("\"", 0, rindex-1) # find the preceding quotation
if lindex != -1: # found, we have a match
if language not in file_handles: # add a file handle for this language:
file_handles[language] = open("outputfile-{}.txt".format(language), "w")
# write the found slice between `lindex` and `rindex` + a new line
file_handles[language].write(line[lindex+1:rindex-1] + "\n")
for handle in file_handles.values(): # lets close our output file handles
handle.close()
Should be significantly faster than regex + as a bonus is that it will work with any language so if you have ...#it lines it will save outputfile-it.txt as well.

You could do something like this:
import re
str = """<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Politische Inklusion"#de .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Political inclusion"#en .
<descriptor/nnn> <http://www.nnn.org/2004/02/skos/core#prefLabel> "Radiologische Kampfmittel"#de . """
german = re.compile('"(.*)"#de')
english = re.compile('"(.*)"#en')
print german.findall(str)
print english.findall(str)
This would give you
['Politische Inklusion', 'Radiologische Kampfmittel']
and
['Political inclusion'].
Now you only have to iterate over those results and write them to the appropriate file.

Related

any better way to validate string pattern after the key word found in txt file?

I am trying to check strings in input file and used re to compose pattern and parsed input file. Here, input is csv file with multiple lines of sql like script, I want to validate strings in order like first check keyword1, then check ketword2 then check keyword3 in each line of input csv file. By doing this, I used for loop but I feel like there is must be better way to handle this. Does anyone suggest how to tackle with this?
use case
CREATE application vat4_xyz_rcc_clm1_trm_soc WITH some text
CREATE flow flow_src_vat4_xyz_rcc_clm1_trm_soc some text
CREATE stream main_stream_vat4_xyz_rcc_clm1_trm_soc with some text
CREATE OR REPLACE target comp_tgt_vat4_xyz_rcc_clm1_trm_soc some text
to handle this, I tried this:
kword=['CREATE','CREATE OR REPLACE']
with open('myinput.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
for string in line:
for word in kword:
if string is word:
atype=next(string)
print('type is', atype) # such as application, flow, stream
if atype in ['application', 'flow', 'steam']:
value=next(atype) ## such as vat4_xyz_rcc_clm1_trm_soc, flow_src_vat4_xyz_rcc_clm1_trm_soc
print("name", value)
else:
print("name not found")
else:
print("type is not correct")
but doing this is not efficient code. I think re might do good job here instead. Does anyone have better idea on this?
objective:
basically, I need to parse each line where if I spot keyword1 such as CREATE, then check word next to ketword1, if nextword is application, then print this and check next word of it, where I composed pattern as follow:
vat4_xyz_rcc_clm1_trm_soc
pat1=r'^[\vat\d+]_(?:xyz|abs)_rcc_[clm\d+]_trm_(?:soc|aws)'
m=re.match(pat1, curStr, re.M)
here is case the eachline has differnt pattern such as
pat1=r'^[\vat\d+]_(?:xyz|abs)_rcc_[clm\d+]_trm_(?:soc|aws)'
pat2=r'^\flow_src_[\vat\d+]_(?:xyz|abs)_rcc_[clm\d+]_trm_(?:soc|aws)'
pat3=r'^\main_stream_[\vat\d+]_(?:xyz|abs)_rcc_[clm\d+]_trm_(?:soc|aws)'
pat4=r'^\comp_tgt_[\vat\d+]_(?:xyz|abs)_rcc_[clm\d+]_trm_(?:soc|aws)'
how can we make this simple for parsing each line with re? any thoughts?
The regex seems like it can be way simpler than what you're trying. How about this:
import re
matcher = re.compile(r"^(CREATE(?: OR REPLACE)?) (\S+) (\S+).*$")
with open("test.txt", "r") as f:
for line in f:
if match := matcher.match(line):
action, action_type, value = match.groups()
print(f"{action=}, {action_type=}, {value=}")
Outputs:
action='CREATE', action_type='application', value='vat4_xyz_rcc_clm1_trm_soc'
action='CREATE', action_type='flow', value='flow_src_vat4_xyz_rcc_clm1_trm_soc'
action='CREATE', action_type='stream', value='main_stream_vat4_xyz_rcc_clm1_trm_soc'
action='CREATE OR REPLACE', action_type='target', value='comp_tgt_vat4_xyz_rcc_clm1_trm_soc'
If you want to further validate the values, I would take the results from the first regex and pipe them into more specialized regex's for each case.
import re
line_matcher = re.compile(r"^(CREATE(?: OR REPLACE)?) (\S+) (\S+).*$")
value_matchers = {
"application": re.compile(r'^vat\d+_(xyz|abs)_rcc_clm\d+_trm_(soc|aws)'),
"flow": re.compile(r'^flow_src_vat\d+_(xyz|abs)_rcc_clm\d+_trm_(soc|aws)'),
"stream": re.compile(r'^main_stream_vat\d+_(xyz|abs)_rcc_clm\d+_trm_(soc|aws)'),
"target": re.compile(r'^comp_tgt_vat\d+_(xyz|abs)_rcc_clm\d+_trm_(soc|aws)'),
}
with open("test.txt", "r") as file:
for line in file:
if not (line_match := line_matcher.match(line)):
print(f"Invalid line: {line=}")
continue
action, action_type, value = line_match.groups()
if not (value_matcher := value_matchers.get(action_type)):
print(f"Invalid action type: {line=}")
continue
if not value_matcher.match(value):
print(f"Invalid {action_type} value: {line=}")
continue
# Do some work on the items
...

"Replace" from central file?

I am trying to extend the replace function. Instead of doing the replacements on individual lines or individual commands, I would like to use the replacements from a central text file.
That's the source:
import os
import feedparser
import pandas as pd
pd.set_option('max_colwidth', -1)
RSS_URL = "https://techcrunch.com/startups/feed/"
feed = feedparser.parse(RSS_URL)
entries = pd.DataFrame(feed.entries)
entries = entries[['title']]
entries = entries.to_string(index=False, header=False)
entries = entries.replace(' ', '\n')
entries = os.linesep.join([s for s in entries.splitlines() if s])
print(entries)
I want to be able to replace words from a RSS feed, from a central "Replacement"-file, witch So the source file should have two columns:Old word, New word. Like replace function replace('old','new').
Output/Print Example:
truck
rental
marketplace
D’Amelio
family
launches
to
invest
up
to
$25M
...
In most cases I want to delete the words that are unnecessary for me, so e.g. replace('to',''). But I also want to be able to change special names, e.g. replace('D'Amelio','DAmelio'). The goal is to reduce the number of words and build up a kind of keyword radar.
Is this possible? I can't find any help Googling. But it could well be that I do not know the right terms or can not formulate.
with open('<filepath>','r') as r:
# if you remove the ' marks from around your words, you can remove the [1:-1] part of the below code
words_to_replace = [word.strip()[1:-1] for word in r.read().split(',')]
def replace_words(original_text, words_to_replace):
for word in words_to_replace:
original_text = original_text.replace(word, '')
return original_text
I was unable to understand your question properly but as far as I understand you have strings like cat, dog, etc. and you have a file in which you have data with which you want to replace the string. If this was your requirement, I have given the solution below, so try running it if it satisfies your requirement.
If that's not what you meant, please comment below.
TXT File(Don't use '' around the strings in Text File):
papa, papi
dog, dogo
cat, kitten
Python File:
your_string = input("Type a string here: ") #string you want to replace
with open('textfile.txt',"r") as file1: #open your file
lines = file1.readlines()
for line in lines: #taking the lines of file in one by one using loop
string1 = f'{line}'
string1 = string1.split() #split the line of the file into list like ['cat,', 'kitten']
if your_string == string1[0][:-1]: #comparing the strings of your string with the file
your_string = your_string.replace(your_string, string1[1]) #If string matches like user has given input cat, it will replace it with kitten.
print(your_string)
else:
pass
If you got the correct answer please upvote my answer as it took my time to make and test the python file.

Get the full word(s) by knowing only just a part of it

I am searching through a text file line by line and i want to get back all strings that contains the prefix AAAXX1234. For example in my text file i have these lines
Hello my ID is [123423819::AAAXX1234_3412] #I want that(AAAXX1234_3412)
Hello my ID is [738281937::AAAXX1234_3413:AAAXX1234_4212] #I
want both of them(AAAXX1234_3413, AAAXX1234_4212)
Hello my ID is [123423819::XXWWF1234_3098] #I don't care about that
The code i have a just to check if the line starts with "Hello my ID is"
with open(file_hrd,'r',encoding='utf-8') as hrd:
hrd=hrd.readlines()
for line in hrd:
if line.startswith("Hello my ID is"):
#do something
Try this:
import re
with open(file_hrd,'r',encoding='utf-8') as hrd:
res = []
for line in hrd:
res += re.findall('AAAXX1234_\d+', line)
print(res)
Output:
['AAAXX1234_3412', 'AAAXX1234_3413', 'AAAXX1234_4212']
I’d suggest you to parse your lines and extract the information into meaningful parts. That way, you can then use a simple startswith on the ID part of your line. In addition, this will also let you control where you find these prefixes, e.g. in case the lines contains additional data that could also theoretically contain something that looks like an ID.
Something like this:
if line.startswith('Hello my ID is '):
idx_start = line.index('[')
idx_end = line.index(']', idx_start)
idx_separator = line.index(':', idx_start, idx_end)
num = line[idx_start + 1:idx_separator]
ids = line[idx_separator + 2:idx_end].split(':')
print(num, ids)
This would give you the following output for your three example lines:
123423819 ['AAAXX1234_3412']
738281937 ['AAAXX1234_3413', 'AAAXX1234_4212']
123423819 ['XXWWF1234_3098']
With that information, you can then check the ids for a prefix:
if any(ids, lambda x: x.startswith('AAAXX1234')):
print('do something')
Using regular expressions through the re module and its findall() function should be enough:
import re
with open('file.txt') as file:
prefix = 'AAAXX1234'
lines = file.read().splitlines()
output = list()
for line in lines:
output.extend(re.findall(f'{prefix}_[\d]+', line))
You can do it by findall with the regex r'AAAXX1234_[0-9]+', it will find all parts of the string that start with AAAXX1234_ and then grabs all of the numbers after it, change + to * if you want it to match 'AAAXX1234_' on it's own as well

Parsing paragraph out of text file in Python?

I am trying to parse certain paragraphs out of multiple text file and store them in list. All the text file have some similar format to this:
MODEL NUMBER: A123
MODEL INFORMATION: some info about the model
DESCRIPTION: This will be a description of the Model. It
could be multiple lines but an empty line at the end of each.
CONCLUSION: Sold a lot really profitable.
Now i can pull out the information where its one line, but am having trouble when i encounter something which is multiple line (like 'Description'). The description length is not known but i know at the end it would have an empty line (which would mean using '\n'). This is what i have so far:
import os
dir = 'Test'
DESCRIPTION = []
for files in os.listdir(dir):
if files.endswith('.txt'):
with open(dir + '/' + files) as File:
reading = File.readlines()
for num, line in enumerate(reading):
if 'DESCRIPTION:' in line:
Start_line = num
if len(line.strip()) == 0:
I don't know if its the best approach, but what i was trying to do with if len(line.strip()) == 0: is to create a list of blank lines and then find the first greater value than Start_Line. I saw this Bisect.
In the end i would like my data to be if i say print Description
['DESCRIPTION: Description from file 1',
'DESCRIPTION: Description from file 2',
'DESCRIPTION: Description from file 3,]
Thanks.
Regular expression. Think about it this way: you have a pattern that will allow you to cut any file into pieces you will find palatable: "newline followed by capital letter"
re.split is your friend
Take a string
"THE
BEST things
in life are
free
IS
YET
TO
COME"
As a string:
p = "THE\nBEST things\nin life are\nfree\nIS\nYET\nTO\nCOME"
c = re.split('\n(?=[A-Z])', p)
Which produces list c
['THE', 'BEST things\nin life are\nfree', 'IS', 'YET', 'TO', 'COME']
I think you can take it from there, as this would separate your files into each a list of strings with each string beings its own section, then from there you can find the "DESCRIPTION" element and store it, you see that you separate each section, including its subcontents by that re split. Important to note that the way I've set up the regex it recognies the PATTERN "newline and then Capital Letter" but CUTS after the newline, which is why it is outside the brackets.

Search for words (exact matches) in multiple texts using Python

I want to let the user choose and open multiple texts and perform a search for exact matches in the texts.
I want the encoding to be unicode.
If I search for "cat" I want it to find "cat", "cat,", ".cat" but not "catalogue".
I don't know how to let the user search for two words ("cat" OR "dog") in all of the texts at the same time??????
Maybe I can use RE?
So far I have just made it possible for the user to insert the path to the directory containing the text files to search in. Now I want to let the user (raw_input) search for two words in all of the texts, and then print and save the results (e.g. "search_word_1" and "search_word_2" found in document1.txt, "search_word_2" found in document4.txt) in a separate document (search_words).
import re, os
path = raw_input("insert path to directory :")
ex_library = os.listdir(path)
search_words = open("sword.txt", "w") # File or maybe list to put in the results
thelist = []
for texts in ex_library:
f = os.path.join(path, texts)
text = open(f, "r")
textname = os.path.basename(texts)
print textname
for line in text.read():
text.close()
Regular expressions are appropriate tool in this case.
I want it to find "cat", "cat,", ".cat" but not "catalogue".
Pattern: r'\bcat\b'
\b matches at a word boundary.
how to let the user search for two words ("cat" OR "dog") in all of the texts at the same time
Pattern: r'\bcat\b|\bdog\b'
To print "filename: <words that are found in it>":
#!/usr/bin/env python
import os
import re
import sys
def fgrep(words, filenames, encoding='utf-8', case_insensitive=False):
findwords = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
flags=re.I if case_insensitive else 0).findall
for name in filenames:
with open(name, 'rb') as file:
text = file.read().decode(encoding)
found_words = set(findwords(text))
yield name, found_words
def main():
words = [w.decode(sys.stdin.encoding) for w in sys.argv[1].split(",")]
filenames = sys.argv[2:] # the rest is filenames
for filename, found_words in fgrep(words, filenames):
print "%s: %s" % (os.path.basename(filename), ",".join(found_words))
main()
Example:
$ python findwords.py 'cat,dog' /path/to/*.txt
Alternative solutions
To avoid reading the whole file in memory:
import codecs
...
with codecs.open(name, encoding=encoding) as file:
found_words = set(w for line in file for w in findwords(line))
You could also print found words in the context they are found e.g., print lines with highlighted words:
from colorama import init # pip install colorama
init(strip=not sys.stdout.isatty()) # strip colors if stdout is redirected
from termcolor import colored # pip install termcolor
highlight = lambda s: colored(s, on_color='on_red', attrs=['bold', 'reverse'])
...
regex = re.compile("|".join(r"\b%s\b" % re.escape(w) for w in words),
flags=re.I if case_insensitive else 0)
for line in file:
if regex.search(line): # line contains words
line = regex.sub(lambda m: highlight(m.group()), line)
yield line
You need to split the text in each file on whitespace and punctuation. Once that's done you can simply look for the words you are searching for in the remaining list. You also need to convert everything to lowercase, unless you also want case sensitive search.
Some (maybe useful) information in addition to the existing answers:
You should be aware that what the user means when he thinks of a "character" (=grapheme) is not always the same as a Unicode character, and some graphemes can be represented by Unicode characters in more than one unique way (e.g. composite character vs. base character + combining mark).
To do a search based on graphemes (=what the user expects in most cases) and not on specific Unicode character sequences, you need to normalize your strings before you search.

Categories

Resources