How can I find some words in files with regex? - python

I have many files and need to categorize them into the words that come up there.
ex) [..murder..murderAttempted..] or [murder, murderAttempted] etc..
I tried this code. but not all came out. so I want "murder" and "murderAttmpted" in files surrounded by "[ ]".
def func(root_dir):
for files in os.listdir(root_dir):
pattern = r'\[.+murder.+murderAttempted.+'
if "txt" in files:
f = open(root_dir + files, 'rt', encoding='UTF8')
for i, line in enumerate(f):
for match in re.finditer(pattern, line):
print(match.group())

This appears to work for me: pattern = r'\[.*murder.*murderAttempted.*\]' instead of pattern = r'\[.+murder.+murderAttempted.+'. I believe it returns all occurrences of "murder" and "murderAttempted" in files surrounded by "[]". The + requires 1 or more occurrence whereas * could have 0. Also note the addition of the end \]. This ensures you only capture strings that are enclosed in brackets.

Related

How to print matching strings in python with regex?

I am working on a Python script that would go through a directory with a bunch of files and extract the strings that match a certain pattern.
More specifically, I'm trying to extract the values of serial number and a max-limit, and the lines look something like this:
#serial number = 642E0523D775
max-limit=50M/50M
I've got the script to go through the files, but I'm having an issue with it actually printing the values that I want it to. Instead of it printing the values, I just get the 'Nothing fount' output.
I'm thinking that it probably has something to do with the regex I'm using, but I can't for the life of me figure out how formulate this.
The script I've come up with so far:
import os
import re
#Where I'm searching
user_input = "/path/to/files/"
directory = os.listdir(user_input)
#What I'm looking for
searchstring = ['serial number', 'max-limit']
re_first = re.compile ('serial.\w.*')
re_second = re.compile ('max-limit=\w*.\w*')
#Regex combine
regex_list = [re_first, re_second]
#Looking
for fname in directory:
if os.path.isfile(user_input + os.sep + fname):
# Full path
f = open(user_input + os.sep + fname, 'r')
f_contents = f.read()
content = fname + f_contents
files = os.listdir(user_input)
lines_seen = set()
for f in files:
print(f)
if f not in lines_seen: # not a duplicate
for regex in regex_list:
matches = re.findall(regex, content)
if matches != None:
for match in matches:
print(match)
else:
print('Nema')
f.close()
Per the documentation, the regex module's match() searches for "characters at the beginning of a string [that] match the regular expression pattern". Since you are prepending your file contents with the file name in the line:
content=fname + f_contents
and then matching your pattern against the content in the line:
result=re.match(regex, content)
there will never be a match.
Since you want to locate a match anywhere in string, use search() instead.
See also: search() vs match()
Edit
The pattern ^[\w&.\-]+$ provided would match neither serial number = 642E0523D775 as it contains a space (" "), nor max-limit=50M/50M as it contains a forward slash ("/"). Both also contain an equals sign ("=") which cannot be matched by your pattern.
Additionally, the character class in this pattern matches the backslash (""), so you may want to remove it (the dash ("-") should not be escaped when it is at the end of the character class).
A pattern to match both these strings as well could be:
^[\w&. \/=\-]+$
Try it out here

Python - Regex - combination of letters and numbers (undefined length)

I am trying to get a File-ID from a text file. In the above example the filename is d735023ds1.htm which I want to get in order to build another url. Those filenames differ however in their length and I would need a universal regex expression to cover all possibilities.
Example filenames
d804478ds1a.htm.
d618448ds1a.htm.
d618448.htm
My code
for cik in leftover_cik_list:
r = requests.get(filing.url)
content = str(r.content)
fileID = None
for line in content.split("\n"):
if fileID == None:
fileIDIndex = line.find("<FILENAME>")
if fileIDIndex != -1:
trimmedText = line[fileIDIndex:]
result = RegEx.search(r"^[\w\d.htm]*$", trimmedText)
if result:
fileID = result.group()
print ("fileID",fileID)
document_link = "https://www.sec.gov/Archives/edgar/data/{0}/{1}/{2}.htm".format(cik, accession_number, fileID)
print ("Document Link to S-1:", document_link)
import re
...
result = re.search('^d\d{1,6}.+\.htm$', trimmedText)
if result:
fileID = result.group()
^d = Start with a d
\d{1,6} = Look for 1-6 digits, if there could be an unlimited amount of digits replace with \d{1,}
.+ = Wild card
\.htm$ = End in .htm
You should try re.match() which searches for a pattern at the beginning of the input string. Also, your regex is not good, you have to add an anti-shash before ., as point means "any character" in regex.
import re
result = re.match('[\w]+\.htm', trimmedText)
Try this regex:
import re
files = [
"d804478ds1a.htm",
"d618448ds1a.htm",
"d618448.htm"
]
for f in files:
match = re.search(r"d\w+\.htm", f)
print(match.group())
d804478ds1a.htm
d618448ds1a.htm
d618448.htm
The assumptions in the above are that the file name starts with a d, ends with .htm and contains only letters, digits and underscores.

function to get rid of delimeters in python

I have a function where the user passes in a file and a String and the code should get rid of the specificed delimeters. I am having trouble finishing the part where I loop through my code and get rid of each of the replacements. I will post the code down below
def forReader(filename):
try:
# Opens up the file
file = open(filename , "r")
# Reads the lines in the file
read = file.readlines()
# closes the files
file.close()
# loops through the lines in the file
for sentence in read:
# will split each element by a spaace
line = sentence.split()
replacements = (',', '-', '!', '?' '(' ')' '<' ' = ' ';')
# will loop through the space delimited line and get rid of
# of the replacements
for sentences in line:
# Exception thrown if File does not exist
except FileExistsError:
print('File is not created yet')
forReader("mo.txt")
mo.txt
for ( int i;
After running the filemo.txt I would like for the output to look like this
for int i
Here's a way to do this using regex. First, we create a pattern consisting of all the delimiter characters, being careful to escape them, since several of those characters have special meaning in a regex. Then we can use re.sub to replace each delimiter with an empty string. This process can leave us with two or more adjacent spaces, which we then need to replace with a single space.
The Python re module allows us to compile patterns that are used frequently. Theoretically, this can make them more efficient, but it's a good idea to test such patterns against real data to see if it does actually help. :)
import re
delimiters = ',-!?()<=;'
# Make a pattern consisting of all the delimiters
pat = re.compile('|'.join(re.escape(c) for c in delimiters))
s = 'for ( int i;'
# Remove the delimiters
z = pat.sub('', s)
#Clean up any runs of 2 or more spaces
z = re.sub(r'\s{2,}', ' ', z)
print(z)
output
for int i

Python: Extract hashtags out of a text file

So, I've written the code below to extract hashtags and also tags with '#', and then append them to a list and sort them in descending order. The thing is that the text might not be perfectly formatted and not have spaces between each individual hashtag and the following problem may occur - as it may be checked with the #print statement inside the for loop :
#socality#thisismycommunity#themoderndayexplorer#modernoutdoors#mountaincultureelevated
So, the .split() method doesn't deal with those. What would be the best practice to this issue?
Here is the .txt file
Grateful for your time.
name = input("Enter file:")
if len(name) < 1 : name = "tags.txt"
handle = open(name)
tags = dict()
lst = list()
for line in handle :
hline = line.split()
for word in hline:
if word.startswith('#') : tags[word] = tags.get(word,0) + 1
else :
tags[word] = tags.get(word,0) + 1
#print(word)
for k,v in tags.items() :
tags_order = (v,k)
lst.append(tags_order)
lst = sorted(lst, reverse=True)[:34]
print('Final Dictionary: ' , '\n')
for v,k in lst :
print(k , v, '')
Use a regular expression. There are only a few limits; a tag must start with either # or #, and it may not contain any spaces or other whitespace characters.
This code
import re
tags = []
with open('../Downloads/tags.txt','Ur') as file:
for line in f.readline():
tags += re.findall(r'[##][^\s##]+', line)
creates a list of all tags in the file. You can easily adjust it to store the found tags in your dictionary; instead of storing the result straight away in tags, loop over it and do with each item as you please.
The regex is built up from these two custom character classes:
[##] - either the single character # or # at the start
[^\s##]+ - a sequence of not any single whitespace character (\s matches all whitespace such as space, tab, and returns), #, or #; at least one, and as many as possible.
So findall starts matching at the start of any tag and then grabs as much as it can, stopping only when encountering any of the "not" characters.
findall returns a list of matching items, which you can immediately add to an existing list, or loop over the found items in turn:
for tag in re.findall(r'[##][^\s##]+', line):
# process "tag" any way you want here
The source text file contains Windows-style \r\n line endings, and so I initially got a lot of empty "lines" on my Mac. Opening the text file in Universal newline mode makes sure that is handled transparently by the line reading part of Python.

automating regex to process multiple files

I'm trying to process some data - specifically I have to
Delete any decimals from all numbers in the file, eg 4.0 -> 4
Add a dash between any dates and any times, eg 2014-01-01 23:45:52 -> 2014-01-01-23:45:52
I've wrote some regexes in sublime text to do this using the find and replace function:
Find : "\.\d", Replace : ""
Find : "(\d{2})\s(\d)", Replace : "$1-$2"
This all works fine and gives me the right results. The problem is that I have to process hundreds of csv files in this way, I've tried to do it in python but it isn't working the way I'd expect. Here's the code used:
for file in csv_list: # csv_list is the list of all the files I need to process
with open(file, "r") as infile:
with open("{}EDIT.csv".format(file.split(".")[0]), "w", newline="") as outfile: # Save the processed version
writer = csv.writer(outfile, delimiter=",")
reader = csv.reader(infile)
for line in reader:
writer.writerow([re.sub("(\d{2})\s(\d)",
"$1-$2", re.sub("\.\d", "", string)) for string in line])
I'm not too confident with regex, so I can't see why this isn't working the way I'd expect. If anyone could help me out that'd be great. Thanks in advance!
As requested, here is an input row, what output I was expecting, and what the actual output is:
input : 0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active
desired output : 0,2013-01-01-20:59:39,5737,english,2013-01-01-21:01:07,active
actual output : 0, 2013-01-$1-$20:59:39,5737,english,2013-01-$1-$21:01:07
You could solve your issue by replacing the first regex pattern with r"\1-\2":
import re
rx = r"(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub("(\d{2})\s(\d)", r"\1-\2", re.sub(r"\.\d", "", s))
print (result)
See the Python demo. See the re.sub reference:
Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
Or, to avoid that fuss with string replacement backreferences, use a single regex for that task and modify the matches inside a lambda expression:
import re
pat = r"\.\d|(\d{2})\s(\d)"
s = "0.0,2013-01-01 20:59:39,5737.0,english,2013-01-01 21:01:07,active"
result = re.sub(pat, lambda m: r"{}-{}".format(m.group(1),m.group(2)) if m.group(1) else "", s)
print (result)
See another Python demo.
Note that perhaps, for better safety, you could use r'\.\d+\b' as the pattern to remove decimal parts (\d+ matches one or more digits, and \b requires a char other than letter, digit or _ after it, or the end of string). The second pattern can be spelled out for the same purpose as r'(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})'.

Categories

Resources