Python - Extract Code from Text using Regex - python

I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles = []
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)

your regex is '^C[0-9]{9}$'
^ start of line
C exact match
[0-9] any digit
{9} 9 times
$ end of line

import re
regex = re.compile('(^C\d{9})')
matches = []
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.

How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(C\d{9})',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search = {}
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(C\d{9})',i) for i in f]
search.update({f.name:data})
print(search)
This would return a dictionary with file names as keys and a list of found matches.

Related

How to trim text from file and put it another file using python?

I have a text file called file1 like
HelloWorldTestClass
MyTestClass2
MyTestClass4
MyHelloWorld
ApexClass
*
ApexTrigger
Book__c
CustomObject
56.0
Now i want to output my file as in file2 which contains test in the word and have output like this
HelloWorldTestClass
MyTestClass2
MyTestClass4
I have a code like this
import re
import os
file_contents1 = f'{os.getcwd()}/build/testlist.txt'
file2_path = f'{os.getcwd()}/build/optestlist.txt'
with open(file_contents1, 'r') as file1:
file1_contents = file1.read()
# print(file1_contents)
# output = [file1_contents.strip() for line in file1_contents if "TestClass" in line]
# # Use a regudjlar expression pattern to match strings that contain "test"
test_strings = [x for x in file1_contents.split("\n") if re.search(r"test", x, re.IGNORECASE)]
# x = test_strings.strip("['t]")
# # Print the result
with open(file2_path, 'w') as file2:
# write the contents of the first file to the second file
for test in test_strings:
file2.write(test)
But it is outputting
HelloWorldTestClass MyTestClass2 MyTestClass4
I didn't find the related question if already asked please attached to it thanks
You can use the in operator with strings to check whether it contains your phrase:
with open('file1.txt', "r") as f_input:
with open('file2.txt', "w") as f_output:
for line in f_input:
if "test" in line.lower():
f_output.write(line)

How to extract string (numbers) from txt file and convert to integers using regular expressions in python

read through and parse a file with text and numbers. extract all the numbers in the file and compute the sum of the numbers. txt file attached
This is for python 3 and above.
import re
names=open("regex_sum_319771_actual.txt")
numlist = list()
for files in names:
files = files.rstrip()
ext =re.findall('([0-9]+)',files)
if len(ext)!= 1 :
continue
num = int(ext[0])
numlist.append(num)
print('done',sum(numlist))
#the sum should give me an output ending with 689
that will work :
import re
with open("regex_sum_319771_actual.txt", "r") as f:
nums = re.findall(r'([0-9]+)', f.read())
print(sum([int(i) for i in nums]))
PS: do not forget to close your file after reading if you do not use with statement
You can iterate char by char.
import re
names = open("regex_sum_319771_actual.txt", 'r')
nbr = []
for line in names:
for carac in line:
if re.match(r'\d', carac):
nbr.append(int(carac))
print(sum(nbr))
names.close()

Using .title() on a CSV file

I have a CSV file where I want:
the first letter of every name to be be capitalized and
the other letters to be lowercase.
I have tried using .title().
The CSV file that I want to have the capital letters (CleanNames.csv) will be 'pulling' these names from another CSV (ValidNames.csv) which is 'pulling' those names from a list of disorganized names (10000DirtyNames.csv).
Here is what I have so far:
import re
import csv
with open("10000DirtyNames.csv", "r") as file:
with open('ValidNames.csv', 'w+') as ValidNames_file:
write = csv.writer(ValidNames_file, delimiter=',');
data = file.read();
pattern = "[A-Za-z]{1,}";
search = re.findall(pattern, data);
write.writerow(search);
with open('CleanNames.csv', 'w') as CleanNames_file:
write2 = csv.writer(CleanNames_file, delimiter=',');
data2 = ValidNames_file.read();
write2.writerow(data2.title());
It works except the CleanName.csv is not being populated at all. There is no error message. What am I doing wrong?
Figured it out on my own but thought I would post my solution encase someone every needed to solve a similar problem.
import re
import csv
with open("10000DirtyNames.csv", "r") as file:
with open('ValidNames.csv', 'w+') as ValidNames_file:
write = csv.writer(ValidNames_file, delimiter=',');
data = file.read();
pattern = "[A-Za-z]{1,}";
search = re.findall(pattern, data);
write.writerow(search);
with open('ValidNames.csv') as ValidNames_file, open('CleanNames.csv', 'w') as CleanNames_file:
for name in ValidNames_file:
CleanNames_file.write(name.title())

import filenames iteratively from a different file

I have a large number of entries in a file. Let me call it file A.
File A:
('aaa.dat', 'aaa.dat', 'aaa.dat')
('aaa.dat', 'aaa.dat', 'bbb.dat')
('aaa.dat', 'aaa.dat', 'ccc.dat')
I want to use these entries, line by line, in a program that would iteratively pick an entry from file A, concatenate the files in this way:
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat'] ###entry number 3
with open('out.dat', 'w') as outfile: ###the name has to be aaa-aaa-ccc.dat
for fname in filenames:
with open(fname) as infile:
outfile.write(infile.read().strip())
All I need to do is to substitute the filenames iteratively and create an output in a "aaa-aaa-aaa.dat" format. I would appreciate any help-- feeling a bit lost!
Many thanks!!!
You can retrieve and modify the file names in the following way:
import re
pattern = re.compile('\W')
with open('fnames.txt', 'r') as infile:
for line in infile:
line = (re.sub(pattern, ' ', line)).split()
# Old filenames - to concatenate contents
content = [x + '.dat' for x in line[::2]];
# New filename
new_name = ('-').join(line[::2]) + '.dat'
# Write the concatenated content to the new
# file (first read the content all at once)
with open(new_name, 'w') as outfile:
for con in content:
with open(con, 'r') as old:
new_content = old.read()
outfile.write(new_content)
This program reads your input file, here named fnames.txt with the exact structure from your post, line by line. For each line it splits the entries using a precompiled regex (precompiling regex is suitable here and should make things faster). This assumes that your filenames are only alphanumeric characters, since the regex substitutes all non-alphanumeric characters with a space.
It retrieves only 'aaa' and dat entries as a list of strings for each line and forms a new name by joining every second entry starting from 0 and adding a .dat extension to it. It joins using a - as in the post.
It then retrieves the individual file names from which it will extract the content into a list content by selecting every second entry from line.
Finally, it reads each of the files in content and writes them to the common file new_name. It reads each of them all at ones which may be a problem if these files are big and in general there may be more efficient ways of doing all this. Also, if you are planning to do more things with the content from old files before writing, consider moving the old file-specific operations to a separate function for readability and any potential debugging.
Something like this:
with open(fname) as infile, open('out.dat', 'w') as outfile:
for line in infile:
line = line.strip()
if line: # not empty
filenames = eval(line.strip()) # read tuple
filenames = [f[:-4] for f in filenames] # remove extension
filename = '-'.join(filenames) + '.dat' # make filename
outfile.write(filename + '\n') # write
If your problem is just calculating the new filenames, how about using os.path.splitext?
'-'.join([
f[0] for f in [os.path.splitext(path) for path in filenames]
]) + '.dat'
Which can be probably better understood if you see it like this:
import os
clean_fnames = []
filenames = ['aaa.dat', 'aaa.dat', 'ccc.dat']
for fname in filenames:
name, extension = os.path.splitext(fname)
clean_fnames.append(name)
name_without_ext = '-'.join(clean_fnames)
name_with_ext = name_without_ext + '.dat'
print(name_with_ext)
HOWEVER: If your issue is that you can not get the filenames in a list by reading the file line by line, you must keep in mind that when you read files, you get text (strings) NOT Python structures. You need to rebuild a list from a text like: "('aaa.dat', 'aaa.dat', 'aaa.dat')\n".
You could take a look to ast.literal_eval or try to rebuild it yourself. The code below outputs a lot of messages to show what's happening:
import pprint
collected_fnames = []
with open('./fileA.txt') as f:
for line in f:
print("Read this (literal) line: %s" % repr(line))
line_without_whitespaces_on_the_sides = line.strip()
if not line_without_whitespaces_on_the_sides:
print("line is empty... skipping")
continue
else:
line_without_parenthesis = (
line_without_whitespaces_on_the_sides
.lstrip('(')
.rstrip(')')
)
print("Cleaned parenthesis: %s" % line_without_parenthesis)
chunks = line_without_parenthesis.split(', ')
print("Collected %s chunks in a %s: %s" % (len(chunks), type(chunks), chunks))
chunks_without_quotations = [chunk.replace("'", "") for chunk in chunks]
print("Now we don't have quotations: %s" % chunks_without_quotations)
collected_fnames.append(chunks_without_quotations)
print("collected %s lines with filenames:\n%s" %
(len(collected_fnames), pprint.pformat(collected_fnames)))

Python Regex And encode

I Am trying to find and print all the Phone numbers in this file. But the file got a lot of unreadable text.
The file looks like this but then really big:
e
How Can I decode this and find all the numbers? I now have the following code:
import glob
import re
path = "C:\\Users\\Joey\\Downloads\\db_sdcard\\mysql\\ibdata1"
files= glob.glob(path)
for name in files:
with open(name, 'r') as f:
for line in f:
print line
match = re.search(r'(/b/d{2}-/d{8}/b)', line)
if match:
found = match.group()
print found
When I run my script i get the following output:
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
Where do I have to put the .decode('utf8') And is my code for the rest good?
Try using the following to find your numbers:
re.findall("\d{2}-\d{8}", line)
It creates a list of all of the matching substrings that fit the format xx-xxxxxxxx, where x is a digit.
When using the last line from your question as an example:
>>> line = ' P t\xe2\x82\xac \xc5\x92 \xc3\x98p\xe2\x82\xac Q~\xc3\x80t\xc3\xb406-23423230xx06-34893646xx secure_encryptedsecure_encrypted\xe2\x82\xac -\xe2\x82\xac -\xe2\x82\xac \n'
>>> re.findall("\d{2}-\d{8}", line)
['06-23423230', '06-34893646']
Here it is in the full statement:
for name in files:
with open(name, 'r') as f:
for line in f:
matches = re.findall("\d{2}-\d{8}", line)
for mt in matches:
print mt
This will print each match on separate lines.
You could even findall the matches in the whole file at once:
for name in files:
with open(name, 'r') as f:
matches = re.findall("\d{2}-\d{8}", f.read())
for mt in matches:
print mt

Categories

Resources