I Am trying to find and print all the Phone numbers in this file. But the file got a lot of unreadable text.
The file looks like this but then really big:
e
How Can I decode this and find all the numbers? I now have the following code:
import glob
import re
path = "C:\\Users\\Joey\\Downloads\\db_sdcard\\mysql\\ibdata1"
files= glob.glob(path)
for name in files:
with open(name, 'r') as f:
for line in f:
print line
match = re.search(r'(/b/d{2}-/d{8}/b)', line)
if match:
found = match.group()
print found
When I run my script i get the following output:
ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
Where do I have to put the .decode('utf8') And is my code for the rest good?
Try using the following to find your numbers:
re.findall("\d{2}-\d{8}", line)
It creates a list of all of the matching substrings that fit the format xx-xxxxxxxx, where x is a digit.
When using the last line from your question as an example:
>>> line = ' P t\xe2\x82\xac \xc5\x92 \xc3\x98p\xe2\x82\xac Q~\xc3\x80t\xc3\xb406-23423230xx06-34893646xx secure_encryptedsecure_encrypted\xe2\x82\xac -\xe2\x82\xac -\xe2\x82\xac \n'
>>> re.findall("\d{2}-\d{8}", line)
['06-23423230', '06-34893646']
Here it is in the full statement:
for name in files:
with open(name, 'r') as f:
for line in f:
matches = re.findall("\d{2}-\d{8}", line)
for mt in matches:
print mt
This will print each match on separate lines.
You could even findall the matches in the whole file at once:
for name in files:
with open(name, 'r') as f:
matches = re.findall("\d{2}-\d{8}", f.read())
for mt in matches:
print mt
Related
I'm completely new to Python and I'm currently trying to write a smalle script for finding&replacing different inputs in all files of the same type in a folder. It currently looks like this and works so far:
import glob
import os
z = input('filetype? (i.e. *.txt): ')
x = input('search for?: ')
y = input('replace with?: ')
replacements = {x:y}
lines = []
for filename in glob.glob(z):
with open(filename, 'r') as infile:
for line in infile:
for src, target in replacements.items():
line = line.replace(src, target)
lines.append(line)
with open(filename, 'w') as outfile:
for line in lines:
outfile.write(line)
lines = []
My problem is that I want to replace full words only, so when trying to replace '0' with 'x', '4025' should not become '4x25'. I've tried to incorporate regex, but I couldn't make it work.
Instead of
line.replace(src, target)
try
re.sub(r'\b' + re.escape(src) + r'\b', target, line)
after importing re. \b matches a word boundary.
Just use regex (re) and add ^ and $ patterns in your expression.
line = re.sub('^'+src+'$', target, line)
Don't forget to import re.
I am a Python beginner and looking for help with an extraction problem.
I have a bunch of textfiles and need to extract all special combinations of an expression ("C"+"exactly 9 numerical digits") and write them to a file including the filename of the textfile. Each occurence of the expression I want to catch start at the beginning of a new line and ends with a "/n".
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
What the output should look like (in the output file)
My_desired_output_file = "filename,C123456789,C987654321"
My code so far:
min_file_size = 5
def list_textfiles(directory, min_file_size): # Creates a list of all files stored in DIRECTORY ending on '.txt'
textfiles = []
for root, dirs, files in os.walk(directory):
for name in files:
filename = os.path.join(root, name)
if os.stat(filename).st_size > min_file_size:
textfiles.append(filename)
for filename in list_textfiles(temp_directory, min_file_size):
string = str(filename)
text = infile.read()
regex = ???
with open(filename, 'w', encoding="utf-8") as outfile:
outfile.write(regex)
your regex is '^C[0-9]{9}$'
^ start of line
C exact match
[0-9] any digit
{9} 9 times
$ end of line
import re
regex = re.compile('(^C\d{9})')
matches = []
with open('file.txt', 'r') as file:
for line in file:
line = line.strip()
if regex.match(line):
matches.append(line)
You can then write this list to a file as needed.
How about:
import re
sample_text = """Some random text here
and here
and here
C123456789
some random text here
C987654321
and here
and here"""
k = re.findall('(C\d{9})',sample_text)
print(k)
This will return all occurrences of that pattern. If you yield line from your text and store your target combination. Something like:
Updated:
import glob
import os
import re
search = {}
os.chdir('/FolderWithTxTs')
for file in glob.glob("*.txt"):
with open(file,'r') as f:
data = [re.findall('(C\d{9})',i) for i in f]
search.update({f.name:data})
print(search)
This would return a dictionary with file names as keys and a list of found matches.
I'm cleaning newspaper articles stored in separated text files.
In one of the cleaning stages, I want to remove all the text within one file that comes after the deliminator 'LOAD-DATE:'. I use a small piece of code that does the work when applied to just one string. See below.
line = 'A little bit of text. LOAD-DATE: And some redundant text'
import re
m = re.match('(.*LOAD-DATE:)', line)
if m:
line = m.group(1)
line = re.sub('LOAD-DATE:', '', line)
print(line)
A little bit of text.
However, when I translate the code to a loop to clean a whole bunch of seperate text files (which works fine in other stages of the script), than it produces gigantic, identical text files, which don't look at all like the original newspaper articles. See loop:
files = glob.glob("*.txt")
for f in files:
with open(f, "r") as fin:
try:
import re
m = re.match('(.*LOAD-DATE:)', fin)
if m:
data = m.group(1)
data = re.sub('LOAD-DATE:', '', data)
except:
pass
with open(f, 'w') as fout:
fout.writelines(data)
Something clearly goes wrong in the loop, but I have no idea what.
Try going line by line through the file. Something like
import re
files = glob.glob("*.txt")
for f in files:
with open(f, "r") as fin:
data = []
for line in fin:
m = re.match('(.*LOAD-DATE:)', line)
if m:
line = m.group(1)
line = re.sub('LOAD-DATE:', '', line)
data.append(line)
with open(f, 'w') as fout:
fout.writelines(data)
I made 10 txt files all containing the string:
A little bit of text. LOAD-DATE: And some redundant text
I changed the m variable as patrick suggested to allow the file to be opened and read.
m = re.match('(.*LOAD-DATE:)', fin.read())
But I also found that I had to include the writelines inside the if statement
if m:
data = m.group(1)
data = re.sub('LOAD-DATE:', '', data)
with open(f, 'w') as fout:
fout.writelines(data)
It changed them all no problem and very quickly.
I hope this helps.
I have a file with some lines. Out of those lines I will choose only lines which starts with xxx. Now the lines which starts with xxx have pattern as follows:
xxx:(12:"pqrs",223,"rst",-90)
xxx:(23:"abc",111,"def",-80)
I want to extract only the string which are their in the first double quote
i.e., "pqrs" and "abc".
Any help using regex is appreciated.
My code is as follows:
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in 'xxx:':
if re.match('^xxx:',line):
c=line
break
this code is giving me error
Your code is wrongly indented. Your f = f.readlines() has 9 spaces in front while for line in f: has 4 spaces. It should look like below.
import re
list_of_prefixes = ["xxx","aaa"]
resulting_list = []
with open("raw.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
for phrase in list_of_prefixes:
if re.match(phrase + ':\(\d+:\"(\w+)',line) != None:
resulting_list.append(re.findall(phrase +':\(\d+:\"(\w+)',line)[0])
Well you are heading in the right direction.
If the input is this simple, you can use regex groups.
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
line=line.rstrip()
m = re.match('^xxx:\(\d*:("[^"]*")',line)
if m is not None:
print(m.group(1))
All the magic is in the regular expression.
^xxx:(\d*:("[^"]*") means
Start from the beginning of the line, match on "xxx:(<any number of numbers>:"<anything but ">"
and because the sequence "<anything but ">" is enclosed in round brackets it will be available as a group (by calling m.group(1)).
PS: next time make sure to include the exact error you are getting
results = []
with open("log.txt","r") as f:
f = f.readlines()
for line in f:
if line.startswith("xxx"):
line = line.split(":") # line[1] will be what is after :
result = line[1].split(",")[0][1:-1] # will be pqrs
results.append(result)
You want to look for lines that start with xxx
then split the line on the :. The first thing after the : is what you want -- up to the comma. Then your result is that string, but remove the quotes. There is no need for regex. Python string functions will be fine
To check if a line starts with xxx do
line.startswith('xxx')
To find the text in first double-quotes do
re.search(r'"(.*?)"', line).group(1)
(as match.group(1) is the first parenthesized subgroup)
So the code will be
with open("file") as f:
for line in f:
if line.startswith('xxx'):
print(re.search(r'"(.*?)"', line).group(1))
re module docs
I got a task to replace "O"(capital O) by "0" in a text file by using python. But one condition is that I have to preserve the other words like Over, NATO etc. I have to replace only the words like 9OO to 900, 2OO6 to 2006 and so on. I tried a lot but yet not successful. My code is given below. Please help me any one. Thanks in advance
import re
srcpatt = 'O'
rplpatt = '0'
cre = re.compile(srcpatt)
with open('myfile.txt', 'r') as file:
content = file.read()
wordlist = re.findall(r'(\d+O|O\d+)',str(content))
print(wordlist)
for word in wordlist:
subcontent = cre.sub(rplpatt, word)
newrep = re.compile(word)
newcontent = newrep.sub(subcontent,content)
with open('myfile.txt', 'w') as file:
file.write(newcontent)
print('"',srcpatt,'" is successfully replaced by "',rplpatt,'"')
re.sub can take in a replacement function, so we can pare this down pretty nicely:
import re
with open('myfile.txt', 'r') as file:
content = file.read()
with open('myfile.txt', 'w') as file:
file.write(re.sub(r'\d+[\dO]+|[\dO]+\d+', lambda m: m.group().replace('O', '0'), content))
import re
srcpatt = 'O'
rplpatt = '0'
cre = re.compile(srcpatt)
reg = r'\b(\d*)O(O*\d*)\b'
with open('input', 'r') as f:
for line in f:
while re.match(reg,line): line=re.sub(reg, r'\g<1>0\2', line)
print line
print('"',srcpatt,'" is successfully replaced by "',rplpatt,'"')
You can probably get away with matching just a leading digit followed by O. This won't handle OO7, but it will work nicely with 8080 for example. Which none of the answers here matching the trailing digits will. If you want to do that you need to use a lookahead match.
re.sub(r'(\d)(O+)', lambda m: m.groups()[0] + '0'*len(m.groups()[1]), content)