improve search using a dict and pyahocorasick - python

I´m new at python and I don´t know how to program well. How do I edit this code so it can works using pyahocorasick? My code is very slow, because I need to search lots of strings at a very big file.
Any other way to improve the search?
import sys
with open('C:/dict_search.txt', 'r') as search_list:
targets = [line.strip() for line in search_list]
with open('C:/source.txt', 'r') as source_file, open('C:/out.txt', 'w') as fout:
for line in source_file:
if any(target in line for target in targets):
fout.write(line)
Dict_search.txt
509344
827276
324194
782211
772854
727246
858908
280903
377881
247333
538710
182734
701212
379326
148310
542129
315285
840427
581092
485581
867746
434527
746814
749479
252045
189668
418513
624231
620284
(...)
source.txt
1,324194,20190103,0000048632,00000000000004870,0000045054!
1,701212,20190103,0000048632,00000000000147072,0000045055!
1,581092,20190103,0000048632,00000000000032900,0000045056!
(...)
I need to find the "word" from dict_search.txt is in the source.txt and if the word is on the line, i need to copy the line to other file.
The problem is that my source.txt is very big and I have more than 100k words at dict_search.txt
My code takes to execute. I tried using the set() method, but I got a blank file.

After looking at your files, it looks like each line in the dict_search.txt file match with the format of second column in source.txt file. If this is the case, the below code will work for you. It's a linear time solution so it's going to be fast on the cost of space because it creates dictionary in memory.
d={}
with open("source.txt", 'r') as f:
for index, line in enumerate(f):
l = line.strip().split(",")
d[l[1]]= line
with open("Dict_search.txt", 'r') as search, open('output.txt', 'w') as output:
for line in search:
row = line.strip()
if row in d:
output.write(d[row])

Related

Python prints two lines in the same line when merging files

I am new to Python and I'm getting this result and I am not sure how to fix it efficiently.
I have n files, let's say for simplicity just two, with some info with this format:
1.250484649 4.00E-02
2.173737246 4.06E-02
... ...
This continues up to m lines. I'm trying to append all the m lines from the n files in a single file. I prepared this code:
import glob
outfile=open('temp.txt', 'w')
for inputs in glob.glob('*.dat'):
infile=open(inputs,'r')
for row in infile:
outfile.write(row)
It reads all the .dat files (the ones I am interested in) and it does what I want but it merges the last line of the first file and the first line of the second file into a single line:
1.250484649 4.00E-02
2.173737246 4.06E-02
3.270379524 2.94E-02
3.319202217 6.56E-02
4.228424345 8.91E-03
4.335169497 1.81E-02
4.557886098 6.51E-02
5.111075901 1.50E-02
5.547288248 3.34E-02
5.685118615 3.22E-03
5.923718239 2.86E-02
6.30299944 8.05E-03
6.528018125 1.25E-020.704223685 4.98E-03
1.961058114 3.07E-03
... ...
I'd like to fix this in a smart way. I can fix this if I introduce a blank line between each data line and then at the end remove all the blank likes but this seems suboptimal.
Thank you!
There's no newline on the last line of each .dat file, so you'll need to add it:
import glob
with open('temp.txt', 'w') as outfile:
for inputs in glob.glob('*.dat'):
with open(inputs, 'r') as infile:
for row in infile:
if not row.endswith("\n"):
row = f"{row}\n"
outfile.write(row)
Also using with (context managers) to automatically close the files afterwards.
To avoid a trailing newline - there's a few ways to do this, but the simplest one that comes to mind is to load all the input data into memory as individual lines, then write it out in one go using "\n".join(lines). This puts "\n" between each line but not at the end of the last line in the file.
import glob
lines = []
for inputs in glob.glob('*.dat'):
with open(inputs, 'r') as infile:
lines += [line.rstrip('\n') for line in infile.readlines()]
with open('temp.txt', 'w') as outfile:
outfile.write('\n'.join(lines))
[line.rstrip('\n') for line in infile.readlines()] - this is a list comprehension. It makes a list of each line in an individual input file, with the '\n' removed from the end of the line. It can then be += appended to the overall list of lines.
While we're here - let's use logging to give status updates:
import glob
import logging
OUT_FILENAME = 'test.txt'
lines = []
for inputs in glob.glob('*.dat'):
logging.info(f'Opening {inputs} to read...')
with open(inputs, 'r') as infile:
lines += [line.rstrip('\n') for line in infile.readlines()]
logging.info(f'Finished reading {inputs}')
logging.info(f'Opening {OUT_FILENAME} to write...')
with open(OUT_FILENAME, 'w') as outfile:
outfile.write('\n'.join(lines))
logging.info(f'Finished writing {OUT_FILENAME}')

Split and print the word before and after the \ of *n of lines, from a txt to two different txt's

I searched around a bit, but I couldn't find a solution that fits my needs.
I'm new to python, so I'm sorry if what I'm asking is pretty obvious.
I have a .txt file (for simplicity I will call it inputfile.txt) with a list of names of folder\files like this:
camisos\CROWDER_IMAG_1.mov
camisos\KS_HIGHENERGY.mov
camisos\KS_LOWENERGY.mov
What I need is to split the first word (the one before the \) and write it to a txt file (for simplicity I will call it outputfile.txt).
Then take the second (the one after the \) and write it in another txt file.
This is what i did so far:
with open("inputfile.txt", "r") as f:
lines = f.readlines()
with open("outputfile.txt", "w") as new_f:
for line in lines:
text = input()
print(text.split()[0])
This in my mind should print only the first word in the new txt, but I only got an empty txt file without any error.
Any advice is much appreciated, thanks in advance for any help you could give me.
You can read the file in a list of strings and split each string to create 2 separate lists.
with open("inputfile.txt", "r") as f:
lines = f.readlines()
X = []
Y = []
for line in lines:
X.append(line.split('\\')[0] + '\n')
Y.append(line.split('\\')[1])
with open("outputfile1.txt", "w") as f1:
f1.writelines(X)
with open("outputfile2.txt", "w") as f2:
f2.writelines(Y)

Using Regex to search a plaintext file line by line and cherry pick lines based on matches

I'm trying to read a plaintext file line by line, cherry pick lines that begin with a pattern of any six digits. Pass those to a list and then write that list row by row to a .csv file.
Here's an example of a line I'm trying to match in the file:
**000003** ANW2248_08_DESOLATE-WASTELAND-3. A9 C 00:55:25:17 00:55:47:12 10:00:00:00 10:00:21:20
And here is a link to two images, one showing the above line in context of the rest of the file and the expected result: https://imgur.com/a/XHjt9e1
import csv
identifier = re.compile(r'^(\d\d\d\d\d\d)')
matched_line = []
with open('file.edl', 'r') as file:
reader = csv.reader(file)
for line in reader:
line = str(line)
if identifier.search(line) == True:
matched_line.append(line)
else: continue
with open('file.csv', 'w') as outputEDL:
print('Copying EDL contents into .csv file for reformatting...')
outputEDL.write(str(matched_line))
Expected result would be the reader gets to a line, searches using the regex, then if the result of the search finds the series of 6 numbers at the beginning, it appends that entire line to the matched_line list.
What I'm actually getting is, once I write what reader has read to a .csv file, it has only picked out [], so the regex search obviously isn't functioning properly in the way I've written this code. Any tips on how to better form it to achieve what I'm trying to do would be greatly appreciated.
Thank you.
Some more examples of expected input/output would better help with solving this problem but from what I can see you are trying to write each line within a text file that contains a timestamp to a csv. In that case here is some psuedo code that might help you solve your problem as well as a separate regex match function to make your code more readable
import re
def match_time(line):
pattern = re.compile(r'(?:\d+[:]\d+[:]\d+[:]\d+)+')
result = pattern.findall(line)
return " ".join(result)
This will return a string of the entire timecode if a match is found
lines = []
with open('yourfile', 'r') as txtfile:
with open('yourfile', 'w') as csvfile:
for line in txtfile:
res = match_line(line)
#alternatively you can test if res in line which might be better
if res != "":
lines.append(line)
for item in lines:
csvfile.write(line)
Opens a text file for reading, if the line contains a timecode, appends the line to a list, then iterates that list and writes the line to the csv.

How to fix Python 3 code to extract specific lines from a text file

I'm trying to extract specific lines from a 4.7 GB text file into another text file.
I'm pretty new to python 3.7.1 and this was the best code I could come up with.
Here is a sample of what the text file looks like:
C00629618|N|TER|P|201701230300133512|15C|IND|DOE, JOHN A|PLEASANTVILLE|WA|00000|PRINCIPAL|DOUBLE NICKEL ADVISORS|01032017|40|H6CA34245|SA01251735122|1141239|||2012520171368850783
C00501197|N|M2|P|201702039042410893|15|IND|DOE, JANE|THE LODGE|GA|00000|UNUM|SVP, CORPORATE COMMUNICATIONS|01312017|230||PR1890575345050|1147350||P/R DEDUCTION ($115.00 BI-WEEKLY)|4020820171370029335
C00177436|N|M2|P|201702039042410893|15|IND|DOE, JOHN|RED ROOM|ME|00000|UNUM|SVP, DEPUTY GENERAL COUNSEL, BUSINESS|01312017|384||PR2260663445050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029336
C00177436|N|M2|P|201702039042410895|15|IND|PALMER, LAURA|TWIN PEAKS|WA|00000|UNUM|EVP, GLOBAL SERVICES|01312017|384||PR2283905245050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029342
C00501197|N|M2|P|201702039042410894|15|IND|COOPER, DALE|TWIN PEAKS|WA|00000|UNUM|SVP, CORP MKTG & PUBLIC RELAT.|01312017|384||PR2283904845050|1147350||P/R DEDUCTION ($192.00 BI-WEEKLY)|4020820171370029339
And this is the code I've written:
import re
with open("data.txt", 'r') as rf:
for line in rf:
field_match = re.match('^(.*):(.*)$',line)
if field_match :
(key) = field_match.groups()
if key == "C00501197" :
print(rec.split('|'))
with open('extracted_data.txt','w') as wf:
wf.write(line)
I need to extract full lines that contain the id C00501197 and then have the program write those extracted lines into another txt file, but as of now it's only extracting one line and that line doesn't begin with the id I want extracted.
Don't use regex if you can avoid it. csv is a good choice, or use simple string manipulation.
ans = []
with open('data.txt') as rf:
for line in rf:
line = line.strip()
if line.startswith("C00501197"):
ans.append(line)
with open('extracted_data.txt', 'w') as wf:
for line in ans:
wf.write(line)
Your output code was a bit busted as well - always wrote out the last line in the file, not the selected records.
You should implement the built in csv module that comes standard with python. It can easily parse each line into a list. Try something like this:
import csv
with open('text.txt', 'r') as file:
my_reader = csv.reader(file, delimiter='|')
for row in my_reader:
if row[0] == 'C00501197':
print(row)
This should output the lines you want. You can then do whatever you want to process them, and save them again.
You don't need to pass through regex, just split the line based on separator and check the nth field you're interested in:
found_lines = []
with open("data.txt", 'r') as rf:
for line_file in rf:
line = line_file.split("|")
if line[0] == "C00501197" :
found_lines.append( line )
with open('extracted_data.txt','w') as wf:
for found_line in found_lines :
wf.write("|".join(map(str,found_line)))
This should work.

Python: Issue with Writing over Lines?

So, this is the code I'm using in Python to remove lines, hence the name "cleanse." I have a list of a few thousand words and their parts-of-speech:
NN by
PP at
PP at
... This is the issue. For whatever reason (one I can't figure out and have been trying to for a few hours), the program I'm using to go through the word inputs isn't clearing duplicates, so the next best thing I can do is the former! Y'know, cycle through the file and delete the duplicates on run. However, whenever I do, this code instead takes the last line of the list and duplicates that hundreds of thousands of times.
Thoughts, please? :(
EDIT: The idea is that cleanseArchive() goes through a file called words.txt, takes any duplicate lines and deletes them. Since Python isn't able to delete lines, though, and I haven't had luck with any other methods, I've turned to essentially saving the non-duplicate data in a list (saveList) and then writing each object from that list into a new file (deleting the old). However, as of the moment as I said, it just repeats the final object of the original list thousands upon thousands of times.
EDIT2: This is what I have so far, taking suggestions from the replies:
def cleanseArchive():
f = open("words.txt", "r+")
given_line = f.readlines()
f.seek(0)
saveList = set(given_line)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
f.write(saveList)
but ATM it's giving me this error:
Traceback (most recent call last):
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 154, in <module>
initialize()
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 100, in initialize
cleanseArchive()
File "C:\Python33\Scripts\AI\prototypal_intelligence.py", line 29, in cleanseArchive
f.write(saveList)
TypeError: must be str, not set
for i in saveList:
f.write(n+"\n")
You basically print the value of n over and over.
Try this:
for i in saveList:
f.write(i+"\n")
If you just want to delete "duplicated lines", I've modified your reading code:
saveList = []
duplicates = []
with open("words.txt", "r") as ins:
for line in ins:
if line not in duplicates:
duplicates.append(line)
saveList.append(line)
Additionally take the correction above!
def cleanseArchive():
f = open("words.txt", "r+")
f.seek(0)
given_line = f.readlines()
saveList = set()
for x,y in enumerate(given_line):
t=(y)
saveList.add(t)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
for i in saveList: f.write(i)
Finished product! I ended up digging into enumerate and essentially just using that to get the strings. Man, Python has some bumpy roads when you get into sets/lists, holy shit. So much stuff not working for very ambiguous reasons! Whatever the case, fixed it up.
Let's clean up this code you gave us in your update:
def cleanseArchive():
f = open("words.txt", "r+")
given_line = f.readlines()
f.seek(0)
saveList = set(given_line)
f.close()
os.remove("words.txt")
f = open("words.txt", "a")
f.write(saveList)
We have bad names that don't respect the Style Guide for Python Code, we have superfluous code parts, we don't use the full power of Python and part of it is not working.
Let us start with dropping unneeded code while at the same time using meaningful names.
def cleanse_archive():
infile = open("words.txt", "r")
given_lines = infile.readlines()
words = set(given_lines)
infile.close()
outfile = open("words.txt", "w")
outfile.write(words)
The seek was not needed, the mode for opening a file to read is now just r, the mode for writing is now w and we dropped the removing of the file because it will be overwritten anyway. Having a look at this now clearer code we see, that we missed to close the file after writing. If we open the file with the with statement Python will take care of that for us.
def cleanse_archive():
with open("words.txt", "r") as infile:
words = set(infile.readlines())
with open("words.txt", "w") as outfile:
outfile.write(words)
Now that we have clear code we'll deal with the error message that occurs when outfile.write is called: TypeError: must be str, not set. This message is clear: You can't write a set directly to the file. Obviously you'll have to loop over the content of the set.
def cleanse_archive():
with open("words.txt", "r") as infile:
words = set(infile.readlines())
with open("words.txt", "w") as outfile:
for word in words:
outfile.write(word)
That's it.

Categories

Resources