Using Python regex to check for string in file - python

I'm using Python regex to check a log file that contains the output of the Windows command tasklist for anything ending with .exe. This log file contains output from multiple callings of tasklist. After I get a list of strings with .exe in them, I want to write them out to a text file after checking to see if that string already exists in the output file. Instead of the desired output, it writes out duplicates of strings already present in the text file. (svchost.exe shows up several times for example.) The goal is to have a text file with a list of each unique process enumerated by tasklist with no duplicates of processes already written in the file.
import re
file1 = open('taskinfo.txt', 'r')
strings = re.findall(r'.*.exe', file1.read())
file1.close()
file2 = open('exes.txt', 'w+')
for item in strings:
line_to_write = re.match(item, file2.read())
if line_to_write == None:
file2.write(item)
file2.write('\n')
else:
pass
I used print statements to debug and made sure than item is the desired output.

There are some problems with your regex. Try this:
strings = re.findall(r'\b\S*\.exe\b', file1.read())
This will only take the text connected to the .exe by starting at a word boundary (\b) and grabbing all non-space characters (\S). Additionally, when you had .exe instead of \.exe, the . was matching as a wildcard, rather than a literal period.

Related

Re-formatting a text file

I am fairly new to Python. I have a text file, full of common misspellings. The correct spelling of the word is prefixed with a $ character, and all misspelled versions of the word preceding it; one on each line.
mispelling.txt:
$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa
I want to create a new text file, based on mispelling.txt, where the format appears as this:
new_mispelling.txt:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
The correct spelling of the word is on the right-hand side of its misspelling, separated by ->; on the same line.
Question:
How do I read in the file, read $ as a new word and thus a new line in my output file, propagate an output file and save to disk?
The purpose of this is to have my collected data be of the same format as this open-source Wikipedia entry dataset of "all" commonly misspelled words, that doesn't contain my own entries of words and misspellings.
As you process the file line-by-line, if you find a word that starts with $, set that as the "currently active correct spelling". Then each subsequent line is a misspelling for that word, so format that into a string and write it to the output file.
current_word = ""
with open("mispelling.txt") as f_in, open("new_mispelling.txt", "w") as f_out:
for line in f_in:
line = line.strip() # Remove whitespace at start and end
if line.startswith("$"):
# If the line starts with $
# Slice the current line from the second character to the end
# And save it as current_word
current_word = line[1:]
else:
# If it doesn't start with $, create the string we want
# And write it.
f_out.write(f"{line}->{current_word}\n")
With your input file, this gives:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
The f"{line}->{current_word}\n" construct is called an f-string and is used for string interpolation in python 3.6+.
A regex solution:
You can use the pattern '\$(\w+)(.*?)(?=\$|$)' and join each value starting with $ with the subsequent values by ->, then join all these by \n for the groups captured, then finally join all such values by \n. Make sure to use re.DOTALL flag since its a multi-line string:
import re
txt='''$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa'''
print('\n'.join('\n'.join('->'.join((v, m.group(1)))
for v in m.group(2).strip('\n').split('\n')) for m in
re.finditer('\$(\w+)(.*?)(?=\$|$)', txt, re.DOTALL)))
OUTPUT:
eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years
I'm leaving file read/write for you assuming that's not the problem you are asking for

Scanning for "Download: " in a list of strings with sentences

I have a script that gets the download and upload speeds of my internet. Now I am making a script that gathers all the download data from a txt file, and averages it, puts it in an excel spreadsheet, or other things. The problem is I haven't been able to find a way to scan for the "Download: " since there is also the download data in the string. I want to be able to get the indexes of all the strings with Download: in them and also get the data after that.
I tried using any() to scan for the words, but realized it just told me if the element is in the list and also that it only checked if the entire word "Download: " was in the list as a string.
downloads_string = "Download: "
with open("file.txt", "r") as file:
file.readlines()
data_downloads_list = any(element in downloads_string for element in file)
print(data_downloads_list)
I expected to get true, but always got false even tough I had Download: in my txt file. I realized it was scanning for strings that were simply "Download: " rather than strings that contained the word along with data.
file.readlines() reads the lines of the file into a list and since it's never assigned to anything just gets garbage collected... you've also got the logic reversed (you're trying to check if your line is in the string you're looking for - not if the string you're looking for is in the line), try:
downloads_string = 'Download: '
with open('file.txt') as file:
data_downloads_list = [line for line in file if downloads_string in line]
print(data_downloads_list)

How to split last word before a certain char in python string and replace that same line in a txt file?

I am currently trying to repeatedly replace a word in a line but there are two current issues with my code. I can successfully locate the lines I want to replace, but with my current code i fail to 1. store the specific value in the strings that i want and then replace the word on that same line.
The text I want to replace appears to two times and looks like this in the textfile:
One_Number_ = "0"
if [ One_Number_ == "0" ]
I wish to change "One" here to something else each time I run the program. What I've tried to do is the following:
with open(os.path.join('file.txt'), 'r') as file:
lines = file.readlines()
with open(os.path.join('file.txt'), 'w') as file:
for line in lines:
if (line.__contains__('_Number_')):
replaceline = line.rsplit('_', 1)[0]
line.replace (replaceline, "NewWord")
file.write(line)
The if-statement runs but the line does not replace "One".
The strings also do not get separated correctly meaning that replacelinedoes not contain just the "One".
How can I adjust my code so it successfully locates the two lines in the textfile that needs to replaced with the new string ("NewWord", as I have just used as an example just now)?
You will need to do the .replace() like:
line = line.replace (replaceline, "NewWord")
The replace method returns a new string is does not modify the existing string.

How to write clean data to a file in python in tabulated format

Issue: Remove the hyperlinks, numbers and signs like ^&*$ etc from twitter text. The tweet file is in CSV tabulated format as shown below:
s.No. username tweetText
1. #abc This is a test #abc example.com
2. #bcd This is another test #bcd example.com
Being a novice at python, I search and string together the following code, thanks to a the code given here:
import re
fileName="path-to-file//tweetfile.csv"
fileout=open("Output.txt","w")
with open(fileName,'r') as myfile:
data=myfile.read().lower() # read the file and convert all text to lowercase
clean_data=' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",data).split()) # regular expression to strip the html out of the text
fileout.write(clean_data+'\n') # write the cleaned data to a file
fileout.close()
myfile.close()
print "All done"
It does the data stripping, but the output file format is not as I desire. The output text file is in a single line like
s.no username tweetText 1 abc This is a cleaned tweet 2 bcd This is another cleaned tweet 3 efg This is yet another cleaned tweet
How can I fix this code to give me an output like given below:
s.No. username tweetText
1 abc This is a test
2 bcd This is another test
3 efg This is yet another test
I think something needs to be added in the regular expression code but I don't know what it could be. Any pointers or suggestions will be helpful.
You can read the line, clean it, and write it out in one loop. You can also use the CSV module to help you build out your result file.
import csv
import re
exp = r"(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"
def cleaner(row):
return [re.sub(exp, " ", item.lower()) for item in row]
with open('input.csv', 'r') as i, open('output.csv', 'wb') as o:
reader = csv.reader(i, delimiter=',') # Comma is the default
writer = csv.writer(o, delimiter=',')
# Take the first row from the input file (the header)
# and write it to the output file
writer.writerow(next(reader))
for row in reader:
writer.writerow(cleaner(row))
The csv module knows correctly how to add separators between items; as long as you pass it a collection of items.
So, what the cleaner method does it take each item (column) in the row from the input file, apply the substitution to the lowercase version of the item; and then return back a list.
The rest of the code is simply opening the file, configuring the CSV module with the separators you want for the input and output files (in the example code, the separator for both files is a tab, but you can change the output separator).
Next, the first row of the input file is read and written out to the output file. No transformation is done on this row (which is why it is not in the loop).
Reading the row from the input file automatically puts the file pointer on the next row - so then we simply loop through the input rows (in reader), for each row apply the cleaner function - this will return a list - and then write that list back to the output file with writer.writerow().
instead of applying the re.sub() and the .lower() expressions to the entire file at once try iterating over each line in the CSV file like this:
for line in myfile:
line = line.lower()
line = re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",line)
fileout.write(line+'\n')
also when you use the with <file> as myfile expression there is no need to close it at the end of your program, this is done automatically when you use with
Try this regex:
clean_data=' '.join(re.sub("[#\^&\*\$]|#\S+|\S+[a-z0-9]\.(com|net|org)"," ",data).split()) # regular expression to strip the html out of the text
Explanation:
[#\^&\*\$] matches on the characters, you want to replace
#\S+matches on hash tags
\S+[a-z0-9]\.(com|net|org) matches on domain names
If the URLs can't be identified by https?, you'll have to complete the list of potential TLDs.
Demo

How to Find a String in a Text File And Replace Each Time With User Input in a Python Script?

I am new to python so excuse my ignorance.
Currently, I have a text file with some words marked as <>.
My goal is to essentially build a script which runs through a text file with such marked words. Each time the script finds such a word, it would ask the user for what it wants to replace it with.
For example, if I had a text file:
Today was a <<feeling>> day.
The script would run through the text file so the output would be:
Running script...
feeling? great
Script finished.
And generate a text file which would say:
Today was a great day.
Advice?
Edit: Thanks for the great advice! I have made a script that works for the most part like I wanted. Just one thing. Now I am working on if I have multiple variables with the same name (for instance, "I am <>. Bob is also <>.") the script would only prompt, feeling?, once and fill in all the variables with the same name.
Thanks so much for your help again.
import re
with open('in.txt') as infile:
text = infile.read()
search = re.compile('<<([^>]*)>>')
text = search.sub(lambda m: raw_input(m.group(1) + '? '), text)
with open('out.txt', 'w') as outfile:
outfile.write(text)
Basically the same solution as that offerred by #phihag, but in script form
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
import re
from os import path
pattern = '<<([^>]*)>>'
def user_replace(match):
return raw_input('%s? ' % match.group(1))
def main():
parser = argparse.ArgumentParser()
parser.add_argument('infile', type=argparse.FileType('r'))
parser.add_argument('outfile', type=argparse.FileType('w'))
args = parser.parse_args()
matcher = re.compile(pattern)
for line in args.infile:
new_line = matcher.sub(user_replace, line)
args.outfile.write(new_line)
args.infile.close()
args.outfile.close()
if __name__ == '__main__':
main()
Usage: python script.py input.txt output.txt
Note that this script does not account for non-ascii file encoding.
To open a file and loop through it:
Use raw_input to get input from user
Now, put this together and update you question if you run into problems :-)
I understand you want advice on how to structure your script, right? Here's what I would do:
Read the file at once and close it (I personally don't like to have open file objects, especially if my filesystem is remote).
Use a regular expression (phihag has suggested one in his answer, so I won't repeat it) to match the pattern of your placeholders. Find all of your placeholders and store them in a dictionary as keys.
For each word in the dictionary, ask the user with raw_input (not just input). And store them as values in the dictionary.
When done, parse your text substituting any instance of a given placeholder (key) with the user word (value). This is also done with regex.
The reason for using a dictionary is that a given placeholder could occur more than once and you probably don't want to make the user repeat the entry over and over again...
Try something like this
lines = []
with open(myfile, "r") as infile:
lines = infile.readlines()
outlines = []
for line in lines:
index = line.find("<<")
if index > 0:
word = line[index+2:line.find(">>")]
input = raw_input(word+"? ")
outlines.append(line.replace("<<"+word+">>", input))
else:
outlines.append(line)
with open(outfile, "w") as output:
for line in outlines:
outfile.write(line)
Disclaimer: I haven't actually run this, so it might not work, but it looks about right and is similar to something I've done in the past.
How it works:
It parses the file in as a list where each element is one line of the file.
It builds the output list of lines. It iterates through the lines in the input, checking if the string << exist. If it does, it rips out the word inside the << and >> brackets, using it as the question for a raw_input query. It takes the input from that query and replaces the value inside the arrows (and the arrows) with the input. It then appends this value to the list. If it didn't see the arrows it simply appended the line.
After running through all the lines, it writes them to the output file. You can make this whatever file you want.
Some issues:
As written, this will work for only one arrow statement per line. So if you had <<firstname>> <<lastname>> on the same line it would ignore the lastname portion. Fixing this wouldn't be too hard to implement - you could place a while loop using the index > 0 statement and holding the lines inside that if statement. Just remember to update the index again if you do that!
It iterates through the list three times. You could likely reduce this to two, but if you have a small text file this shouldn't be a huge problem.
It could be sensitive to encoding - I'm not entirely sure about that however. Worst case there you need to cast as a string.
Edit: Moved the +2 to fix the broken if statement.

Categories

Resources