extracting specific data from a text file in python - python

If I have a text file containing:
Proto Local Address Foreign Address State PID
TCP 0.0.0.0:11 0.0.0.0:0 LISTENING 12 dns.exe
TCP 0.0.0.0:95 0.0.0.0:0 LISTENING 589 lsass.exe
TCP 0.0.0.0:111 0.0.0.0:0 LISTENING 888 svchost.exe
TCP 0.0.0.0:123 0.0.0.0:0 LISTENING 123 lsass.exe
TCP 0.0.0.0:449 0.0.0.0:0 LISTENING 2 System
Is there a way to extract ONLY the process ID names such as dns.exe, lsass.exe, etc..?
I tried using split() so I could get the info right after the string LISTENING. Then I took whats left (12 dns.exe, 589 lsass.exe, etc... ), and checked the length of each string. So if the len() of 12 dns.exe was between 17 or 20 for example, I would get the substring of that string with specific numbers. I only took into account the length of the PID numbers(which can be anywhere between 1 to 4 digits) but then forgot that the length of each process name varies (there are hundreds). Is there a simpler way to do this or am I out of luck?

split should work just fine so long you ignore the header in your file
processes = []
with open("file.txt", "r") as f:
lines = f.readlines()
# Loop through all lines, ignoring header.
# Add last element to list (i.e. the process name)
for l in lines[1:]:
processes.append(l.split()[-1])
print processes
Result:
['dns.exe', 'lsass.exe', 'svchost.exe', 'lsass.exe', 'System']

You can use pandas DataFrames to do this without getting into the hassle of split:
parsed_file = pandas.read_csv("filename", header = 0)
will automatically read this into a DataFrame for you. You can then filter by those rows containing dns.exe, etc. You may need to define your own header
Here is a more general replacement for read_csv if you want more control. I've assumed your columns are all tab separated, but you can feel free to change the splitting character however you like:
with open('filename','r') as logs:
logs.readline() # skip header so you can can define your own.
columns = ["Proto","Local Address","Foreign Address","State","PID", "Process"]
formatted_logs = pd.DataFrame([dict(zip(columns,line.split('\t'))) for line in logs])
Then you can just filter the rows by
formatted_logs = formatted_logs[formatted_logs['Process'].isin(['dns.exe','lsass.exe', ...])]
If you want just the process names, it is even simpler. Just do
processes = formatted_logs['Process'] # returns a Series object than can be iterated through

You could simply use re.split:
import re
rx = re.compile(" +")
l = rx.split(" 12 dns.exe") # => ['', '12', 'dns.exe']
pid = l[1]
it will split the string on a arbitrary number of spaces, and you take second element.

with open(txtfile) as txt:
lines = [line for line in txt]
process_names = [line.split()[-1] for line in lines[1:]]
This opens your input file and reads all the lines into a list. Next, the list is iterated over starting at the second element (because the first is the header row) and each line is split(). The last item in the resulting list is then added to process_names.

You could also use simply split and treat the line step by step, one by one like this:
def getAllExecutables(textFile):
execFiles = []
with open(textFile) as f:
fln = f.readline()
while fln:
pidname = str.strip(list(filter(None, fln.split(' ')))[-1]) #splitting the line, removing empty entry, stripping unnecessary chars, take last element
if (pidname[-3:] == 'exe'): #check if the pidname ends with exe
execFiles.append(pidname) #if it does, adds it
fln = f.readline() #read the next line
return execFiles
exeFiles = getAllExecutables('file.txt')
print(exeFiles)
Some remarks on the code above:
Filter all the unnecessary empty element in the file line by filter
stripping all the unnecessary characters in the file (such as \n) by str.strip
Get the last element of the line after split using l[-1]
Check if the last 3 chars of that element is exe. If it is, adds it to the resulting list.
Results:
['dns.exe', 'lsass.exe', 'svchost.exe', 'lsass.exe']

Related

who does IndexError: list index out of range appears ? i did some test still cant find out

Im working on a simple project python to practice , im trying to retreive data from file and do some test on a value
in my case i do retreive data as table from a file , and i do test the last value of the table if its true i add the whole line in another file
Here my data
AE300812 AFROUKH HAMZA 21 admis
AE400928 VIEGO SAN 22 refuse
AE400599 IBN KHYAT mohammed 22 admis
B305050 BOUNNEDI SALEM 39 refuse
here my code :
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
contenu = fichier.read()
tab = contenu.split()
for i in range(0,len(tab),5):
if tab[i+4]=="admis":
fichier2.write(tab[i]+" "+tab[i+1]+" "+tab[i+2]+" "+tab[i+3]+" "+tab[i+4]+" "+"\n")
fichier.close()
And here the following error :
if tab[i+4]=="admis":
IndexError: list index out of range
You look at tab[i+4], so you have to make sure you stop the loop before that, e.g. with range(0, len(tab)-4, 5). The step=5 alone does not guarantee that you have a full "block" of 5 elements left.
But why does this occur, since each of the lines has 5 elements? They don't! Notice how one line has 6 elements (maybe a double name?), so if you just read and then split, you will run out of sync with the lines. Better iterate lines, and then split each line individually. Also, the actual separator seems to be either a tab \t or double-spaces, not entirely clear from your data. Just split() will split at any whitespace.
Something like this (not tested):
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
for line in fichier:
tab = line.strip().split(" ") # actual separator seems to be tab or double-space
if tab[4]=="admis":
fichier2.write(tab[0]+" "+tab[1]+" "+tab[2]+" "+tab[3]+" "+tab[4]+" "+"\n")
Depending on what you actually want to do, you might also try this:
with open("concours.txt","r") as fichier, open("admis.txt","w") as fichier2:
for line in fichier:
if line.strip().endswith("admis"):
fichier2.write(line)
This should just copy the admis lines to the second file, with the origial double-space separator.

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

Process each line of text file using Python

I am fairly new to Python. I have a text file containing many blocks of data in following format along with other unnecessary blocks.
NOT REQUIRED :: 123
Connected Part-1:: A ~$
Connected Part-3:: B ~$
Connector Location:: 100 200 300 ~$
NOT REQUIRED :: 456
Connected Part-2:: C ~$
i wish to extract the info (A,B,C, 100 200 300) corresponding to each property ( connected part-1, Connector location) and store it as list to use it later. I have prepared following code which reads file, cleans the line and store it as list.
import fileinput
with open('C:/Users/file.txt') as f:
content = f.readlines()
for line in content:
if 'Connected Part-1' in line or 'Connected Part-3' in line:
if 'Connected Part-1' in line:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
print ('PART_1:',connected_part_1)
if 'Connected Part-3' in line:
connected_part_3 = [s.strip(' \n ~ $ Connected Part -3 ::') for s in content]
print ('PART_3:',connected_part_3)
if 'Connector Location' in line:
# removing unwanted characters and converting into the list
content_clean_1 = [s.strip('\n ~ $ Connector Location::') for s in content]
#converting a single string item in list to a string
s = " ".join(content_clean_1)
# splitting the string and converting into a list
weld_location= s.split(" ")
print ('POSITION',weld_location)
here is the output
PART_1: ['A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
POSITION ['d', 'Part-1::', 'A', '\t\tConnector', 'Location::', '100.00', '200.00', '300.00', '\t\tConnected', 'Part-3::', 'C~\t']
PART_3: ['1:: A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
From the output of this program, i may conclude that, since 'content' is the string consisting all the characters in the file, the program is not reading an individual line. Instead it is considering all text as single string. Could anyone please help in this case?
I am expecting following output:
PART_1: ['A']
PART_3: ['C']
POSITION: ['100.00', '200.00','300.00']
(Note) When i am using individual files containing single line of data, it works fine. Sorry for such a long question
I will try to make it clear, and show how I would do it without regex. First of all, the biggest issue with the code presented is that when using the string.strip function the entire content list is being read:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
Content is the entire file lines, I think you want simply something like:
connected_part_1 = [line.strip(' \n ~ $ Connected Part -1 ::')]
How to parse the file is a bit subjective, but given the file format posted as input, I would do it like this:
templatestr = "{}: {}"
with open('inputreadlines.txt') as f:
content = f.readlines()
for line in content:
label, value = line.split('::')
ltokens = label.split()
if ltokens[0] == 'Connected':
print(templatestr.format(
ltokens[-1], #The last word on the label
value.split()[:-1])) #the split value without the last word '~$'
elif ltokens[0] == 'Connector':
print(value.split()[:-1]) #the split value without the last word '~$'
else: #NOT REQUIRED
pass
You can use the string.strip function to remove the funny characters '~$' instead of removing the last token as in the example.

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

Trouble concatenating two strings

I am having trouble concatenating two strings. This is my code:
info = infl.readline()
while True:
line = infl.readline()
outfl.write(info + line)
print info + line
The trouble is that the output appears on two different lines. For example, output text looks like this:
490250633800 802788.0 953598.2
802781.968872 953674.839355 193.811523 1 0.126805 -999.000000 -999.000000 -999.000000
I want both strings on the same line.
There must be a '\n' character at the end of info. You can remove it with:
info = infl.readline().rstrip()
You should remove line breaks in the line and info variables like this :
line=line.replace("\n","")
readline will return a "\n" at the end of the string 99.99% of the time. You can get around this by calling rstrip on the result.
info = infl.readline().rstip()
while True:
#put it both places!
line = infl.readline().rstip()
outfl.write(info + line)
print info + line
readline's docs:
Read one entire line from the file. A trailing newline character is kept in the string (but may be absent when a file ends with an incomplete line)...

Categories

Resources