I'm trying to check if a specific string is in a file text
so i have this file that contains the following:
Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address (state) rxbytes txbytes
tcp4 0 0 192.168.1.6.50860 72.21.91.29.http CLOSE_WAIT 892 691
tcp4 0 0 192.168.1.6.50858 www.v.dropbox.co.https ESTABLISHED 27671 7563
tcp4 0 0 192.168.1.6.50857 162.125.17.1.https ESTABLISHED 17581 3642
and here is my code:
char = ""
file = open("location")
for i, line in enumerate(file):
addi = i + 1
if line.strip() == char:
print "MATCH FOUND on line " + str(addi)
print "finished"
For this to work, I have to paste the entire line in my char var. For example, it works if I paste "Active Internet connections", but If I put "Internet", it goes straight to the print "finished" line. How would I fix this?
You need to look for contains (in) rather than equals (==). You can also use a list comprehension to get all the matches then print out the results:
char = "<search-string>"
with open("location") as file:
results = [i for i, line in enumerate(file, 1) if char in line]
if results:
print "MATCHES FOUND on lines " + ', '.join(results)
print "finished"
If you need more complicated search rules, then you may want to look at the regex module re
Might want to try using with open() as for proper file-handling.
And using the in keyword will work better than == because you want a match if it contains your string.
Also, using str.format is more readable IMO than "stuff" + str(value)
find = "Active Internet connections"
with open('location') as f:
for i, line in enumerate(f, 1):
if find in line:
print("Match found on line {}".format(i))
print("finished")
In Python, strings are nothing more than lists of characters. To check if a string exists in another, you can use the in operator.
if char in line:
# do something
As simple as char in line.
Example usage is that "hi" in "hit" will be True, and "hi" in "hello" will be False.
You are checking if the line is in char, but you should do the reverse, since the entire line isn't in the char:
for i, line in enumerate(file):
line_index = i + 1
if char in line:
print "MATCH FOUND on line " + str(line_index)
print "finished"
also, I would recommend not to use char as a variable name. try to use more explicit and less ambiguous names like pattern_to_find
Looking for a sub-string in Python is very simple task. Python's methods find() and count() are very useful in this context.
# This is the string you're looking for
ip = "192.168.1.6.50860"
# You need to do both, open and read file, to get its content
file = open("/home/my/own/directory/here/file.txt").read()
def findLine(text, string):
if string in text:
return "MATCH FOUND on line {}".format(text[0, text.find(string)].count("\n") + 1)
else:
return "MATCH NOT FOUND"
print(findLine(file, ip)) # Prints 3 (1-based indexing)
Try this:
search = "what you want to find goes here"
filename = "file to read"
with open(filename) as f:
for i, line in enumerate(f, 1):
if search in line:
print "MATCH FOUND in line", i
Related
I have a huge text file which I need to read line by line for memory optimization.
I would like to get the string within two identifiers, as an example here between the identifiers '{' and '}':
input:
"
not this line
not this line
Pattern 'pattern' {
get this line
get this line
}
not this line
not this line
"
the output would be a string "get this line get this line "
There can be some other identifiers ('{', '}', '[', ...) inside the string but I need matching ones. Ex: Pattern { something else {...} } would get something else {...} (the englobed {...} is inside the string)
I have written a simple counter like this but it is quite slow. I was looking at a faster way of doing this.
currentString = ""
counter = 0
def GetStringBetweenIdentifiers(string, identifierA, identifierB):
global currentString, counter
for i in string:
if (i == identifierB):
counter -= 1
if(counter > 0):
currentString += i
if(i == identifierA):
counter += 1
if(counter==0):
string = currentString
currentString = ""
return string
return ""
with open(filePath) as read_obj:
for num, line in enumerate(read_obj, 1):
String = GetStringBetweenIdentifiers(line, '{', '}')
if (String != ""):
"Do something with the string"
To add some examples, there can be identifiers in the middle of the line, for example:
input:
"
not this line
not this line
Pattern 'pattern' { I want this
get this line { something here }
get this line
also this part } not this part
not this line
not this line
"
the output would be a string " I want this get this line { something here } get this line also this part"
Thank you for reading!
This kind of thing can be very tricky due to ambiguous sequences. For example... Let's say that the start of a sequence of interest is '{' and the end is '}'. Now imagine that you've observed a start sentinel then, before you see an end marker, you see another start marker. What do you do then?
Anyway, here's something that will work in the perfect world (which doesn't really exist but it might give some ideas).
My input file looks like this:
not this line
not this line
Pattern 'pattern' { I want this
get this line { something here }
get this line
also this part } not this part
not this line
not this line
...and the code like this...
START = '{'
END = '}'
capture = 0
data = []
section = []
with open('foo.txt') as txt:
while (c := txt.read(1)):
if c == START:
if (capture := capture + 1) > 1:
section.append(c)
elif c == END:
if (capture := capture - 1) < 0:
print('ERROR: unable to process (too many end tags)')
break
if capture:
section.append(c)
elif section:
data.append(section)
section = []
elif capture and c not in '\r\n':
section.append(c)
for section in data:
print(''.join(section))
...and this output....
I want this get this line { something here }get this line also this part
Welcome to the world of regex. It's quirky, but highly effective. This works for your situation, if in the lines you read there is only one capture-able sequence, which may contain sub sequences that might also be captured, as you show in your example. It will fail if there are independent sequences within the same input string, as it will capture the "outer most" subsequence that it finds. It would be a little more work to have it handle this case. (As they say, an exercise left to the interested reader.)
Lots of good info in the python dox and this website is key for testing.
Aside: You may also want to look into grep terminal command (not a python solution). grep is highly effective at processing massive files and pulling out matches and it works seamlessly with regex also
Anyhow:
import re
with open('dummy_text.txt', 'r') as src:
lines = src.readlines()
composite_string = ''.join(lines)
print('loaded and working with:\n')
print(composite_string)
print()
pattern = r'{((?s:.*))}'
results = re.search(pattern, composite_string)
print(f'I found: {results.group(1)}')
Produces:
loaded and working with:
not this line
not this line
Pattern 'pattern' {
get this line
get {this} line
}
not this line
not this line
I found:
get this line
get {this} line
i am very new in python (and programming in general) and here is my issue. i would like to replace (or delete) a part of a string from a txt file which contains hundreds or thousands of lines. each line starts with the very same string which i want to delete.
i have not found a method to delete it so i tried a replace it with empty string but for some reason it doesn't work.
here is what i have written:
file = "C:/Users/experimental/Desktop/testfile siera.txt"
siera_log = open(file)
text_to_replace = "Chart: Bar Backtest: NQU8-CME [CB] 1 Min #1 | Study: free dll = 0 |"
for each_line in siera_log:
new_line = each_line.replace("text_to_replace", " ")
print(new_line)
when i print it to check if it was done, i can see that the lines are as they were before. no change was made.
can anyone help me to find out why?
each line starts with the very same string which i want to delete.
The problem is you're passing a string "text_to_replace" rather than the variable text_to_replace.
But, for this specific problem, you could just remove the first n characters from each line:
text_to_replace = "Chart: Bar Backtest: NQU8-CME [CB] 1 Min #1 | Study: free dll = 0 |"
n = len(text_to_replace)
for each_line in siera_log:
new_line = each_line[n:]
print(new_line)
If you quote a variable it becomes a string literal and won't be evaluated as a variable.
Change your line for replacement to:
new_line = each_line.replace(text_to_replace, " ")
I have done a good many searches but have not been able to find a solution to regex a line and split the values into two variable components. I am using python2.6 and trying to figure out how to regex the integers into value variable and the text into the metric variable. The output information is pulled from a subprocess command running netstat -s.
The below match will only provide the top 6 lines but not the bottom ones where a string is first. I tried using an or conditional within the parenthesis and that did not work, tried (?P<value>[0-9]+|\w+\s[0-9]+) I have been using this site which is really helpful but still no luck, https://regex101.com/r/yV5hA4/3#python
Any help or thoughts of using another method will be appreciated.
Code:
for line in output.split("\n"):
match = re.search(r"(?P<value>[0-9]+)\s(?P<metric>\w+.*)", line, re.I)
if match:
value, metric = match.group('value', 'metric')
print "%s => " % value + metric
What is trying to be regex:
17277 DSACKs received
4 DSACKs for out of order packets received
2 connections reset due to unexpected SYN
10294 connections reset due to unexpected data
48589 connections reset due to early user close
294 connections aborted due to timeout
TCPDSACKIgnoredOld: 15371
TCPDSACKIgnoredNoUndo: 1554
TCPSpuriousRTOs: 2
TCPSackShifted: 6330903
TCPSackMerged: 1883219
TCPSackShiftFallback: 792316
I would just forget about using re here, and just do something like this:
for line in output.split("\n"):
value = None
metric = ""
for word in line.split():
if word.isdigit():
value = int(word)
else:
metric = "{} {}".format(metric, word)
print "{} => {}".format(metric.strip(":"), value)
One slight caveat is that any line that has two or more numbers in it will only report the last one, but that's no worse than how your current approach would deal with that case...
Edit: missed that OP is on Python 2.6, in which case, this should work:
for line in output.split("\n"):
value = None
metric = ""
for word in line.split():
if word.isdigit():
value = int(word)
else:
metric = metric + " " + word
print "%s => %s" % (metric.strip(":"), str(value))
I have text file as follows seq.txt
>S1
AACAAGAAGAAAGCCCGCCCGGAAGCAGCTCAATCAGGAGGCTGGGCTGGAATGACAGCG
CAGCGGGGCCTGAAACTATTTATATCCCAAAGCTCCTCTCAGATAAACACAAATGACTGC
GTTCTGCCTGCACTCGGGCTATTGCGAGGACAGAGAGCTGGTGCTCCATTGGCGTGAAGT
CTCCAGGGCCAGAAGGGGCCTTTGTCGCTTCCTCACAAGGCACAAGTTCCCCTTCTGCTT
CCCCGAGAAAGGTTTGGTAGGGGTGGTGGTTTAGTGCCTATAGAACAAGGCATTTCGCTT
CCTAGACGGTGAAATGAAAGGGAAAAAAAGGACACCTAATCTCCTACAAATGGTCTTTAG
TAAAGGAACCGTGTCTAAGCGCTAAGAACTGCGCAAAGTATAAATTATCAGCCGGAACGA
GCAAACAGACGGAGTTTTAAAAGATAAATACGCATTTTTTTCCGCCGTAGCTCCCAGGCC
AGCATTCCTGTGGGAAGCAAGTGGAAACCCTATAGCGCTCTCGCAGTTAGGAAGGAGGGG
TGGGGCTGTCCCTGGATTTCTTCTCGGTCTCTGCAGAGACAATCCAGAGGGAGACAGTGG
ATTCACTGCCCCCAATGCTTCTAAAACGGGGAGACAAAACAAAAAAAAACAAACTTCGGG
TTACCATCGGGGAACAGGACCGACGCCCAGGGCCACCAGCCCAGATCAAACAGCCCGCGT
CTCGGCGCTGCGGCTCAGCCCGACACACTCCCGCGCAAGCGCAGCCGCCCCCCCGCCCCG
GGGGCCCGCTGACTACCCCACACAGCCTCCGCCGCGCCCTCGGCGGGCTCAGGTGGCTGC
GACGCGCTCCGGCCCAGGTGGCGGCCGGCCGCCCAGCCTCCCCGCCTGCTGGCGGGAGAA
ACCATCTCCTCTGGCGGGGGTAGGGGCGGAGCTGGCGTCCGCCCACACCGGAAGAGGAAG
TCTAAGCGCCGGAAGTGGTGGGCATTCTGGGTAACGAGCTATTTACTTCCTGCGGGTGCA
CAGGCTGTGGTCGTCTATCTCCCTGTTGTTC
>S2
ACACGCATTCACTAAACATATTTACTATGTGCCAGGCACTGTTCTCAGTGCTGGGGATAT
AGCAGTGAAGAAACAGAAACCCTTGCACTCACTGAGCTCATATCTTAGGGTGAGAAACAG
TTATTAAGCAAGATCAGGATGGAAAACAGATGGTACGGTAGTGTGAAATGCTAAAGAGAA
AAATAACTACGGAAAAGGGATAGGAAGTGTGTGTATCGCAGTTGACTTATTTGTTCGCGT
TGTTTACCTGCGTTCTGTCTGCATCTCCCACTAAACTGTAAGCTCTACATCTCCCATCTG
TCTTATTTACCAATGCCAACCGGGGCTCAGCGCAGCGCCTGACACACAGCAGGCAGCTGA
CAGACAGGTGTTGAGCAAGGAGCAAAGGCGCATCTTCATTGCTCTGTCCTTGCTTCTAGG
AGGCGAATTGGGAAATCCAGAGGGAAAGGAAAAGCGAGGAAAGTGGCTCGCTTTTGGCGC
TGGGGAAGAGGTGTACAGTGAGCAGTCACGCTCAGAGCTGGCTTGGGGGACACTCTCACG
CTCAGGAGAGGGACAGAGCGACAGAGGCGCTCGCAGCAGCGCGCTGTACAGGTGCAACAG
CTTAGGCATTTCTATCCCTATTTTTACAGCGAGGGACACTGGGCCTCAGAAAGGGAAGTG
CCTTCCCAAGCTCCAACTGCTCATAAGCAGTCAACCTTGTCTAAGTCCAGGTCTGAAGTC
CTGGAGCGATTCTCCACCCACCACGACCACTCACCTACTCGCCTGCGCTTCACCTCACGT
GAGGATTTTCCAGGTTCCTCCCAGTCTCTGGGTAGGCGGGGAGCGCTTAGCAGGTATCAC
CTATAAGAAAATGAGAATGGGTTGGGGGCCGGTGCAAGACAAGAATATCCTGACTGTGAT
TGGTTGAATTGGCTGCCATTCCCAAAACGAGCTTTGGCGCCCGGTCTCATTCGTTCCCAG
CAGGCCCTGCGCGCGGCAACATGGCGGGGTCCAGGTGGAGGTCTTGAGGCTATCAGATCG
GTATGGCATTGGCGTCCGGGCCCGCAAGGCG
.
.
.
.
I have to count patterns in these sequences to achieve python script
import re
infile = open("seq.txt", 'r')
out = open("pat.txt", 'w')
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
for line in infile:
line = line.strip("\n")
if line.startswith('>'):
name = line
else:
s = re.findall(pattern,line)
print '%s:%s' %(name,s)
out.write('%s:\t%s\n' %(name,len(s)))
But it is giving the wrong result. The script is reading line by line.
S1 : 0
S1 : 0
S1 : 0
S1 : 0
S2 : 0
S2 : 1
S2 : 0
S2 : 1
But I want output as follows:
S1 : 0
S2 : 2
Can anybody help?
Use a hit counter, zero it if line.startswith('>'). Increment by len(s) otherwise.
This code might be helpful for you:
import re
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
with open('seq.txt') as f:
sections = f.read().split('\n\n')
for section in sections:
lines = section.split()
name = lines[0].lstrip('>')
data = ''.join(lines[1:])
print '{0}: {1}'.format(name, len(pattern.findall(data)))
Example output:
S1: 1
S2: 2
Notes:
It's assumed that two newline characters are used to separate every section as in the example.
It's assumed that every section name is preceded by a greater than (>) character as in the example.
If you already have a pattern, use pattern.findall(data) instead of re.findall(pattern, data)
You should gather input until you enter the next pattern. This would also solve the corner case of where your pattern crosses a line boundary (not sure if that "can" happen with your data, but it looks like it).
Use a counter. Also, have your print function inside the for loop, so it's going to iterate as many times as the else condition. Note that it's also not a good idea to use the variable line as both the iterator variable in the for loop and as another variable. It makes the code more confusing.
counter_dict = {}
for line in infile:
if line[0] == '>':
name = line[1:len(line) - 2]
counter_dict[name] = 0
else:
counter_dict[name] += len(re.findall(pattern,line))
for (key, val) in counter_dict.items():
print '%s:%s' %(key, val)
out.write('%s:\t%s\n' %(key, val)
I have done this operation millions of times, just using the + operator! I have no idea why it is not working this time, it is overwriting the first part of the string with the new one! I have a list of strings and just want to concatenate them in one single string! If I run the program from Eclipse it works, from the command-line it doesn't!
The list is:
["UNH+1+XYZ:08:2:1A+%CONVID%'&\r", "ORG+1A+77499505:ABC+++A+FR:EUR++123+1A'&\r", "DUM'&\r"]
I want to discard the first and the last elements, the code is:
ediMsg = ""
count = 1
print "extract_the_info, lineList ",lineList
print "extract_the_info, len(lineList) ",len(lineList)
while (count < (len(lineList)-1)):
temp = ""
# ediMsg = ediMsg+str(lineList[count])
# print "Count "+str(count)+" ediMsg ",ediMsg
print "line value : ",lineList[count]
temp = lineList[count]
ediMsg += " "+temp
print "ediMsg : ",ediMsg
count += 1
print "count ",count
Look at the output:
extract_the_info, lineList ["UNH+1+XYZ:08:2:1A+%CONVID%'&\r", "ORG+1A+77499505:ABC+++A+FR:EUR++123+1A'&\r", "DUM'&\r"]
extract_the_info, len(lineList) 8
line value : ORG+1A+77499505:ABC+++A+FR:EUR++123+1A'&
ediMsg : ORG+1A+77499505:ABC+++A+FR:EUR++123+1A'&
count 2
line value : DUM'&
DUM'& : ORG+1A+77499505:ABC+++A+FR:EUR++123+1A'&
count 3
Why is it doing so!?
While the two answers are correct (use " ".join()), your problem (besides very ugly python code) is this:
Your strings end in "\r", which is a carriage return. Everything is fine, but when you print to the console, "\r" will make printing continue from the start of the same line, hence overwrite what was written on that line so far.
You should use the following and forget about this nightmare:
''.join(list_of_strings)
The problem is not with the concatenation of the strings (although that could use some cleaning up), but in your printing. The \r in your string has a special meaning and will overwrite previously printed strings.
Use repr(), as such:
...
print "line value : ", repr(lineList[count])
temp = lineList[count]
ediMsg += " "+temp
print "ediMsg : ", repr(ediMsg)
...
to print out your result, that will make sure any special characters doesn't mess up the output.
'\r' is the carriage return character. When you're printing out a string, a '\r' will cause the next characters to go at the start of the line.
Change this:
print "ediMsg : ",ediMsg
to:
print "ediMsg : ",repr(ediMsg)
and you will see the embedded \r values.
And while your code works, please change it to the one-liner:
ediMsg = ' '.join(lineList[1:-1])
Your problem is printing, and it is not string manipulation. Try using '\n' as last char instead of '\r' in each string in:
lineList = [
"UNH+1+TCCARQ:08:2:1A+%CONVID%'&\r",
"ORG+1A+77499505:PARAF0103+++A+FR:EUR++11730788+1A'&\r",
"DUM'&\r",
"FPT+CC::::::::N'&\r",
"CCD+CA:5132839000000027:0450'&\r",
"CPY+++AF'&\r",
"MON+712:1.00:EUR'&\r",
"UNT+8+1'\r"
]
I just gave it a quick look. It seems your problem arises when you are printing the text. I haven't done such things for a long time, but probably you only get the last line when you print. If you check the actual variable, I'm sure you'll find that the value is correct.
By last line, I'm talking about the \r you got in the text strings.