How to count ocurrences of substings in string from text file - python

How to count ocurrences of substings in string from text file - python - python

I want to count the number of lines on a .txt file were a string contains two sub-strings.
I tried the following:
with open(filename, 'r') as file:
for line in file:
wordsList = line.split()
if any("leads" and "show" in s for s in wordsList):
repetitions +=1
print "Repetitions: %i" % (repetitions)
But it doesn't seem to be working as it should.
With the following demo input file I got 3 repetitions when it should be 2:
www.google.es/leads/hello/show
www.google.es/world
www.google.com/leads/nyc/oops
www.google.es/leads/la/show
www.google.es/leads/nope/pop
leads.
show
I also tried chaning "any" for "all" but I get even stranger results.

"leads" and "show" in s is interpreted as:
"leads" and ("show" in s) because of precedence.
Python tries to interpret "leads" as a boolean. As it is a non-empty string, it evaluates to True.
So, your expression is equivalent to "show" in s
What you mean is:
"leads" in s and "show" in s

Related

Match a string which is few lines above another line where the first string was matched

So, I have this text file which is huge. I need to look for a string and when I match it, I need to go a few lines back(above the current line) and search for another string and extract some information from that line that contains the second string. How can I do this in Python using regex match?
I am trying to do something like this.
substr1 = re.compile("ACT",re.IGNORECASE)
substr2 = re.compile(vector,re.IGNORECASE)
try:
with open (filepath, 'rt') as in_file:
for linenum, line in enumerate(in_file):
if substr2.search(line) != None:
print(linenum,line)
# Code to trace back a few lines to look for substr1
break
except FileNotFoundError: # If the file not found,
print("pattern not found.") # print an error message.
It is kind of like I want to read it backward when I match the first string and look for the first occurrence of the second string. The number of lines varies and I cannot thus use the dequeue option I think. I am totally new to Python.
Any help is appreciated, thank you!
Am adding an example log file that I am reading.
X 123
X 1234
X 12345
Vector1
----
-----
-----
X 1231
X 12344
X 123456
vector a
vector b
vector c
vector d
-------
-------
Vector
----
-----
-----
X 1233
X 12345
X 123451
Vector2
String 1 : Vector
String 2 : X
Output should be X 123456

You do not need to backtrack. Instead, just search forward in a smarter manner. If you search for substr1 first, the only issue that could happen is that more occurrences of substr1 will be found before you find substr2. The way to handle that is to keep updating match of substr1 as you go.
From your description, it does not appear that you need regex at all. Instead, you appear to be looking for simple string containment tests.
substr1 = 'X'
substr2 = 'Vector'
with open (filepath, 'rt') as in_file:
matched = None
for linenum, line in enumerate(in_file, start=1):
if substr1 in line:
matched = line
elif matched and line == substr2:
# Process the second string
print(matched)
break
If you have whitespace at the end of your lines, as you do in the sample you give, you may want to use line.startswith(substr2) instead of line == substr2.
Minor fixes:
start=1 will make your line numbers start with 1, which is probably what you want.
If you want to compare against None, the proper way is is not None instead of !=. Additionally, regex.search returns a match object. It will always be truthy if a match occurs. The idiomatic way to check it is without even is not None.

Parsing paragraph out of text file in Python?

I am trying to parse certain paragraphs out of multiple text file and store them in list. All the text file have some similar format to this:
MODEL NUMBER: A123
MODEL INFORMATION: some info about the model
DESCRIPTION: This will be a description of the Model. It
could be multiple lines but an empty line at the end of each.
CONCLUSION: Sold a lot really profitable.
Now i can pull out the information where its one line, but am having trouble when i encounter something which is multiple line (like 'Description'). The description length is not known but i know at the end it would have an empty line (which would mean using '\n'). This is what i have so far:
import os
dir = 'Test'
DESCRIPTION = []
for files in os.listdir(dir):
if files.endswith('.txt'):
with open(dir + '/' + files) as File:
reading = File.readlines()
for num, line in enumerate(reading):
if 'DESCRIPTION:' in line:
Start_line = num
if len(line.strip()) == 0:
I don't know if its the best approach, but what i was trying to do with if len(line.strip()) == 0: is to create a list of blank lines and then find the first greater value than Start_Line. I saw this Bisect.
In the end i would like my data to be if i say print Description
['DESCRIPTION: Description from file 1',
'DESCRIPTION: Description from file 2',
'DESCRIPTION: Description from file 3,]
Thanks.

Regular expression. Think about it this way: you have a pattern that will allow you to cut any file into pieces you will find palatable: "newline followed by capital letter"
re.split is your friend
Take a string
"THE
BEST things
in life are
free
IS
YET
TO
COME"
As a string:
p = "THE\nBEST things\nin life are\nfree\nIS\nYET\nTO\nCOME"
c = re.split('\n(?=[A-Z])', p)
Which produces list c
['THE', 'BEST things\nin life are\nfree', 'IS', 'YET', 'TO', 'COME']
I think you can take it from there, as this would separate your files into each a list of strings with each string beings its own section, then from there you can find the "DESCRIPTION" element and store it, you see that you separate each section, including its subcontents by that re split. Important to note that the way I've set up the regex it recognies the PATTERN "newline and then Capital Letter" but CUTS after the newline, which is why it is outside the brackets.

Get some string before " not in all lines python

I have such entries in a txt file with such structure:
Some sentence.
Some other "other" sentence.
Some other smth "other" sentence.
In original:
Камиш-Бурунський залізорудний комбінат
Відкрите акціонерне товариство "Кар'єр мармуровий"
Закрите акціонерне товариство "Кар'єр мармуровий"
I want to extract everything before " and write to another file. I want the result to be:
Some other
Some other smth
Відкрите акціонерне товариство
Закрите акціонерне товариство
I have done this:
f=codecs.open('organization.txt','r+','utf-8')
text=f.read()
words_sp=text.split()
for line in text:
before_keyword, after_keyword = line.split(u'"',1)
before_word=before_keyword.split()[0]
encoded=before_word.encode('cp1251')
print encoded
But it doesn't work since there is a file lines that doesn't have ". How can I improve my code to make it work?

There are two problems. First you must use the splitlines() function to break a string into lines. (What you have will iterate one character at a time.) Secondly, the following code will fail when split returns a single item:
before_keyword, after_keyword = line.split(u'"',1)
The following works for me:
for line in text.splitlines():
if u'"' in line:
before_keyword, after_keyword = line.split(u'"',1)
... etc. ...

Get a value from a string in python

Program Details:
I am writing a program for python that will need to look through a text file for the line:
Found mode 1 of 12: EV= 1.5185449E+04, f= 19.612545, T= 0.050988.
Problem:
Then after the program has found that line, it will then store the line into an array and get the value 19.612545, from f = 19.612545.
Question:
I so far have been able to store the line into an array after I have found it. However I am having trouble as to what to use after I have stored the string to search through the string, and then extract the information from variable f. Does anyone have any suggestions or tips on how to possibly accomplish this?

Depending upon how you want to go at it, CosmicComputer is right to refer you to Regular Expressions. If your syntax is this simple, you could always do something like:
line = 'Found mode 1 of 12: EV= 1.5185449E+04, f= 19.612545, T= 0.050988.'
splitByComma=line.split(',')
fValue = splitByComma[1].replace('f= ', '').strip()
print(fValue)
Results in 19.612545 being printed (still a string though).
Split your line by commas, grab the 2nd chunk, and break out the f value. Error checking and conversions left up to you!

Using regular expressions here is maddness. Just use string.find as follows: (where string is the name of the variable the holds your string)
index = string.find('f=')
index = index + 2 //skip over = and space
string = string[index:] //cuts things that you don't need
string = string.split(',') //splits the remaining string delimited by comma
your_value = string[0] //extracts the first field
I know its ugly, but its nothing compared with RE.

Workarounds when a string is too long for a .join. OverflowError occurs

I'm working through some python problems on pythonchallenge.com to teach myself python and I've hit a roadblock, since the string I am to be using is too large for python to handle. I receive this error:
my-macbook:python owner1$ python singleoccurrence.py
Traceback (most recent call last):
File "singleoccurrence.py", line 32, in <module>
myString = myString.join(line)
OverflowError: join() result is too long for a Python string
What alternatives do I have for this issue? My code looks like such...
#open file testdata.txt
#for each character, check if already exists in array of checked characters
#if so, skip.
#if not, character.count
#if count > 1, repeat recursively with first character stripped off of page.
# if count = 1, add to valid character array.
#when string = 0, print valid character array.
valid = []
checked = []
myString = ""
def recursiveCount(bigString):
if len(bigString) == 0:
print "YAY!"
return valid
myChar = bigString[0]
if myChar in checked:
return recursiveCount(bigString[1:])
if bigString.count(myChar) > 1:
checked.append(myChar)
return recursiveCount(bigString[1:])
checked.append(myChar)
valid.append(myChar)
return recursiveCount(bigString[1:])
fileIN = open("testdata.txt", "r")
line = fileIN.readline()
while line:
line = line.strip()
myString = myString.join(line)
line = fileIN.readline()
myString = recursiveCount(myString)
print "\n"
print myString

string.join doesn't do what you think. join is used to combine a list of words into a single string with the given seperator. Ie:
>>> ",".join(('foo', 'bar', 'baz'))
'foo,bar,baz'
The code snippet you posted will attempt to insert myString between every character in the variable line. You can see how that will get big quickly :-). Are you trying to read the entire file into a single string, myString? If so, the way you want to concatenate the strings is like this:
myString = myString + line
While I'm here... since you're learning Python here are some other suggestions.
There are easier ways to read an entire file into a variable. For instance:
fileIN = open("testdata.txt", "r")
myString = fileIN.read()
(This won't have the exact behaviour of your existing strip() code, but may in fact do what you want.)
Also, I would never recommend practical Python code use recursion to iterate over a string. Your code will make a function call (and a stack entry) for every character in the string. Also I'm not sure Python will be very smart about all the uses of bigString[1:]: it may well create a second string in memory that's a copy of the original without the first character. The simplest way to process every character in a string is:
for mychar in bigString:
... do your stuff ...
Finally, you are using the list named "checked" to see if you've ever seen a particular character before. But the membership test on lists ("if myChar in checked") is slow. In Python you're better off using a dictionary:
checked = {}
...
if not checked.has_key(myChar):
checked[myChar] = True
...
This exercise you're doing is a great way to learn several Python idioms.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to count ocurrences of substings in string from text file - python - python

"leads" and "show" in s is interpreted as: "leads" and ("show" in s) because of precedence. Python tries to interpret "leads" as a boolean. As it is a non-empty string, it evaluates to True. So, your expression is equivalent to "show" in s What you mean is: "leads" in s and "show" in s

Related

Match a string which is few lines above another line where the first string was matched

Parsing paragraph out of text file in Python?

Get some string before " not in all lines python

Get a value from a string in python

Workarounds when a string is too long for a .join. OverflowError occurs

Categories

Resources