Python - Splitting a large string by number of delimiter occurrences

Python - Splitting a large string by number of delimiter occurrences - python

I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.
For instance:
ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//
In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.
So an input of splitting the string by // by 1 would return:
ABCDEF
an input of splitting the string by // by 2 would return:
ABCDEF
//
GHIJKLMN
an input of splitting the string by // by 3 would return:
ABCDEF
//
GHIJKLMN
//
OPQ
And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.
I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.
Is there a simple or elegant way to achieve this? Thanks!

If you want to work with files instead of strings in memory, here is another answer.
This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).
def file_split(file_name, delimiter, n=1):
with open(file_name) as fh:
for line in fh:
line = line.rstrip() # use .rstrip("\n") to only strip newlines
if line == delimiter:
n -= 1
if n <= 0:
return
print line
file_split('data.txt', '//', 3)
You can use this to write the output to a new file like this:
python split.py > newfile.txt
With a little extra work, you can use argparse to pass parameters to the program.

As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job:
from itertools import islice
with open('filename') as f :
lines = islice(f,0,2*N-1)

The method that comes to my mind when I read your question uses a for loop
where you cut up the string into several (for example the 100 you called) and iterate through the substring.
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
log = log+steps
# and go again from the start only with this offset
now you can go through all the elements go through the whole 2 million(!) line string.
best thing to do here is actually make a recursive function from this(if that is what you want):
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
def iterateThroughHugeString(beginning):
substring = thestring[:beginning+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
iterateThroughHugeString(beginning+steps)
# and go again from the start only with this offset

For instance:
i = 0
s = ""
fd = open("...")
for l in fd:
if l[:-1] == delimiter: # skip last '\n'
i += 1
if i >= max_split:
break
s += l
fd.close()

Since you are learning Python it would be a challenge to model a complete dynamic solution. Here's a notion of how you can model one.
Note: The following code snippet only works for file(s) which is/are in the given format (see the 'For Instance' in the question). Hence, it is a static solution.
num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
print ([next(myfile) for x in range(num-1)])
Now that have the idea, you can use pattern matching and so on.

Related

python 3 parsing a semicolon separated very long string to remove each second element

I'm pretty new to python and are looking for a way to get the following result from a long string
reading in lines of a textfile where each line looks like this
; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;
after dataprocessing the data shall be stored in another textfile with this data
short example
2:55:12;66,81;66,75;35,38;
the real string is much longer but always with the same pattern
; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38; Puff2OG;30,25; Puff1OG;29,25; PuffFB;23,50; ....
So this means remove leading semicolon
keep second element
remove third element
keep fourth element
remove fith element
keep sixth element
and so on
the number of elements can vary so I guess as a first step I have to parse the string to get the number of elements and then do some looping through the string and assign each part that shall be kept to a variable
I have tried some variations of the command .split() but with no success.
Would it be easier to store all elements in a list and then for-loop through the list keeping and dropping elements?
If Yes how would this look like so at the end I have stored a file with
lines like this
2:55:12 ; 66,81 ; 66,75 ; 35,38 ;
2:56:12 ; 67,15 ; 74;16 ; 39,15 ;
etc. ....
best regards Stefan

This solution works independently of the content between the semicolons
One line, though it's a bit messier:
result = ' ; '.join(string.split(';')[1::2])
Getting rid of lead semicolon:
Just slice it off!
string = string[2:]
Splitting by semicolon & every second element:
Given a string, we can split by semicolon:
arr = string.split(';')[1::2]
The [::2] means to slice out every second element, starting with index 1. This keeps all "even" elements (second, fourth, etcetera).
Resulting string
To produce the string result you want, simply .join:
result = ' ; '.join(arr)

A regex based solution, which operates on the original input:
inp = "; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;"
output = re.sub(r'\s*[A-Z][^;]*?;', '', inp)[2:]
print(output)
This prints:
2:55:12;66,81;66,75;35,38;

This shows how to do it for one line of input if the same pattern repeats itself every time
input_str = "; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;"
f = open('output.txt', 'w') # open text to write to
output_list = input_str.split(';')[1::2] # create list with numbers of interest
# write to file
for out in output_list:
f.write(f"{out.strip()} ; ")
# end line
f.write("\n")

thank you very much for the quick response. You are awesome.
Your solutions are very comact.
In the meantime I found another solution but this solution needs more lines of code
best regards Stefan
I'm not familiar with how to insert code as a code-section properly
So I add it as plain text
fobj = open(r"C:\Users\Stefan\AppData\Local\Programs\Python\Python38-32\Heizung_2min.log")
wobj = open(r"C:\Users\Stefan\AppData\Local\Programs\Python\Python38-32\Heizung_number_2min.log","w")
for line in fobj:
TextLine = fobj.readline()
print(TextLine)
myList = TextLine.split(';')
TextLine = ""
for index, item in enumerate(myList):
if index % 2 == 1:
TextLine += item
TextLine += ";"
TextLine += '\n'
print(TextLine)
wobj.write(TextLine)
fobj.close()
wobj.close()`

Match a string which is few lines above another line where the first string was matched

So, I have this text file which is huge. I need to look for a string and when I match it, I need to go a few lines back(above the current line) and search for another string and extract some information from that line that contains the second string. How can I do this in Python using regex match?
I am trying to do something like this.
substr1 = re.compile("ACT",re.IGNORECASE)
substr2 = re.compile(vector,re.IGNORECASE)
try:
with open (filepath, 'rt') as in_file:
for linenum, line in enumerate(in_file):
if substr2.search(line) != None:
print(linenum,line)
# Code to trace back a few lines to look for substr1
break
except FileNotFoundError: # If the file not found,
print("pattern not found.") # print an error message.
It is kind of like I want to read it backward when I match the first string and look for the first occurrence of the second string. The number of lines varies and I cannot thus use the dequeue option I think. I am totally new to Python.
Any help is appreciated, thank you!
Am adding an example log file that I am reading.
X 123
X 1234
X 12345
Vector1
----
-----
-----
X 1231
X 12344
X 123456
vector a
vector b
vector c
vector d
-------
-------
Vector
----
-----
-----
X 1233
X 12345
X 123451
Vector2
String 1 : Vector
String 2 : X
Output should be X 123456

You do not need to backtrack. Instead, just search forward in a smarter manner. If you search for substr1 first, the only issue that could happen is that more occurrences of substr1 will be found before you find substr2. The way to handle that is to keep updating match of substr1 as you go.
From your description, it does not appear that you need regex at all. Instead, you appear to be looking for simple string containment tests.
substr1 = 'X'
substr2 = 'Vector'
with open (filepath, 'rt') as in_file:
matched = None
for linenum, line in enumerate(in_file, start=1):
if substr1 in line:
matched = line
elif matched and line == substr2:
# Process the second string
print(matched)
break
If you have whitespace at the end of your lines, as you do in the sample you give, you may want to use line.startswith(substr2) instead of line == substr2.
Minor fixes:
start=1 will make your line numbers start with 1, which is probably what you want.
If you want to compare against None, the proper way is is not None instead of !=. Additionally, regex.search returns a match object. It will always be truthy if a match occurs. The idiomatic way to check it is without even is not None.

How would you find text in a string in python and then look for a number after it?

I have a log file and at the end of each line in the file there is this string:
Line:# where # is the line number.
I am trying to get the # and compare it to the previous line's number. what would be the best way to do that in python?

I would probably use str.split because it seems easy:
with open('logfile.log') as fin:
numbers = [ int(line.split(':')[-1]) for line in fin ]
Now you can use zip to compare one number with the next one:
for num1,num2 in zip(numbers,numbers[1:]):
compare(num1,num2) #do comparison here.
Of course, this isn't lazy (you store every line number in the file at once when you really only need 2 at a time), so it might take up a lot of memory if your files are HUGE. It wouldn't be hard to make it lazy though:
def elem_with_next(iterable):
ii = iter(iterable)
prev = next(ii)
for here in ii:
yield prev,here
prev = here
with open('logfile.log') as fin:
numbers = ( int(line.split(':')[-1]) for line in fin )
for num1,num2 in elem_with_next(numbers):
compare(num1,num2)

I'm assuming that you don't have something convenient to split a string on, meaning a regular expression might make more sense. That is, if the lines in your log file are structured like:
date: 1-15-2013, error: mildly_annoying, line: 121
date: 1-16-2013, error: err_something_bad, line: 123
Then you won't be able to use line.split('#') as mgilson as suggested, although if there is always a colon, line.split(':') might work. In any case, a regular expression solution would look like:
import re
numbers = []
for line in log:
digit_match = re.search("(\d+)$", line)
if digit_match is not None:
numbers.append(int(digit_match.group(1)))
Here the expression "(\d+)$" is matching some number of digits and then the end of the line. We extract the digits with the group(1) method on the returned match object and then add them to our list of line numbers.
If you're not confident that the "Line: #" will always come at the end of the log, you could replace the regular expression used above with something akin to "Line:\s*(\d+)" which checks for the string "Line:" then some (or no) whitespace, and then any number of digits.

Get a value from a string in python

Program Details:
I am writing a program for python that will need to look through a text file for the line:
Found mode 1 of 12: EV= 1.5185449E+04, f= 19.612545, T= 0.050988.
Problem:
Then after the program has found that line, it will then store the line into an array and get the value 19.612545, from f = 19.612545.
Question:
I so far have been able to store the line into an array after I have found it. However I am having trouble as to what to use after I have stored the string to search through the string, and then extract the information from variable f. Does anyone have any suggestions or tips on how to possibly accomplish this?

Depending upon how you want to go at it, CosmicComputer is right to refer you to Regular Expressions. If your syntax is this simple, you could always do something like:
line = 'Found mode 1 of 12: EV= 1.5185449E+04, f= 19.612545, T= 0.050988.'
splitByComma=line.split(',')
fValue = splitByComma[1].replace('f= ', '').strip()
print(fValue)
Results in 19.612545 being printed (still a string though).
Split your line by commas, grab the 2nd chunk, and break out the f value. Error checking and conversions left up to you!

Using regular expressions here is maddness. Just use string.find as follows: (where string is the name of the variable the holds your string)
index = string.find('f=')
index = index + 2 //skip over = and space
string = string[index:] //cuts things that you don't need
string = string.split(',') //splits the remaining string delimited by comma
your_value = string[0] //extracts the first field
I know its ugly, but its nothing compared with RE.

How can I count the line number between two character in a file with python?

Hi
I'm new to python and I have a 3.2 python!
I have a file which has some sort of format like this:
Number of segment pairs = 108570; number of pairwise comparisons = 54234
'+' means given segment; '-' means reverse complement
Overlaps Containments No. of Constraints Supporting Overlap
******************* Contig 1 ********************
E_180+
E_97-
******************* Contig 2 ********************
E_254+
E_264+ is in E_254+
E_276+
******************* Contig 3 ********************
E_256-
E_179-
I want to count the number of non-empty lines between the *****contig#****
and I want to get a result like this
contig1=2
contig2=3
contig3=2**

Probably, it's best to use regular expressions here. You can try the following:
import re
str = open(file).read()
pairs = re.findall(r'\*+ (Contig \d+) \*+\n([^*]*)',str)
pairs is a list of tuples, where the tuples have the form ('Contig x', '...')
The second component of each tuple contains the text after the mark
Afterwards, you could count the number of '\n' in those texts; most easily this can be done via a list comprehension:
[(contig, txt.count('\n')) for (contig,txt) in pairs]
(edit: if you don't want to count empty lines you can try:
[(contig, txt.count('\n')-txt.count('\n\n')) for (contig,txt) in pairs]
)

def give(filename):
with open(filename) as f:
for line in f:
if 'Contig' in line:
category = line.strip('* \r\n')
break
cnt = 0
aim = []
for line in f:
if 'Contig' in line:
yield (category+'='+str(cnt),aim)
category = line.strip('* \r\n')
cnt = 0
aim= []
elif line.strip():
cnt+=1
if 'is in' in line:
aim.append(line.strip())
yield (category+'='+str(cnt),aim)
for a,b in give('input.txt'):
print a
if b: print b
result
Contig 1=2
Contig 2=3
['E_264+ is in E_254+']
Contig 3=2
The function give() isn't a normal function, it is a generator function. See the doc, and if you have question, I will answer.
strip() is a function that eliminates characters at the beginning and at the end of a string
When used without argument, strip() removes the whitespaces (that is to say \f \n \r \t \v and blank space). When there is a string as argument, all the characters present in the string argument that are found in the treated string are removed from the treated string. The order of characters in the string argument doesn't matter: such an argument doesn't designates a string but a set of characters to be removed.
line.strip() is a means to know if there are characters that aren't whitespaces in a line
The fact that elif line.strip(): is situated after the line if 'Contig' in line: , and that it is written elif and not if, is important: if it was the contrary, line.strip() would be True for line being for exemple
******** Contig 2 *********\n
I suppose that you will be interested to know the content of the lines like this one:
E_264+ is in E_254+
because it is this kind of line that make a difference in the countings
So I edited my code in order that the function give() produce also the information of these kind of lines

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Splitting a large string by number of delimiter occurrences - python

As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job: from itertools import islice with open('filename') as f : lines = islice(f,0,2*N-1)

For instance: i = 0 s = "" fd = open("...") for l in fd: if l[:-1] == delimiter: # skip last '\n' i += 1 if i >= max_split: break s += l fd.close()

Related

python 3 parsing a semicolon separated very long string to remove each second element

Match a string which is few lines above another line where the first string was matched

How would you find text in a string in python and then look for a number after it?

Get a value from a string in python

How can I count the line number between two character in a file with python?

Categories

Resources