Slicing a string into a list based on reoccuring patterns

Slicing a string into a list based on reoccuring patterns - python

I have a long string variable full of hex values:
hexValues = 'AA08E3020202AA08E302AA1AA08E3020101' etc..
The first 2 bytes (AA08) are a signature for the start of a frame and the rest of the data up to the next AA08 are the contents of the signature.
I want to slice the string into a list based on the reoccurring start of frame sign, e.g:
list = [AA08, E3020202, AA08, F25S1212, AA08, 42ABC82] etc...
I'm not sure how I can split the string up like this. Some of the frames are also corrupted, where the start of the frame won'y have AA08, but maybe AA01.. so I'd need some kind of regex to spot these.
if I do list = hexValues.split('AA08)', the list just removes all the starts of the frame...
So I'm a bit stuck.
Newbie to python.
Thanks

For the case when you don't have "corrupted" data the following should do:
hex_values = 'AA08E3020202AA08E302AA1AA08E3020101'
delimiter = hex_values[:4]
hex_values = hex_values.replace(delimiter, ',' + delimiter + ',')
hex_list = hex_values.split(',')[1:]
print(hex_list)
['AA08', 'E3020202', 'AA08', 'E302AA1', 'AA08', 'E3020101']

Without considering corruptions, you may try this.
l = []
for s in hexValues.split('AA08'):
if s:
l += ['AA08', s]

Related

How to split string with multiple delimiters in Python?

My First String
xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv
But I want to result like this below
bonding_err_bond0-if_eth2
I try some code but seems not work correctly
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
x = csv.rsplit('.', 4)[2]
print(x)
But Result that I get is com-bonding_err_bond0-if_eth2-d But my purpose is bonding_err_bond0-if_eth2

If you are allowed to use the solution apart from regex,
You can break the solution into a smaller part to understand better and learn about join if you are not aware of it. It will come in handy.
solution= '-'.join(csv.split('.', 4)[2].split('-')[1:3])
Thanks,
Shashank

Probably you got the answer, but if you want a generic method for any string data you can do this:
In this way you wont be restricted to one string and you can loop the data as well.
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
first_index = csv.find("-")
second_index = csv.find("-d")
result = csv[first_index+1:second_index]
print(result)
# OUTPUT:
# bonding_err_bond0-if_eth2

You can just separate the string with -, remove the beginning and end, and then join them back into a string.
csv = "xxx.xxx.com-bonding_err_bond0-if_eth2-d.rrd.csv"
x = '-'.join(csv.split('-')[1:-1])
Output
>>> csv
>>> bonding_err_bond0-if_eth2

Remove single quotes around array

I have data that looks like this:
minterms = [['1,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x'], ['x,x,x,x,1,x,x,x,x,x,x,x,x,x,x,x,1,x,x,x,x,x,x']]
and I want to remove the single quotes around each array to get this:
minterms = [[1,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x,x], [x,x,x,x,1,x,x,x,x,x,x,x,x,x,x,x,1,x,x,x,x,x,x]]
I have tried
mintermNew = minterms.replace("'", "")
and this doesn't work.
What am I doing wrong here?
Edit:
Here is a snippet of my code giving a bit more context.
dontcares = []
mintermAry = []
for mindata in minterms:
for mindataIdx in mindata:
mintermAry.append(mindataIdx.split())
print(SOPform(fullsymlst, mintermAry, dontcares))
return
I am using mindataIdx.split() to put the data into an array. MindataIdx is the data that looks like [['1,x,x,x,x....'].
Using .split("") as mentioned in the commends throws this error:
mintermAry.append(mindataIdx.split(""))
ValueError: empty separator
using .split(" ") yields no changes.
Edit 2:
The data is being read into a dataframe from a file. The first 4 rows I want to discard. I am using this method to do it.
df = df.replace('-', 'x', regex=True)
dfstr =
df.to_string(header=False,index=False,index_names=False).split('\n')
dfArray = np.array(dfstr)
dfArrayDel = np.delete(dfArray,range(4), 0)
dfArrayData = np.char.lstrip(dfArrayDel)
splitData = np.char.split(dfArrayData)

First of all, you're definitly doing somthing very wrong, as, there is no reason for there to be single quotes around the contents of an array. Is this a string you're working with? Please elaborate.
Ill have to assume you want to split the string in the array up into separate elements by the commas, in which case you would want this -
miniterms.map(s => s[0].split(","));

I can't tell if your writing in python or js, regardless your problem is that your 2d array contains only a single String, hence why it's all wrapped in quotes. If the String in your inner arrays were split into individual elements they would look like this:
[[1,'x','x','x','x','x','x','x','x','x','x','x'...], ['x','x','x','x',1,'x'...]]
1 is a Number and therefore not wrapped in quotes while x is a char or String and therefore is wrapped in quotes. These quotes are there only to visualize the variable datatype and are not part of the variable value itself. As the quotes don't exist they can't be removed (eg by using replace)
If your String, before putting it in an array looks like this.
data = '1,x,x,x,x,x,x,x,x,x,x,x'
You can split it into an array like this:
data_array = data.split("")

I needed to split mindataIdx by the comma to create individual items, and then it was able to be recognized by SOPform. Thanks!
dontcares = []
mintermAry = []
for mindata in minterms:
for mindataIdx in mindata:
mintermAry.append(mindataIdx.split(","))
print(SOPform(fullsymlst, mintermAry, dontcares))

Python: Split between two characters

Let's say I have a ton of HTML with no newlines. I want to get each element into a list.
input = "<head><title>Example Title</title></head>"
a_list = ["<head>", "<title>Example Title</title>", "</head>"]
Something like such. Splitting between each ><.
But in Python, I don't know of a way to do that. I can only split at that string, which removes it from the output. I want to keep it, and split between the two equality operators.
How can this be done?
Edit: Preferably, this would be done without adding the characters back in to the ends of each list item.

# initial input
a = "<head><title>Example Title</title></head>"
# split list
b = a.split('><')
# remove extra character from first and last elements
# because the split only removes >< pairs.
b[0] = b[0][1:]
b[-1] = b[-1][:-1]
# initialize new list
a_list = []
# fill new list with formatted elements
for i in range(len(b)):
a_list.append('<{}>'.format(b[i]))
This will output the given list in python 2.7.2, but it should work in python 3 as well.

You can try this:
import re
a = "<head><title>Example Title</title></head>"
data = re.split("><", a)
new_data = [data[0]+">"]+["<" + i+">" for i in data[1:-1]] + ["<"+data[-1]]
Output:
['<head>', '<title>Example Title</title>', '</head>']

The shortest approach using re.findall() function on extended example:
# extended html string
s = "<head><title>Example Title</title></head><body>hello, <b>Python</b></body>"
result = re.findall(r'(<[^>]+>[^<>]+</[^>]+>|<[^>]+>)', s)
print(result)
The output:
['<head>', '<title>Example Title</title>', '</head>', '<body>', '<b>Python</b>', '</body>']

Based on the answers by other people, I made this.
It isn't as clean as I had wanted, but it seems to work. I had originally wanted to not re-add the characters after split.
Here, I got rid of one extra argument by combining the two characters into a string. Anyways,
def split_between(string, chars):
if len(chars) is not 2: raise IndexError("Argument chars must contain two characters.")
result_list = [chars[1] + line + chars[0] for line in string.split(chars)]
result_list[0] = result_list[0][1:]
result_list[-1] = result_list[-1][:-1]
return result_list
Credit goes to #cforemanand #Ajax1234.

Or even simpler, this:
input = "<head><title>Example Title</title></head>"
print(['<'+elem if elem[0]!='<' else elem for elem in [elem+'>' if elem[-1]!='>' else elem for elem in input.split('><') ]])

Python - Splitting a large string by number of delimiter occurrences

I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.
For instance:
ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//
In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.
So an input of splitting the string by // by 1 would return:
ABCDEF
an input of splitting the string by // by 2 would return:
ABCDEF
//
GHIJKLMN
an input of splitting the string by // by 3 would return:
ABCDEF
//
GHIJKLMN
//
OPQ
And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.
I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.
Is there a simple or elegant way to achieve this? Thanks!

If you want to work with files instead of strings in memory, here is another answer.
This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).
def file_split(file_name, delimiter, n=1):
with open(file_name) as fh:
for line in fh:
line = line.rstrip() # use .rstrip("\n") to only strip newlines
if line == delimiter:
n -= 1
if n <= 0:
return
print line
file_split('data.txt', '//', 3)
You can use this to write the output to a new file like this:
python split.py > newfile.txt
With a little extra work, you can use argparse to pass parameters to the program.

As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job:
from itertools import islice
with open('filename') as f :
lines = islice(f,0,2*N-1)

The method that comes to my mind when I read your question uses a for loop
where you cut up the string into several (for example the 100 you called) and iterate through the substring.
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
log = log+steps
# and go again from the start only with this offset
now you can go through all the elements go through the whole 2 million(!) line string.
best thing to do here is actually make a recursive function from this(if that is what you want):
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
def iterateThroughHugeString(beginning):
substring = thestring[:beginning+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
iterateThroughHugeString(beginning+steps)
# and go again from the start only with this offset

For instance:
i = 0
s = ""
fd = open("...")
for l in fd:
if l[:-1] == delimiter: # skip last '\n'
i += 1
if i >= max_split:
break
s += l
fd.close()

Since you are learning Python it would be a challenge to model a complete dynamic solution. Here's a notion of how you can model one.
Note: The following code snippet only works for file(s) which is/are in the given format (see the 'For Instance' in the question). Hence, it is a static solution.
num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
print ([next(myfile) for x in range(num-1)])
Now that have the idea, you can use pattern matching and so on.

python - matching string and replacing

I have a file i am trying to replace parts of a line with another word.
it looks like bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212
i need to delete everything but bob123#bobscarshop.com, but i need to match 23rh32o3hro2rh2 with 23rh32o3hro2rh2:poniacvibe , from a different text file and place poniacvibe infront of bob123#bobscarshop.com
so it would look like this bob123#bobscarshop.com:poniacvibe
I've had a hard time trying to go about doing this, but i think i would have to split the bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212 with data.split(":") , but some of the lines have a (:) in a spot that i don't want the line to be split at, if that makes any sense...
if anyone could help i would really appreciate it.

ok, it looks to me like you are using a colon : to separate your strings.
in this case you can use .split(":") to break your strings into their component substrings
eg:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
print(firststring.split(":"))
would give:
['bobkeiser', 'bob123#bobscarshop.com', '0.0.0.0.0', '23rh32o3hro2rh2', '234212']
and assuming your substrings will always be in the same order, and the same number of substrings in the main string you could then do:
firststring = "bobkeiser:bob123#bobscarshop.com:0.0.0.0.0:23rh32o3hro2rh2:234212"
firstdata = firststring.split(":")
secondstring = "23rh32o3hro2rh2:poniacvibe"
seconddata = secondstring.split(":")
if firstdata[3] == seconddata[0]:
outputdata = firstdata
outputdata.insert(1,seconddata[1])
outputstring = ""
for item in outputdata:
if outputstring == "":
outputstring = item
else
outputstring = outputstring + ":" + item
what this does is:
extract the bits of the strings into lists
see if the "23rh32o3hro2rh2" string can be found in the second list
find the corresponding part of the second list
create a list to contain the output data and put the first list into it
insert the "poniacvibe" string before "bob123#bobscarshop.com"
stitch the outputdata list back into a string using the colon as the separator
the reason your strings need to be the same length is because the index is being used to find the relevant strings rather than trying to use some form of string type matching (which gets much more complex)
if you can keep your data in this form it gets much simpler.
to protect against malformed data (lists too short) you can explicitly test for them before you start using len(list) to see how many elements are in it.
or you could let it run and catch the exception, however in this case you could end up with unintended results, as it may try to match the wrong elements from the list.
hope this helps
James
EDIT:
ok so if you are trying to match up a long list of strings from files you would probably want something along the lines of:
firstfile = open("firstfile.txt", mode = "r")
secondfile= open("secondfile.txt",mode = "r")
first_raw_data = firstfile.readlines()
firstfile.close()
second_raw_data = secondfile.readlines()
secondfile.close()
first_data = []
for item in first_raw_data:
first_data.append(item.replace("\n","").split(":"))
second_data = []
for item in second_raw_data:
second_data.append(item.replace("\n","").split(":"))
output_strings = []
for item in first_data:
searchstring = item[3]
for entry in second_data:
if searchstring == entry[0]:
output_data = item
output_string = ""
output_data.insert(1,entry[1])
for data in output_data:
if output_string == "":
output_string = data
else:
output_string = output_string + ":" + data
output_strings.append(output_string)
break
for entry in output_strings:
print(entry)
this should achieve what you're after and as prove of concept will print the resulting list of stings for you.
if you have any questions feel free to ask.
James
Second edit:
to make this output the results into a file change the last two lines to:
outputfile = open("outputfile.txt", mode = "w")
for entry in output_strings:
outputfile.write(entry+"\n")
outputfile.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Slicing a string into a list based on reoccuring patterns - python

Without considering corruptions, you may try this. l = [] for s in hexValues.split('AA08'): if s: l += ['AA08', s]

Related

How to split string with multiple delimiters in Python?

Remove single quotes around array

Python: Split between two characters

Python - Splitting a large string by number of delimiter occurrences

python - matching string and replacing

Categories

Resources