Parsing a string in python without knowing is in the string - python

I'm trying to parse a random string between a set of quotation marks.
The data is always of the form:
klheafoiehaiofa"randomtextnamehere.ext"fiojeaiof;jidoa;jfioda
I know what .ext is, and that what I want is randomtextnamehere.ext, and that it is always separated by " ".
Currently I can only deal with certain cases, but I want to be able to handle any case, and if I could just start grabbing at the first ", and stop at the second ", that would be great. Since there's a possibility of there being more than one set of " per line.
Thanks!

You can use the str.split method:
Docstring:
S.split([sep [,maxsplit]]) -> list of strings
Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are removed
from the result.
In [1]: s = 'klheafoiehaiofa"randomtextnamehere.ext"fiojeaiof;jidoa;jfioda'
In [2]: s.split('"', 2)[1]
Out[2]: 'randomtextnamehere.ext'

Related

Python 3 split()

When I'm splitting a string "abac" I'm getting undesired results.
Example
print("abac".split("a"))
Why does it print:
['', 'b', 'c']
instead of
['b', 'c']
Can anyone explain this behavior and guide me on how to get my desired output?
Thanks in advance.
As #DeepSpace pointed out (referring to the docs)
If sep is given, consecutive delimiters are not grouped together and are deemed to delimit empty strings (for example, '1,,2'.split(',') returns ['1', '', '2']).
Therefore I'd suggest using a better delimiter such as a comma , or if this is the formatting you're stuck with then you could just use the builtin filter() function as suggested in this answer, this will remove any "empty" strings if passed None as the function.
sample = 'abac'
filtered_sample = filter(None, sample.split('a'))
print(filtered_sample)
#['b', 'c']
When you split a string in python you keep everything between your delimiters (even when it's an empty string!)
For example, if you had a list of letters separated by commas:
>>> "a,b,c,d".split(',')
['a','b','c','d']
If your list had some missing values you might leave the space in between the commas blank:
>>> "a,b,,d".split(',')
['a','b','','d']
The start and end of the string act as delimiters themselves, so if you have a leading or trailing delimiter you will also get this "empty string" sliced out of your main string:
>>> "a,b,c,d,,".split(',')
['a','b','c','d','','']
>>> ",a,b,c,d".split(',')
['','a','b','c','d']
If you want to get rid of any empty strings in your output, you can use the filter function.
If instead you just want to get rid of this behavior near the edges of your main string, you can strip the delimiters off first:
>>> ",,a,b,c,d".strip(',')
"a,b,c,d"
>>> ",,a,b,c,d".strip(',').split(',')
['a','b','c','d']
In your example, "a" is what's called a delimiter. It acts as a boundary between the characters before it and after it. So, when you call split, it gets the characters before "a" and after "a" and inserts it into the list. Since there's nothing in front of the first "a" in the string "abac", it returns an empty string and inserts it into the list.
split will return the characters between the delimiters you specify (or between an end of the string and a delimiter), even if there aren't any, in which case it will return an empty string. (See the documentation for more information.)
In this case, if you don't want any empty strings in the output, you can use filter to remove them:
list(filter(lambda s: len(s) > 0, "abac".split("a"))

Turn string into a list and remove carriage returns (Python)

I have a string like this:
['过\r\n啤酒\r\n小心\r\n照顾\r\n锻炼\r\n过去\r\n忘记\r\n哭\r\n包\r\n个子\r\n瘦\r\n选择\r\n奶奶\r\n突然\r\n节目\r\n']
How do I remove all of the "\r\n", and then turn the string into a list like so:
[过, 啤酒, 小心, 照顾, 过去, etc...]
str.split removes all whitespace; this includes \r and \n:
A = ['过\r\n啤酒\r\n小心\r\n照顾\r\n锻炼\r\n过去\r\n忘记\r\n哭\r\n包\r\n个子\r\n瘦\r\n选择\r\n奶奶\r\n突然\r\n节目\r\n']
res = A[0].split()
print(res)
['过', '啤酒', '小心', '照顾', '锻炼', '过去', '忘记', '哭', '包', '个子', '瘦', '选择', '奶奶', '突然', '节目']
As described in the str.split docs:
If sep is not specified or is None, a different splitting
algorithm is applied: runs of consecutive whitespace are regarded as a
single separator, and the result will contain no empty strings at the
start or end if the string has leading or trailing whitespace.
To limit the split to \r\n you can use .splitlines():
>>> li=['过\r\n啤酒\r\n小心\r\n照顾\r\n锻炼\r\n过去\r\n忘记\r\n哭\r\n包\r\n个子\r\n瘦\r\n选择\r\n奶奶\r\n突然\r\n节目\r\n']
>>> li[0].splitlines()
['过', '啤酒', '小心', '照顾', '锻炼', '过去', '忘记', '哭', '包', '个子', '瘦', '选择', '奶奶', '突然', '节目']
Try this:
s = "['过\r\n啤酒\r\n小心\r\n照顾\r\n锻炼\r\n过去\r\n忘记\r\n哭\r\n包\r\n个子\r\n瘦\r\n选择\r\n奶奶\r\n突然\r\n节目\r\n']"
s = s.replace('\r\n', ',').replace("'", '')
print(s)
Output:
[过,啤酒,小心,照顾,锻炼,过去,忘记,哭,包,个子,瘦,选择,奶奶,突然,节目,]
This first replace replaces the \r\n and the second one replaces the single quote from the string as you expected as the output.

How to get sub string from a string in python using split or regex

I have a str in python like below. I want extract a substring from it.
table='abc_test_01'
number=table.split("_")[1]
I am getting test as a result.
What I want is everything after the first _.
The result I want is test_01 how can I achieve that.
Here is the code as already given by many of them
table='abc_test_01'
number=table.split("_",1)[1]
But the above one may fail in situations when the occurrence is not in the string, then you'll get IndexError: list index out of range
For eg.
table='abctest01'
number=table.split("_",1)[1]
The above one will raise IndexError, as the occurrence is not in the string
So the more accurate code for handling this is
table.split("_",1)[-1]
Therefore -1 will not get any harm because the number of occurrences is already set to one.
Hope it helps :)
To get the substring (all characters after the first occurrence of underscore):
number = table[table.index('_')+1:]
# Output: test_01
You could do it like:
import re
string = "abc_test_01"
rx = re.compile(r'[^_]*_(.+)')
match = rx.match(string).group(1)
print(match)
Or with normal string functions:
string = "abc_test_01"
match = '_'.join(string.split('_')[1:])
print(match)
Nobody mentions that the split() function can have an maxsplit argument:
str.split(sep=None, maxsplit=-1)
return a list of the words in the string, using sep as the delimiter string. If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
So the solution is only:
table.split('_', 1)[1]
You can try this:
Edit: Thanks to #valtah's comment:
table = 'abc_test_01'
#final = "_".join(table.split("_")[1:])
final = table.split("_", 1)[1]
print final
Output:
'test_01'
Also the answer of #valtah in the comment is correct:
final = table.partition("_")[2]
print final
Will output the same result

Python - Removing \n when using default split()?

I'm working on strings where I'm taking input from the command line. For example, with this input:
format driveName "datahere"
when I go string.split(), it comes out as:
>>> input.split()
['format, 'driveName', '"datahere"']
which is what I want.
However, when I specify it to be string.split(" ", 2), I get:
>>> input.split(' ', 2)
['format\n, 'driveName\n', '"datahere"']
Does anyone know why and how I can resolve this? I thought it could be because I'm creating it on Windows and running on Unix, but the same problem occurs when I use nano in unix.
The third argument (data) could contain newlines, so I'm cautious not to use a sweeping newline remover.
Default separator in split() is all whitespace which includes newlines \n and spaces.
Here is what the docs on split say:
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
When you define a new sep it only uses that separator to split the strings.
Use None to get the default whitespace splitting behaviour with a limit:
input.split(None, 2)
This leaves the whitespace at the end of input() untouched.
Or you could strip the values afterwards; this removes whitespace from the start and end, not the middle, of each resulting string, just like input.split() would:
[v.strip() for v in input.split(' ', 2)]
The default str.split targets a number of "whitespace characters", including also tabs and others. If you do str.split(' '), you tell it to split only on ' ' (a space). You can get the default behavior by specifying None, as in str.split(None, 2).
There may be a better way of doing this, depending on what your actual use-case is (your example does not replicate the problem...). As your example output implies newlines as separators, you should consider splitting on them explicitly.
inp = """
format
driveName
datahere
datathere
"""
inp.strip().split('\n', 2)
# ['format', 'driveName', 'datahere\ndatathere']
This allows you to have spaces (and tabs etc) in the first and second item as well.

Find Certain String Indices

I have this string and I need to get a specific number out of it.
E.G. encrypted = "10134585588147, 3847183463814, 18517461398"
How would I pull out only the second integer out of the string?
You are looking for the "split" method. Turn a string into a list by specifying a smaller part of the string on which to split.
>>> encrypted = '10134585588147, 3847183463814, 18517461398'
>>> encrypted_list = encrypted.split(', ')
>>> encrypted_list
['10134585588147', '3847183463814', '18517461398']
>>> encrypted_list[1]
'3847183463814'
>>> encrypted_list[-1]
'18517461398'
Then you can just access the indices as normal. Note that lists can be indexed forwards or backwards. By providing a negative index, we count from the right rather than the left, selecting the last index (without any idea how big the list is). Note this will produce IndexError if the list is empty, though. If you use Jon's method (below), there will always be at least one index in the list unless the string you start with is itself empty.
Edited to add:
What Jon is pointing out in the comment is that if you are not sure if the string will be well-formatted (e.g., always separated by exactly one comma followed by exactly one space), then you can replace all the commas with spaces (encrypt.replace(',', ' ')), then call split without arguments, which will split on any number of whitespace characters. As usual, you can chain these together:
encrypted.replace(',', ' ').split()

Categories

Resources