splitting a string based on tab in the file

splitting a string based on tab in the file - python

I have file that contains values separated by tab ("\t"). I am trying to create a list and store all values of file in the list. But I get some problem. Here is my code.
line = "abc def ghi"
values = line.split("\t")
It works fine as long as there is only one tab between each value. But if there is one than one tab then it copies the tab to values as well. In my case mostly the extra tab will be after the last value in the file.

You can use regex here:
>>> import re
>>> strs = "foo\tbar\t\tspam"
>>> re.split(r'\t+', strs)
['foo', 'bar', 'spam']
update:
You can use str.rstrip to get rid of trailing '\t' and then apply regex.
>>> yas = "yas\t\tbs\tcda\t\t"
>>> re.split(r'\t+', yas.rstrip('\t'))
['yas', 'bs', 'cda']

Split on tab, but then remove all blank matches.
text = "hi\tthere\t\t\tmy main man"
print([splits for splits in text.split("\t") if splits])
Outputs:
['hi', 'there', 'my main man']

You can use regexp to do this:
import re
patt = re.compile("[^\t]+")
s = "a\t\tbcde\t\tef"
patt.findall(s)
['a', 'bcde', 'ef']

An other regex-based solution:
>>> strs = "foo\tbar\t\tspam"
>>> r = re.compile(r'([^\t]*)\t*')
>>> r.findall(strs)[:-1]
['foo', 'bar', 'spam']

Python has support for CSV files in the eponymous csv module. It is relatively misnamed since it support much more that just comma separated values.
If you need to go beyond basic word splitting you should take a look. Say, for example, because you are in need to deal with quoted values...

Related

How to capture specific character arrangements or a word form a line using python

I need to read following sample line and grab a specific word from that line.
sample line
#apple (orange3ball/345-35:;bat9cap/253-43) school=(book,pen,bottle)
Let say I want to grab word 'orange3ball' (in between ('(' and '/') and 'bat9cap' and 'bottle' . what is the best way to do it.
I tried with split() function but I couldn't do it properly.
If it is too difficult to do can I search a specif arrangements of characters in a line.
As an example can I find the 'bat9cap' character arrangement from the above line.

This is a job for the interactive shell! Make a variable containing the line in question and experiment away. Here I did it for you to show you one slightly convoluted way to "grab" the word between ( and /.
>>> line = "#apple (orange3ball/345-35:;bat9cap/253-43) school=(book,pen,bottle)"
>>> line.split()
['#apple', '(orange3ball/345-35:;bat9cap/253-43)', 'school=(book,pen,bottle)']
>>> line.split()[1]
'(orange3ball/345-35:;bat9cap/253-43)'
>>> line.split()[1].split("/")
['(orange3ball', '345-35:;bat9cap', '253-43)']
>>> line.split()[1].split("/")[0]
'(orange3ball'
>>> line.split()[1].split("/")[0].strip("(")
'orange3ball'
Notice that I just pressed uparrow to get the code I used last and appended some stuff to it. The last line is rather unreadable though, so after finding something that works you may want to break it into several lines and use some nicely named variables to store the intermediate results.
The ideal way to do it depends on which aspects of the line you can depend on always being like they are here. (E.g. if the #apple part is optional so that it may not be there at all.) You may need to split on different characters or index into the resulting lists from the end of the list using negative indices (e.g. mylist[-1] to get the last item).

Use in to test membership:
>>> s='#apple (orange3ball/345-35:;bat9cap/253-43) school=(book,pen,bottle)'
>>> 'orange3ball' in s
True
>>> 'orange4ball' in s
False
>>> 'bat9cap' in s
True
>>> 'bat9ball' in s
False
You can also use a regex to break apart on word boundaries:
>>> import re
>>> re.findall(r'(?:\W*(\w+))', s)
['apple', 'orange3ball', '345', '35', 'bat9cap', '253', '43', 'school', 'book', 'pen', 'bottle']
The advantage of the second method is that only entire matches are a match in the resulting list:
>>> 'or' in s
True
>>> 'or' in re.findall(r'(?:\W*(\w+))', s)
False
Or just use a single regex to test for the whole word:
>>> re.search(r'\borange3ball\b', s)
<_sre.SRE_Match object; span=(8, 19), match='orange3ball'>
>>> re.search(r'\borange\b', s)
>>>
(The return of a match object is a positive match...)

Python Split With Delimiter In Field Value

I have a "CSV" with which some of the data fields happen to contain the comma delimiter as in the second row of the following sample data.
"1","stuff","and","things"
"2","black,white","more","stuff"
I can't change the source data and I don't know how to str.split() and not split in the value "black,white".
Ways I've approached my problem:
I looked at partition() and don't see how that would benefit me.
I'm sure a regex would capture data properly but I'm not sure how to tie one into splitting.
Since every row in the source will always have the same number of fields I thought maybe setting maxsplit would help but talked myself out of that with the thinking that it would still split within "black,white" and I would end up loosing the last value (which would be "stuff" in this case).
Certainly this is easy to overcome so I'm looking forward to learning something new!
Your help is greatly appreciated.

Using csv and StringIO:
>>> import csv, StringIO
>>> data = """"1","stuff","and","things"
... "2","black,white","more","stuff"
... """
>>> reader = csv.reader(StringIO.StringIO(data))
>>> for row in reader:
... print row
...
['1', 'stuff', 'and', 'things']
['2', 'black,white', 'more', 'stuff']

If your source is not CSV, and you just want to balance quotes in your string you may try using shlex module:
import shlex
lex = shlex.shlex('"2","black,white","more","stuff"')
for i in lex:
print i

Commas outside the strings are always followed by double-quotes. Just split on ," instead of just , (or even ",")
>>> x = '"2","black,white","more","stuff"'
>>> x
'"2","black,white","more","stuff"'
>>> x.split(',"')
['"2"', 'black,white"', 'more"', 'stuff"']
>>> [y.strip('"') for y in x.split(',"')]
['2', 'black,white', 'more', 'stuff']
Of course, edit for efficiency
YevgenYampolskiy's suggestion of shlex is also an alternative.
>>> x = '"2","black,white","more","stuff"'
>>> x
'"2","black,white","more","stuff"'
>>> import shlex
>>> y = shlex.shlex(x)
>>> [i.strip('"') for i in y if i != ',']
['2', 'black,white', 'more', 'stuff']

Protect commas on consecutive string.join() and string.split()

Suppose the following code (notice the commas inside the strings):
>>> a = ['1',",2","3,"]
I need to concatenate the values into a single string. Naive example:
>>> b = ",".join(a)
>>> b
'1,,2,3,'
And later I need to split the resulting object again:
>>> b.split(',')
['1', '', '2', '3', '']
However, the result I am looking for is the original list:
['1', ',2', '3,']
What's the simplest way to protect the commas in this process? The best solution I came up with looks rather ugly.
Note: the comma is just an example. The strings can contain any character. And I can choose other characters as separators.

The strings can contain any character.
If no matter what you use as a delimiter, there is a chance that the item itself contains the delimiter character, then use the csv module:
import csv
class PseudoFile(object):
# http://stackoverflow.com/a/8712426/190597
def write(self, string):
return string
writer = csv.writer(PseudoFile())
This concatenates the items in a using commas:
a = ['1',",2","3,"]
line = writer.writerow(a)
print(line)
# 1,",2","3,"
This recovers a from line:
print(next(csv.reader([line])))
# ['1', ',2', '3,']

Do you have to use comas to separate the items? Else you could also use another symbol that is not used in the items of the list.
In [1]: '|'.join(['1', ',2', '3,']).split('|')
Out[1]: ['1', ',2', '3,']
Edit: The string may apparently contain any character. Is it an option to use the json module? You could just dump and load the list.
In [3]: json.dumps(['1', ',2', '3,'])
Out[3]: '["1", ",2", "3,"]'
In [4]: json.loads('["1", ",2", "3,"]')
Out[4]: [u'1', u',2', u'3,']
Edit #2: If you may not use it, you could use str.encode('string-encode') to escape the characters in your string and then enclose the encoded version into single quotes and separate those with comas:
In [10]: print "'example'".encode('string-escape')
\'example\' #' (have to close the opened string for stackoverflow
In [11]: print r"\'example\'".decode('string-escape')
'example'
Edit #3: Running example of str.encode('string-encode'):
import re
def list_to_str(list):
return ','.join("'{}'".format(s.encode('string-escape')) for s in list)
def str_to_list(str):
return re.findall(r"'([^']*)'", str)
if __name__ == '__main__':
a = ['1', ',2', '3,']
b = list_to_str(a)
print 'It is {} that this works.'.format(str_to_list(b) == a)

When you are serializing a list to a String, then you need to choose as a separator a character that doesn't appear in the list items. Can't you just replace the comma with another character?
b = ";".join(a)
b.split(';')

Does the delimiter need to be only a single character? If not then you can use a delimiter made up of a sequence of characters that definitley wont appear in your string, like |#| or something similar.

You need to escape the comma and probably also escape the escape sequence. Here's one way:
>>> a = ['1',",2","3,"]
>>> b = ','.join(s.replace('%', '%%').replace(',', '%2c') for s in a)
>>> [s.replace('%2c', ',').replace('%%', '%') for s in b.split(',')]
['1', ',2', '3,']
>>> b
'1,%2c2,3%2c'
>>>

I would join and split using another character than ",", e.g. ";":
>>> b = ";".join(a)
>>> b.split(';')
['1', ',2', '3,']

How to split a string by using [] in Python

So from this string:
"name[id]"
I need this:
"id"
I used str.split ('[]'), but it didn't work. Does it only take a single delimiter?

Use a regular expression:
import re
s = "name[id]"
re.find(r"\[(.*?)\]", s).group(1) # = 'id'
str.split() takes a string on which to split input. For instance:
"i,split,on commas".split(',') # = ['i', 'split', 'on commas']
The re module also allows you to split by regular expression, which can be very useful, and I think is what you meant to do.
import re
s = "name[id]"
# split by either a '[' or a ']'
re.split('\[|\]', s) # = ['name', 'id', '']

Either
"name[id]".split('[')[1][:-1] == "id"
or
"name[id]".split('[')[1].split(']')[0] == "id"
or
re.search(r'\[(.*?)\]',"name[id]").group(1) == "id"
or
re.split(r'[\[\]]',"name[id]")[1] == "id"

Yes, the delimiter is the whole string argument passed to split. So your example would only split a string like 'name[]id[]'.
Try eg. something like:
'name[id]'.split('[', 1)[-1].split(']', 1)[0]
'name[id]'.split('[', 1)[-1].rstrip(']')

I'm not a fan of regex, but in cases like it often provides the best solution.
Triptych already recommended this, but I'd like to point out that the ?P<> group assignment can be used to assign a match to a dictionary key:
>>> m = re.match(r'.*\[(?P<id>\w+)\]', 'name[id]')
>>> result_dict = m.groupdict()
>>> result_dict
{'id': 'id'}
>>>

You don't actually need regular expressions for this. The .index() function and string slicing will work fine.
Say we have:
>>> s = 'name[id]'
Then:
>>> s[s.index('[')+1:s.index(']')]
'id'
To me, this is easy to read: "start one character after the [ and finish before the ]".

def between_brackets(text):
return text.partition('[')[2].partition(']')[0]
This will also work even if your string does not contain a […] construct, and it assumes an implied ] at the end in the case you have only a [ somewhere in the string.

I'm new to python and this is an old question, but maybe this?
str.split('[')[1].strip(']')

You can get the value of the list use []. For example, create a list from URL like below with split.
>>> urls = 'http://quotes.toscrape.com/page/1/'
This generates a list like the one below.
>>> print( urls.split("/") )
['http:', '', 'quotes.toscrape.com', 'page', '11', '']
And what if you wanna get value only "http" from this list? You can use like this
>>> print(urls.split("/")[0])
http:
Or what if you wanna get value only "1" from this list? You can use like this
>>> print(urls.split("/")[-2])
1

str.split uses the entire parameter to split a string. Try:
str.split("[")[1].split("]")[0]

How to quickly parse a list of strings

If I want to split a list of words separated by a delimiter character, I can use
>>> 'abc,foo,bar'.split(',')
['abc', 'foo', 'bar']
But how to easily and quickly do the same thing if I also want to handle quoted-strings which can contain the delimiter character ?
In: 'abc,"a string, with a comma","another, one"'
Out: ['abc', 'a string, with a comma', 'another, one']
Related question: How can i parse a comma delimited string into a list (caveat)?

import csv
input = ['abc,"a string, with a comma","another, one"']
parser = csv.reader(input)
for fields in parser:
for i,f in enumerate(fields):
print i,f # in Python 3 and up, print is a function; use: print(i,f)
Result:
0 abc
1 a string, with a comma
2 another, one

The CSV module should be able to do that for you

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

splitting a string based on tab in the file - python

You can use regex here: >>> import re >>> strs = "foo\tbar\t\tspam" >>> re.split(r'\t+', strs) ['foo', 'bar', 'spam'] update: You can use str.rstrip to get rid of trailing '\t' and then apply regex. >>> yas = "yas\t\tbs\tcda\t\t" >>> re.split(r'\t+', yas.rstrip('\t')) ['yas', 'bs', 'cda']

Split on tab, but then remove all blank matches. text = "hi\tthere\t\t\tmy main man" print([splits for splits in text.split("\t") if splits]) Outputs: ['hi', 'there', 'my main man']

You can use regexp to do this: import re patt = re.compile("[^\t]+") s = "a\t\tbcde\t\tef" patt.findall(s) ['a', 'bcde', 'ef']

An other regex-based solution: >>> strs = "foo\tbar\t\tspam" >>> r = re.compile(r'([^\t])\t') >>> r.findall(strs)[:-1] ['foo', 'bar', 'spam']

Python has support for CSV files in the eponymous csv module. It is relatively misnamed since it support much more that just comma separated values. If you need to go beyond basic word splitting you should take a look. Say, for example, because you are in need to deal with quoted values...

Related

How to capture specific character arrangements or a word form a line using python

Python Split With Delimiter In Field Value

Protect commas on consecutive string.join() and string.split()

How to split a string by using [] in Python

How to quickly parse a list of strings

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

splitting a string based on tab in the file - python

You can use regex here: >>> import re >>> strs = "foo\tbar\t\tspam" >>> re.split(r'\t+', strs) ['foo', 'bar', 'spam'] update: You can use str.rstrip to get rid of trailing '\t' and then apply regex. >>> yas = "yas\t\tbs\tcda\t\t" >>> re.split(r'\t+', yas.rstrip('\t')) ['yas', 'bs', 'cda']

Split on tab, but then remove all blank matches. text = "hi\tthere\t\t\tmy main man" print([splits for splits in text.split("\t") if splits]) Outputs: ['hi', 'there', 'my main man']

You can use regexp to do this: import re patt = re.compile("[^\t]+") s = "a\t\tbcde\t\tef" patt.findall(s) ['a', 'bcde', 'ef']

An other regex-based solution: >>> strs = "foo\tbar\t\tspam" >>> r = re.compile(r'([^\t]*)\t*') >>> r.findall(strs)[:-1] ['foo', 'bar', 'spam']

Python has support for CSV files in the eponymous csv module. It is relatively misnamed since it support much more that just comma separated values. If you need to go beyond basic word splitting you should take a look. Say, for example, because you are in need to deal with quoted values...

Related

How to capture specific character arrangements or a word form a line using python

Python Split With Delimiter In Field Value

Protect commas on consecutive string.join() and string.split()

How to split a string by using [] in Python

How to quickly parse a list of strings

Categories

Resources

An other regex-based solution: >>> strs = "foo\tbar\t\tspam" >>> r = re.compile(r'([^\t])\t') >>> r.findall(strs)[:-1] ['foo', 'bar', 'spam']