The code looks like this
txt = "ID:2020,Sugar:3,cost_sugar:30,ID:2021,Sugar:5,cost_sugar:50"
x = str(txt.split(","))
And the output is
['ID:2020', 'Sugar:3']
Now I want to perform split again such that the output should look like:
['ID', '2020', 'Sugar', '3']
I like the answers provided by people by using the re module. Here is another way to do that:
txt.replace(":", ",").split(",")
Essentially, you're treating ":" as a ",", so why not first replace the ":" with a "," and then split?
Another solution is to use regex
>>> import re
>>> re.split(",|:", txt)
['ID', '2020', 'Sugar', '3', 'cost_sugar', '30', 'ID', '2021', 'Sugar', '5', 'cost_sugar', '50']
Python's re module has a nice feature
import re
re.split(':|,',txt)
You can use either code below related with your project, or it's a good example if you are learning the first time. The first one uses list. The second one uses re.
txt = "ID:2020,Sugar:3,cost_sugar:30,ID:2021,Sugar:5,cost_sugar:50"
returnlist = []
for x in list(txt.split(",")):
for fullkey in list(x.split(":")):
returnlist.append(fullkey)
print(returnlist)
OR
import re
txt = "ID:2020,Sugar:3,cost_sugar:30,ID:2021,Sugar:5,cost_sugar:50"
returnlist = re.split(':|,', txt)
print(returnlist)
you can use a list to first get the key and values and then print the list
Related
I webscraped some data but it jammed it all into one place, so I'm trying to split a list of strings and the strings are composed out of string characters and out of numbers. I want to split them the moment a number appears and make myself a data table out of that.
Imagine there is a list strings:
string0 = 'string123' ; string1 = 'a12' ; string2 = 'bob69'....
Has anyone have got any ideas how I can do that?
What about using regex? i.e., the re package in python, combined with the split method? Something like this could work:
import re
string = 'string01string02string23string4string500string'
strlist = re.split('(\d+)', string)
print(strlist)
['string', '01', 'string', '02', 'string', '23', 'string', '4', 'string', '500', 'string']
You would then need to combine every other element in the list in your case i think, so something like this:
cmb = [i+j for i,j in zip(strlist[::2], strlist[1::2])]
print(cmb)
['string01', 'string02', 'string23', 'string4', 'string500']
You can split using a regex with only a lookbehind and lookahead (see re documentation for reference):
import re
re.split('(?<=\D)(?=\d)', string0)
output: ['string', '123']
NB. if you want to split on any change from non-number to number and conversely:
re.split('(?<=\D)(?=\d)|(?<=\d)(?=\D)', 'abc123abc123')
## OR
re.findall('(\D+|\d+)', 'abc123abc123')
output: ['abc', '123', 'abc', '123']
I am doing the following to get the movieID:
>>> x.split('content')
['movieID" ', '="770672122">']
>>> [item for item in x.split('content')[1] if item.isdigit()]
['7', '7', '0', '6', '7', '2', '1', '2', '2']
>>> ''.join([item for item in x.split('content')[1] if item.isdigit()])
'770672122'
Would would be a better way to do this?
Without using a regular expression, you could just split by the double quotes and take the next to last field.
u="""movieID" content="7706">"""
u.split('"')[-2] # returns: '7706'
This trick is definitely the most readable, if you don't know about regular expressions yet.
Your string is a bit strange though as there are 3 double quotes. I assume it comes from an HTML file and you're only showing a small substring. In that case, you might make your code more robust by using a regular expression such as:
import re
s = re.search('(\d+)', u) # looks for multiple consecutive digits
s.groups() # returns: ('7706',)
You could make it even more robust (but you'll need to read more) by using a DOM-parser such as BeautifulSoup.
I assume x looks like this:
x = 'movieID content="770672122">'
Regex is definitely one way to extract the content. For example:
>>> re.search(r'content="(\d+)', x).group(1)
'770672122'
The above fetches one or more consecutive digits which follow the string content=".
It seems you could do something like the following if your string is like the below:
>>> import re
>>> x = 'movieID content="770672122">'
>>> re.search(r'\d+', x).group()
'770672122'
I have a "CSV" with which some of the data fields happen to contain the comma delimiter as in the second row of the following sample data.
"1","stuff","and","things"
"2","black,white","more","stuff"
I can't change the source data and I don't know how to str.split() and not split in the value "black,white".
Ways I've approached my problem:
I looked at partition() and don't see how that would benefit me.
I'm sure a regex would capture data properly but I'm not sure how to tie one into splitting.
Since every row in the source will always have the same number of fields I thought maybe setting maxsplit would help but talked myself out of that with the thinking that it would still split within "black,white" and I would end up loosing the last value (which would be "stuff" in this case).
Certainly this is easy to overcome so I'm looking forward to learning something new!
Your help is greatly appreciated.
Using csv and StringIO:
>>> import csv, StringIO
>>> data = """"1","stuff","and","things"
... "2","black,white","more","stuff"
... """
>>> reader = csv.reader(StringIO.StringIO(data))
>>> for row in reader:
... print row
...
['1', 'stuff', 'and', 'things']
['2', 'black,white', 'more', 'stuff']
If your source is not CSV, and you just want to balance quotes in your string you may try using shlex module:
import shlex
lex = shlex.shlex('"2","black,white","more","stuff"')
for i in lex:
print i
Commas outside the strings are always followed by double-quotes. Just split on ," instead of just , (or even ",")
>>> x = '"2","black,white","more","stuff"'
>>> x
'"2","black,white","more","stuff"'
>>> x.split(',"')
['"2"', 'black,white"', 'more"', 'stuff"']
>>> [y.strip('"') for y in x.split(',"')]
['2', 'black,white', 'more', 'stuff']
Of course, edit for efficiency
YevgenYampolskiy's suggestion of shlex is also an alternative.
>>> x = '"2","black,white","more","stuff"'
>>> x
'"2","black,white","more","stuff"'
>>> import shlex
>>> y = shlex.shlex(x)
>>> [i.strip('"') for i in y if i != ',']
['2', 'black,white', 'more', 'stuff']
How do I convert "1,,2'3,4'" into a list? Commas separate the individual items, unless they are within quotes. In that case, the comma is to be included in the item.
This is the desired result: ['1', '', '2', '3,4']. One regex I found on another thread to ignore the quotes is as follows:
re.compile(r'''((?:[^,"']|"[^"]*"|'[^']*')+)''')
But this gives me this output:
['', '1', ',,', "2'3,4'", '']
I can't understand, where these extra empty strings are coming from, and why the two commas are even being printed at all, let alone together.
I tried making this regex myself:
re.compile(r'''(, | "[^"]*" | '[^']*')''')
which ended up not detecting anything, and just returned my original list.
I don't understand why, shouldn't it detect the commas at the very least? The same problem occurs if I add a ? after the comma.
Instead of a regular expression, you might be better off using the csv module since what you are dealing with is a CSV string:
from cStringIO import StringIO
from csv import reader
file_like_object = StringIO("1,,2,'3,4'")
csv_reader = reader(file_like_object, quotechar="'")
for row in csv_reader:
print row
This results in the following output:
['1', '', '2', '3,4']
pyparsing includes a predefined expression for comma-separated lists:
>>> from pyparsing import commaSeparatedList
>>> s = "1,,2'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', "2'3", "4'"]
Hmm, looks like you have a typo in your data, missing a comma after the 2:
>>> s = "1,,2,'3,4'"
>>> print commaSeparatedList.parseString(s).asList()
['1', '', '2', "'3,4'"]
So from this string:
"name[id]"
I need this:
"id"
I used str.split ('[]'), but it didn't work. Does it only take a single delimiter?
Use a regular expression:
import re
s = "name[id]"
re.find(r"\[(.*?)\]", s).group(1) # = 'id'
str.split() takes a string on which to split input. For instance:
"i,split,on commas".split(',') # = ['i', 'split', 'on commas']
The re module also allows you to split by regular expression, which can be very useful, and I think is what you meant to do.
import re
s = "name[id]"
# split by either a '[' or a ']'
re.split('\[|\]', s) # = ['name', 'id', '']
Either
"name[id]".split('[')[1][:-1] == "id"
or
"name[id]".split('[')[1].split(']')[0] == "id"
or
re.search(r'\[(.*?)\]',"name[id]").group(1) == "id"
or
re.split(r'[\[\]]',"name[id]")[1] == "id"
Yes, the delimiter is the whole string argument passed to split. So your example would only split a string like 'name[]id[]'.
Try eg. something like:
'name[id]'.split('[', 1)[-1].split(']', 1)[0]
'name[id]'.split('[', 1)[-1].rstrip(']')
I'm not a fan of regex, but in cases like it often provides the best solution.
Triptych already recommended this, but I'd like to point out that the ?P<> group assignment can be used to assign a match to a dictionary key:
>>> m = re.match(r'.*\[(?P<id>\w+)\]', 'name[id]')
>>> result_dict = m.groupdict()
>>> result_dict
{'id': 'id'}
>>>
You don't actually need regular expressions for this. The .index() function and string slicing will work fine.
Say we have:
>>> s = 'name[id]'
Then:
>>> s[s.index('[')+1:s.index(']')]
'id'
To me, this is easy to read: "start one character after the [ and finish before the ]".
def between_brackets(text):
return text.partition('[')[2].partition(']')[0]
This will also work even if your string does not contain a […] construct, and it assumes an implied ] at the end in the case you have only a [ somewhere in the string.
I'm new to python and this is an old question, but maybe this?
str.split('[')[1].strip(']')
You can get the value of the list use []. For example, create a list from URL like below with split.
>>> urls = 'http://quotes.toscrape.com/page/1/'
This generates a list like the one below.
>>> print( urls.split("/") )
['http:', '', 'quotes.toscrape.com', 'page', '11', '']
And what if you wanna get value only "http" from this list? You can use like this
>>> print(urls.split("/")[0])
http:
Or what if you wanna get value only "1" from this list? You can use like this
>>> print(urls.split("/")[-2])
1
str.split uses the entire parameter to split a string. Try:
str.split("[")[1].split("]")[0]