parsing in python - python

I have following string
adId:4028cb901dd9720a011e1160afbc01a3;siteId:8a8ee4f720e6beb70120e6d8e08b0002;userId:5082a05c-015e-4266-9874-5dc6262da3e0
I need only the value of adId,siteId and userId.
means
4028cb901dd9720a011e1160afbc01a3
8a8ee4f720e6beb70120e6d8e08b0002
5082a05c-015e-4266-9874-5dc6262da3e0
all the 3 in different variable or in a array so that i can use all three

You can split them to a dictionary if you don't need any fancy parsing:
In [2]: dict(kvpair.split(':') for kvpair in s.split(';'))
Out[2]:
{'adId': '4028cb901dd9720a011e1160afbc01a3',
'siteId': '8a8ee4f720e6beb70120e6d8e08b0002',
'userId': '5082a05c-015e-4266-9874-5dc6262da3e0'}

You could do something like this:
input='adId:4028cb901dd9720a011e1160afbc01a3;siteId:8a8ee4f720e6beb70120e6d8e08b0002;userId:5082a05c-015e-4266-9874-5dc6262da3e0'
result={}
for pair in input.split(';'):
(key,value) = pair.split(':')
result[key] = value
print result['adId']
print result['siteId']
print result['userId']

matches = re.findall("([a-z0-9A-Z_]+):([a-zA-Z0-9\-]+);", buf)
for m in matches:
#m[1] is adid and things
#m[2] is the long string.
You can also limit the lengths using {32} like
([a-zA-Z0-9]+){32};
Regular expressions allow you to validate the string and split it into component parts.

There is an awesome method called split() for python that will work nicely for you. I would suggest using it twice, once for ';' then again for each one of those using ':'.

Related

Regex: Capture a line when certain columns are equal to certain values

Let's say we have this data extract:
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
I want to retrieve the line when from = paris, and type = member.
Which means in this example I have only:
1,paris,berlin,member,12
That satisfy these rules. I am trying to do this with Regex only. I am still learning and I could only get this:
^.*(paris).*(member).*$
However, this will give me also the second line where paris is a destination.
The idea I guess is to:
Divide the line by commas.
Check if the second item is equal to 'paris'
Check if the fourth item is equal to 'member', or even check if there is 'member' in that line as there is no confusion with this part.
Any solution where I can use only regex?
Use [^,]* instead of .* to match a sequence of characters that doesn't include the comma separator. Use this for each field you want to skip when matching the line.
^[^,]*,paris,[^,]*,member,
Note that this is a very fragile mechanism compared to use the csv module, since it will break if you have any fields that contain comma (the csv module understands quoting a field to protect the delimiter).
This should do it:
^.*,(paris),.*,(member),.*$
As many have pointed out, I would read this into a dictionary using csv. However, if you insist on using regex, this should work:
[0-9]+\,paris.*[^-]member.*
try this.
import re
regex = r"\d,paris,\w+,member,\d+"
str = """ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10"""
str = str.split("\n")
for line in str:
if (re.match(regex, line)):
print(line)
You can try this:
import re
s = """
ID,from,to,type,duration
1,paris,berlin,member,12
2,berlin,paris,member,12
3,paris,madrid,non-member,10
"""
final_data = re.findall('\d+,paris,\w+,member,\d+', s)
Output:
['1,paris,berlin,member,12']
However, note that the best solution is to read the file and use a dictionary:
import csv
l = list(csv.reader(open('filename.csv')))
final_l = [dict(zip(l[0], i)) for i in l[1:]]
final_data = [','.join(i[b] for b in l[0]) for i in final_l if i['from'] == 'paris' and i['type'] == 'member']

Why is the split() returning list objects that are empty? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Python split a string at an underscore

How do I split a string at the second underscore in Python so that I get something like this
name = this_is_my_name_and_its_cool
split name so I get this ["this_is", "my_name_and_its_cool"]
the following statement will split name into a list of strings
a=name.split("_")
you can combine whatever strings you want using join, in this case using the first two words
b="_".join(a[:2])
c="_".join(a[2:])
maybe you can write a small function that takes as argument the number of words (n) after which you want to split
def func(name, n):
a=name.split("_")
b="_".join(a[:n])
c="_".join(a[n:])
return [b,c]
Assuming that you have a string with multiple instances of the same delimiter and you want to split at the nth delimiter, ignoring the others.
Here's a solution using just split and join, without complicated regular expressions. This might be a bit easier to adapt to other delimiters and particularly other values of n.
def split_at(s, c, n):
words = s.split(c)
return c.join(words[:n]), c.join(words[n:])
Example:
>>> split_at('this_is_my_name_and_its_cool', '_', 2)
('this_is', 'my_name_and_its_cool')
I think you're trying the split the string based on second underscore. If yes, then you used use findall function.
>>> import re
>>> s = "this_is_my_name_and_its_cool"
>>> re.findall(r'^[^_]*_[^_]*|[^_].*$', s)
['this_is', 'my_name_and_its_cool']
>>> [i for i in re.findall(r'^[^_]*_[^_]*|(?!_).*$', s) if i]
['this_is', 'my_name_and_its_cool']
print re.split(r"(^[^_]+_[^_]+)_","this_is_my_name_and_its_cool")
Try this.
Here's a quick & dirty way to do it:
s = 'this_is_my_name_and_its_cool'
i = s.find('_'); i = s.find('_', i+1)
print [s[:i], s[i+1:]]
output
['this_is', 'my_name_and_its_cool']
You could generalize this approach to split on the nth separator by putting the find() into a loop.

String concatenation produces incorrect output in Python?

I have this code:
filenames=["file1","FILE2","file3","fiLe4"]
def alignfilenames():
#build a string that can be used to add labels to the R variables.
#format goal: suffixes=c(".fileA",".fileB")
filestring='suffixes=c(".'
for filename in filenames:
filestring=filestring+str(filename)+'",".'
print filestring[:-3]
#now delete the extra characters
filestring=filestring[-1:-4]
filestring=filestring+')'
print "New String"
print str(filestring)
alignfilenames()
I'm trying to get the string variable to look like this format: suffixes=c(".fileA",".fileB".....) but adding on the final parenthesis is not working. When I run this code as is, I get:
suffixes=c(".file1",".FILE2",".file3",".fiLe4"
New String
)
Any idea what's going on or how to fix it?
Does this do what you want?
>>> filenames=["file1","FILE2","file3","fiLe4"]
>>> c = "suffixes=c(%s)" % (",".join('".%s"' %f for f in filenames))
>>> c
'suffixes=c(".file1",".FILE2",".file3",".fiLe4")'
Using a string.join is a much better way to add a common delimiter to a list of items. It negates the need to have to check for being on the last item before adding the delimiter, or in your case attempting to strip off the last one added.
Also, you may want to look into List Comprehensions
It looks like you might be trying to use python to write an R script, which can be a quick solution if you don't know how to do it in R. But in this case the R-only solution is actually rather simple:
R> filenames= c("file1","FILE2","file3","fiLe4")
R> suffixes <- paste(".", tolower(filenames), sep="")
R> suffixes
[1] ".file1" ".file2" ".file3" ".file4"
R>
What's going on is that this slicing returns an empty string
filestring=filestring[-1:-4]
Because the end is before the begin. Try the following on the command line:
>>> a = "hello world"
>>> a[-1:-4]
''
The solution is to instead do
filestring=filestring[:-4]+filestring[-1:]
But I think what you actually wanted was to just drop the last three characters.
filestring=filestring[:-3]
The better solution is to use the join method of strings as sberry2A suggested

How to split a string by using [] in Python

So from this string:
"name[id]"
I need this:
"id"
I used str.split ('[]'), but it didn't work. Does it only take a single delimiter?
Use a regular expression:
import re
s = "name[id]"
re.find(r"\[(.*?)\]", s).group(1) # = 'id'
str.split() takes a string on which to split input. For instance:
"i,split,on commas".split(',') # = ['i', 'split', 'on commas']
The re module also allows you to split by regular expression, which can be very useful, and I think is what you meant to do.
import re
s = "name[id]"
# split by either a '[' or a ']'
re.split('\[|\]', s) # = ['name', 'id', '']
Either
"name[id]".split('[')[1][:-1] == "id"
or
"name[id]".split('[')[1].split(']')[0] == "id"
or
re.search(r'\[(.*?)\]',"name[id]").group(1) == "id"
or
re.split(r'[\[\]]',"name[id]")[1] == "id"
Yes, the delimiter is the whole string argument passed to split. So your example would only split a string like 'name[]id[]'.
Try eg. something like:
'name[id]'.split('[', 1)[-1].split(']', 1)[0]
'name[id]'.split('[', 1)[-1].rstrip(']')
I'm not a fan of regex, but in cases like it often provides the best solution.
Triptych already recommended this, but I'd like to point out that the ?P<> group assignment can be used to assign a match to a dictionary key:
>>> m = re.match(r'.*\[(?P<id>\w+)\]', 'name[id]')
>>> result_dict = m.groupdict()
>>> result_dict
{'id': 'id'}
>>>
You don't actually need regular expressions for this. The .index() function and string slicing will work fine.
Say we have:
>>> s = 'name[id]'
Then:
>>> s[s.index('[')+1:s.index(']')]
'id'
To me, this is easy to read: "start one character after the [ and finish before the ]".
def between_brackets(text):
return text.partition('[')[2].partition(']')[0]
This will also work even if your string does not contain a […] construct, and it assumes an implied ] at the end in the case you have only a [ somewhere in the string.
I'm new to python and this is an old question, but maybe this?
str.split('[')[1].strip(']')
You can get the value of the list use []. For example, create a list from URL like below with split.
>>> urls = 'http://quotes.toscrape.com/page/1/'
This generates a list like the one below.
>>> print( urls.split("/") )
['http:', '', 'quotes.toscrape.com', 'page', '11', '']
And what if you wanna get value only "http" from this list? You can use like this
>>> print(urls.split("/")[0])
http:
Or what if you wanna get value only "1" from this list? You can use like this
>>> print(urls.split("/")[-2])
1
str.split uses the entire parameter to split a string. Try:
str.split("[")[1].split("]")[0]

Categories

Resources