I have a list of this format:
["['05-Aug-13 10:17', '05-Aug-13 10:17', '05-Aug-13 15:17']"]
I am using:
for d in date:
print d
This produces:
['05-Aug-13 10:17', '05-Aug-13 10:17', '05-Aug-13 15:17']
I then try and add this to a defaultdict, so underneath the print d I write:
myDict[text].append(date)
Later, I try and iterate through this by using:
for text, date in myDict.iteritems():
for d in date:
print d, '\n'
But this doesn't work, just producing the format show in the first line of code in this question. The format suggests a list in a list, so I tried using:
for d in date:
for e in d:
myDict[text].append(e)
But this included every character of the line, as opposed to each separate date. What am I doing wrong? How can I have a defaultdict with this format
text : ['05-Aug-13 10:17', '06-Aug-13 11:17']
whose values can all be reached?
Your list contains only one element: the string representation of another list. You will have to parse it into an actual list before treating it like one:
import ast
actual_list = ast.literal_eval(your_list[0])
As an alternative (though the regular expression might need tuning for your use)
import re
pattern = r"\d{2}-[A-Z][a-z]{2}-\d{1,2} \d{2}:\d{2}"
re.findall(ptrn, your_list[0])
Related
How can I copy data from changing string?
I tried to slice, but length of slice is changing.
For example in one case I should copy number 128 from string '"edge_liked_by":{"count":128}', in another I should copy 15332 from "edge_liked_by":{"count":15332}
You could use a regular expression:
import re
string = '"edge_liked_by":{"count":15332}'
number = re.search(r'{"count":(\d*)}', string).group(1)
Really depends on the situation, however I find regular expressions to be useful.
To grab the numbers from the string without caring about their location, you would do as follows:
import re
def get_string(string):
return re.search(r'\d+', string).group(0)
>>> get_string('"edge_liked_by":{"count":128}')
'128'
To only get numbers from the *end of the string, you can use an anchor to ensure the result is pulled from the far end. The following example will grab any sequence of unbroken numbers that is both preceeded by a colon and ends within 5 characters of the end of the string:
import re
def get_string(string):
rval = None
string_match = re.search(r':(\d+).{0,5}$', string)
if string_match:
rval = string_match.group(1)
return rval
>>> get_string('"edge_liked_by":{"count":128}')
'128'
>>> get_string('"edge_liked_by":{"1321":1}')
'1'
In the above example, adding the colon will ensure that we only pick values and don't match keys such as the "1321" that I added in as a test.
If you just want anything after the last colon, but excluding the bracket, try combining split with slicing:
>>> '"edge_liked_by":{"count":128}'.split(':')[-1][0:-1]
'128'
Finally, considering this looks like a JSON object, you can add curly brackets to the string and treat it as such. Then it becomes a nested dict you can query:
>>> import json
>>> string = '"edge_liked_by":{"count":128}'
>>> string = '{' + string + '}'
>>> string = json.loads(string)
>>> string.get('edge_liked_by').get('count')
128
The first two will return a string and the final one returns a number due to being treated as a JSON object.
It looks like the type of string you are working with is read from JSON, maybe you are getting it as the output of some API you are working with?
If it is JSON, you've probably gone one step too far in atomizing it to a string like this. I'd work with the original output, if possible, if I were you.
If not, to make it more JSON like, I'd convert it to JSON by wrapping it in {}, and then working with the json.loads module.
import json
string = '"edge_liked_by":{"count":15332}'
string = "{"+string+"}"
json_obj = json.loads(string)
count = json_obj['edge_liked_by']['count']
count will have the desired output. I prefer this option to using regular expressions because you can rely on the structure of the data and reuse the code in case you wish to parse out other attributes, in a very intuitive way. With regular expressions, the code you use will change if the data are decimal, or negative, or contain non-numeric characters.
Does this help ?
a='"edge_liked_by":{"count":128}'
import re
b=re.findall(r'\d+', a)[0]
b
Out[16]: '128'
I'm just practicing basic web scraping using Python and Regex
I want to write a function that takes a string object as input and returns a dictionary where each key is a date as string like '2017-01-23' (without the quotes tho); and each value corresponding is the approval rating, stored as a floating numbers.
Here is what the input object(data) looks like:
As you can see, each record(per day) is denoted by {}, and each key:value pattern followed by ','
{"date":"2017-01-23","future":false,"subgroup":"All polls","approve_estimate":"45.46693",
"approve_hi":"50.88971","approve_lo":"40.04416","disapprove_estimate":"41.26452",
"disapprove_hi":"46.68729","disapprove_lo":"35.84175"},
{"date":"2017-01-24","future":false,"subgroup":"All polls"
...................
Here's a regex pattern for the dates:
date_pattern = r'\d{4}-\d{2}-\d{2}'
Using this,
date_pattern = r'\d{4}-\d{2}-\d{2}'
date_matcher = re.compile(date_pattern)
date_matches = matcher.findall(long_string) #list of all dates in string
But for the actual approval rating value, this wouldn't work because I'm not looking for a match, but the number that comes after this, which is 45.46693 in this example.
approve_pattern = r'approve_estimate\":'
#float(re.sub('[aZ]','',re.sub('["]','',re.split(approve_pattern, data) [1])))
The problem with the approve_pattern is that I can only fetch one value at a time. So how can I do this for the entire data and store the approve rating values as float?
Also, I want to only keep records for which "future":false to discard predicted values, and only keep the values with "future":true.
Please assume all encountered dates have valid approval estimates.
Here's the desired output
date_matches=['2018-01-01','2018-01-02','2018-01-03'] # "future":true filtered out
approve_matches=[47.1,47.2,47.9]
final_dict = {k:v for k,v in zip(date_matches,approve_matches)}
final_dict #Desired Output {'2018-01-01': 47.1, '2018-01-02': 47.2, '2018-01-03': 47.9}
Your data looks very much like JSON, except that it must be enclosed in brackets to form an array. You should use a JSON parser (e.g., json.loads) to read it.
Let's say s is your original string. Then the following expression results in your dictionary:
final_dict = {record['date']: record['approve_estimate']
for record in json.loads("[" + s + "]")
if record['future']}
# Empty in your case
I am trying to parse a substring using re.
From the string present in variable s,I would like to split the string present till the first !(the string stored in s has two !) and store it as a substring.From this substring(stored in variable result), I wish to parse another substring.
Here is the code,
import re
s='ecNumber*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#!ecNumber*2.3.1.11#kmValue*0.081#kmValueMaximum*#!'
Data={}
result = re.search('%s(.*)%s' % ('ec', '!'), s).group(1)
print result
ecNumber = re.search('%s(.*)%s' % ('Number*', '#kmValue*'), result).group(1)
Data["ecNumber"]=ecNumber
print Data
The value corresponding to each tag present in the substring(example:ecNumber) is stored in between * and # (example: *2.4.1.11#).I attempted to parse the value stored for the tag ecNumber in the first substring.
The output I obtain is
result='Number*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#!ecNumber*2.3.1.11#kmValue*0.081#kmValueMaximum*#'
{'ecNumber': '*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#!ecNumber*2.3.1.11#kmValue*0.081'}
The desired output is
result= 'ecNumber*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#'
{'ecNumber': '2.4.1.11'}
I would like to store each tag and its corresponding value.For example,
{'ecNumber': '2.4.1.11','kmValue':'0.021','kmValueMaximum':'1.25'}
Despite you are asking a solution with regular expression, I would say it's much easier to use direct string operations for this problem, since the source string is well formatted.
For infomation before the first i:
print dict([i.split('*') for i in s.split('!', 1)[0].split('#') if i])
For all information in s:
print [dict([i.split('*') for i in j.split('#') if i]) for j in s.split('!') if j]
You can try this:
import re
s='ecNumber*2.4.1.11#kmValue*0.57#kmValueMaximum*1.25#'
new_data = re.findall('(?<=^)[a-zA-Z]+(?=\*)|(?<=#)[a-zA-Z]+(?=\*)|(?<=\*)[-\d\.]+(?=#)', s)
final_data = dict([new_data[i:i+2] for i in range(0, len(new_data)-1, 2)])
Output:
{'kmValue': '0.57', 'kmValueMaximum': '1.25', 'ecNumber': '2.4.1.11'}
I want to remove special character '-' from date format in python. I have retrieved maximum date from a database column.
Here is my small code:
def max_date():
Max_Date= hive_select('SELECT MAX (t_date) FROM ovisto.ovisto_log')
value = Max_Date[0]
print value
Here is output:
{0: '2017-02-21', '_c0': '2017-02-21'}
I want only numbers without special character '-' from output.
so, I am expecting this answer '20170221'
I have tried in different ways but could not get proper answer.
How can I get in a simple way? Thanks for your time.
just rebuild a new dictionary using dict comprehension, iterating on the original dictionary, and stripping the unwanted characters from values using str.replace
d = {0: '2017-02-21', '_c0': '2017-02-21'}
new_d = {k:v.replace("-","") for k,v in d.items()}
print(new_d)
result:
{0: '20170221', '_c0': '20170221'}
if you only want to keep the values and drop the duplicates (and the order too :), use a set comprehension with the values instead:
s = {v.replace("-","") for _,v in d.items()}
You can try strptime:
value = Max_Date[0]
new_val= datetime.datetime.strptime( str( value ), '%Y%m%d').strftime('%m/%d/%y')
I found it here: How to convert a date string to different format
I have following string
adId:4028cb901dd9720a011e1160afbc01a3;siteId:8a8ee4f720e6beb70120e6d8e08b0002;userId:5082a05c-015e-4266-9874-5dc6262da3e0
I need only the value of adId,siteId and userId.
means
4028cb901dd9720a011e1160afbc01a3
8a8ee4f720e6beb70120e6d8e08b0002
5082a05c-015e-4266-9874-5dc6262da3e0
all the 3 in different variable or in a array so that i can use all three
You can split them to a dictionary if you don't need any fancy parsing:
In [2]: dict(kvpair.split(':') for kvpair in s.split(';'))
Out[2]:
{'adId': '4028cb901dd9720a011e1160afbc01a3',
'siteId': '8a8ee4f720e6beb70120e6d8e08b0002',
'userId': '5082a05c-015e-4266-9874-5dc6262da3e0'}
You could do something like this:
input='adId:4028cb901dd9720a011e1160afbc01a3;siteId:8a8ee4f720e6beb70120e6d8e08b0002;userId:5082a05c-015e-4266-9874-5dc6262da3e0'
result={}
for pair in input.split(';'):
(key,value) = pair.split(':')
result[key] = value
print result['adId']
print result['siteId']
print result['userId']
matches = re.findall("([a-z0-9A-Z_]+):([a-zA-Z0-9\-]+);", buf)
for m in matches:
#m[1] is adid and things
#m[2] is the long string.
You can also limit the lengths using {32} like
([a-zA-Z0-9]+){32};
Regular expressions allow you to validate the string and split it into component parts.
There is an awesome method called split() for python that will work nicely for you. I would suggest using it twice, once for ';' then again for each one of those using ':'.