So I've been trying to extract the strings that follow the "dot" character in the text file, but only for lines that follow the pattern as below, that is, after the date and time:
09 May 2018 10:37AM • 6PR, Perth (Mornings)
The problem is for each of those lines, the date and time would change so the only common pattern is that there would be AM or PM right before the "dot".
However, if I search for "AM" or "PM" it wouldn't recognize the lines because the "AM" and "PM" are attached to the time.
This is my current code:
for i,s in enumerate(open(file)):
for words in ['PM','AM']:
if re.findall(r'\b' + words + r'\b', s):
source=s.split('•')[0]
Any idea how to get around this problem? Thank you.
I guess your regex is the problem here.
for i, s in enumerate(open(file)):
if re.findall(r'\d{2}[AP]M', s):
source = s.split('•')[0]
# 09 May 2018 10:37AM
If you are trying to extract the datetime try using regex.
Ex:
import re
s = "09 May 2018 10:37AM • 6PR, Perth (Mornings)"
m = re.search("(?P<datetime>\d{2}\s+(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{4}\s+\d{2}\:\d{2}(AM|PM))", s)
if m:
print m.group("datetime")
Output:
09 May 2018 10:37AM
Related
I have such string
Sale: \t\t\t5 Jan \u2013 10 Jan
I want to extract the start and the end of the sale. Very straightforward approach would be to make several spilts, but I want to that using regular expressions.
As the result I want to get
start = "5 Jan"
end = "10 Jan"
Is it possible to do that using regex?
This should help.
import re
s = "Sale: \t\t\t5 Jan \u2013 10 Jan"
f = re.findall(r"\d+ \w{3}", s)
print f
Output:
['5 Jan', '10 Jan']
This may not be an optimised one but works assuming the string pattern remains the same.
import re
s = 'Sale: \t\t\t5 Jan \u2013 10 Jan'
start, end = re.search(r'Sale:(.*)', s).group(1).strip().replace('\u2013', ',').split(', ')
# start <- 5 Jan
# end <- 10 Jan
I have a String Like
"Originally Posted on 09 May, 2016. By query 3 j...."
how can i extract date using python??
I tried this code:
dStr = "Originally Posted on 09 May, 2016. By query 3 j...."
date_st = re.findall("(\d+\ \w+,)", dStr)
printing date_st, i've got:
['09 May,']
what should i do for year??
You forget to add the year after ',', Only need to add '\d+.' is well.
re.findall("(\d+\ \w+, \d+\.)", dStr)
You will got this:
'09 May, 2016.'
You were almost there. Just add 4 digits for the year after the ,. It is better to use [a-z]+ instead of \w+ to match the month names as \w matches _ and 0-9 (along with alphabets) as well.
re.findall(r'\d+\s[a-z]+,\s\d{4}',s,re.I)
I would like to extract the dates from the following text:
Some text some more text October 12 - 2010
The result would be:
yyyy-mm-dd: 2010-10-12
How can I tell regex the month is words and can be "january", "february" etc then a single space, [a group of 1-2 characters] a space and the final [group of four digits \d{4}]
Writing out the actual names of the months in the regex makes for a very readable and maintainable expression, which I feel is important when it comes to regexes. Like so:
(january|february|march|april|may|june|july|august|september|october|november|december)\s\d{1-2}\s\d{4}
Using the above regular expression and calendar library to find calendar names, you can proceed as follows.
import calendar
import re
month_num = {v: str(k) for k,v in enumerate(calendar.month_name)}
apattern = r'(january|february|march|april|may|june|july|august|september|october|november|december)\s\d{1,2}\s\-\s\d{4}'
re.sub(apattern, lambda x: 'yyyy-mm-dd:' + x.group().split(" ")[-1]+"-"+x.group().split(" ")[-3] + "-" + month_num[x.group().capitalize().split(" ")[0]], 'october 12 - 2010')
I have a string:
string = ""7807161604","Sat Jan 16 00:00:57 +0000 2010","Global focus begins tonight. Pretty interested to hear more about it.","Madison Alabama","al","17428434","81","51","Sun Nov 16 21:46:24 +0000 2008","243"
I only want the text: "Global focus begins tonight. Pretty interested to hear more about it."" which is between the 2nd and 3rd comma/delimiter.
If i use:
i = string.split(',', 2)
s = i[2]
j = s.split(',',-7)
print j[0]
i will get the desired output.
But, if there is an extra comma between the original string as shown below:
string = ""7807161604","Sat Jan 16 00:00:57 +0000 2010","Global focus begins tonight. Pretty interested, to hear more about it.","Madison Alabama","al","17428434","81","51","Sun Nov 16 21:46:24 +0000 2008","243""
Then this approach does not work because the output I require is being split. Can anyone please help and suggest a different approach or advise if I'm going wrong? thanks!
You can use python's built-in csv module to do this.
j = next(csv.reader([string]));
Now j is each item delimited by a , and will include commas if the value is wrapped in ". See j[2].
I have the following string, I collected from twitter streaming api
"""{"created_at":"Mon Mar 11 20:15:36 +0000 2013","id":311208808837951488,"id_str":"311208808837951488","text":"ALIENS ENTRATE E' IMPORTANTE!!! \n\n\n\nMTV's Musical March Madness ritorna il 18 marzo...Siete pronti A http:\/\/t.co\/ABXEfquTJw via #Hopee_dream","source":"\u003ca href=\"http:\/\/twitter.com\/download\/android\" rel=\"nofollow\"\u003eTwitter for Android\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1025970793,"id_str":"1025970793","name":"Tom's Perfection\u2665","screen_name":"_MyGreenEyes_","location":"","url":null,"description":"Angel,don't you cry,i'll meet you on the other side.\u2661","protected":false,"followers_count":387,"friends_count":520,"listed_count":1,"created_at":"Fri Dec 21 08:39:17 +0000 2012","favourites_count":174,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":false,"statuses_count":772,"lang":"it","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/a0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/si0.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/3363059730\/3d791e51eefa800150cd99917abc1d2c_normal.jpeg","profile_image_url_https":"https:\/\/si0.twimg.com\/profile_images\/3363059730\/3d791e51eefa800150cd99917abc1d2c_normal.jpeg","profile_banner_url":"https:\/\/si0.twimg.com\/profile_banners\/1025970793\/1362500832","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"entities":{"hashtags":[],"urls":[{"url":"http:\/\/t.co\/ABXEfquTJw","expanded_url":"http:\/\/tl.gd\/l9f5j7","display_url":"tl.gd\/l9f5j7","indices":[101,123]}],"user_mentions":[]},"favorited":false,"retweeted":false,"possibly_sensitive":false,"filter_level":"medium"}"""
I am doing the following operations:
import json
json_string = json_string.strip()
jsn_dict = json.loads(json_string)
print jsn_dict["text"]
gives:
ALIENS ENTRATE E' IMPORTANTE!!!
instead of:
"ALIENS ENTRATE E' IMPORTANTE!!! \n\n\n\nMTV's Musical March Madness ritorna il 18 marzo...Siete pronti A http:\/\/t.co\/ABXEfquTJw via #Hopee_dream"
Looks to me that the newline characters are creating problems in parsing this string to python dictionary.
But then I am doing json_string.strip() operation. I thought it will remove such stuff from my string..
What am I doing wrong?
The str.strip() method only removes whitespace characters at the beginning and the end of a string. Not anywhere in the middle.
To remove all newlines from a string, you could do:
"some\n\n\nstring".replace("\n", "")
or
"some\n\n\nstring".translate(None, "\n")
The first one is probably easier to read and understand.
I ran into a similar problem with text being truncated after the \n symbol.
In my case, the issue was because I had accidentally prepended the # symbol to the key I was using to pull the value out of the dict:
print jsn_dict['#text']