I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.
So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234
Related
I am attempting to use REGEX to extract connection strings from blocks of text in a pandas dataframe.
My REGEX works on REGEX101.com (see Screenshot below). Link to my saved test here: https://regex101.com/r/ILnpS0/1
When I try to run the REGEX in a Pandas dataframe, I don’t get any REGEX matches/extracts (but no an error), despite getting matches on REGEX101. Link to my code in a Google Colab notebook: https://colab.research.google.com/drive/1WAMlGkHAOqe38Lzo_K0KHwD_ynVJyIq1?usp=sharing
Therefore the issue appears to be how pandas is interpreting my REGEX
Can anyone identify why I not getting any REGEX matches using pandas?
REGEX Logic
My REGEX consists of 3 groups
(?=Source = DB2.Database)(.*?)(?=\]\))
Group 1: (?=Source = DB2.Database) is a “Lookbehind” that looks for the text “Source = DB2.Database” i.e the start of my connection string.
Group 2: (.?)* looks for any characters and acts as a span between the 1st and 3rd group.
Group 3: (?=])) is a look behind assertion that aims to identify the end of the connection string)
Additional tests:
When I run a simplified version of the REGEX (DB2.Database) I get the match, as expected. This example is also in the notebook linked above.
My code (same as in linked Colab Notebook)
import pandas as pd
myDF = pd.DataFrame({'conn_str':['''{'expression': 'let\n Source = Snowflake.Databases("whitehouse.australia-east.azure.snowflakecomputing.com","USER"),\n WH_DW_Database = Source{[Name="WHOUSE_DW",Kind="Database"]}[Data],\n DWH_Schema = SPARK_DW_Database{[Name="DWH",Kind="Schema"]}[Data],\n D_ACCOUNT_CURR_View = DWH_Schema{[Name="D_ACCOUNT_CURR",Kind="View"]}[Data],\n #"Filtered Rows" = Table.SelectRows(D_ACCOUNT_CURR_View, each ([PAYMENT_TYPE] = "POSTPAID") and ([ACCOUNT_SEGMENT] <> "Consumer") ),\n #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"DESCRIPTION", "ACCOUNT_NUMBER"})\nin\n #"Removed Other Columns"'}''','''{'expression': 'let\n Source = DB2.Database("69.699.69.69", "WHUDB", [HierarchicalNavigation=true, Implementation="Microsoft", Query="SELECT\n base.HEAD_PARTY_NO,\n base.HEAD_PARTY_NAME,\n usg.BILL_MONTH,\n base.CUSTOMER_NUMBER,\n base.ACCOUNT_NUMBER,\n base.CHARGE_ARRANGEMENT_NUMBER,\n usg.DATA_MB,\n usg.DATA_MB/1024 as Data_GB,\n base.PRODUCT_DESCRIPTION,\nbase.LINE_DESCRIPTION\n\nFROM PRODUCT.MOBILE_ACTIVE_BASE base\nLEFT JOIN PRODUCT.MOBILE_USAGE_SUMMARY usg\n\nON\n base.CHARGE_ARRANGEMENT_NUMBER = usg.CHARGE_ARRANGEMENT_NUMBER\n\nand \nbase.CHARGE_ARRANGEMENT_ID = usg.CHARGE_ARRANGEMENT_ID\n\nWHERE base.PRODUCT_DESCRIPTION LIKE \'%Share%\' \n--AND (base.HEAD_PARTY_NO = 71474425 or base.HEAD_PARTY_NO = 73314303)\nAND usg.BILL_MONTH BETWEEN (current_date - 5 MONTHS) and CURRENT_DATE \nOrder by base.ACCOUNT_NUMBER,Data_MB desc with ur"]),\n #"Added Custom1" = Table.AddColumn(Source, "Line Number", each Text.Middle([CHARGE_ARRANGEMENT_NUMBER],1,14)),\n #"Renamed Columns" = Table.RenameColumns(#"Added Custom1",{{"LINE_DESCRIPTION", "Line Description"}, {"BILL_MONTH", "Bill Month"}}),\n #"Filtered Rows" = Table.SelectRows(#"Renamed Columns", each ([PRODUCT_DESCRIPTION] <> "Sharer Unlimited NZ & Aus mins + Unlimited NZ & Aus texts" and [PRODUCT_DESCRIPTION] <> "Sharer with Data Stretch"))\nin\n #"Filtered Rows"'}''']})
myDF
#why isn't this working?
#this regex works on REGEX 101 : https://regex101.com/r/ILnpS0/1
regex_db =r'(?=Source = DB2.Database)(.*?)(?=\]\))'
myDF['SQLDB connection2'] = myDF['conn_str'].str.extract(regex_db ,expand=True)
myDF
#This is a simplified version of the above REGEX, and works to extracts the text "DB2.Database"
#This works fine
regex_db2 =r'(DB2.Database)'
myDF['SQLDB connection1'] = myDF['conn_str'].str.extract(regex_db2 ,expand=True)
myDF
Any suggestions on what I am doing wrong?
Try running your regex in dot all mode, so that .* will match across newlines:
regex_db = r'(?=Source = DB2.Database)(.*?)(?=\]\))'
myDF["SQLDB connection2"] = myDF["conn_str"].str.extract(regex_db, expand=True, flags=re.S)
myDF
In the following string
SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\nCANDICE - https://www.lovebilly.com\n\nfilmed this video in 4k on this -- http://amzn.to/2sTDnRZ\nwith this lens -- http://amzn.to/2rUJOmD\nbig drone - http://amzn.to/2o3GLX5\nSony CAMERA http://amzn.to/2nOBmnv\nOLD CAMERA; http://amzn.to/2o2cQBT\nMAIN LENS; http://amzn.to/2od5gBJ\nBIG SONY CAMERA; http://amzn.to/2nrdJRO\nBIG Canon CAMERA; on http://instagram.com/caseyneistat\non https://www.facebook.com/cneistat\non https://twitter.com/CaseyNeistat\n\namazing intro song by https://soundcloud.com/discoteeth\n\nad disclosure. THIS IS NOT AN AD. not selling or promoting anything. but samsung did produce the Shantell Video as a 'GALAXY PROJECT' which is an initiative that enables creators like Shantell and me to make projects we might otherwise not have the opportunity to make. hope that's clear. if not ask in the comments and i'll answer any specifics.
I am trying to remove any \n. This string is accessed from a pandas df. The solution I have tried is:
i = str(i).replace("\n", "")
The original code looks like:
for i in data["description"]:
print(i)
i = str(i).replace("\n", "")
i = str(i).split(" ")
for x in i:
x = x.replace("\n", "")
print(x)
where data is the df that stores all of the data from the csv file, and description is the column where the string is taken out of.
I suspect that the failure of replace() to work is due to the string being from a df, as when I try it with just a regular string
x = "a \n\n string"
.replace() works just fine. Any reason why taking strings from a df causes replace to fail? Thanks.
Pandas Dataframes keep their string methods a bit hidden behind the .str attribute. Something like df["column_name"].str.replace("\n", "") should work, and I'd recommend the pandas documentation below to learn more.
https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#string-methods
This should work:
df["description"].str.replace("\n", "")
Or you could use either of the following if you want to do this for the entire df:
df = df.replace("\n", "")
df.replace("\n", "", inplace = True)
it's my first time with regex and I have some issues, which hopefully you will help me find answers. Let's give an example of data:
chartData.push({
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
});
var newDate = new Date();
newDate.setFullYear(
2007,
10,
1 );
Want I want to retrieve is to get the date which is the last bracket and the corresponding description. I have no idea how to do it with one regex, thus I decided to split it into two.
First part:
I retrieve the value after the description:. This was managed with the following code:[\n\r].*description:\s*([^\n\r]*) The output gives me the result with a quote "9710" but I can fairly say that it's alright and no changes are required.
Second part:
Here it gets tricky. I want to retrieve the values in brackets after the text newDate.setFullYear. Unfortunately, what I managed so far, is to only get values inside brackets. For that, I used the following code \(([^)]*)\) The result is that it picks all 3 brackets in the example:
"{
date: newDate,
visits: 9710,
color: "#016b92",
description: "9710"
}",
"()",
"2007,
10,
1 "
What I am missing is an AND operator for REGEX with would allow me to construct a code allowing retrieval of data in brackets after the specific text.
I could, of course, pick every 3rd result but unfortunately, it doesn't work for the whole dataset.
Does anyone of you know the way how to resolve the second part issue?
Thanks in advance.
You can use the following expression:
res = re.search(r'description: "([^"]+)".*newDate.setFullYear\((.*)\);', text, re.DOTALL)
This will return a regex match object with two groups, that you can fetch using:
res.groups()
The result is then:
('9710', '\n2007,\n10,\n1 ')
You can of course parse these groups in any way you want. For example:
date = res.groups()[1]
[s.strip() for s in date.split(",")]
==>
['2007', '10', '1']
import re
test = r"""
chartData.push({
date: 'newDate',
visits: 9710,
color: "#016b92",
description: "9710"
})
var newDate = new Date()
newDate.setFullYear(
2007,
10,
1);"""
m = re.search(r".*newDate\.setFullYear(\(\n.*\n.*\n.*\));", test, re.DOTALL)
print(m.group(1).rstrip("\n").replace("\n", "").replace(" ", ""))
The result:
(2007,10,1)
The AND part that you are referring to is not really an operator. The pattern matches characters from left to right, so after capturing the values in group 1 you cold match all that comes before you want to capture your values in group 2.
What you could do, is repeat matching all following lines that do not start with newDate.setFullYear(
Then when you do encounter that value, match it and capture in group 2 matching all chars except parenthesis.
\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);
Regex demo | Python demo
Example code
import re
regex = r"\r?\ndescription: \"([^\"]+)\"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(([^()]+)\);"
test_str = ("chartData.push({\n"
"date: newDate,\n"
"visits: 9710,\n"
"color: \"#016b92\",\n"
"description: \"9710\"\n"
"});\n"
"var newDate = new Date();\n"
"newDate.setFullYear(\n"
"2007,\n"
"10,\n"
"1 );")
print (re.findall(regex, test_str))
Output
[('9710', '\n2007,\n10,\n1 ')]
There is another option to get group 1 and the separate digits in group 2 using the Python regex PyPi module
(?:\r?\ndescription: "([^"]+)"(?:\r?\n(?!newDate\.setFullYear\().*)*\r?\nnewDate\.setFullYear\(|\G)\r?\n(\d+),?(?=[^()]*\);)
Regex demo
I am trying to extract elements using a regex, while needing to also distinguish which lines have "-External" at the end. The naming structure I am working with is:
<ServerName>: <Country>-<CountryCode>
or
<ServerName>: <Country>-<CountryCode>-External
For example:
test1 = 'Neo1: Brussels-BRU-External'
test2 = 'Neo1: Brussels-BRU'
match = re.search(r'(?<=: ).+', test1)
print match.group(0)
This gives me "Brussels-BRU". I am trying to extract "Brussels" and "BRU" separately, while not caring about anything to the left of the :.
After, I need to know when a line has "-External". Is there a way I can treat the existence of "-External" as True and without as None?
I suggest that regexs are not needed here, and that a simple split or 2 can get you what are after. Here is a way to split() the line into pieces from which you can then select what you are interested in:
Code:
def split_it(a_string):
on_colon = a_string.split(':')
return on_colon[0], on_colon[1].strip().split('-')
Test Code:
tests = (
'Neo1: Brussels-BRU-External',
'Neo1: Brussels-BRU',
)
for test in tests:
print(split_it(test))
Results:
('Neo1', ['Brussels', 'BRU', 'External'])
('Neo1', ['Brussels', 'BRU'])
Analysis:
The length of the list can be used to determine if the additional field 'External' is present.
In python I need a logic for below scenario I am using split function to this.
I have string which contains input as show below.
"ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988."
"ID909900000 25-01-1986 hello 10 minutes."
And output should be as shown below which replace date format to "date" and time format to "time".
"ID674021384 date hello hi thanks time date."
"ID909900000 date hello time."
And also I need a count of date and time for each Id as show below
ID674021384 DATE:2 TIME:1
ID909900000 DATE:1 TIME:1
>>> import re
>>> from collections import defaultdict
>>> lines = ["ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.", "ID909900000 25-01-1986 hello 10 minutes."]
>>> pattern = '(?P<date>\d{1,2}[/-]\d{1,2}[/-]\d{4})|(?P<time>\d+ minutes)'
>>> num_occurences = {line:defaultdict(int) for line in lines}
>>> def repl(matchobj):
num_occurences[matchobj.string][matchobj.lastgroup] += 1
return matchobj.lastgroup
>>> for line in lines:
text_id = line.split(' ')[0]
new_text = re.sub(pattern,repl,line)
print new_text
print '{0} DATE:{1[date]} Time:{1[time]}'.format(text_id, num_occurences[line])
print ''
ID674021384 date heloo hi thanks time and date.
ID674021384 DATE:2 Time:1
ID909900000 date hello time.
ID909900000 DATE:1 Time:1
For parsing similar lines of text, like log files, I often use regular expressions using the re module. Though split() would work well also for separating fields which don't contain spaces and the parts of the date, using regular expressions allows you to also make sure the format matches what you expect, and if need be warn you of a weird looking input line.
Using regular expressions, you could get the individual fields of the date and time and construct date or datetime objects from them (both from the datetime module). Once you have those objects, you can compare them to other similar objects and write new entries, formatting the dates as you like. I would recommend parsing the whole input file (assuming you're reading a file) and writing a whole new output file instead of trying to alter it in place.
As for keeping track of the date and time counts, when your input isn't too large, using a dictionary is normally the easiest way to do it. When you encounter a line with a certain ID, find the entry corresponding to this ID in your dictionary or add a new one to it if not. This entry could itself be a dictionary using dates and times as keys and whose values is the count of each encountered.
I hope this answer will guide you on the way to a solution even though it contains no code.
You could use a couple of regular expressions:
import re
txt = 'ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.'
retime = re.compile('([0-9]+) *minutes')
redate = re.compile('([0-9]+[/-][0-9]+[/-][0-9]{4})')
# find all dates in 'txt'
dates = redate.findall(txt)
print dates
# find all times in 'txt'
times = retime.findall(txt)
print times
# replace dates and times in orignal string:
newtxt = txt
for adate in dates:
newtxt = newtxt.replace(adate, 'date')
for atime in times:
newtxt = newtxt.replace(atime, 'time')
The output looks like this:
Original string:
ID674021384 25/01/1986 heloo hi thanks 5 minutes and 25-01-1988.
Found dates:['25/01/1986', '25-01-1988']
Found times: ['5']
New string:
ID674021384 date heloo hi thanks time minutes and date.
Dates and times found:
ID674021384 DATE:2 TIME:1
Chris