python regex match content crossing multiple lines - python

I have a txt file, include multiple line.My result crossing multiple lines.
for example, my data can be simplified as the following:
target_str =
x:-2.12343234
aaa:-3.05594480202
aaa:-3.01292995004
aaa:-2.383299
456:-2.232342
x:-2.53739230
aaa:-2.96875038099
aaa:-2.92326261448
aaa:-2.87628054847
bbb:-2.82755928961
456:-2.77678240323
x:-2.3433210
aaa:-2.72356707049
aaa:-2.6675072938
aaa:-2.60827106148
456:-2.3323232
x:-2.8743920
aaa:-2.433233
aaa:-2.9747893
aaa:-2.9747893
bbb:-2.43873
456:-2.43434
I want to match
x:.....
aaa:.....
aaa:.....
aaa:.....
bbb:.....
456:.....
means if there exist bbb, then I pick up the lines from x:... to 456:....
The expected results for the example data is:
x:-2.53739230
aaa:-2.96875038099
aaa:-2.92326261448
aaa:-2.87628054847
bbb:-2.82755928961
456:-2.77678240323
x:-2.8743920
aaa:-2.433233
aaa:-2.9747893
aaa:-2.9747893
bbb:-2.43873
456:-2.43434
I write:
a=re.findall(r"x:(.*\n){4}bbb:.*\n456.*",target_str)
print(a)
But the results is:
['aaa:-2.87628054847\n', 'aaa:-2.9747893\n']
This is not correct, can anyone help me? thanks a lot.

Try with following regex:
(x:(?:.*\n){4}bbb:.*\n456.*)
(?:.*\n) - ?: Makes group non capturing, so it won't be set to output.
Adding parenthesses on whole regex makes it an group which you would like to see as output

Related

Find date and time in a text column in a dataframe Python

I'm trying to find and extract the date and time in a column that contain text sentences. The example data is as below.
df = {'Id': ['001', '002',...],
'Description': ['
THERE IS AN INTERUPTION/FAILURE # 9.6AM ON 27.1.2020 FOR JB BRANCH. THE INTERUPTION ALSO INVOLVED A, B, C AND SOME OTHER TOWN AREAS. OTC AND SST SERVICES INTERRUPTED AS GENSET ALSO WORKING AT THAT TIME. WE CALL FOR SERVICE. THE TECHNICHIAN COME AT 10.30AM. THEN IT BECOME OK AROUND 10.45AM', 'today is 23/3/2013 #10:AM we have',...],
....
}
df = pd.DataFrame (df, columns = ['Id','Description'])
I have tried the datefinder library below but it gives todays date which is wrong.
findDate = dtf.find_dates(le['Description'][0])
for dates in findDate:
print(dates)
Does anyone know what is the best way to extract it and automatically put it into a new column? Or does anyone know any library that can calculate duration between time and date in a string text. Thank you.
So you have two issues here.
you want to know how to apply a function on a DataFrame.
you want a function to extract a pattern from a bunch of text
Here is how to apply a function on a Serie (if selecting only one column as I did, you get a Serie). Bonus points: Read the DataFrame.apply() and Series.apply() documentation (30s) to become a Pandas-chad!
def do_something(x):
some-code()
df['new_text_column'] = df['original_text_column'].apply(do_something)
And here is one way to extract patterns from a string using regexes. Read the regex doc (or follow a course)and play around with RegExr to become an omniscient god (that is, if you use a command-line on Linux, along with your regex knowledge).
Modified from: How to extract the substring between two markers?
import re
text = 'gfgfdAAA1234ZZZuijjk'
# Searching numbers.
m = re.search('\d+', text)
if m:
found = m.group(0)
# found: 1234

Issue Finding Substring

I'm trying to find a substring Chief in the df column. Its working fine with split() on text with spaces but not working as expected with find().
sum(df['JobTitle'].apply(lambda x :'chief' in x.lower().split() ))
sum(df['JobTitle'].apply(lambda x : x.lower().find('chief') ==1))
Can you please highlight what the issue in find usage is here?
You can try with re:
import re
# if it appears, add 1, else add 0
sum(df['JobTitle'].apply(lambda x : int(bool(re.findall(r'\bchief\b', x.lower()))))
# add the number of times the word appears
sum(df['JobTitle'].apply(lambda x : len(re.findall(r'\bchief\b', x.lower())))
EDIT
If you want to catch chief but not words with chief inside, like mischief, use r'\bchief\b'
Demo : https://regex101.com/r/jYOfM1/1

Extracting multiple text with regex and matching the existence as a condition

I am trying to extract elements using a regex, while needing to also distinguish which lines have "-External" at the end. The naming structure I am working with is:
<ServerName>: <Country>-<CountryCode>
or
<ServerName>: <Country>-<CountryCode>-External
For example:
test1 = 'Neo1: Brussels-BRU-External'
test2 = 'Neo1: Brussels-BRU'
match = re.search(r'(?<=: ).+', test1)
print match.group(0)
This gives me "Brussels-BRU". I am trying to extract "Brussels" and "BRU" separately, while not caring about anything to the left of the :.
After, I need to know when a line has "-External". Is there a way I can treat the existence of "-External" as True and without as None?
I suggest that regexs are not needed here, and that a simple split or 2 can get you what are after. Here is a way to split() the line into pieces from which you can then select what you are interested in:
Code:
def split_it(a_string):
on_colon = a_string.split(':')
return on_colon[0], on_colon[1].strip().split('-')
Test Code:
tests = (
'Neo1: Brussels-BRU-External',
'Neo1: Brussels-BRU',
)
for test in tests:
print(split_it(test))
Results:
('Neo1', ['Brussels', 'BRU', 'External'])
('Neo1', ['Brussels', 'BRU'])
Analysis:
The length of the list can be used to determine if the additional field 'External' is present.

regex to find image sequences from list of filenames

I need some help with a regex string to pull any filename that looks like it might be part of a frame sequence out of a previously generated list of filenames.
Frames in a sequence will generally have a minimum padding of 3 and will be preceeded by either a '.' or a '_' An exception is: if the filename is only made up of a number and the .jpg extension (e.g 0001.jpg, 0002.jpg, etc.). I'd like to capture all these in one line of regex, if possible.
Here's what I have so far:
(.*?)(.|_)(\d{3,})(.*)\.jpg
Now I know this doesn't do the "preceeded by . or _" bit and instead just finds a . or _ anywhere in the string to return a positive. I've tried a bit of negative lookbehind testing, but can't get the syntax to work.
A sample of data is:
test_canon_shot02.jpg
test_shot01-04.jpg
test_shot02-03.jpg
test_shot02-02.jpg
test_shot01-03.jpg
test_canon_shot03.jpg
test_shot01-02.jpg
test_shot02.jpg
test_canon_shot02.jpg
test_shot01.jpg
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
OrangeXmas2015_Print_A ct2.jpg
sh120_HF_V01-01.jpg
sh120_HF_V01-02.jpg
sh200_DMP_v04.jpg
sh120_HF_V04.jpg
sh120_HF_V03.jpg
sh120_HF_V02.jpg
blah_v02.jpg
blah_v01.jpg
blah_Capture0 4.jpg
blah_Capture03 .jpg
blah_Capture01. jpg
blah_Capture02.jpg
Wall_GraniteBlock_G rey_TC041813.jpg
Renders10_wire.jpg
Renders10.jpg
Renders09_wire.jpg
Renders09.jpg
Renders08_wire.jpg
Renders08.jpg
Renders07_wire.jpg
Renders07.jpg
Renders06_wire.jpg
Renders06.jpg
Renders05_wire.jpg
Renders05.jpg
Renders04_wire.jpg
Renders04.jpg
Renders03_wire.jpg
Renders03.jpg
Renders02_wire.jpg
Renders02.jpg
Renders01_wire.jpg
Renders01.jpg
archmodels58_057_carpinusbetulus_leaf_diffuse.jpg
archmodels58_042_bark_bump.jpg
archmodels58_023_leaf_diffuse.jpg
WINDY TECHNICZNE-reflect00.jpg
archmodels58_057_leaf_opacity.jpg
archmodels58_057_bark_reflect.jpg
archmodels58_057_bark_bump.jpg
blahC-00-oknaka.jpg
bed
debt
cab
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
The result I'm after is 2 sequences identified:
GameAssets_.00000.jpg to GameAssets_.00024.jpg
00000.jpg to 00018.jpg
Based on the rules you specified in your question, this pattern should accomplish what you need:
(^|\r?\n|.*_|.*\.)\d{3,}.*\.jpg
for item in re.findall(r'.*?[._]?0{3,}.*',data):
print(item)
GameAssets_.00024.jpg
GameAssets_.00023.jpg
GameAssets_.00022.jpg
GameAssets_.00021.jpg
GameAssets_.00020.jpg
GameAssets_.00019.jpg
GameAssets_.00018.jpg
GameAssets_.00017.jpg
GameAssets_.00016.jpg
GameAssets_.00015.jpg
GameAssets_.00014.jpg
GameAssets_.00013.jpg
GameAssets_.00012.jpg
GameAssets_.00011.jpg
GameAssets_.00010.jpg
GameAssets_.00009.jpg
GameAssets_.00008.jpg
GameAssets_.00007.jpg
GameAssets_.00006.jpg
GameAssets_.00005.jpg
GameAssets_.00004.jpg
GameAssets_.00003.jpg
GameAssets_.00002.jpg
GameAssets_.00001.jpg
GameAssets_.00000.jpg
00018.jpg
00017.jpg
00016.jpg
00015.jpg
00014.jpg
00013.jpg
00012.jpg
00011.jpg
00010.jpg
00009.jpg
00008.jpg
00007.jpg
00006.jpg
00005.jpg
00004.jpg
00003.jpg
00002.jpg
00001.jpg
00000.jpg
Try
(.*?)(\.|_?)(000\d{0,})(.*)\.jpg
Notice that I had to escape the '.' in the second group. Also, I had to make the search for '.' and '_' optional in the second group. Finally, I had to add the minimum padding to the third group.
I used regex101.com to test and refine the regex: regex101

strip sides of a string in python

I have a list like this:
Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]
I want to strip the unwanted characters using python so the list would look like:
Tomato
Populus trichocarpa
I can do the following for the first one:
name = ">Tomato4439"
name = name.strip(">1234567890")
print name
Tomato
However, I am not sure what to do with the second one. Any suggestion would be appreciated.
given:
s='Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]'
this:
s = s.split()
[s[0].strip('0123456789,'), s[-2].replace('[',''), s[-1].replace(']','')]
will give you
['Tomato', 'Populus', 'trichocarpa']
It might be worth investigating regular expressions if you are going to do this frequently and the "rules" might not be that static as regular expressions are much more flexible dealing with the data in that case. For the sample problem you present though, this will work.
import re
a = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
re.sub(r"^([A-Za-z]+).+\[([^]]+)\]$", r"\1 \2", a)
This gives
'Tomato Populus trichocarpa'
If the strings you're trying to parse are consistent semantically, then your best option might be classifying the different "types" of strings you have, and then creating regular expressions to parse them using python's re module.
>>> import re
>>> line = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
>>> match = re.match("^([a-zA-Z]+).*\[([a-zA-Z ]+)\].*",line)
>>> match.groups()
('Tomato', 'Populus trichocarpa')
edited to not include the [] on the 2nd part... this should work for any thing that matches the pattern of your query (eg starts with name, ends with something in []) it would also match
"Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa apples]" for example
Previous answers were simpler than mine, but:
Here is one way to print the stuff that you don't want.
tag = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
import re, os
find = re.search('>(.+?) \[', tag).group(1)
print find
Gives you
gi|224089052|ref|XP_002308615.1| predicted protein
Then you can use the replace function to remove that from the original string. And the translate function to remove the extra unwanted characters.
tag2 = tag.replace(find, "")
tag3 = str.translate(tag2, None, ">[],")
print tag3
Gives you
Tomato4439 Populus trichocarpa

Categories

Resources