Find text between strings python - python

I looked at similar question and answers but could not solve my issue.
I have a string, like the following:
ecc, ecc, .....thisIsUnique("92781227-7e7e-4768-8ee3-4e1615bddf3c", ecc, ecc.......
could be very long before and after without having some unique text.
What I need is to get the 92781227-7e7e-4768-8ee3-4e1615bddf3c code as string. So I'm looking to something that ca sound like:
when you find thisIsUnique go ahead, read the code after you find the first (" characters and keep reading until you find the first ", characters.
Unfortunately I'm not familiar with regex, but maybe there are different ways to solve the problem
thanks to all

There are a few sites you should read up on for what regex is. https://regexone.com/ and Learning Regular Expressions Use a site like this to test what you have tried: https://regex101.com/ But to get you started, this runs exactly what you have pasted as an example:
import re
text = 'ecc, ecc, .....thisIsUnique("92781227-7e7e-4768-8ee3-4e1615bddf3c", ecc, ecc.......'
match = re.search('thisIsUnique\("([^"]+)', text)
print (match.group(1))
result:
92781227-7e7e-4768-8ee3-4e1615bddf3c

Use re.search:
In [991]: text = 'ecc, ecc, .....thisIsUnique("92781227-7e7e-4768-8ee3-4e1615bddf3c", ecc, ecc.......'
In [992]: re.search('(?<=thisIsUnique\(")(.*?)"', text).group(1)
Out[992]: '92781227-7e7e-4768-8ee3-4e1615bddf3c'
'(?<=thisIsUnique\(")(.*?)"'
Employs a lookbehind.
Additional Reading
Regex HOWTO - getting started with tutorial
General documentation
Additional tutorial site - TutorialsPoint

Related

extract URL from string in python

I want to extract a full URL from a string.
My code is:
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
print re.match(r'(ftp|http)://.*\.(jpg|png)$', data)
Output:
None
Expected Output
http://www.google.com/a.jpg
I found so many questions on StackOverflow, but none worked for me.
I have seen many posts and this is not a duplicate. Please help me! Thanks.
You were close!
Try this instead:
r'(ftp|http)://.*\.(jpg|png)'
You can visualize this here.
I would also make this non-greedy like this:
r'(ftp|http)://.*?\.(jpg|png)'
You can visualize this greedy vs. non-greedy behavior here and here.
By default, .* will match as much text as possible, but you want to match as little text as possible.
Your $ anchors the match at the end of the line, but the end of the URL is not the end of the line, in your example.
Another problem is that you're using re.match() and not re.search(). Using re.match() starts the match at the beginning of the string, and re.search() searches anywhere in the string. See here for more information.
You should use search instead of match.
import re
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
url=re.search('(ftp|http)://.*\.(jpg|png)', data)
if url:
print url.group(0)
Find the start of the url by using find(http:// , ftp://) . Find the end of url using find(jpg , png). Now get the substring
data = "ahahahttp://www.google.com/a.jpg>hhdhd"
start = data.find('http://')
kk = data[start:]
end = kk.find('.jpg')
print kk[0:end+4]

Using python to find specific pattern contained in a paragraph

I'm trying to use python to go through a file, find a specific piece of information and then print it to the terminal. The information I'm looking for is contained in a block that looks something like this:
\\Version=EM64L-G09RevD.01\State=1-A1\HF=-1159.6991675\RMSD=4.915e-11\RMSF=1.175e-07\ZeroPoint=0.0353317\
I would like to be able to get the information HF=-1159.6991675. More generally, I would like the script to copy and print \HF=WhateverTheNumberIs\
I've managed to make scripts that are able to copy an entire line and print it out to the terminal, but I am unsure how to accomplish this particular task.
My suggestions is to use regular expressions (regex) in order to catch the required pattern:
import re #for using regular expressions
s = open(<filename here>).read() #read the content of the file and hold it as a string to be scanned
p = re.compile("\HF=[^\]+", re.flags) #p would be the pattern as you described, starting with \HF= till the next \)
print p.findall(s) #finds all occurrences and prints them
Regular expressions is the answer, something like r'/HF.*/'.
Tutorial:- regex tutorial
Once you have learned regex, it is an indispensable resource.

Filter strings into list depending on position - Python

For example, this is my string:
myString = "<html><body><p>Hello World!</p><p>Hello Dennis!</p></body></html>"
and what i am trying to achieve is:
myList = ['Hello World!','Hello Dennis!']
Using regular expressions or another method, how can i filter out paragraph text out of myString while ignoring the html tags to achieve myList?
I have tried:
import re
a="<body><p>Hello world!</p><p>Hello Denniss!</p></body>"
result=re.search('<p>(.*)</p>', a)
print result.group(1)
Which resulted in: Hello world!</p><p>Hello Denniss! and when i tried (.*)(.*) i got Hello World!
This string is just an example. The string may also be <garbage>abcdefghijk<gar<bage> depending on how the web developer coded the website.
It may be a complex regex, but i need to learn this as it is for a cyber security competition i will be participating in later this year and i think my best bet is to develop an algorithm which searches for text between a > and a <.
How would i go about this?
Sorry if my question is not formatted properly, i have a bit of learning problems.
Do you want to get rid of all tags in a html text? I won't choose regular expression, better the other method, for example with BeautifulSoup and you will surprise all in that hacking meeting:
from bs4 import BeautifulSoup
myString = "<html><body><p>Hello World!</p><p>Hello Dennis!</p></body></html>"
myList = list(BeautifulSoup(myString).strings))
It yields:
['Hello World!', 'Hello Dennis!']
HTML parsing with regex is definitly limited, but if you'd like to have real solution of HTML mining try to look at this addon BeautifulSoup.
As for your regex, the asterisk quantifier is greedy it will gorge until the last of </p>. So, you should use (?=XXX) command which means search until XXX found.
Try the following:
re.findall(r'<p>(.*?)(?=</p>)', s)

Extracting a string from a txt file

So im just experimenting, trying to parse through the web using python and i thought i would try to make a script that would search for my favorite links to watch shows online. Im trying to now have my program search through sidereel.com for a good link to my desired show and return to me the links. I know that the site saves the links in the following format:
watch-freeseries.mu'then some long string that i need to ignore followed by '14792088'
So what i need to be able to do is to find this string in the txt file of the site and return to me only the 8 numbers at the end of the string. I not sure how i can get to the numbers and i need them because they are the link number. Any help would be much appreciated
You could use a regular expression to do this fairly easily.
>>> import re
>>> text = "watch-freeseries.mu=lklsflamflkasfmsaldfasmf14792088"
>>> expr = re.compile("watch\-freeseries\.mu.*?(\d{8})")
>>> expr.findall(text)
['14792088']
A breakdown of the expression:
watch\-freeseries\.mu - Match the start of the expected expression. Escape any possible special characters by preceding them with \.
.*? - Match any character. . means any character and * means that appear one after the other an infinite amount of times. The ? is to perform a non-greedy match so that the match will not overlap if two or more urls show up in the same string.
(\d{8}) - Match and save the last 8 digits
Note: If you're trying to parse links out of a webpage there are easier ways. I've seen many recommendations on StackOverflow for the BeautifulSoup package in particular. I've never used it myself so YMMV.

REGEX: Parsing n digits with non numeric word boundaries

I hope this message finds you in good spirits. I am trying to find a quick tutorial on the \b expression (apologies if there is a better term). I am writing a script at the moment to parse some xml files, but have ran into a bit of a speed bump. I will show an example of my xml:
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
<...> is unimportant and non relevant xml code. Focus primarily on the CustomerID and OrderId.
My issue lies in parsing a string, similar to the above statement. I have a regexParse definition that works perfectly. However it is not intuitive. I need to match only the part of the string that contains 44444444.
My Current setup is:
searchPattern = '>\d{8}</CustomerId'
Great! It works, but I want to do it the right way. My thinking is 1) find 8 digits 2) if the some word boundary is non numeric after that matches CustomerId return it.
Idea:
searchPattern = '\bd{16}\b'
My issue in my tests is incorporating the search for CustomerId somewhere before and after the digits. I was wondering if any of you can either help me out with my issue, or point me in the right path (in words of a guide or something along the lines). Any help is appreciated.
Mods if this is in the wrong area apologies, I wanted to post this in the Python discussion because I am not sure if Python regex supports this functionality.
Thanks again all,
darcmasta
txt = """
<....></...><...></...><OrderId>123456</OrderId><...></...>
<CustomerId>44444444</CustomerId><...></...><...></...>
"""
import re
pattern = "<(\w+)>(\d+)<"
print re.findall(pattern,txt)
#output [('OrderId', '123456'), ('CustomerId', '44444444')]
You might consider using a look-back operator in your regex to make it easy for a human to read:
import re
a = re.compile("(?<=OrderId>)\\d{6}")
a.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['123456']
b = re.compile("(?<=CustomerId>)\\d{8}")
b.findall("<....></...><...></...><OrderId>123456</OrderId><...></...><CustomerId>44444444</CustomerId><...></...><...></...>")
['44444444']
You should be using raw string literals:
searchPattern = r'\b\d{16}\b'
The escape sequence \b in a plain (non-raw) string literal represents the backspace character, so that's what the re module would be receiving (unrecognised escape sequences such as \d get passed on as-is, i.e. backslash followed by 'd').

Categories

Resources