Match text between parenthesis that end with .md - python

I need to get the text inside the parenthesis where the text ends with .md using a regex (if you know another way you can say it) in python.
Original string:
[Romanian (Romania)](books/free-programming-books-ro.md)
Expected result:
books/free-programming-books-ro.md

This should work:
import re
s = '[Romanian (Romania)](books/free-programming-books-ro.md)'
result = re.findall(r'[^\(]+\.md(?=\))',s)
['books/free-programming-books-ro.md']

Related

How to check if a line contains a string in Python

I'm trying to check if a subString exists in a string using regular expression.
RE : re_string_literal = '^"[a-zA-Z0-9_ ]+"$'
The thing is, I don't want to match any substring. I'm reading data from a file:
Now one of the lines have this text:
cout<<"Hello"<<endl;
I just want to check if there's a string inside the line and if yes, store it in a list.
I have tried the re.match method but it only works if we have to match a pattern, but in this case, I just want to check if a string exists or not, if yes, store it somewhere.
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'
text = 'cout<<"Hello World!"<<endl;'
re.match(re_string_lit,text)
It doesn't output anything.
In simple words,
I just want to extract everything inside ""
If you just want to extract everything inside "" then string splitting would be much simpler way of doing things.
>>> a = 'something<<"actualString">>something,else'
>>> b = a.split('"')[1]
>>> b
'actualString'
The above example would only work for not more than 2 instances of double quotes ("), but you could make it work by iterating over every substring extracted using split method and applying a much simpler Regular Expression.
This worked for me:
re.search('"(.+?)"', 'cout<<"Hello"<<endl')

Python - Parsing JSON formatted text file with regex

I have a text file formatted like a JSON file however everything is on a single line (could be a MongoDB File). Could someone please point me in the direction of how I could extract values using a Python regex method please?
The text shows up like this:
{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","author":null,"d‌​escription":null,"fi‌​leAssetId":"034b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991","filename"‌​:"Reports.pdf"},"cre‌​atedBy":1531,"create‌​dByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acro‌​bat.png","id":3041,"‌​inheritedPermissions‌​":false,"name":"map"‌​,"permissions":[23,8‌​7,35,49,65],"type":3‌​,"viewLevel":2},{"__‌​type":"WikiNode:http‌​:\/\/samplesite.com.‌​au\/ns\/business\/wi‌​ki","children":[],"c‌​ontent":
I am wanting to get the "fileAssetId" and filename". Ive tried to load the like with Pythons JSON module but I get an error
For the FileAssetid I tried this regex:
regex = re.compile(r"([0-9a-f]{8})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{4})\S*-\S*([0-9a-f]{12})")
But i get the following 034b9317‌​, 60d9, 45c2, b6d6, 0f24‌​b59e1991
Im not to sure how to get the data as its displayed.
How about using positive lookahead and lookbehind:
(?<=\"fileAssetId\":\")[a-fA-F0-9-]+?(?=\")
captures the fileAssetId and
(?<=\"filename\":\").+?(?=\")
matches the filename.
For a detailed explanation of the regex have a look at the Regex101-Example. (Note: I combined both in the example with an OR-Operator | to show both matches at once)
To get a list of all matches use re.findall or re.finditer instead of re.match.
re.findall(pattern, string) returns a list of matching strings.
re.finditer(pattern, string) returns an iterator with the objects.
You can use python's walk method and check each entry with re.match.
In case that the string you got is not convertable to a python dict, you can use just regex:
print re.match(r'.*fileAssetId\":\"([^\"]+)\".*', your_pattern).group(1)
Solution for your example:
import re
example_string = '{"d":{"__type":"WikiFileNodeContent:http:\/\/samplesite.com.u\/ns\/business\/wiki","author":null,"description":null,"fileAssetId":"034b9317-60d9-45c2-b6d6-0f24b59e1991","filename":"Reports.pdf"},"createdBy":1531,"createdByUsername":"John Cash","icon":"\/Assets10.37.5.0\/pix\/16x16\/page_white_acrobat.png","id":3041,"inheritedPermissions":false,"name":"map","permissions":[23,87,35,49,65],"type":3,"viewLevel":2},{"__type":"WikiNode:http:\/\/samplesite.com.au\/ns\/business\/wiki","children":[],"content"'
regex_pattern = r'.*fileAssetId\":\"([^\"]+)\".*'
match = re.match(regex_pattern, example_string)
fileAssetId = match.group(1)
print('fileAssetId: {}'.format(fileAssetId))
executing this yields:
34b9317‌​-60d9-45c2-b6d6-0f24‌​b59e1991
Try adding \n to the string that you are entering in to the file (\n means new line)
Based on the idea given here https://stackoverflow.com/a/3845829 and by following the JSON standard https://www.json.org/json-en.html, we can use Python + regex https://pypi.org/project/regex/ and do the following:
json_pattern = (
r'(?(DEFINE)'
r'(?P<whitespace>( |\n|\r|\t)*)'
r'(?P<boolean>true|false)'
r'(?P<number>-?(0|([1-9]\d*))(\.\d*[1-9])?([eE][+-]?\d+)?)'
r'(?P<string>"([^"\\]|\\("|\\|/|b|f|n|r|t|u[0-9a-fA-F]{4}))*")'
r'(?P<array>\[((?&whitespace)|(?&value)(,(?&value))*)\])'
r'(?P<key>(?&whitespace)(?&string)(?&whitespace))'
r'(?P<value>(?&whitespace)((?&boolean)|(?&number)|(?&string)|(?&array)|(? &object)|null)(?&whitespace))'
r'(?P<object>\{((?&whitespace)|(?&key):(?&value)(,(?&key):(?&value))*)\})'
r'(?P<document>(?&object)|(?&array))'
r')'
r'(?&document)'
)
json_regex = regex.compile(json_pattern)
match = json_regex.match(json_document_text)
You can change last line in json_pattern to match not document but individual objects replacing (?&document) by (?&object). I think the regex is easier than I expected, but I did not run extensive tests on this. It works fine for me and I have tested hundreds of files. I wil try to improve my answer in case I find any issue when running it.

Python - RegEx match does not write to file if it contains full-stop

I'm trying to use a RegEx expression in a Python script in order to find specific variables within a webpage. I then export this using a csv file. However, if the found group contains a full-stop, it does not export at all. How do I remedy this?
In this webpage, the item displayed changes depending on a code inputted. My script automates the inputting of codes, and then records the item produced. Here are the relevant parts of my code:
import re
regName = r'The item name is (.*?)\.'
response = opener.open(
'http://website.com/webpage.php' + itemValues)
html = response.read()
responseDecode = html.decode('utf8')
name = re.findall(regName, responseDecode)
#Convert stuff to Unicode
uniName = name[0].encode('utf8', 'replace')
with open("readable.txt", "a") as file:
file.write("\n"*2)
file.write(uniName + '\n')
Of note, I convert to unicode because some of the item names contain accented characters.
EDIT: an example of something that would not work would be, for instance, R.O.B.O.T . All that would be written would be R
Try using regName = r'The item name is (.*?)\.$' The $ marks the end of the string, which would allow the other full stops to not be consumed early. Right now the regular expression is being greedy and matching on the first one.
Or if the string doesn't end right there, try adding a space or some other following character. You need to specify the kind of character that marks the end of the item string.

Strip a string if it contains a parnethesis

I have some scraped data that varies in format slightly, however in order to standadise it I need to remove anything within the parenthesis including the parenthesis, if they exist that is. I have attempted to useing strip in various ways but to no avail.
Some example data:
Text (te)
Text Text (tes)
Text-Text (te)
Text Text
Text-Text (tes)
And how I need to appear after standardisation:
Text
Text Text
Text-Text
Text Text
Text-Text
Can anyone offer me a solution for this? Thanks SMNALLY
from re import sub
x = sub("(?s)\(.*\)", "", x)
This will remove everything between the parenthesis (including newlines) as well as the parenthesis themselves.
Assuming the parenthesis do not nest, and that there is at most one pair per string, try this:
import re
myString = re.sub(r'\(.*\)', '', myString)
A more specific pattern might be:
myString = re.sub(r'\s*\(\w+\)\s*$', '', myString)
The above pattern deletes the whitespace that surrounds the parenthetical expression, and only deletes from the end of the line.

Parsing text file in python

I have html-file. I have to replace all text between this: [%anytext%]. As I understand, it's very easy to do with BeautifulSoup for parsing hmtl. But what is regular expression and how to remove&write back text data?
Okay, here is the sample file:
<html>
[t1] [t2] ... [tood] ... [sadsada]
Sample text [i8]
[d9]
</html>
Python script must work with all strings and replace [%] -> some another string, for example:
<html>
* * ... * ... *
Sample text *
*
</html>
What I did:
import re
import codecs
fullData = ''
for line in codecs.open(u'test.txt', encoding='utf-8'):
line = re.sub("\[.*?\]", '*', line)
fullData += line
print fullData
This code does exactly I described in sample. Thanks all.
Regex does the trick if you are needing to replace any text between "[%" and "%]".
The code would look something like this:
import re
newstring = re.sub("\[%.*?%\]",newtext,oldstring)
The regex used here is lazy so it would match everything between an occurrence of "[%" and the next occurrence of "%]". You could make it greedy by removing the question mark. This would match everything between the first occurrence of of "[%" and the last occurrence of "%]"
Looks like you need to parse a generic textfile, looking for that marker to replace it -- the fact that other text outside the marker is HTML, at least from the way you phrased your task, does not seem to matter.
If so, and what you want is to replace every occurrence of [%anytext%] with loremipsum, then a simple:
thenew = theold.replace('[%anytext%]', 'loremipsum')
will serve, if theold is the original string containing the file's text -- now thenew is a new string with all occurrences of that marker replaced - no need for regex, BS or anything else.
If your task is very different from this, pls edit your Question to explain it in more detail!-)

Categories

Resources