I want to adjust unmarked chapter numbers in a text without changing unmarked page numbers (so I can later remove them). I intend to use a regex statement including a variable made up of a incrementally increasing chapter number to do this.
The text string is like this:
"Introduction text
1
Chapter 1 title
Chapter 1 text
192 (page number)
Chapter 1 text some more
193
2
Chapter 2 title
Chapter 2 text
194
And so on"
Using the variable chapter_fix = 1 combined with the operator of chapter_fix = chapter_fix =+ 1
I want to take the regex strings '\n1\n', '\n2\n', '\n3\n'...up to '\n365\n' and convert it to the strings '\nChapter 1\n', '\nChapter 2\n', '\nChapter 3\n'...up to '\nChapter 365\n' without modifying those random page number that are not in the same number series.
Unfortunately, I am an amateur coder and despite working on this problem for several hours and having some basis in similar regex problems I just really can't figure it out. Frankly I don't even really have any usable code to present as an example but I would really appreciate help in figuring this out to learn this better.
I have included my current, profoundly incorrect, code below. I am just going to make the adjustment manually this time but I'd still like to learn to do it right if anybody can help.
###Module imports###
import re
text = '''
Intro Text
SECTION I
Section I Title
118
1
Chapter 1 Title
119
text
121
text
122
2
Chapter 2 Title
text
'''
###Function Definitions###
def chapter_correction (text):
chapter_correction = 1
pattern = r'\n{}\n'.format(chapter_correction)
print(pattern)
text = re.sub(pattern, text, chapter_correction, flags=re.IGNORECASE)
chapter_correction += 1
return text
text = chapter_correction (text)
print(text)
Related
I've been working on a job description parser and I have been trying to extract the entire sentence which consists of the number of years of experience required.
I have tried to use regex which provides me the number of years but not the entire sentence.
def extract_years(self,resume_text):
resume_text = str(resume_text.split('.'))
exp=[]
rx = re.compile(r"(\d+(?:-\d+)?\+?)\s*(years?)",re.I)
for word in resume_text:
exp_temp = rx.search(resume_text)
if exp_temp:
exp.append(exp_temp[0])
exp = list(set(exp))
return exp
Output:
['5-7 years']
Desired Output:
['5-7 years of experience in journalism, communications, or content creation preferred']
Try: (\d+(?:-\d+)?+?)\s*(years?).*
Though I'm somewhat new to Regex, I believe you can get what you desire using a combination of ".*" to end of your match terms and possibly the beginning if "5-7 years" comes after some characters like "needs 5-7 years of experience".
just adding the group ".*" at the end would mean to add any combination of characters, 0 or more after your initial match stopping at a line break, to match the entire sentence.
Hope this helps.
So I wrote a program for my Kindle e-reader that searches my highlights and deletes repetitive text (it's usually information about the book title, author, page number, etc.). I thought it was functional but sometimes there would random be periods (.) on certain lines of the output. At first I thought the program was just buggy but then I realized that the regex I'm using to match the books title and author was also matching any sentence that ended in brackets.
This is the code for the regex that I'm using to detect the books title and author
titleRegex = re.compile('(.+)\((.+)\)')
Example
Desired book title and author match: Book title (Author name)
What would also get matched: *I like apples because they are green (they are sometimes red as well). *
In this case it would delete everything and leave just the period at the end of the sentence. This is obviously not ideal because it deletes the text I highlighted
Here is the unformatted text file that goes into my program
The program works by finding all of the matches for the regexes I wrote, looping through those matches and one by one replacing them with empty strings.
Would there be any ways to make my title regex more specific so that it only picks up author titles and not full sentences that end in brackets? If not, what steps would I have to take to restructure this program?
I've attached my code to the bottom of this post. I would greatly appreciate any help as I'm a total coding newbie. Thanks :)
import re
titleRegex = re.compile('(.+)\((.+)\)')
titleRegex2 = re.compile(r'\ufeff (.+)\((.+)\)')
infoRegex = re.compile(r'(.) ([a-zA-Z]+) (Highlight|Bookmark|Note) ([a-zA-Z]+) ([a-zA-Z]+) ([0-9]+) (\|)')
locationRegex = re.compile(r' Location (\d+)(-\d+)? (\|)')
dateRegex = re.compile(r'([a-zA-Z]+) ([a-zA-Z]+) ([a-zA-Z]+), ([a-zA-Z]+) ([0-9]+), ([0-9]+)')
timeRegex = re.compile(r'([0-9]+):([0-9]+):([0-9]+) (AM|PM)')
newlineRegex = re.compile(r'\n')
sepRegex = re.compile('==========')
regexList = [titleRegex, titleRegex2, infoRegex, locationRegex, dateRegex, timeRegex, sepRegex, newlineRegex]
string = open("/Users/devinnagami/myclippings.txt").read()
for x in range(len(regexList)):
newString = re.sub(regexList[x], ' ', string)
string = newString
finalText = newString.split(' ')
with open('booknotes.txt', 'w') as f:
for item in finalText:
f.write('%s\n' % item)
There isn't enough information to tell if "Book title (Book Author)" is different than something like "I like Books (Good Ones)" without context. Thankfully, the text you showed has plenty of context. Instead of creating several different regular expressions, you can combine them into one expression to encode that context.
For instance:
quoteInfoRegex = re.compile(
r"^=+\n(?P<title>.*?) \((?P<author>.*?)\)\n" +
r"- Your Highlight on page (?P<page>[\d]+) \| Location (?P<location>[\d-]+) \| Added on (?P<added>.*?)\n" +
r"\n" +
r"(?P<quote>.*?)\n", flags=re.MULTILINE)
for m in quoteInfoRegex.finditer(data):
print(m.groupdict())
This will pull out each line of the text, and parse it, knowing that the book title is the first line after the equals, and the quote itself is below that.
Im very new to coding and only know the very basics. I am using python and trying to print everything between two sentences in a text. I only want the content between, not before or after. It`s probably very easy, but i couldnt figure it out.
Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)
I want to collect the bold text to use in website later. Everything except the italic text(the sentence before and after) is dynamic if that has anything to say.
You can use split to cut the string and access the parts that you are interested in.
If you know how to get the full text already, it's easy to get the bold sentence by removing the two constant sentences before and after.
full_text = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom) Ev 134 Haukelifjell Hordaland / Telemark — Icy. 10 o'clock 1 degree. Valid from: 05.01.2020 13:53 Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
s1 = "Ev 39 Fursetfjellet (Oppdøl - Batnfjordsøra) No reports. Ev 134 Haukelifjell (Liamyrane bom - Fjellstad bom)"
s2 = "Rv 3 Kvikne (Tynset (Motrøa) - Ulsberg)"
bold_text = full_text.split(s1)[1] # Remove the left part.
bold_text = bold_text.split(s2)[0] # Remove the right part.
bold_text = bold_text.strip() # Clean up spaces on each side if needed.
print(bold_text)
It looks like a job for regular expressions, there is the re module in Python.
You should:
Open the file
Read its content in a variable
Use search or match function in the re module
In particular, in the last step you should use your "surrounding" strings as "delimiters" and capture everything between them. You can achieve this using a regex pattern like str1 + "(.*)" + str2.
You can give a look at regex documentation, but just to give you an idea:
".*" captures everything
"()" allows you actually capture the content inside them and access it later with an index (e.g. re.search(pattern, original_string).group(1))
With the help of joksnet's programs here I've managed to get plaintext Wikipedia articles that I'm looking for.
The text returned includes Wiki markup for the headings, so for example, the sections of the Albert Einstein article are returned like this:
==Biography==
===Early life and education===
blah blah blah
What I'd really like to do is feed the retrieved text to a function and wrap all the top level sections in bold html tags and the second level sections in italics, like this:
<b>Biography</b>
<i>Early life and education</i>
blah blah blah
But I'm afraid I don't know how to even start, at least not without making the function dangerously naive. Do I need to use regular expressions?
Any suggestions greatly appreciated.
PS Sorry if "parsing" is too strong a word for what I'm trying to do here.
I think the best way here would be to let MediaWiki take care of the parsing. I don't know the library you're using, but basically this is the difference between
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content
which returns the raw wikitext and
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=Albert%20Einstein&rvprop=content&rvparse
which returns the parsed HTML.
You can use regex and scraping modules like Scrapy and Beautifulsoup to parse and scrape wiki pages.
Now that you clarified your question I suggest you use the py-wikimarkup module that is hosted on github. The link is https://github.com/dcramer/py-wikimarkup/ . I hope that helps.
I ended up doing this:
def parseWikiTitles(x):
counter = 1
while '===' in x:
if counter == 1:
x = x.replace('===','<i>',1)
counter = 2
else:
x = x.replace('===',r'</i>',1)
counter = 1
counter = 1
while '==' in x:
if counter == 1:
x = x.replace('==','<b>',1)
counter = 2
else:
x = x.replace('==',r'</b>',1)
counter = 1
x = x.replace('<b> ', '<b>', 50)
x = x.replace(r' </b>', r'</b>', 50)
x = x.replace('<i> ', '<i>', 50)
x = x.replace(r' </i>', r'<i>', 50)
return x
I pass the string of text with wiki titles to that function and it returns the same text with the == and === replaced with bold and italics HTML tags. The last thing removes spaces before and after titles, for example == title == gets converted to <b>title</b> instead of <b> title </b>
Has worked without problem so far.
Thanks for the help guys,
Alex
I have a text that goes like this:
text = "All human beings are born free and equal in dignity and rights. They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood."
How do I write a function hedging(text) that processes my text and produces a new version that inserts the word "like" in the every third word of the text?
The outcome should be like that:
text2 = "All human beings like are born free like and equal in like..."
Thank you!
Instead of giving you something like
solution=' like '.join(map(' '.join, zip(*[iter(text.split())]*3)))
I'm posting a general advice on how to approach the problem. The "algorithm" is not particularly "pythonic", but hopefully easy to understand:
words = split text into words
number of words processed = 0
for each word in words
output word
number of words processed += 1
if number of words processed is divisible by 3 then
output like
Let us know if you have questions.
You could go with something like that:
' '.join([n + ' like' if i % 3 == 2 else n for i, n in enumerate(text.split())])