Parsing and updating markdown file with Python - python

I'm creating a script that will traverse a markdown file and update the any image tags from
![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif)
to
![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif?alt-text="Daffy Duck")
I'm new to Python, so I'm unsure about syntax and my approach, but my current thinking is to create an new empty string, traverse the original markdown line by line, if an image tag is detected splice the alt text to the correct location and add the lines to the new markdown string. The code I have so far looks like:
import markdown
from markdown.treeprocessors import Treeprocessor
from markdown.extensions import Extension
originalMarkdown = '''
## New Article
Lorem ipsum dolor sit amet, consectetur adipiscing elit. In pretium nunc ligula. Quisque bibendum vel lectus sed pulvinar. Phasellus id magna ac arcu iaculis facilisis. Curabitur tincidunt sed ipsum vel lacinia. Nulla et semper urna. Quisque ultrices hendrerit magna nec tempor.
![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif)
Quisque accumsan sem mi. Nunc orci justo, laoreet vel metus nec, interdum euismod ipsum.
![Bugs Bunny](http://www.nationalnannies.com/wp-content/uploads/2012/03/bugsbunny.png)
Suspendisse augue odio, pharetra ac erat eget, volutpat ornare velit. Sed ac luctus quam. Sed id mauris erat. Duis lacinia faucibus metus, nec vehicula metus consectetur eu.
'''
updatedMarkdown = ""
# First create the treeprocessor
class AltTextExtractor(Treeprocessor):
def run(self, doc):
"Find all alt_text and append to markdown.alt_text. "
self.markdown.alt_text = []
for image in doc.findall('.//img'):
self.markdown.alt_text.append(image.get('alt'))
# Then traverse the markdown file and concatenate the alt text to the end of any image tags
class ImageTagUpdater(Treeprocessor):
def run(self, doc):
# Set a counter
count = 0
# Go through markdown file line by line
for line in doc:
# if line is an image tag
if line > ('.//img'):
# grab the array of the alt text
img_ext = ImgExtractor(md)
# find the second to last character of the line
location = len(line) - 1
# insert the alt text
line += line[location] + '?' + '"' + img_ext[count] + '"'
# add line to new markdown file
updatedMarkdownadd.add(line)
The above code is pseudo code. I'm able to successfully extract the strings I need from the original file but I'm unable to concatenate those strings to their respective image tags and update the original file.

Provided your files aren't huge, it might be easier to overwrite the file, rather than try to wedge little bits in here or there.
orig = '![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif)'
new = '![Daffy Duck](http://www.nonstick.com/wp-content/uploads/2015/10/Daffy.gif?alt-text="Daffy Duck")'
with open(filename, 'r') as f:
text = f.readlines()
new_text = "\n".join([line if line != orig else new for line in text])
with open(filename, 'w') as f:
f.write(new_text)
You could also use regex re.sub, but I suppose its a matter of preference.

Related

How can I wrap text into a paragraph of x characters without importing any modules?

I have a list of words (lowercase) parsed from an article. I joined them together using .join() with a space into a long string. Punctuation will be treated like words (ie. with spaces before and after).
I want to write this string into a file with at most X characters (in this case, 90 characters) per line, without breaking any words. Each line cannot start with a space or end with a space.
As part of the assignment I am not allowed to import modules, which from my understanding, textwrap would've helped.
I have basically a while loop nested in a for loop that goes through every 90 characters of the string, and firstly checks if it is not a space (ie. in the middle of a word). The while loop would then iterate through the string until it reaches the next space (ie. incorporates the word unto the same line). I then check if this line, minus the leading and trailing whitespaces, is longer than 90 characters, and if it is, the while loop iterates backwards and reaches the character before the word that extends over 90 characters.
x = 0
for i in range(89, len(text), 90):
while text[i] != " ":
i += 1
if len(text[x:i].strip()) > 90:
while text[i - 1] != " ":
i = i - 1
file.write("".join(text[x:i]).strip() + "\n")
x = i
The code works for 90% of the file after comparing with the file with correct outputs. Occasionally there are lines where it would exceed 90 characters without wrapping the extra word into the next line.
EX:
Actual Output on one line (93 chars):
extraordinary thing , but i never read a patent medicine advertisement without being impelled
Expected Output with "impelled" on new line (84 chars + 8 chars):
extraordinary thing , but i never read a patent medicine advertisement without being\nimpelled
Are there better ways to do this? Any suggestions would be appreciated.
You could consider using a "buffer" to hold the data as you build each line to output. As you read each new word check if adding it to the "buffer" would exceed the line length, if it would then you print the "buffer" and then reset the "buffer" starting with the word that couldn't fit in the sentence.
data = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis a risus nisi. Nunc arcu sapien, ornare sit amet pretium id, faucibus et ante. Curabitur cursus iaculis nunc id convallis. Mauris at enim finibus, fermentum est non, fringilla orci. Proin nibh orci, tincidunt sed dolor eget, iaculis sodales justo. Fusce ultrices volutpat sapien, in tincidunt arcu. Vivamus at tincidunt tortor. Sed non cursus turpis. Sed tempor neque ligula, in elementum magna vehicula in. Duis ultricies elementum pellentesque. Pellentesque pharetra nec lorem at finibus. Pellentesque sodales ligula sed quam iaculis semper. Proin vulputate, arcu et laoreet ultrices, orci lacus pellentesque justo, ut pretium arcu odio at tellus. Maecenas sit amet nisi vel elit sagittis tristique ac nec diam. Suspendisse non lacus purus. Sed vulputate finibus facilisis."""
sentence_limit = 40
buffer = ""
for word in data.split():
word_length = len(word)
buffer_length = len(buffer)
if word_length > sentence_limit:
print(f"ERROR: the word '{word}' is longer than the sentence limit of {sentence_limit}")
break
if buffer_length + word_length < sentence_limit:
if buffer:
buffer += " "
buffer += word
else:
print(buffer)
buffer = word
print(buffer)
OUTPUT
Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Duis a risus nisi. Nunc
arcu sapien, ornare sit amet pretium id,
faucibus et ante. Curabitur cursus
iaculis nunc id convallis. Mauris at
enim finibus, fermentum est non,
fringilla orci. Proin nibh orci,
tincidunt sed dolor eget, iaculis
sodales justo. Fusce ultrices volutpat
sapien, in tincidunt arcu. Vivamus at
tincidunt tortor. Sed non cursus turpis.
Sed tempor neque ligula, in elementum
magna vehicula in. Duis ultricies
elementum pellentesque. Pellentesque
pharetra nec lorem at finibus.
Pellentesque sodales ligula sed quam
iaculis semper. Proin vulputate, arcu et
laoreet ultrices, orci lacus
pellentesque justo, ut pretium arcu odio
at tellus. Maecenas sit amet nisi vel
elit sagittis tristique ac nec diam.
Suspendisse non lacus purus. Sed
vulputate finibus facilisis.
Using a regular expression:
import re
with open('f0.txt', 'r') as f:
# file must be 1 long single line of text)
text = f.read().rstrip()
for line in re.finditer(r'(.{1,70})(?:$|\s)', text):
print(line.group(1))
To approach another way without regex:
# Constant
J = 70
# output list
out = []
with open('f0.txt', 'r') as f:
# assumes file is 1 long line of text
line = f.read().rstrip()
i = 0
while i+J < len(line):
idx = line.rfind(' ', i, i+J)
if idx != -1:
out.append(line[i:idx])
i = idx+1
else:
out.append(line[i:i+J] + '-')
i += J
out.append(line[i:]) # get ending line portion
for line in out:
print(line)
Here are the file contents (1 long single string):
I have basically a while loop nested in a for loop that goes through every 90 characters of the string, and firstly checks if it is not a space (ie. in the middle of a word). The while loop would then iterate through the string until it reaches the next space (ie. incorporates the word unto the same line). I then check if this line, minus the leading and trailing whitespaces, is longer than 90 characters, and if it is, the while loop iterates backwards and reaches the character before the word that extends over 90 characters.
Output:
I have basically a while loop nested in a for loop that goes through
every 90 characters of the string, and firstly checks if it is not a
space (ie. in the middle of a word). The while loop would then
iterate through the string until it reaches the next space (ie.
incorporates the word unto the same line). I then check if this line,
minus the leading and trailing whitespaces, is longer than 90
characters, and if it is, the while loop iterates backwards and
reaches the character before the word that extends over 90 characters.

Split chat log file at date (with regex) and count number of messages per month

I have several chat history logs and would like to count the number of messages sent and received per month. Some messages correspond to one line in the text file, but not all of them. Therefore, I want to split the messages at the date and time. Then I want to extract the month and year from each date, and count the number of messages and adjust this number in a dictionary. Finally, I want to print the month/year and the number of messages.
This is how the source file looks like (dates are d/m/Y):
09/10/2017, 10:55 - Name omitted: Lorem ipsum dolor sit amet, consectetur adipiscing elit.
09/10/2017, 11:17 - Name omitted: Pellentesque massa tellus, porttitor et iaculis vitae, sodales ac mauris.
Aliquam ullamcorper dictum laoreet. Proin ornare ultrices eros, ut fermentum ex accumsan at. Curabitur dignissim massa a nisi molestie, id hendrerit elit convallis.
Etiam tincidunt gravida arcu, vel lacinia tellus dignissim eu. Praesent ullamcorper neque eu tellus interdum, in semper nibh sagittis. Fusce dignissim sollicitudin mauris in tempus. Sed in magna ante.
09/10/2017, 11:29 - Name omitted: Nam eu risus laoreet, commodo neque eget, tincidunt risus. Suspendisse eu ullamcorper metus.
And this is my code, which unfortunately is not working. I get a long list of 1 as a result:
import os
import re
nummessages = {}
datafiles = ("file1.txt", "file2.txt")
for file in datafiles:
with open(file, "r", encoding="utf8") as infile:
for line in infile:
regexdate = re.compile("([0-9]{2})(\/)([0-9]{2})(\/)([0-9]{4})(,)(\s)([0-9]{2})(:)([0-9]{2})")
messages = regexdate.split(line)
for message in messages:
key = re.search("([0-9]{2})(\/)([0-9]{4})", message)
value = message.count(message)
if key in nummessages.keys():
nummessages[key].append(value)
else:
nummessages[key] = [value]
for key in sorted(nummessages.items()):
print(str(key[0]) + "\t" + str(key[1]))
My desired output looks like this:
09/2017: 45 messages
10/2017: 10 messages
...
What am I doing wrong? (FYI, I am new to Python)
try this:
the main idea in this solution is to parse the month and year of the logs and use it as a key in data dictionary. Now, for every logs that matches the same month and year, an increment of 1 will be added to the dictionary's value
data = {} # outside
for file in datafiles:
with open(file, "r", encoding="utf8") as infile:
for l in infile:
m = re.match(r'\d{2}/(\d{2})/(\d{4})', l)
if m:
key = '{}/{}'.format(m.group(1), m.group(2))
if key not in data.keys():
data[key] = 0
data[key] += 1
# printing
for k in data:
print '{}: {} messages'.format(k, data[k])
lines refer to each line in your log files
Using collections.defaultdict
Ex:
import re
from collections import defaultdict
result = defaultdict(int)
with open(file, "r", encoding="utf8") as infile:
for line in infile: #Iterate Each line
line = line.strip()
m = re.match("(\d{2}/(\d{2})/(\d{4}))", line) #Check if line starts with date
if m:
result["{}/{}".format(m.group(2), m.group(3))] += 1 #form month/year and get count.
print(result)

Python .replace() function not working correctly

I'm trying to figure out why the .replace function in python isn't functioning correctly. I have spent the entire day yesterday searching for an answer but alas have not found one.
I'm trying to open and read a file, copy it into a list, count the number of lines in the list and remove all the punctuation (ie , . ! ? etc). I can do everything except remove the punctuation (and I must use the .replace function instead of importing a module).
with open('Small_text_file.txt', 'r') as myFile: #adding lines from file to list
contents = myFile.readlines()
fileList= []
# punctuation = ['(', ')', '?', ':', ';', ',', '.', '!', '/', '"', "'"]
for i in contents:
fileList.append(i.rstrip())
print('The Statistics are:\n','Number of lines:', len(fileList)) #first part of question
for item in fileList:
fileList = item.replace(',', "")
fileList = item.replace('.', "")
print(fileList)
The "Small text file" is:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Vivamus condimentum sagittis lacus? laoreet luctus ligula laoreet ut.
Vestibulum ullamcorper accumsan velit vel vehicula?
Proin tempor lacus arcu. Nunc at elit condimentum, semper nisi et, condimentum mi.
In venenatis blandit nibh at sollicitudin. Vestibulum dapibus mauris at orci maximus pellentesque.
Nullam id elementum ipsum. Suspendisse
Running the code returns the following:
The Statistics are:
Number of lines: 6
Nullam id elementum ipsum Suspendisse
So the code DOES remove the comma and period characters but it also removes the preceding 5 lines of the text and only prints the very last line. What am I doing wrong here?
Use enumerate:
for x, item in enumerate(fileList):
fileList[x] = item.replace(',', "").replace('.', "")
Note: item.replace() returns replaced string which you need to store in the right index of list. enumerate helps you keep track of index while iterating through the list.
It should be
for i,item in enumerate(fileList):
fileList[i] = item.replace(',', "").replace('.', "")
Without enumerate,
for i in range(len(fileList)):
fileList[i] = fileList[i].replace(',', "").replace('.', "")

Python long string to justified text of x characters long lines

As an assignment I have to take in a long string of text then output it justified with each line being x characters long.
The current method I am trying to use is not working and I can not figure out why, it just gets stuck in an infinite loop.
I would appreciate some help with debugging my code.
code:
words = 'Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. Nam quam nunc, blandit vel, luctus pulvinar, hendrerit id, lorem. Maecenas nec odio et ante tincidunt tempus. Donec vitae sapien ut libero venenatis faucibus. Nullam quis ante. Etiam sit amet orci eget eros faucibus tincidunt. Duis leo. Sed fringilla mauris sit amet nibh. Donec sodales sagittis magna. Sed consequat, leo eget bibendum sodales, augue velit cursus nunc, quis gravida magna mi a libero. Fusce vulputate eleifend sapien. Vestibulum purus quam, scelerisque ut, mollis sed, nonummy id, metus. Nullam accumsan lorem in dui. Cras ultricies mi eu turpis hendrerit fringilla. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia Curae; In ac dui quis mi consectetuer lacinia.'.split()
max_len = 60
line = ''
lines = []
for word in words:
if len(line) + len(word) <= max_len:
line += (' ' + word)
else:
lines.append(line.strip())
line = ''
import re
def JustifyLine(oline, maxLen):
if len(oline) < maxLen:
s = 1
nline = oline
while len(nline) < maxLen:
match = '\w(\s{%i})\w' % s
replacement = ' ' * (s + 1)
nline = re.sub(match, replacement, nline, 1)
if len(re.findall(match, nline)) == 0:
s = s + 1
replacement = s + 1
elif len(nline) == maxLen:
return nline
return oline
for l in lines[:-1]:
string = JustifyLine(l, max_len)
print(string)
Your major problem is that you are replacing letter-whitespace-letter with more white space, deleting the letters on either side of it. So your line never gets longer, and your loop never terminates.
Put the letters in their own groups, and add references (e.g., \1) to the replacement string.
Stephen's answer gives you a bit more than I was going to give you.
Suggestions for the future:
Work out what loop isn't terminating. e.g. add print statements to suspect loops. A different character to each.
Print out the key values for the loop condition and check that they are heading the right way. In this case the length of nline. If it isn't increasing every time through you need to worry that it won't terminate.
Think carefully before having two loop exits (the condition on the loop and the the return), it can make it harder to reason about the behaviour.

get words from large file, using low memory in python

I need to iterate over the words in a file. The file could be very big (over 1TB), the lines could be very long (maybe just one line). Words are English, so reasonable in size. So I don't want to load in the whole file or even a whole line.
I have some code that works, but may explode if lines are to long (longer than ~3GB on my machine).
def words(file):
for line in file:
words=re.split("\W+", line)
for w in words:
word=w.lower()
if word != '': yield word
Can you tell be how I can, simply, rewrite this iterator function so that it does not hold more than needed in memory?
Don't read line by line, read in buffered chunks instead:
import re
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
for word in (w.lower() for w in words if w):
yield word
if buffer:
yield buffer.lower()
I'm using the callable-and-sentinel version of the iter() function to handle reading from the file until file.read() returns an empty string; I prefer this form over a while loop.
If you are using Python 3.3 or newer, you can use generator delegation here:
def words(file, buffersize=2048):
buffer = ''
for chunk in iter(lambda: file.read(buffersize), ''):
words = re.split("\W+", buffer + chunk)
buffer = words.pop() # partial word at end of chunk or empty
yield from (w.lower() for w in words if w)
if buffer:
yield buffer.lower()
Demo using a small chunk size to demonstrate this all works as expected:
>>> demo = StringIO('''\
... Lorem ipsum dolor sit amet, consectetur adipiscing elit. Pellentesque in nulla nec mi laoreet tempus non id nisl. Aliquam dictum justo ut volutpat cursus. Proin dictum nunc eu dictum pulvinar. Vestibulum elementum urna sapien, non commodo felis faucibus id. Curabitur
... ''')
>>> for word in words(demo, 32):
... print word
...
lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
pellentesque
in
nulla
nec
mi
laoreet
tempus
non
id
nisl
aliquam
dictum
justo
ut
volutpat
cursus
proin
dictum
nunc
eu
dictum
pulvinar
vestibulum
elementum
urna
sapien
non
commodo
felis
faucibus
id
curabitur

Categories

Resources