Using python-docx to iterate over a document and change text - python

The Problem:
I am supposed to create a word file with 250x5 placeholders.
The background is akin to a database export into a word document.
So I created the first placeholder
and copied the block x250.
Now I need the number to be different for each block; so the second block of 250 would read
I thought to myself, surely this is faster to automate than to individually type 1000+ numbers.
But here I am, stuck, because my Python expertise is very limited.
What I have so far:
from docx import Document
import os
document = Document("Mapping.docx")
paragraph = document.paragraphs[10]
def replaceNumbers(ReplaceNumber):
number = 0
for i in range(len(document.paragraphs)):
if number == 5:
replaceNumbers(ReplaceNumber + 1)
y = document.paragraphs[i].text
if "1" in y:
document.paragraphs[i].text = document.paragraphs[i].text.replace("1", str(ReplaceNumber))
number = number + 1
if __name__ == '__main__':
now I am guessing there is a problem in my usage of the "while" loop, as the program gets stuck. but manually debugging seems impossible with so many imported libraries.
I hope someone can help me here


Changing row-height and column-width in LibreOffice Calc using Python3

I want to write a LibreOffice Calc document from within a Python3 program. Using pyoo I can do almost everything I want, including formatting and merging cells. But I cannot adjust row heights and column widths.
I found Change the column width and row height very helpful, and have been experimenting with it, but I can't seem to get quite the result I want. My present test file, based on the answer mentioned above, looks like this:
#! /usr/bin/python3
import os, pyoo, time, uno
s = '-'
while s != 'Y':
s = input("Have you remembered to start Calc? ").upper()
os.popen("soffice --accept=\"socket,host=localhost,port=2002;urp;\" --norestore --nologo --nodefault")
desktop = pyoo.Desktop('localhost', 2002)
doc = desktop.create_spreadsheet()
class ofic:
sheet_idx = 0
row_num = 0
sheet = None
o = ofic()
uno_localContext = uno.getComponentContext()
uno_resolver = uno_localContext.ServiceManager.createInstanceWithContext("", uno_localContext )
uno_ctx = uno_resolver.resolve( "uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext" )
uno_smgr = uno_ctx.ServiceManager
uno_desktop = uno_smgr.createInstanceWithContext( "", uno_ctx)
uno_model = uno_desktop.getCurrentComponent()
uno_controller = uno_model.getCurrentController()
uno_sheet_count = 0
doc.sheets.create("Page {}".format(1), index=o.sheet_idx)
o.sheet = doc.sheets[o.sheet_idx]
o.sheet[0, 0].value = "The quick brown fox jumps over the lazy dog"
o.sheet[1, 1].value = o.sheet_idx
uno_sheet_count += 1
uno_active_sheet = uno_model.CurrentController.ActiveSheet
uno_columns = uno_active_sheet.getColumns()
uno_column = uno_columns.getByName("B")
uno_column.Width = 1000
The main problem with the above is that I have 2 Calc documents on the screen, one of which is created before the Python program gets going; the other is created from Python with a pyoo function. The first document gets the column width change, and the second receives the text input etc. I want just the second document, and of course I want the column width change applied to it.
I am sure the answer must be fairly straightforward, but after hours of experimentation I still can't find it. Could someone point me in the right direction, please?
Your code alternates between pyoo and straight Python-UNO, so it's no wonder that it's giving messy results. Pick one or the other. Personally, I use straight Python-UNO and don't see the benefit of adding the extra pyoo library.
the other is created from Python with a pyoo function
Do you mean this line of code from your question, and is this the "second document" that you want the column change applied to?
doc = desktop.create_spreadsheet()
If so, then get objects from that document instead of whichever window the desktop happens to have selected.
controller = doc.getCurrentController()
sheets = doc.getSheets()
Or perhaps you want the other document, the one that didn't get created from Python. In that case, grab a reference to that document before creating the second one.
first_doc = uno_desktop.getCurrentComponent()
second_doc = desktop.create_spreadsheet()
controller = first_doc.getCurrentController()
sheets = first_doc.getSheets()
If you don't have a reference to the document, you can find it by iterating through the open windows.
oComponents = desktop.getComponents()
oDocs = oComponents.createEnumeration()
Finally, how to resize a column. The link in your question is for Excel and VBA (both from Microsoft), so I'm not sure why you think that would be relevant. Here is a Python-UNO example of resizing columns.
oColumns = oSheet.getColumns()
oColumn = oColumns.getByName("A")
oColumn.Width = 7000
oColumn = oColumns.getByName("B")
oColumn.OptimalWidth = True

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = ''
URLPrefix = ''
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MaxIterations -= 1
Iterations += 1
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
#print("This is the end ??")
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print("It's Over")
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

Python - PseudoCode for functions and operands

I am quite new to the PseudoCode concept... I would like to get an idea of how functions and operands like modulus, floor division and the likes would be written in PseudoCode. Writing the PseudoCode for this code might actually help me understand better...
user_response=input("Input a number: ")
def string (our_input):
if (our_input % 15) == 0 :
return ("fizzbuzz")
elif (our_input % 3) == 0 :
return ("fizz")
elif (our_input % 5) == 0 :
return ("buzz")
else :
return ("null")
Your code really isn't that hard to understand assuming the knowledge of % as the modulus operation. Floor division can really just be written as a regular division. Floor the result, if really necessary.
If you don't know how to express them, then explicitly write modulus(x, y) or floor(z).
Pseudocode is meant to be readable, not complicated.
Pseudocode could even be just words, and not "code". The pseudo part of it comes from the logical expression of operations
The basic idea of PseudoCode is either to
a) Make complicated code understandable, or
b) Express an idea that you are going to code/haven't yet figured out how to code.
For example, if I am going to make a tool that needs to read information in from a database, parse it into fields, get just the info that the user requests, then format the information and print it to the screen, my first draft of the code would be simple PseudoCode as so:
# Header information
# Get user input
# Connect to Database
# Read in values from database
# Gather useful information
# Format information
# Print information
This gives me a basic structure for my program, that way I don't get lost as I'm making it. Also, if someone else is working with me, we can divvy up the work (He works on the code to connect to the database, and I work on the code to get the user input.)
As the program progresses, I would be replacing PseudoCode with real, working code.
# Header information
user_input_row = int(input("Which row (1-10)? "))
user_input_column = input("Which column (A, B, C)? "))
dbase = dbconn("My_Database")
row_of_interest = dbase.getrow(user_input_row)
# Gather useful information
# Format information
# Print information
At any point I might realize there are other things to do in the code, and if I don't want to stop what I'm working on, I add them in to remind myself to come back and code them later.
# Header information #Don't forget to import the database dbconn class
user_input_row = int(input("Which row (1-10)? "))
#Protect against non-integer inputs so that the program doesn't fail
user_input_column = input("Which column (A, B, C)? "))
#Make sure the user gives a valid column before connecting to the database
dbase = dbconn("My_Database")
#Verify that we have a connection to the database and that the database is populated
row_of_interest = dbase.getrow(user_input_row)
# Separate the row by columns -- use .split()
# >> User only wants user_input_column
# Gather useful information
# Format information
# >> Make the table look like this:
# C C1 C2 < User's choice
# _________|________|_______
# Title | Field | Group
# Print information
After you're done coding, the old PseudoCode can even serve to be good comments to your program so that another person will know right away what the different parts of your program are doing.
PseudoCode also works really well when asking a question when you don't know how to code something but you know what you want, for example if you had a question about how to make a certain kind of loop in your program:
my_list = [0,1,2,3,4,5]
for i in range(len(my_list)) but just when i is even:
print (my_list[i]) #How do I get it to print out when i is even?
The PseudoCode helps the reader know what you're trying to do and they can help you easier.
In your case, useful PseudoCode for things like explaining your way through code might look like:
user_response=input("Input a number: ") # Get a number from user as a string
our_input=float(user_response) # Change that string into a float
def string (our_input):
if (our_input % 15) == 0 : # If our input is divisible by 15
return ("fizzbuzz")
elif (our_input % 3) == 0 : # If our input is divisible by 3 but not 15
return ("fizz")
elif (our_input % 5) == 0 : # If our input is divisible by 5 but not 15
return ("buzz")
else : # If our input is not divisible by 3, 5 or 15
return ("null")
print(string(our_input)) # Print out response
user_response=input("Input a number: ")
def string (our_input):
if (our_input % 15) == 0 :
return ("fizzbuzz")
elif (our_input % 3) == 0 :
return ("fizz")
elif (our_input % 5) == 0 :
return ("buzz")
Request an input from the user
Ensure that the input is a number
If the input is divisible by 15
send "fizzbuzz" to the main
But if the input is divisible by 3 and not 15
send "fizz" to the main program
But if the input is divisible by 5 and not 15 or 3
send "buzz" to the main program
Display the result.

How to optimize the overall processing time in this code?

I have written a code for take set of documents as a list and take another set of words as a list then if in each document check whether any word containing from the list of words and i make sentences from available words
//find whether the whole word in the sentence- return None if its not in.
def findWholeWord(w):
return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search
for data in dataset['abc']:
mvc = ''
for x in newdataset['Words']:
y = findWholeWord(x)(data)
if y != None:
mvc = mvc+" "+ x
when i run this code for 10000 documents with average word count of 10 , it take like so long time . How to optimize this code? or possible alternatives for this code to do the same functionality
Since you just want to check if a word exists in the set of abc, you don't need to use re.
for raw_data in dataset['abc']:
data = raw_data.lower()
mvc = ''
for x in newdataset['Words']:
if x not in data:
mvc = mvc+" "+ x
Are you sure that these code works slow? I am not. I think most time you spend opening files. You need to profile your code as Will says. Also you can use multiprocessing to improve speed of your code.

appending array breaks program

I am writing a program to analyze some of our invoice data. Basically,I need to take an array containing each individual invoice we sent out over the past year & break it down into twelve arrays which contains the invoices for that month using the dateSeperate() function, so that monthly_transactions[0] returns Januaries transactions, monthly_transactions[1] returns Februaries & so forth.
I've managed to get it working so that dateSeperate returns monthly_transactions[0] as the january transactions. However, once all of the January data is entered, I attempt to append the monthly_transactions array using line 44. However, this just causes the program to break & become unrepsonsive. The code still executes & doesnt return an error, but Python becomes unresponsive & I have to force quite out of it.
I've been writing the the global array monthly_transactions. dateSeperate runs fine as long as I don't include the last else statement. If I do that, monthly_transactions[0] returns an array containing all of the january invoices. the issue arises in my last else statement, which when added, causes Python to freeze.
Can anyone help me shed any light on this?
I have written a program that defines all of the arrays I'm going to be using (yes I know global arrays aren't good. I'm a marketer trying to learn programming so any input you could give me on how to improve this would be much appreciated
import csv
line_items = []
monthly_transactions = []
accounts_seperated = []
Then I import all of my data and place it into the line_items array
def csv_dict_reader(file_obj):
global board_info
reader = csv.DictReader(file_obj, delimiter=',')
for line in reader:
item = []
item.append(line["company id"])
item.append(line["user id"])
item.append(line["Transaction Date"])
item.append(line["FIrst Transaction"])
if __name__ == "__main__":
with open("ChurnTest.csv") as f_obj:
#formats the transacation date data to make it more readable
def dateFormat():
for i in range(len(line_items)):
ddmmyyyy =(line_items[i][3])
yyyymmdd = ddmmyyyy[6:] + "-"+ ddmmyyyy[:2] + "-" + ddmmyyyy[3:5]
line_items[i][3] = yyyymmdd
#Takes the line_items array and splits it into new array monthly_tranactions, where each value holds one month of data
def dateSeperate():
for i in range(len(line_items)):
#if there are no values in the monthly transactions, add the first line item
if len(monthly_transactions) == 0:
test = []
# check to see if the line items year & month match a value already in the monthly_transaction array.
for j in range(len(monthly_transactions)):
line_year = line_items[i][3][:2]
line_month = line_items[i][3][3:5]
array_year = monthly_transactions[j][0][3][:2]
array_month = monthly_transactions[j][0][3][3:5]
#print(line_year, array_year, line_month, array_month)
#If it does, add that line item to that month
if line_year == array_year and line_month == array_month:
#Otherwise, create a new sub array for that month
I would really, really appreciate any thoughts or feedback you guys could give me on this code.
Based on the comments on the OP, your csv_dict_reader function seems to do exactly what you want it to do, at least inasmuch as it appends data from its argument csv file to the top-level variable line_items. You said yourself that if you print out line_items, it shows the data that you want.
"But appending doesn't work." I take it you mean that appending the line_items to monthly_transactions isn't being done. The reason for that is that you didn't tell the program to do it! The appending that you're talking about is done as part of your dateSeparate function, however you still need to call the function.
I'm not sure exactly how you want to use your dateFormat and dateSeparate functions, but in order to use them, you need to include them in the main function somehow as calls, i.e. dateFormat() and dateSeparate().
EDIT: You've created the potential for an endless loop in the last else: section, which extends monthly_transactions by 1 if the line/array year/month aren't equal. This is problematic because it's within the loop for j in range(len(monthly_transactions)):. This loop will never get to the end if the length of monthly_transactions is increased by 1 every time through.

