Trouble setting table width using Python Docx - python

i have to create numerous pdfs of tables every year, so i was trying to write a script with Docx to create the table in Word with each column having its own set width and left or right alignment. Right now i am working on two tables - one with 262 rows, the other with 1036 rows.
The code works great for the table with 262 rows, but will not set column widths correctly for the table with 1036 rows. Since the code is identical for both tables, i am thinking it is a problem with the data itself, or possibly the size of the table? I tried creating the second larger table without any data, and the widths are correct. I then tried creating the table below with a subset of the 7 rows from the 1036 rows, including the rows with the largest numbers of characters, in case a column was not wrapping the text but instead was forcing the column widths to change. It runs fine - widths are correct. I use the exact same code on the full data set of 1036 rows, and the widths change. Any ideas?
Below is the code for only 7 rows of data. it works correctly - the first column is 3.5", the second and third columns are 1.25 inches.
from docx import Document
from docx.shared import Cm, Pt, Inches
import docx
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_ALIGN_VERTICAL
year = '2019'
set_width_dic = {
'PUR'+year +'_subtotals_indexed_by_site.txt':(Inches(3.50), Inches(1.25), Inches(1.25)),
'PUR'+year +'_subtotals_indexed_by_chemical.txt':(Inches(3.50), Inches(1.25), Inches(1.25))}
set_alignment_dic = {
'PUR'+year +'_subtotals_indexed_by_site.txt':[[0,WD_ALIGN_PARAGRAPH.LEFT, 'Commodity or site'],[1,WD_ALIGN_PARAGRAPH.RIGHT, 'Pounds Applied'],[2,WD_ALIGN_PARAGRAPH.RIGHT, 'Agricultural Applications']],
'PUR'+year +'_subtotals_indexed_by_chemical.txt':[[0,WD_ALIGN_PARAGRAPH.LEFT, 'Chemical'],[1,WD_ALIGN_PARAGRAPH.RIGHT, 'Pounds Applied'],[2,WD_ALIGN_PARAGRAPH.RIGHT, 'Agricultural Applications']]}
#the data
list_of_lists = ['Chemical', 'Pounds Applied', 'Agricultural Applications'],['ABAMECTIN', '51,276.54', '69,659'],['ABAMECTIN, OTHER RELATED', '0.03', 'N/A'], ['S-ABSCISIC ACID', '1,856.38', '230'],['ACEPHATE', '158,054.76', '11,082'],['SULFUR','49,038,554.00','170,396'],['BACILLUS SPHAERICUS 2362, SEROTYPE H5A5B, STRAIN ABTS 1743 FERMENTATION SOLIDS, SPORES AND INSECTICIDAL TOXINS','11,726.29','N/A']
doc = docx.Document() # Create an instance of a word document
col_ttl = 3 # add enough columns as headings in the first list in list_of_lists
row_ttl =7# add rows to equal total number lists in list_of_lists
# Creating a table object
table = doc.add_table(rows= row_ttl, cols= col_ttl)
table.style='Light Grid Accent 1'
for r in range(len(list_of_lists)):
row=table.rows[r]
widths = set_width_dic[file] #Ex of widths = (Inches(3.50), Inches(1.25), Inches(1.25))
for c, cell in enumerate(table.rows[r].cells):#c is an index, cell is the empty cell of the table,
table.cell(r,c).vertical_alignment = WD_ALIGN_VERTICAL.BOTTOM
table.cell(r,c).width = widths[c]
par = cell.add_paragraph(str(list_of_lists[r][c]))
for l in set_alignment_dic[file]:
if l[0] == c:
par.alignment = l[1]
doc.save(path+'try.docx')
When i try to do the exact same code for the entire list_of_lists (a list of 1036 lists), the widths are incorrect: column 1 = 4.23", column 2 = 1.04", and column 3 = 0.89"
I printed the full 1036 row list_of_lists on my cmd box, then pasted it in a text file thinking i might be able to include it here. However when i attempted to run the full list, it would not paste back into the cmd box - it gave an EOL error, and only showed the first 65 lists in the list_of_lists. DocX is able to make the full table, just wont set the correct widths. I am baffled. I have looked through every StackExchange python Docx table width post i can find, and many other googled sites. Any thoughts much appreciated.

Just figured out the issue. I needed to add autofit = False. Now the code works for the longer table as well
table.autofit = False
table.allow_autofit = False

Related

How to divide a pandas data frame into sublists of n at a time?

I have a data frame made of tweets and their author, there is a total of 45 authors. I want to divide the data frame into groups of 2 authors at a time such that I can export them later into csv files.
I tried using the following: (given that the authors are in column named 'B' and the tweets are in columns named 'A')
I took the following from this question
df.set_index(keys=['B'],drop=False,inplace=True)
authors = df['B'].unique().tolist()
in order to separate the lists :
dgroups =[]
for i in range(0,len(authors)-1,2):
dgroups.append(df.loc[df.B==authors[i]])
dgroups.extend(df.loc[df.B ==authors[i+1]])
but instead it gives me sub-lists like this:
dgroups = [['A'],['B'],
[tweet,author],
['A'],['B'],
[tweet,author2]]
prior to this I was able to divide them correctly into 45 sub-lists derived from the previous link 1 as follows:
for i in authors:
groups.append(df.loc[df.B==i])
so how would i do that for 2 authors or 3 authors or like that?
EDIT: from #Jonathan Leon answer, i thought i would do the following, which worked but isn't a dynamic solution and is inefficient i guess, especially if n>3 :
dgroups= []
for i in range(2,len(authors)+1,2):
tempset1=[]
tempset2=[]
tempset1 = df.loc[df.B==authors[i-2]]
if(i-1 != len(authors)):
tempset2=df.loc[df.B ==authors[i-1]]
dgroups.append(tempset1.append(tempset2))
else:
dgroups.append(tempset1)
This imports the foreign language incorrectly, but the logic works to create a new csv for every two authors.
pd.read_csv('TrainDataAuthorAttribution.csv')
# df.groupby('B').count()
authors=df.B.unique().tolist()
auths_in_subset = 2
for i in range(auths_in_subset, len(authors)+auths_in_subset, auths_in_subset):
# print(authors[i-auths_in_subset:i])
dft = df[df.B.isin(authors[i-auths_in_subset:i])]
# print(dft)
dft.to_csv('df' + str(i) + '.csv')

How to calculate the number of occurrences between data in excel?

I have a huge CSV table of thousands of data, I want to make a table of number of occurrence of two elements together divided by how many that element presented
[
Like Bitcoin appeared 8 times in this rows with 2 times with API so the relation between bitcoin to API: is that API always exists with bitcoin so the value of API appearing with bitcoin is 1 and bitcoin appearing with API is 1/4.
I want something looks like this in the end
How I can do it with python or any other tool?
This is sample of file
sample of the file
This, I think, does do the job. I typed your spreadsheet into a csv by hand (would have been nice to be able to cut and paste), and the results seem reasonable.
import itertools
import csv
import numpy as np
words = {}
for row in open('input.csv'):
parts = row.rstrip().split(',')
for a,b in itertools.combinations(parts,2):
if a not in words:
words[a] = [b]
else:
words[a].append( b )
if b not in words:
words[b] = [a]
else:
words[b].append( a )
print(words)
size = len(words)
keys = list(words.keys())
track = np.zeros((size,size))
for i,k in enumerate(keys):
track[i,i] = len(words[k])
for j in words[k]:
track[i,keys.index(j)] += 1
track[keys.index(j),i] += 1
print(keys)
# Scale to [0,1].
for row in range(track.shape[0]):
track[row,:] /= track[row,row]
# Create a csv with the results.
fout = open('corresp.csv','w')
print( ','.join([' ']+keys), file=fout )
for row in range(track.shape[0]):
print( keys[row], file=fout, end=',')
print( ','.join(f"{track[row,i]}" for i in range(track.shape[1])), file=fout )
Here's the first few lines of the result:
,API,Backend Development,Bitcoin,Docker,Article Rewriting,Article writing,Blockchain,Content Writing,Ghostwriting,Android,Ethereum,PHP,React.js,C Programming,C++ Programming,ASIC,Digital ASIC Coding,Embedded Software,Article Writing,Blog,Copy Typing,Affiliate Marketing,Brand Marketing,Bulk Marketing,Sales,BlockChain,Business Strategy,Non-fungible Tokens,Technical Writing,.NET,Arduino,Software Architecture,Bluetooth Low Energy (BLE),C# Programming,Ada programming,Programming,Haskell,Rust,Algorithm,Java,Mathematics,Machine Learning (ML),Matlab and Mathematica,Data Entry,HTML,Circuit Designs,Embedded Systems,Electronics,Microcontroller, C++ Programming,Python
API,1.0,0.14285714285714285,0.5714285714285714,0.14285714285714285,0.0,0.0,0.2857142857142857,0.0,0.0,0.0,0.14285714285714285,0.0,0.14285714285714285,0.2857142857142857,0.2857142857142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Backend Development,0.6666666666666666,1.0,0.6666666666666666,0.6666666666666666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bitcoin,0.21052631578947367,0.05263157894736842,1.0,0.05263157894736842,0.0,0.0,0.2631578947368421,0.0,0.0,0.05263157894736842,0.10526315789473684,0.10526315789473684,0.05263157894736842,0.15789473684210525,0.21052631578947367,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.05263157894736842,0.0,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Docker,0.6666666666666666,0.6666666666666666,0.6666666666666666,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I had a look at this by creating a pivot table in Excel for every combination of columns there are: AB AC, AD, BC, BD, CD and putting the unique entries from the first column, eg A, in the rows and the unique entries from the second, eg B, in the column and then putting column A in the values area, I find all matches and the count of all matches
This is a clunky method but I note from the Python based method that has been submitted, my answer is essentially no more or less clunky than that!

arcpy - using an external table to iteratively change parameters into my code

I am working with python 2.7 and arcpy under Jupyter notebook environment.
I would like to adapt iteratively my code to a reference table.
This is my reference table which contains the 3 variables that I use for the tool I am running in arpcy:
RegY HunCal CRY
1 1718 BL1
1 1112 JU1
1 1112 JU1
1 1213 JU1
This is a simple xls table which I imported to my jupyter notebook. I have it as a visual reference for when I have to change these variables in my code.
In the beginning, I was doing it by hand because they were a few changes to make. But now there are more than 150 changes to adapt, and, this amount increases with time. Therefore, I would like to modify the code in such a way that it uses the reference table to iterate through every feature each time the reference table changes.
This is the code I am using:
# 2011
# Set geoprocessor object property to overwrite existing output
arcpy.gp.overwriteOutput = True
arcpy.env.workspace = r'C:\Users\GeoData\simSear\SBA_D.gdb'
# Process: Group Similar Features
SS.SimilaritySearch("redD_RegY_1_1112","blackD_CRY_JU1_1112","SS_JU1_1112","NO_COLLAPSE",
"MOST_SIMILAR","ATTRIBUTE_PROFILES",0,
"Temperatur;Precipitat", 'DateFin')
How can I adapt the code in such a way that the variables from the reference table are inserted into my code in the following way:
From the reference table, the values from RegY would be replaced in redD_RegY_**1**_1112.
The values from CRY would be replaced in blackD_CRY_**JU1**_1112 and SS_**JU1**_1112
The values from HunCal would be replaced in redD_RegY_1_**1112**, blackD_CRY_JU1_**1112**, SS_JU1_**1112**
Any hint or suggestion would be highly appreciated.
You should iterate through each row of the table to get your reference table values, then use them to build the unique strings for your input, candidate, and output features.
for row in table:
regY = row[0]
hunCal = row[1]
cry = row[2]
input_features_to_match = 'redD_RegY_{}_{}'.format(regY, hunCal)
candidate_features = 'blackD_CRY_{}_{}'.format(cry, hunCal)
output_features = 'SS_{}_{}'.format(cry, hunCal)
SS.SimilaritySearch(
input_features_to_match,
candidate_features,
output_features,
'NO_COLLAPSE',
'MOST_SIMILAR',
'ATTRIBUTE_PROFILES',
0,
'Temperatur;Precipitat',
'DateFin')
Or much more compactly:
for row in table:
SS.SimilaritySearch(
'redD_RegY_{}_{}'.format(row[0], row[1]),
'blackD_CRY_{}_{}'.format(row[2], row[1]),
'SS_{}_{}'.format(row[2], row[1]),
'NO_COLLAPSE',
'MOST_SIMILAR',
'ATTRIBUTE_PROFILES',
0,
'Temperatur;Precipitat',
'DateFin')

Is pandas and numpy any good for manipulation of non numeric data?

I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers

Extract information from website using Xpath, Python

Trying to extract some useful information from a website. I came a bit now im stuck and in need of your help!
I need the information from this table
http://gbgfotboll.se/serier/?scr=scorers&ftid=57700
I wrote this code and i got the information that i wanted:
import lxml.html
from lxml.etree import XPath
url = ("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")
rows_xpath = XPath("//*[#id='content-primary']/div[1]/table/tbody/tr")
name_xpath = XPath("td[1]//text()")
team_xpath = XPath("td[2]//text()")
league_xpath = XPath("//*[#id='content-primary']/h1//text()")
html = lxml.html.parse(url)
divName = league_xpath(html)[0]
for id,row in enumerate(rows_xpath(html)):
scorername = name_xpath(row)[0]
team = team_xpath(row)[0]
print scorername, team
print divName
I get this error
scorername = name_xpath(row)[0]
IndexError: list index out of range
I do understand why i get the error. What i really need help with is that i only need the first 12 rows. This is what the extract should do in these three possible scenarios:
If there are less than 12 rows: Take all the rows except THE LAST ROW.
If there are 12 rows: same as above..
If there are more than 12 rows: Simply take the first 12 rows.
How can i can i do this?
EDIT1
It is not a duplicate. Sure it is the same site. But i have already done what that guy wanted to which was to get all the values from the row. Which i can already do. I don't need the last row and i dont want it to extract more than 12 rows if there is..
I think is it what you want:
#coding: utf-8
from lxml import etree
import lxml.html
collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700")
#all table rows
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval('//div[#id="content-primary"]/div/table[1]/tbody/tr')
# If there are less than 12 rows (or <=12): Take all the rows except the last.
if len(rows) <= 12:
rows.pop()
else:
# If there are more than 12 rows: Simply take the first 12 rows.
rows = rows[0:12]
for row in rows:
# all columns of current table row (Spelare, Lag, Mal, straffmal)
columns = row.findall("td")
# pick textual data from each <td>
collected.append([column.text for column in columns])
for i in collected: print i
Output:
This is how you can get the rows you need based on what you described in your post. This is just the logic based on concept that rows is a list, you have to incorporate this into your code as needed.
if len(rows) <=12:
print rows[0:-1]
elif len(rows) > 12:
print rows[0:12]

Categories

Resources