OpenPyXL Using Built-In Conditional Formatting ie: Duplicate and Unique Values - python

I am writing a python method that checks a specific column in Excel and highlights duplicate values in red (if any), then copy those rows onto a separate sheet that I will use to check to see why they have duplicate values. This is just for Asset Management where I want to check to make sure there are no two exact serial numbers or Asset ID numbers etc.
At this moment I just want to check the column and highlight duplicate values in red. As of now, I have this method started and it runs it just does not highlight of the cells that have duplicate values. I am using a test sheet with these values in column A,
(336,565,635,567,474,326,366,756,879,567,453,657,678,324,987,667,567,657,567)The number "567" repeats a few times.
def check_duplicate_values(self,wb):
self.wb=wb
ws=self.wb.active
dxf = DifferentialStyle(fill=self.red_fill())
rule = Rule(type="duplicateValues", dxf=dxf, stopIfTrue=None, formula=['COUNTIF($A$1:$A1,A1)>1'])
ws.conditional_formatting.add('Sheet1!$A:$A',rule) #Not sure if I need this
self.wb.save('test.xlsx')
In Excel, I can just create a Conditional Format rule to accomplish this however in OpenPyXL I am not sure if I am using their built-in methods correctly. Also, could my formula be incorrect?

Whose built-in methods are you referring to? openpyxl is a file format library and, hence, allows you manage conditional formats as they are stored in Excel worksheets. Unfortunately, the details of the rules are not very clear from the specification so form of reverse engineering from an existing is generally required, though it's probably worth noting that rules created by Excel are almost always more verbose than actually required.
I would direct further questions to the openpyxl mailing list.

Just remove the formula and you're good to go.
duplicate_rule = Rule(type="duplicateValues", dxf=dxf, stopIfTrue=None)
You can also use unique rule:
unique_rule = Rule(type="uniqueValues", dxf=dxf, stopIfTrue=None)
Check this out for more info: https://openpyxl.readthedocs.io/en/stable/_modules/openpyxl/formatting/rule.html#RuleType

Related

Splitting a DataFrame to filtered "sub - datasets"

So I have a DataFrame with several columns, some contain objects (string) and some are numerical.
I'd like to create new dataframes which are "filtered" to the combination of the objects available.
To be clear, those are my object type columns:
Index(['OS', 'Device', 'Design',
'Language'],
dtype='object')
["Design"] and ["Language"] have 3 options each.
I filtered ["OS"] and ["Device"] manually as I needed to match them.
However, now I want to create multiple variables each contains a "filtered" dataframe.
For example:
I have
"android_fltr1_d1" to represent the next filter:
["OS"]=android, ["Device"]=1,["Design"]=1
and "android_fltr3_d2" to represent:
["OS"]=android, ["Device"]=3,["Design"]=2
I tried the next code (which works perfectly fine).
android_fltr1_d1 = android_fltr1[android_fltr1["Design"]==1].drop(["Design"],axis=1)
android_fltr1_d2 = android_fltr1[android_fltr1["Design"]==2].drop(["Design"],axis=1)
android_fltr1_d3 = android_fltr1[android_fltr1["Design"]==3].drop(["Design"],axis=1)
android_fltr3_d1 = android_fltr3[android_fltr3["Design"]==1].drop(["Design"],axis=1)
android_fltr3_d2 = android_fltr3[android_fltr3["Design"]==2].drop(["Design"],axis=1)
android_fltr3_d3 = android_fltr3[android_fltr3["Design"]==3].drop(["Design"],axis=1)
android_fltr5_d1 = android_fltr5[android_fltr5["Design"]==1].drop(["Design"],axis=1)
android_fltr5_d2 = android_fltr5[android_fltr5["Design"]==2].drop(["Design"],axis=1)
android_fltr5_d3 = android_fltr5[android_fltr5["Design"]==3].drop(["Design"],axis=1)
As you can guess, I don't find it efficient and would like to use a for loop to generate those variables (as I'd need to match each ["Language"] option to each filter I created. Total of 60~ variables).
Thought about using something similar to .format() in the loop in order to be some kind of a "place-holder", couldn't find a way to do it.
It would be probably the best to use a nested loop to create all the variables, though I'd be content even with a single loop for each column.
I find it difficult to build the for loop to execute it and would be grateful for any help or directions.
Thanks!
As suggested I tried to find my answer in:How do I create variable variables?
Yet I failed to understand how I use the globals() function in my case. I also found that using '%' is not working anymore.

Read data from CSV and write data to CSV - String to integer

I have a CSV file with 100,000 rows.
Each row in column A is a sentence comprised of both chars and integers.
I want column B to contain only integers.
I want the new columns to be in the same CSV file.
How can I accomplish this?
If I'm understanding your question correctly, I would use .isdigit() to parse the data in column A. I'm frankly not sure what the format of column A is, so I don't know exactly what you would do with this (if you gave more information I could give a more specific answer). Your solution will likely come in a similar form to this:
def find(lines):
B = []
for line in lines:
numbers = [c for c in line if c.isdigit()]
current = int(''.join(numbers))
# current is the concatenation of all
# integers found in column A from left to right
B.append(current)
return B
Let me know if this makes sense or is even in the right track for your solution. Once again, without knowing what you're trying to do, and what A looks like, I'm not sure what your actual goals are.
EDIT
I'm not going to explain the csv stuff for you, mainly because there is a fantastic resource and library for it included in python here. If you have specific questions related to writing csv, definitely post them.
It sounds like you essentially want to pull int values out of column A then add them to a new column B. There are definitely many ways to solve this, but the general form of the problem is for each row you'll filter out the int, then you'll add the filtered int into the new column. I'll list a couple:
Regex: You could use a pattern such as [0-9]+ to pull the string out of A, then use int(whatever that output is) to cast to int, then store those values in B. I'm a sucker for a good regular expression and this one is fairly straight forward. Regexr is a great resource to learn about this and test your pattern.
Use an algorithm similar to above: The above algorithm worked before, but I've updated it slightly. Now that it's been updated it'll return an array of numbers correspondent to numbers in A from left to right. This is relatively sound, but it doesn't necessarily guarantee you have the right integer, given that if the title has an int in it, it'll mess some things up. It is likely one of the more clear ways of doing this, though.

how can i change the size of a table in word using python (pywin32)

ms word table with python
I am working with python on word tables, i am generating tables, but all of them are
auto fit window..
is it possible to change it to auto fit contents?
i had tried something like this:
table = location.Tables.Add(location,len(df)+1,len(df.columns)
table.AutoFit(AutoFitBehavior.AutoFitToContents)
but it keeps to raise errors
You want to change you table creation to use this:
//''#Add two ones after your columns
table = location.Tables.Add(location,len(df)+1,len(df.columns),1,1)
Information about why you need those variables can be read here:
http://msdn.microsoft.com/en-us/library/office/ff845710(v=office.15).aspx
But basically, the default behavior is to disable Cell Autofitting and Use Table Autofit to Window. The first "1" enables Cell Autofitting. From the link I posted above, the DefaultTableBehavior can either be wdWord8TableBehavior (Autofit disabled --default) or wdWord9TableBehavior (Autofit enabled). The number comes from opening up Word's VBA editor and typing in the Immediate Window:
?Word.wdWord9TableBehavior
Next, from the link, we see another option called AutoFitBehavior. This is defined as:
Sets the AutoFit rules for how Word sizes tables. Can be one of the WdAutoFitBehavior constants.
So now we have another term to look up. In the VBA editor's Immediate window again type:
?Word.wdAutoFitBehavior.
After the last dot, the possible options should appear. These will be:
wdAutoFitContent
wdAutoFitFixed
wdAutoFitWindow
AutoFitContent looks to be the option we want, so let's finish up that previous line with:
?Word.wdAutoFitBehavior.wdAutoFitContent
The result will be a "1".
Now you may ask, why do we have to go through all this trouble finding the numerical representations of the values. From my experience, with using pywin32 with Excel, is that you can't get the Built-in values, from the string, most of the time. But putting in the numerical representation works just the same.
Also, One more reason for why your code may be failing is that the table object may not have a function "Autofit".
I'm using Word 2007, and Table has the function, AutoFitBehavior.
So Change:
table.AutoFit(AutoFitBehaviour.AutoFitToContent)
To:
table.AutoFitBehavior(1)
//''Which we know the 1 means wd.wdAutoFitBehavior.wdAutoFitContent
Hope I got it right, and this helps you out.

XlDirectionDown and selecting filled cells with Python

I've already asked the root question but I thought I might see if I can get more help with this. I'm trying to work with XlDirectionDown in order to select the last filled cell in an Excel spreadsheet.
Ultimately, I'd like to use Python to select all filled cells in this sheet from A through AE. It will be copied into a text file and appended into SQL Server...so I don't want any blanks.
What I have so far:
import win32com.client as win32
excel = win32.gencache.EnsureDispatch('Excel.Application')
excel.Visible = 1;
excel.Workbooks.Open('G:/working.xlsx')
XlDirectionDown = 4
last = excel.Range("A:A").End(XlDirectionDown)
excel.Range("A1:A"+str(last)).Select()
First of all, the XlDirectionDown does not seem to work. The cursor in Excel remains on the first cell.
Secondly, I get an exception for the last line in this code (something to do with Range). Does anybody understand what's going on with this code? Also, is there ANY documentation on win32com or Pywin32 out there?? I can't find any how-to's! Thanks as always everyone.
I have used a specific cell rather than range of cells as starting point. Replace
last = excel.Range("A:A").End(XlDirectionDown)
with
last = excel.Range("A1:A1").End(XlDirectionDown)
However if there are any blank cells, this will stop just before it. You probably want to use UsedRange() instead. This will be the smallest range that contains all your cells, according to Excel: you may find (as I have) that resulting range is wider than AE (contains blank columns at end), and contains many entirely blank rows at the bottom. However, since you want to filter out blank cells anyways, those will be skipped during filtering.
As to the exception on last line of code, this is because End returns a Range object, and you can't convert a range to a string, or if you can then str(last) is a range so "A1:A"+str(last) will be an invalid range.
As to filtering out blank cells, I'm not sure what that means: when you copy the data to a text file, what will you put for blank cells? If you have "A blank C" will you put "A C"? The C will end up in wrong column of your database. Anyways just something that caught my attention.
There is no single place for documentation for win32com, although the Python on Windows book has a lot of info, and google gets you results quite useful, including SO hits. The one thing that keeps tripping me whenever I use Excel COM (this is not specific to python's win32com) is that everything in a workbook is a Range, you can't have an individual cells, even when some methods or properties might lead you to think you are getting a cell you're actually getting a range, it often requires a bit of extra thinking about how to go about getting to the desired cell.
I got started with win32com and Excel here.
In your code, what does excel.Range("A:A").End(XlDirectionDown) return? Test it. You might want to add .Select(), and then use excel.Selection.Address to get the last cell. Test it in interactive mode, it's easier to see what's going on there.
As an alternative, you can use a while loop to go through your cells. This code is looping the rows until an empty cell:
excel.Range("A1").Select()
while excel.ActiveCell.Value:
val = excel.ActiveCell.Value
print(val)
excel.ActiveCell.Offset(2,1).Select() # Move a row down
The last line is a bit funny; in VBA you should write Offset(1,0) to go one row down. However in Python you have to add one to both row and column. Maybe due to indexing?

openpyxl please do not assume text as a number when importing

There are numerous questions about how to stop Excel from interpreting text as a number, or how to output number formats with openpyxl, but I haven't seen any solutions to this problem:
I have an Excel spreadsheet given to me by someone else, so I did not create it. When I open the file with Excel, I have certain values like "5E12" (clone numbers, if anyone cares) that appear to display correctly, but there's a little green arrow next to each one warning me that "This appears to be a number stored as text". Excel then asks me if I would like to convert it to a number, and if I saw yes, I get 5000000000000, which then converts automatically to scientific notation and displays 5E12 again, only this time a text output would show the full number with zeroes. Note that before the conversion, this really is text, even to Excel, and I'm only being warned/offered to convert it.
So, when reading this file in with openpyxl (from openpyxl.reader.excel import load_workbook), the 5E12 is getting converted automatically to 5000000000000. I assume that openpyxl is making the same assumption that Excel made, only the conversion happens without a prompt or input on my part.
How can I prevent this from happening? I do not want text that look like "numbers stored as text" to convert to numbers. They are text unless I say so.
So far, the only solution I have found is to add single quotes to the front of each cell, but this is not an ideal solution, as it's manual labor rather than a programmatic solution. Also, the solution needs to be general, since I don't always know where this problem might occur (I'm reading millions of lines per day, so I don't want to have to do anything by hand).
I think this is a problem with openpyxl. There is a google group discussion from the beginning of 2011 that mentions this problem, but assumes it's too rare to matter. https://groups.google.com/forum/?fromgroups=#!topic/openpyxl-users/HZfpShMp8Tk
So, any suggestions?
If you want to use openpyxl again (for whatever reason), the following changes to the worksheet reader routine do the trick of keeping the strings as strings:
diff --git a/openpyxl/reader/worksheet.py b/openpyxl/reader/worksheet.py
--- a/openpyxl/reader/worksheet.py
+++ b/openpyxl/reader/worksheet.py
## -134,8 +134,10 ##
data_type = element.get('t', 'n')
if data_type == Cell.TYPE_STRING:
value = string_table.get(int(value))
-
- ws.cell(coordinate).value = value
+ ws.cell(coordinate).set_value_explicit(value=value,
+ data_type=Cell.TYPE_STRING)
+ else:
+ ws.cell(coordinate).value = value
# to avoid memory exhaustion, clear the item after use
element.clear()
The Cell.value is a property and on assignment call Cell._set_value, which then does a Cell.bind_value which according to the method's doc: "Given a value, infer type and display options". As the types of the values are in the XML file those should be taken (here I only do that for strings) instead of doing something 'smart'.
As you can see from the code, the test whether it is a string was already there.

Categories

Resources