Orange - how to load one file to two separate tables? - python

I want to read data from one file, load it to two Orange.data.Tables and change column names in one of these tables. Here's the code:
import Orange
data = Orange.data.Table("sample_data.tab")
data2 = Orange.data.Table(data)
data2.domain = Orange.data.Domain(data.domain)
for inst in data2.domain:
data2.domain[inst].name = '__' + data2.domain[inst].name
but, for some reason, column names change in both tables. How to change column names in only one of the tables?

Orange reuses column descriptors. This is needed so that if you load data from two files and use one for fitting a model, you can use the other file to testing it. Generally, if the column has the same name and type, it will also have the same descriptor.
You can avoid this by adding createNewOn=0 when loading the data.
d1 = Orange.data.Table("iris")
d2 = Orange.data.Table("iris", createNewOn=0)
The whole story is described here: http://docs.orange.biolab.si/reference/rst/Orange.data.table.html and here: http://docs.orange.biolab.si/reference/rst/Orange.feature.descriptor.html#variable-descriptor-reuse.
There is an error in documentation: create_new_on should be createNewOn. (My example above is correct.)
If the data is already loaded, like in your case, changing the descriptors for one table without losing the data is more complicated. If you really need it, I can show you how, but I guess you don't. If you do so, you'd get two completely unrelated tables.

Related

Django-Import-Export assign Field to specific column

I have a working form which exports all data to xls using Resource and Fields from the django-import-export package. But now I need to go a bit further and make it a bit more advanced. I need to export all data to an existing Excel template, assigning each field to a specific column, as in, for instance: Field 1 needs to be filled in B3, Field 2 in C4 etc. etc.
I'm thinking after_export() can be used here but the docs dont go into detail how to use this method. Does anyone have some experience with this package?
I'm not using the admin integration.
after_export() is a method you can override to modify data before it is returned.
It does nothing by default. However, it supplies the data object, which is a Dataset instance. If you access the dict property, then you will have a list of OrderedDict instances corresponding to each row which is returned by your queryset. Therefore, you can construct an entirely new Dataset (list of dicts) which match up with the format you need.
For example, if you wanted the first column of the first row to appear in cell B3, you could do it like this:
def after_export(self, queryset, data, *args, **kwargs):
newdata = tablib.Dataset()
cell1_val = data.dict[0].get("col1", "")
for i, row in enumerate(data):
if i == 1:
newrow = list(row)
# cell b3
newrow[2] = cell1_val
row = newrow
newdata.append(row)
data.dict = newdata.dict
However, you'll note that this approach is not easy to read and is quite unwieldy. Therefore if you don't need any other features of django-import-export then using a library like openpyxl is probably a better approach.

How to store data collected from different sources to create dataframe?

I am working on creating a dataframe for classification tasks.
Since my data is coming from all kinds of different sources I am wondering what the best way to collect the data step by step would be.
I am starting off with a folder of files, and want to store their path and filename and add new data, such as their label, that I get from a txtfile that is saved somewhere else.
But what is the best way to do that?
I was thinking about a list of dictionary like
data = [{"path": path_to_file_1, "filename" : filename_1, "label" : label_1},
{"path": path_to_file_2, "filename" : filename_2, "label" : label_2},
{"path": path_to_file_3, "filename" : filename_3, "label" : label_3}]
and so on .
My idea was to iterate through my folder, collect the information via different functions that I wrote and create a dictionary for each of my files like so:
for filename in folder:
dict_filename={}
label=get_label(filename)
path=get_path(filename)
dict_filename["label"]=label
dict_filename["path"]=path
dict_filename["filename"]=filename
data.append(dict_filename)
with dict_filename being a dictionary that only contains the information of the file that I am looking at at the moment.
SO at the end I would get a list containing all the dictionaries that I created for all of my files.
My questions are:
Is this a way that makes sense or is there a different way that works better/easier/smoother?
If this works, what do I do to create a new dictionary in every loop (I suppose I need a different name for each dictionary so I just don't overwrite my first one with every loop)?
This might be something pretty basic as I am new to Python, but I am grateful for everyone that can help me out!
Thanks in advance!
The dictionary is the way to go on this one. However, there is a lot of redundancy that could be eliminated depending on the structure of your data.
For example, you can use one dictionary to store all the dataframes in this manner:
dfs[filename] = pd.DataFrame(path).rename(label)
This basically creates makes accessing the information much easier later on. In addition, you can use:
df = pd.concat(dfs, axis=1)
To combine all your dataframes in the end.

Splitting a DataFrame to filtered "sub - datasets"

So I have a DataFrame with several columns, some contain objects (string) and some are numerical.
I'd like to create new dataframes which are "filtered" to the combination of the objects available.
To be clear, those are my object type columns:
Index(['OS', 'Device', 'Design',
'Language'],
dtype='object')
["Design"] and ["Language"] have 3 options each.
I filtered ["OS"] and ["Device"] manually as I needed to match them.
However, now I want to create multiple variables each contains a "filtered" dataframe.
For example:
I have
"android_fltr1_d1" to represent the next filter:
["OS"]=android, ["Device"]=1,["Design"]=1
and "android_fltr3_d2" to represent:
["OS"]=android, ["Device"]=3,["Design"]=2
I tried the next code (which works perfectly fine).
android_fltr1_d1 = android_fltr1[android_fltr1["Design"]==1].drop(["Design"],axis=1)
android_fltr1_d2 = android_fltr1[android_fltr1["Design"]==2].drop(["Design"],axis=1)
android_fltr1_d3 = android_fltr1[android_fltr1["Design"]==3].drop(["Design"],axis=1)
android_fltr3_d1 = android_fltr3[android_fltr3["Design"]==1].drop(["Design"],axis=1)
android_fltr3_d2 = android_fltr3[android_fltr3["Design"]==2].drop(["Design"],axis=1)
android_fltr3_d3 = android_fltr3[android_fltr3["Design"]==3].drop(["Design"],axis=1)
android_fltr5_d1 = android_fltr5[android_fltr5["Design"]==1].drop(["Design"],axis=1)
android_fltr5_d2 = android_fltr5[android_fltr5["Design"]==2].drop(["Design"],axis=1)
android_fltr5_d3 = android_fltr5[android_fltr5["Design"]==3].drop(["Design"],axis=1)
As you can guess, I don't find it efficient and would like to use a for loop to generate those variables (as I'd need to match each ["Language"] option to each filter I created. Total of 60~ variables).
Thought about using something similar to .format() in the loop in order to be some kind of a "place-holder", couldn't find a way to do it.
It would be probably the best to use a nested loop to create all the variables, though I'd be content even with a single loop for each column.
I find it difficult to build the for loop to execute it and would be grateful for any help or directions.
Thanks!
As suggested I tried to find my answer in:How do I create variable variables?
Yet I failed to understand how I use the globals() function in my case. I also found that using '%' is not working anymore.

OpenPyXL Using Built-In Conditional Formatting ie: Duplicate and Unique Values

I am writing a python method that checks a specific column in Excel and highlights duplicate values in red (if any), then copy those rows onto a separate sheet that I will use to check to see why they have duplicate values. This is just for Asset Management where I want to check to make sure there are no two exact serial numbers or Asset ID numbers etc.
At this moment I just want to check the column and highlight duplicate values in red. As of now, I have this method started and it runs it just does not highlight of the cells that have duplicate values. I am using a test sheet with these values in column A,
(336,565,635,567,474,326,366,756,879,567,453,657,678,324,987,667,567,657,567)The number "567" repeats a few times.
def check_duplicate_values(self,wb):
self.wb=wb
ws=self.wb.active
dxf = DifferentialStyle(fill=self.red_fill())
rule = Rule(type="duplicateValues", dxf=dxf, stopIfTrue=None, formula=['COUNTIF($A$1:$A1,A1)>1'])
ws.conditional_formatting.add('Sheet1!$A:$A',rule) #Not sure if I need this
self.wb.save('test.xlsx')
In Excel, I can just create a Conditional Format rule to accomplish this however in OpenPyXL I am not sure if I am using their built-in methods correctly. Also, could my formula be incorrect?
Whose built-in methods are you referring to? openpyxl is a file format library and, hence, allows you manage conditional formats as they are stored in Excel worksheets. Unfortunately, the details of the rules are not very clear from the specification so form of reverse engineering from an existing is generally required, though it's probably worth noting that rules created by Excel are almost always more verbose than actually required.
I would direct further questions to the openpyxl mailing list.
Just remove the formula and you're good to go.
duplicate_rule = Rule(type="duplicateValues", dxf=dxf, stopIfTrue=None)
You can also use unique rule:
unique_rule = Rule(type="uniqueValues", dxf=dxf, stopIfTrue=None)
Check this out for more info: https://openpyxl.readthedocs.io/en/stable/_modules/openpyxl/formatting/rule.html#RuleType

Python, how to insert value in Powerpoint template?

I want to use an existing powerpoint presentation to generate a series of reports:
In my imagination the powerpoint slides will have content in such or similar form:
Date of report: {{report_date}}
Number of Sales: {{no_sales}}
...
Then my python app opens the powerpoint, fills in the values for this report and saves the report with a new name.
I googled, but could not find a solution for this.
There is python-pptx out there, but this is all about creating a new presentation and not inserting values in a template.
Can anybody advice?
Ultimately, barring some other library which has additional functionality, you need some sort of brute force approach to iterate the Slides collection and each Slide's respective Shapes collection in order to identify the matching shape (unless there is some other library which has additional "Find" functionality in PPT). Here is brute force using only win32com:
from win32com import client
find_date = r'{{report_date}}'
find_sales = r'{{no_sales}}'
report_date = '01/01/2016' # Modify as needed
no_sales = '604' # Modify as needed
path = 'c:/path/to/file.pptx'
outpath = 'c:/path/to/output.pptx'
ppt = client.Dispatch("PowerPoint.Application")
pres = ppt.Presentations.Open(path, WithWindow=False)
for sld in pres.Slides:
for shp in sld.Shapes:
with shp.TextFrame.TextRange as tr:
if find_date in tr.Text
tr.Replace(find_date, report_date)
elif find_sales in shp.TextFrame.Characters.Text
tr.Replace(find_sales, no_sales)
pres.SaveAs(outpath)
pres.Close()
ppt.Quit()
If these strings are inside other strings with mixed text formatting, it gets trickier to preserve existing formatting, but it should still be possible.
If the template file is still in design and subject to your control, I would consider giving the shape a unique identifier like a CustomXMLPart or you could assign something to the shapes' AlternativeText property. The latter is easier to work with because it doesn't require well-formed XML, and also because it's able to be seen & manipulated via the native UI, whereas the CustomXMLPart is only accessible programmatically, and even that is kind of counterintuitive. You'll still need to do shape-by-shape iteration, but you can avoid the string comparisons just by checking the relevant property value.
I tried this on a ".ppx" file I had hanging around.
A microsoft office power point ".pptx" file is in ".zip" format.
When I unzipped my file, I got an ".xml" file and three directories.
My ".pptx" file has 116 slides comprised of 3,477 files and 22 directories/subdirectories.
Normally, I would say it is not workable, but since you have only two short changes you probably could figure out what to change and zip the files to make a new ".ppx" file.
A warning: there are some xml blobs of binary data in one or more of the ".xml" files.
You can definitely do what you want with python-pptx, just perhaps not as straightforwardly as you imagine.
You can read the objects in a presentation, including the slides and the shapes on the slides. So if you wanted to change the text of the second shape on the second slide, you could do it like this:
slide = prs.slides[1]
shape = slide.shapes[1]
shape.text = 'foobar'
The only real question is how you find the shape you're interested in. If you can make non-visual changes to the presentation (template), you can determine the shape id or shape name and use that. Or you could fetch the text for each shape and use regular expressions to find your keyword/replacement bits.
It's not without its challenges, and python-pptx doesn't have features specifically designed for this role, but based on the parameters of your question, this is definitely a doable thing.

Categories

Resources