To compare 2 excel files using python

To compare 2 excel files using python - python

I have a excel file with following fields
Software_name , Version and Count.
This file is a inventory of all softwares installed in the network of an organization which was generated using LANdesk.
I have another excel file which is an purchase inventory of these softwares which is generated manually.
I need to compare these sheets and create a report stating whether the organization is compliant or not.
Hence how do i compare these two files.
there are instances like Microsoft Office is mentioned as just office and 'server' is spelt as 'svr'
How do go about with it?

The first step as SeyZ mentions is to determine how you want to read these Excel files. I don't have experience with the libraries he refers to. Instead I use COM programming to read and write Excel files, which of course requires that you have Excel installed. This capability comes from PyWin32 which is installed by default if you use the ActiveState Python installer, or can be installed separately if you got Python from Python.org.
The next step would be to convert things into a common format for comparing, or searching for elements from one file within the other. My first thought here would be to load the contents of the LANdesk software inventory into a database table using something quick and easy like SQLite.
Then for each item of the manual purchase list, normalize the product name and search for it within the inventory table.
Normalizing the values would be a process of splitting a name into parts and replacing partial words and phrases with their full versions. For example, you could create a lookup table of conversions:
partial full
-------------------- --------------------
svr server
srv server
SRV Stevie Ray Vaughan
office Microsoft Office
etc et cetera
You would want to run your manual list data through the normalizing process and add partial values and their full version to this table until it handles all of the cases that you need. Then run the comparison. Here is some Pythonish pseudocode:
for each row of manual inventory excel worksheet:
product = sh.Cells(row, 1) # get contents of row n, column 1
# adjust based on the structure of this sheet
parts = product.split(" ") # split on spaces into a list
for n, part in enumerate(parts):
parts[n] = Normalize(part) # look up part in conversion table
normalProduct = " ".join(parts)
if LookupProduct(normalProduct): # look up normalized name in LANdesk list
add to compliant list
else:
add to non-compliant list
if len(non-compliant list) > 0:
TimeForShopping(non-compliant list)
If you have experience with using SQLite or any other database with Python, then creating the LANdesk product table, and the normalize and lookup routines should be fairly straightforward, but if not then more pseudocode and examples would be in order. Let me know if you need those.

There are several libraries to manipulate .xls files.
XLRD allows you to extract data from Excel spreadsheet files. So you can compare two files easily. (read)
XLWT allows you to create some Excel files. (write)
XLUtils requires both of the xlrd and xlwt packages. So, you can read & write easily thanks to this library.

Related

Excel data extraction using regular expressions through Python

This is part 1 of a series of questions I will make on this forum so bear with me on this one. I am a novice programmer who took on a large project because i like to torture myself, so please be kind.
I am writing a Python script to process an Excel document full of accounts (See example below), each one being the same format, extract specific type of data from it, and then export that data to a SQL table. This is the process flow I have in mind when illustrating the script on paper:
The input is a large Excel document containing bookkeeping accounts with this format below:
Account format example and the data to be extracted highlighted, I believe the software used to produce this is an antiquated accounting software named "Zeus"](https://i.stack.imgur.com/Htdze.png)
The data to be extracted is the account name and number (they're on the same cell so I find it easier to extract them altogether so that I can use them as a primary key in a SQL table; will talk about that on another post) and the whole table of details of the account as highlighted above. Mind you, there are thousands of bookkeeping accounts of this format on the document and multiple of these are used for the same account name and number, meaning they have the same header, but different details.
The data processing will go like this:
Use regular expressions to match, extract, and store in an array, each account name and number (so that I can keep record of every account number and use them as a primary key in a SQL table)
Extract and match the content of each account details table to their respective account name and number (haven't figured out how to do that yet, however, I will be using a relationship table to link them to their primary key once data is exported).
Export the extracted data into a database software (mySQL or MS Access... will most likely use MS Access).
After data is extracted and processed, a Excel report is to be created consisting on a table with the name and number of the account on the first column and then the details of the account on the following columns (will post about that later on).
Part 1: Excel data extraction/"scraping"
Quick note: I have tried multiple methods such as (MS Access, VBA and MS Power Automate) to do this and avoid having to manually code everything, ended up failing miserably, so I decided to bite the bullet and just do it.
So here's the question: after doing some research, I came across multiple methods to extract data from an excel, and several methods to use regex to do web scraping and PDF data extraction.
Is there a way to extract data from an Excel document through Python using regex match? If so, how could I do that?
PS: I will be documenting my journey through this forum on another post in order to help other fellow data entry workers.

Look into these python modules:
import xlwt
from xlwt.Workbook import *
import xlsxwriter
import numpy as np
import pandas as pd
from pandas import ExcelWriter
Then you can use pandas dataframe like:
data = pd.read_excel('testacct.xlsx')
This will put the entire spreadsheet into a dict with generic column names:
If there are multiple sheets, then the df object will be a list of dicts. Each column is a list or row data.
You can traverse the rows like:
cols = data.keys()
for row in range(len(data[cols[0]])):
for col in cols:
print(data[col][row])
print("--")
You can join the column data and strip out spaces.
Then you can use regex to any of the header values.

How to use Python to automate the movement of data between two Excel workbooks with specific parameters

Thanks for taking the time to read my question.
I am working on a personal project to learn python scripting for excel, and I want to learn how to move data from one workbook to another.
In this example, I am emulating a company employee ledger that has name, position, address, and more (The organizations is by row so every employee takes up one row). But the project is to have a selected number of people be transferred to a new ledger (another excel file). So I have a list of emails in a .txt file (it could even be another excel file but I thought .txt would be easier), and I would want the script to run through the .txt file, get the emails, and look for any rows that have a matching email address(all emails are in cell 'B'). And if any are found, then copy that entire row to the new excel file.
I tried a lot of ways to make this work, but I could not figure it out. I am really new to python so I am not even sure if this is possible. Would really appreciate some help!

You have essentially two packages that will allow manipulation of Excel files. For reading in data and performing analysis the standard package for use is pandas. You can save the files as .xlsx however you are only really working with base table data and not the file itself (IE, you are extracing data FROM the file, not working WITH the file)
However what you need is really to perform manipulation on Excel files directly which is better done with openpyxl
You can also read files (such as your text file) using with open function that is native to Python and is not a third party import like pandas or openpyxl.
Part of learning to program includes learning how to use documentation.
As such, here is the documentation you require with sufficient examples to learn openpyxl: https://openpyxl.readthedocs.io/en/stable/
And you can learn about pandas here: https://pandas.pydata.org/docs/user_guide/index.html
And you can learn about python with open here: https://docs.python.org/3/tutorial/inputoutput.html
Hope this helps.
EDIT: It's possible I or another person can give you a specific example using your data / code etc, but you would have to provide it fully. Since you're learning, I suggest using the documentation or youtube.

Best way to import data in Python? XML?

I am very new to python and want to make a script for a spreadsheet I use at work. Basically, i need to associate an address with multiple 5 digit reference codes. There are multiple addresses with a corresponding group of reference codes.
i.e:
Address:
1234 E. 32nd Street,
New York, NY, 10001
Ref #'s
RL081
RL089
LA063
Address 2:
etc....
I need my script to look up a location by ref code. This information is then used to build a new spreadsheet (each row needs an address and the address is looked up using a ref code). What is the best way to use this info in python? Would it be a dictionary? Should I put the addresses / ref codes into an XML type file?
Thanks
Edit (clarification):
Basically, I have those addresses and corresponding ref codes (they could be in a plain text document, I could organize them in a spreadsheet, or whatever so python can use them). The script I'm building needs to use those ref codes to enter an address into a new spreadsheet. Basically, I input a half complete spreadsheet and the script fills in the addresses based on the ref code in each row.

Import into what?
If you have everything in a spreadsheet, Python has a very good CSV reader library. Once you've read it in, the challenge becomes what to do with it.
If you are looking at a medium term solution, I'd recommend looking at using SQLite to set up a simple spreadsheet that can manage the information in a more structured way. SQLite scales well in the beginning stages of a project and it becomes a trivial case to insert into a fully-fledged RDBMS like PostGreSQL or MySQL if it becomes neccessary.
From there it becomes a case of writing the libraries you need to manipulate your data, and present it. In the initial stages this can be done using the command line but by using an SQL database this can be exposed through a webpage for multiple people down the line without worrying about managing data integrity.

I prefer to use JSON over XML for storing data that will later be used in python. The json module is fairly robust and easy to use. Since you will be performing lookups I would definitely loading the information as a python dictionary. Since you'll be querying by ref codes you'll want to use those for keys and have the address as the value.

I need my script to look up a location by ref code
Since this is the only requirement you've stated, I would recommend using a dict where keys are ref codes and values are addresses.
I'm not sure why you are asking about "file types". It seems you already have all this information stored in a spreadsheet - no need to write a new file.

manipulating excel 2007 files using python

Using python I need to be able to do the following operations to a workbook for excel 2007:
delete rows
sorting a worksheet
getting distinct values from a column
I am looking into openpyxl; however, it seems to have limited capabilities.
Can anyone please recommend a library that can do the above tasks?

I want to preface this with letting you know this is only a windows based solution. But if you are using Windows I would recommend using Win32Com which can be found here. This module gives Python programmatic access to any Microsoft Office Application (including Excel) and uses many of the same methods used in VBA. Usually what you will do is record a macro (or recall from memory) how to do something in VBA and then use the same functions in Python
To start we want to connect to Excel and get access to the first sheet as an example
#First we need to access the module that lets us connect to Excel
import win32com.client
# Next we want to create a variable that represents Excel
app = win32com.client.Dispatch("Excel.Application")
# Lastly we will assume that the workbook is active and get the first sheet
wbk = app.ActiveWorkbook
sheet = wbk.Sheets(1)
At this point we have a variable named sheet that represents the excel work sheet we will be working with. Of course there are multiple ways to access the sheet, this is usually the way I demo how to use win32com with excel because it is very intuitive.
Now assume I have the following values on the first sheet and I will go over one by one how to answer what you were asking:
A
1 "d"
2 "c"
3 "b"
4 "a"
5 "c"
Delete Rows:
Lets assume that you want to delete the first row in your active sheet.
sheet.Rows(1).Delete()
This creates:
A
1 "c"
2 "b"
3 "a"
4 "c"
Next Lets sort the cells in ascending order (although I would recommend extracting the values to python and doing the sorting within a list and sending the values back)
rang = sheet.Range("A1","A4")
sheet.Sort.SetRange(rang)
sheet.Sort.Apply()
This creates:
A
1 "a"
2 "b"
3 "c"
4 "c"
And now we will get distinct values from the column. The main thing to take away here is how to extract values from a cells. You can either select a lot of cells at once and with sheet.Range("A1","A4") or you can access the values by iterating over cell by cell with sheet.Cells(row,col). Range is orders of magnitude faster, but Cells is slightly easier for debugging.
#Get a list of all Values using Range
valLstRange = [val[0] for val in sheet.Range("A1","A4").Value]
#Get a list of all Values using Cells
valLstCells = [sheet.Cells(row,1).Value for row in range(1,4)]
#valLstCells and valLstRange both = ["a","b","c","c"]
Now lastly you wanted to save the workbook and you can do this with the following:
wbk.SaveAs("C:/savedWorkbook.xlsx")
And you are done!
INFO About COM
If you have worked with VBA, .NET, VBscript or any other language to work with Excel many of these Excel methods will look the same. That is because they are all using the same library provided by Microsoft. This library uses COM, which is Microsoft's way of providing API's to programmers that are language agnostic. COM itself is an older technology and can be tricky to debug. If you want more information on Python and COM I highly recommend Python Programming on Win32 by Mark Hammond. He is the guy that gets a shoutout after you install Python on Windows in the official .msi installer.
ALTERNATIVES TO WIN32COM
I also need to point out there are several fantastic open source alternatives that can be faster than COM in most situations and work on any OS (Mac, Linux, Windows, etc.). These tools all parse the zipped files that comprise a .xlsx. If you did not know that a .xlsx file is a .zip, just change the extension to .zip and you can then explore the contents (kind of interesting to do at least once in your career). Of these I recommend Openpyxl which I have used for parsing and creating Excel files on a server where performance was critical. Never use win32com for server activities as it opens an out-of-process instance of excel.exe for each instance that can be leaky
RECOMMENDATION
I would recommend win32com for users who are working intimately with individual data sets (analysts, financial services, researchers, accountants, business operations, etc.) that are performing data discovery activities as it works great with open workbooks. However, developers or users that need to perform very large tasks with a small footprint or extremely large manipulations or processing in parallel must use a package such as openpyxl.

Convert .csv file into .dbf using Python?

How can I convert a .csv file into .dbf file using a python script? I found this piece of code online but I'm not certain how reliable it is. Are there any modules out there that have this functionality?

Using the dbf package you can get a basic csv file with code similar to this:
import dbf
some_table = dbf.from_csv(csvfile='/path/to/file.csv', to_disk=True)
This will create table with the same name and either Character or Memo fields and field names of f0, f1, f2, etc.
For a different filename use the filenameparameter, and if you know your field names you can also use the field_names parameter.
some_table = dbf.from_csv(csvfile='data.csv', filename='mytable',
field_names='name age birth'.split())
Rather basic documentation is available here.
Disclosure: I am the author of this package.

You won't find anything on the net that reads a CSV file and writes a DBF file such that you can just invoke it and supply 2 file-paths. For each DBF field you need to specify the type, size, and (if relevant) number of decimal places.
Some questions:
What software is going to consume the output DBF file?
There is no such thing as "the" (one and only) DBF file format. Do you need dBase III ? dBase 4? 7? Visual FoxPro? etc?
What is the maximum length of text field that you need to write? Do you have non-ASCII text?
Which version of Python?
If your requirements are minimal (dBase III format, no non-ASCII text, text <= 254 bytes long, Python 2.X), then the cookbook recipe that you quoted should do the job.

Use the csv library to read your data from the csv file. The third-party dbf library can write a dbf file for you.
Edit: Originally, I listed dbfpy, but the library above seems to be more actively updated.

None that are well-polished, to my knowledge. I have had to work with xBase files many times over the years, and I keep finding myself writing code to do it when I have to do it. I have, somewhere in one of my backups, a pretty functional, pure-Python library to do it, but I don't know precisely where that is.
Fortunately, the xBase file format isn't all that complex. You can find the specification on the Internet, of course. At a glance the module that you linked to looks fine, but of course make copies of any data that you are working with before using it.
A solid, read/write, fully functional xBase library with all the bells and whistles is something that has been on my TODO list for a while... I might even get to it in what is left this year, if I'm lucky... (probably not, though, sadly).

I have created a python script here. It should be customizable for any csv layout. You do need to know your DBF data structure before this will be possible. This script requires two csv files, one for your DBF header setup and one for your body data. good luck.
https://github.com/mikebrennan/csv2dbf_python

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.