Excel data extraction using regular expressions through Python

Excel data extraction using regular expressions through Python - python

This is part 1 of a series of questions I will make on this forum so bear with me on this one. I am a novice programmer who took on a large project because i like to torture myself, so please be kind.
I am writing a Python script to process an Excel document full of accounts (See example below), each one being the same format, extract specific type of data from it, and then export that data to a SQL table. This is the process flow I have in mind when illustrating the script on paper:
The input is a large Excel document containing bookkeeping accounts with this format below:
Account format example and the data to be extracted highlighted, I believe the software used to produce this is an antiquated accounting software named "Zeus"](https://i.stack.imgur.com/Htdze.png)
The data to be extracted is the account name and number (they're on the same cell so I find it easier to extract them altogether so that I can use them as a primary key in a SQL table; will talk about that on another post) and the whole table of details of the account as highlighted above. Mind you, there are thousands of bookkeeping accounts of this format on the document and multiple of these are used for the same account name and number, meaning they have the same header, but different details.
The data processing will go like this:
Use regular expressions to match, extract, and store in an array, each account name and number (so that I can keep record of every account number and use them as a primary key in a SQL table)
Extract and match the content of each account details table to their respective account name and number (haven't figured out how to do that yet, however, I will be using a relationship table to link them to their primary key once data is exported).
Export the extracted data into a database software (mySQL or MS Access... will most likely use MS Access).
After data is extracted and processed, a Excel report is to be created consisting on a table with the name and number of the account on the first column and then the details of the account on the following columns (will post about that later on).
Part 1: Excel data extraction/"scraping"
Quick note: I have tried multiple methods such as (MS Access, VBA and MS Power Automate) to do this and avoid having to manually code everything, ended up failing miserably, so I decided to bite the bullet and just do it.
So here's the question: after doing some research, I came across multiple methods to extract data from an excel, and several methods to use regex to do web scraping and PDF data extraction.
Is there a way to extract data from an Excel document through Python using regex match? If so, how could I do that?
PS: I will be documenting my journey through this forum on another post in order to help other fellow data entry workers.

Look into these python modules:
import xlwt
from xlwt.Workbook import *
import xlsxwriter
import numpy as np
import pandas as pd
from pandas import ExcelWriter
Then you can use pandas dataframe like:
data = pd.read_excel('testacct.xlsx')
This will put the entire spreadsheet into a dict with generic column names:
If there are multiple sheets, then the df object will be a list of dicts. Each column is a list or row data.
You can traverse the rows like:
cols = data.keys()
for row in range(len(data[cols[0]])):
for col in cols:
print(data[col][row])
print("--")
You can join the column data and strip out spaces.
Then you can use regex to any of the header values.

Related

Parsing Excel sheet based on date, number of visitors, and printing email

I am trying to parse through an Excel sheet that has columns for the website name (column A), the number of visitors (F), a contact at that website's first name (B), one for last name (C), for email (E), and date it was last modified (L).
I want to write a python script that goes through the sheet and looks at sites that have been modified in the last 3 months and prints out the name of the website and an email.

It is pretty straightforward to do this. I think a little bit of googling can help you a lot. But in short, you need to use a library called Pandas which is a really powerful tool for handling spreadsheets, datasets, and table-based files.
Pandas documentation is very well written. You can use the tutorials provided within the documentation to work your way through the problem easily. However, I'll give you a brief overview of what you should do.
First open the spreadsheet (excel file) inside python using Pandas and load it into a data frame (read the docs and you'll understand).
Second Using one of the methods provided by pandas called where (actually there are a couple of methods) you can easily set a condition (like if date is older than some data) and get the masked data frame (which represents your spreadsheet) back from the method.

Best way to save an np.array or a python list object as a single record in BigQuery?

I have an ML model (text embedding) which outputs a large 1024 length vector of floats, which I want to persist in a BigQuery table.
The individual values in the vector don't mean anything on their own, the entire vector is the feature of interest. Hence, I want to store these lists in a single Column in BigQuery as opposed to one column for each float. Additionally, adding an additional 1024 rows to a table that is originally just 4 or 5 rows seems like a bad idea.
Is there a way of storing a python list or an np.array in a column in BigQuery (maybe convert them to a json first or something along those lines?)

Maybe it's not exactly you were looking for, but the following options are the closest workarounds to what you're trying to achieve.
First of all, you can save your data in an CSV file with one column locally and then load that file into BigQuery. There are also other file formats that can be loaded into BigQuery from a local machine, that might interest you. I personally would go with a CSV.
I did the experiment, by creating an empty table in my dataset, without adding a field. Then I used the code mentioned in the first link, after saving a column of my random data in a CSV file.
If you encounter the following error regarding the permissions, see this solution. It uses an authentication key instead.
google.api_core.exceptions.Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/project-name/jobs/job-id?location=EU: Request had insufficient authentication scopes.
Also, you might find this link useful, in case you get the following error:
google.api_core.exceptions.BadRequest: 400 Provided Schema does not match Table my-project:my_dataset.random_data. Cannot add fields (field: double_field_0)
Besides loading your data from a local file, can upload your data file on Google Cloud Storage and load the data from there. Many file formats are being supported, as Avro, Parquet, ORC, CSV and newline delimited JSON.
Finally there is an option for streaming the data directly into a BigQuery table by using the API, but it is not available via the free tier.

Parsing a CSV into a database for an API using Python?

I'm gonna use data from a .csv to train a model to predict user activity on google ads (impressions, clicks) in relation to the weather for a given day. And I have a .csv that contains 6000+ recordings of this info and want to parse it into a database using Python.
I tried making a df in pandas but for some reason the whole table isn't shown. The middle columns (there's about 7 columns I think) and rows (numbered over 6000 as I mentioned) are replaced with '...' when I print the table so I'm not sure if the entirety of the information is being stored and if this will be usable.
My next attempt will possible be SQLite but since it's local memory, will this interfere with someone else making requests to my API endpoint if I don't have the db actively open at all times?
Thanks in advance.

If you used pd.read_csv() i can assure you all of the info is there, it's just not displaying it.
You can check by doing something like print(df['Column_name_you_are_interested_in'].tolist()) just to make sure though. You can also use the various count type methods in pandas to make sure all of your lines are there.
Panadas is pretty versatile so it shouldn't have trouble with 6000 lines

Python CSV Manipulation find and replace with the data in two different CSVs

I am new to python
I am having a huge json data from which i have scripted and got sorted and made two csv files using python script. The data is assessments data made for educational and research purpose.
Now the first csv files contains the questions Ids, question text, choiceIDs(refereed to answer options), and choice text. Their are some fields but for this particular question this is enough...
And the second csv file contains students responses, the fields to be considered in this files are, questionids, responseids.
So now i want to map the questionid from the second csv with the questionid in the first csv and then collecting all the question text and choices text and choice ids from that particular questionid in first csv and write it to a new csv file. Then i also need to map the responseid in the second csv with the choice ids in the first csv and write it to the newly created third csv.
So how can i do this using a python script
i havent wrote the script for this yet as i am struggling with logic for the same.

I would recommend you to use the pandas library, that contains all the functions you need, starting with read_csv.

To compare 2 excel files using python

I have a excel file with following fields
Software_name , Version and Count.
This file is a inventory of all softwares installed in the network of an organization which was generated using LANdesk.
I have another excel file which is an purchase inventory of these softwares which is generated manually.
I need to compare these sheets and create a report stating whether the organization is compliant or not.
Hence how do i compare these two files.
there are instances like Microsoft Office is mentioned as just office and 'server' is spelt as 'svr'
How do go about with it?

The first step as SeyZ mentions is to determine how you want to read these Excel files. I don't have experience with the libraries he refers to. Instead I use COM programming to read and write Excel files, which of course requires that you have Excel installed. This capability comes from PyWin32 which is installed by default if you use the ActiveState Python installer, or can be installed separately if you got Python from Python.org.
The next step would be to convert things into a common format for comparing, or searching for elements from one file within the other. My first thought here would be to load the contents of the LANdesk software inventory into a database table using something quick and easy like SQLite.
Then for each item of the manual purchase list, normalize the product name and search for it within the inventory table.
Normalizing the values would be a process of splitting a name into parts and replacing partial words and phrases with their full versions. For example, you could create a lookup table of conversions:
partial full
-------------------- --------------------
svr server
srv server
SRV Stevie Ray Vaughan
office Microsoft Office
etc et cetera
You would want to run your manual list data through the normalizing process and add partial values and their full version to this table until it handles all of the cases that you need. Then run the comparison. Here is some Pythonish pseudocode:
for each row of manual inventory excel worksheet:
product = sh.Cells(row, 1) # get contents of row n, column 1
# adjust based on the structure of this sheet
parts = product.split(" ") # split on spaces into a list
for n, part in enumerate(parts):
parts[n] = Normalize(part) # look up part in conversion table
normalProduct = " ".join(parts)
if LookupProduct(normalProduct): # look up normalized name in LANdesk list
add to compliant list
else:
add to non-compliant list
if len(non-compliant list) > 0:
TimeForShopping(non-compliant list)
If you have experience with using SQLite or any other database with Python, then creating the LANdesk product table, and the normalize and lookup routines should be fairly straightforward, but if not then more pseudocode and examples would be in order. Let me know if you need those.

There are several libraries to manipulate .xls files.
XLRD allows you to extract data from Excel spreadsheet files. So you can compare two files easily. (read)
XLWT allows you to create some Excel files. (write)
XLUtils requires both of the xlrd and xlwt packages. So, you can read & write easily thanks to this library.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.