Swamped with real time data and tasked with building a database

Swamped with real time data and tasked with building a database - python

I work for a power company, and have been tasked with building a database. I have a pretty beginner/intermediate understanding level of python, and can fuddle decently with MSSQL. They have procured Azure for this project, and I am completely lost of how to start this task.
Here is one of the sources of data that I want to scrape every minute.
http://ets.aeso.ca/ets_web/docroot/tradingPage.html - this is a complete overview of the Alberta power market in real time.
Ideally, I would want to be able to scrape this data and other sources, and then modify it to fit into in a certain format and push it onto the SQL server.
Do I need virtual machines that are just looping over python scripts? Or do I need managed instances? This data also then needs to be able to be queried right after it is scraped. Eventually this data may feed machine learning algorithms (I don't know jack about that either but I have been told it should play friendly with that type of enviornment).
Just looking to see if anyone has any insight in how you would approach this, and can tell me what I clearly don't know and haven't thought of. Any insight is truly appreciated.
Thanks!

Related

Storing, modifying and manipulating web scraped data

I'm working on a python webscraper that pulls data from a car advertising site. I got the scraping part all done with beatifoulsoup but I've ran into many difficulties trying to store and modify it. I would really appreciate some advice on this part since I'm a lacking knowledge on this part.
So here is what I want to do:
Scrape the data each hour (done).
Store scraped data as a dictionary in a .JSON file (done).
Everytime the ad_link not found in the scraped_data.json set it to dict['Status'] = 'Inactive' (done).
If a cars price changes , print notification + add old price to dictionary. On this part I came across many challenges with the .JSON way.
I've kept using 2 .json files and comparing them to each other (scraped_data_temp , permanent_data.json) but I think this is by far not the best method.
What would you guys suggest? How should I do this? .
What would be the best way to approach manipulating this kind of data ? (Databases maybe? - got no experince with them but I'm eager to learn) and what would be a good way to represent this kind of data, pygal?
Thank you very much.

If you have larger data, I would definitely recommend using some kind of DB. If you don't have the need to use DB server, you can use sqlite. I have used it in the past to save bigger data locally. You can use sqlalchemy in python to interact with DB-s.
As for displaying data, I tend to use matplotlib. It's extremely flexible, has extensive documentation and examples, so you can adjust data to you linking, graphs, charts, etc.
I'm assuming that you are using python3.

Running a Python Script on a Website (in the background)

Firstly, apologies for the very basic question. I have looked into other answers but they haven't quite answered what I'm after. I'm confident designing a site in HTML/CSS and have very very basic knowledge of Python.
I want to run a very basic Python script on my website. It analyses tweets about a specific topic, and then posts a sentiment analysis score. I want it to run this sentiment analysis every hour and cache the score.
I have a working Python script which does this in Jupyter Notebook. Could you give me an overview of how I would make this script function online and cache the results? I've read into using Python web frameworks, but from my limited understanding, they seem like overkill?
Thank you for your help!

Could you give me an overview of how I would make this script function online
The key thing would be to uncouple the two parts of your system:
Producing the data
Showing it in a website.
So the first thing to do is have your sentiment-analysis script push its value to a database. The database could be something as simple as a csv file, or it could be a key/value store, or something like MySQL or CouchDB (or hundreds of other choices).
Over on the website you have to make a decision between:
Server-side
Client-side
If the former, you could program in Python if that is what you are most familiar with. Whatever language/framework combination you go for, there will an example tutorial of how to read a value from a database and display it: it is just about the most fundamental thing.
If client-side you will usually be programming in JavaScript. Again you need to choose a framework, but again you should easily be able to find a tutorial to follow.
(Unless you have a good reason to prefer server-side, such as familiarity with an existing framework, or security issues with accessing your database, I'd go with a client-side approach.)
I've read into using Python web frameworks... overkill?
Yes and no. You are going to need some kind of database, and some kind of framework. It would be good to understand the basics of web security, too. If the sentiment analysis is your major goal, all that is going to be a distraction, and it might be better to find a friend who already knows web programming to work with. Or just find a tutorial that is very close to what you want to do, and adapt that.
(P.S. I was going to flag your question as "too broad", but you did ask for an overview, so I hope this helps.)

I need a starting point to code an app to extract text from pdf to excel

To start I just want to state that I'm an Electrical Engineer with basic knowledge of programming.
My requirement is as follows:
I want to create an app where I can load and view PDF files that
contain tables.
These PDF files tables are of irregular shapes and in a different
position on every page. (that's why tools like tabular couldn't help
me)
Each table entry is multiline and of irregular dimensions (I cannot
select a whole row at a time it has to be each element alone. simply
copying the lines to excel won't work either because it will need a
lot of formatting)
So I want to be able to select each table entry individually from the
table (like a selection or cropping box over the required text),
delete new line if there is a new line in the text and just keep spaces.
The generated excel (or access database I do not really mind any)
should be reviewable and saveable (if those are even words XD).
I have a good knowledge of python and a very elementary knowledge of Django and I'm seeking some expert who can tell me what do I really need to learn (and if possible where to learn it) to execute my project.
Is it very much for me to execute and if I can dedicate 10 hours a week, how much would it take me to execute such a project.
Thanks all for your help in advance.

Don't use Python, use Word. Open the pdf, then step through the tables collection to collect the data and put it into excel. See this for an example

Here are the advises i can provide you :
first of all, ask internet for questions :
https://lmddgtfy.net/?q=python%20library%20tabular%20pdf
-> Camelot , which is mentioned multiple time seems to be relevant
For the use of excel sheet, i present you one of the most famous library for manipulating DataFrame: Pandas
You can use small courses on internet which will offer you a quick ability to manage your project easier.
for the application, you can easily find on youtube courses on a library made by someone who will explain you how to do a basic application. It could offer you the entry point you are talking about. Then, You can just wonder what else do you need or simply want for making it better.
for the time needed, it depends on how much time do you need to understand the basics, how much time you spend on having a deeper comprehension. I think in one week, working during your free time with a real interest, it could be working( not perfect, but working, which is a good beginning)
PS: I am not sure if your question is relevant for the aims of stackoverflow. I suggest you to read this file. ( https://stackoverflow.com/help/how-to-ask)

Python -- Create a Webform

I'm working on a project to try and create a more streamlined process to enter data into our database.
Currently, we're just using raw_input("Question: "), but this seems very outdated and is prone to mistakes. Given that there are over 100 questions we need to answer, this can take quite some time, loading only one question at a time.
I was interested in creating a web-based form to do this instead, as individuals wouldn't need to install python and various libraries on their own computers, but connect to an IP address on our network instead.
However, they're going to be using this to input medical data into our database, so I need some sort of secure-login feature, and only after being "validated", is a technician redirected to our medical form. I saw that using FlaskWTF might be able to solve this issue, but their documentation is a little confusing to me.
I'm wondering exactly how to do this. I've been looking at FlaskWTF (recommended on another post), but I don't need the ability to upload documents, only get data from a text box, or to see if someone has selected multiple boxes (i.e., if the person indicates they have both cancer AND diabetes).
Likewise, I'm wondering if I can create a google form, download it, and host it on an internal server. However, I'm curious on how I can retrieve the data with Python. I saw this post, but its only for a one question form, and I have at least 100 questions on this form.
If you think creating a webform is too difficult for someone who has no major python web experience, nor experience with HTML/PHP (I've mostly worked with databases using elasticsearch and some, albeit very basic, python-based AI/Chatterbots), would you recommend creating a form using TKinker instead? If so, how to I save form inputs as variables, and make it look a little prettier?
My apologies that I don't have code here, but rather a series of questions. As I continue to work on this project, I will probably post snippits of code to this site.
Best!

Automating IBM SPSS Data Collection survey export?

I'm so sorry for the vague question here, but I'm hoping an SPSS expert will be able to help me out here. We have some surveys that are done via SPSS, from which we extract data for an internal report. Right now the process is very cumbersome and requires going to the SPSS Data Collection Interviewer Server Administration page and manually exporting data from two different projects (which takes hours at a time!). We then take that data, massage it, and upload it to another database that drives the internal report.
My question is, does anyone out there know how to automate this process? Is there a SQL Server database behind the SPSS data? Where does the .mdd file come in to play? Can my team (who is well-versed in extracting data from various sources) tap into the SQL Server database behind SPSS to get our data? Or do we need some sort of Python script and plugin?
If I'm missing information that would be helpful in answering the question, please let me know. I'm happy to provide it; I just don't know what to provide.
Thanks so much.

As mentioned by other contributors, there are a few ways to achieve this. The simplest I can suggest is using the DMS (data management script) and windows scheduler. Ideally you should follow below steps.
Prerequisite:
1. You should have access to the server running IBM Data collection
2. Basic knowledge of windows task scheduler
3. Knowledge of DMS scripting
Approach:
1. Create a new DMS script from the template
2. If you want to perform only data extract / transformation, you only need input and output data source
3. In the input data source, create/build the connection string pointing to your survey on IBM Data collection server. Use the data source as SQL
4. In the select query: use "Select * from VDATA" if you want to export all variables
5. Set the output data connection string by selecting the output data format as SPSS (if you want to export it in SPSS)
6. run the script manually and see if the SPSS export is what is expected
7. Create batch file using text editor (save with .bat extension). Add below lines
cd "C:\Program Files\IBM\SPSS\DataCollection\6\DDL\Scripts\Data Management\DMS"
Call DMSRun YOURDMSFILENAME.dms
Then add a line to copy (using XCOPY) the data / files extracted to the location where you want to further process it.
Save the file and open windows scheduler to schedule the execution of this batch file for data extraction.
If you want to do any further processing, you create an mrs or dms file and add to the batch file.
Hope this helps!

There are a number of different ways you can accomplish easing this task and even automate it completely. However, if you are not an IBM SPSS Data Collection expert and don't have access to somebody who is or have the time to become one, I'd suggest getting in touch with some of the consultants who offer services on the platform. Internally IBM doesn't have many skilled SPSS resources available, so they rely heavily on external partners to do services on a lot of their products. This goes for IBM SPSS Data Collection in particular, but is also largely true for SPSS Statistics.
As noted by previous contributors there is an approach using Python for data cleaning, merging and other transformations and then loading that output into your report database. For maintenance reasons I'd probably not suggest this approach. Though you are most likely able to automate the export of data from SPSS Data Collection to a sav file with a simple SPSS Syntax (and an SPSS add-on data component), it is extremely error prone when upgrading either SPSS Statistics or SPSS Data Collection.
From a best practice standpoint, you ought to use the SPSS Data Collection Data Management module. It is very flexible and hardly requires any maintenance on upgrades, because you are working within the same data model framework (e.g. survey metadata, survey versions, labels etc. is handled implicitly) right until you load your transformed data into your reporting database.
Ideally the approach would be to build the mentioned SPSS Data Collection Data Management script and trigger it at the end of each completed interview. In this way your reporting will be close to real-time (you can make it actual real-time by triggering the DM script during the interview using the interview script events - just a FYI).
All scripting on the SPSS Data Collection platform including Data Management scripting is very VB-like, so for most people knowing VB, it is very easy to get started and it is documented very well in the SPSS Data Collection DDL. There you'll also be able to find examples of extracting survey data from SPSS Data Collection surveys (as well as reading and writing data to/from other databases, files etc.). There are also many examples of data manipulation and transformation.
Lastly, to answer your specific questions:
Yes, there is always an MS SQL Server behind SPSS Data Collection -
no exceptions. However, generally speaking the data model is way to
complex to read out data directly from it. If you have a look in it,
you'll quickly realize this.
The MDD file (short for Meta Data Document) is containing all survey meta
data including data source specifications, version history etc.
Without it you'll not be able to make anything of the survey data in
the database, which is the main reason I'd suggest to stay within the
SPSS Data Collection platform for as large part of your data handling
as possible. However, it is indeed just a readable XML file.
Note that the SPSS Data Collection Data Management Module requires a separate license and if the scripting needed is large or complex, you'd probably want base professional too, if that's not what you already use for developing the questionnaires and handling the surveys.
Hope that helps.

This isn't as clean as working directly with whatever database is holding the data, but you could do something with an exported data set:
There may or may not be a way for you to write and run an export script from inside your Admin panel or whatever. If not, you could write a simple Python script using Selenium WebDriver which logs into your admin panel and exports all data to a *.sav data file.
Then you can use the Python SPSS extensions to write your analysis scripts. Note that these scripts have to run on a machine that has a copy of SPSS installed.
Once you have your data and analysis results accessible to Python, you should be able to easily write that to your other database.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.