Data Quality Process - defining rules - python

I am working on a Data Quality Monitoring project which is new me.
I started with a Data Profiling to analyse my data and have a global view of it.
Next, i thought about defining some data quality rules, but i'm a little bit confused about how to implement these rules.
If u guys can guide me a little bit as i'm totally new to this.

This is quite ambiguous question but I try to guess a few tips how to start. Since you are a new to data quality and want already implementation hints, lets start from that.
Purpose: Data quality monitoring system wants to a) recognize error and b) trigger next step how to handle it.
First, build a data quality rule for your data set. The rule can be attribute, record, table or cross-table rule. Lets start with attribute level rule. Implement a rule that recognizes that attribute content does not have '#' in it. Run it to email attributes and create an error record for each row that does not have '#' in email attribute. Error record should have these attributes:
ErrorInstanceID; ErrorName; ErrorCategory; ErrorRule; ErrorLevel; ErrorReaction; ErrorScript; SourceSystem; SourceTable; SourceRecord; SourceAttribute; ErrorDate;
"asd2321sa1"; "Email Format Invalid"; "AttributeError"; "Does not contain #"; "Warning|Alert"; "Request new email at next login"; "ScriptID x"; "Excel1"; "Sheet1"; "RowID=34"; "Column=Email"; "1.1.2022"
MONITORING SYSTEM
You need to make above scripts configurable so that you can change systems, tables and columns as well as rules easily. When ran on top of data sets, they will all populate error records to the same structures resulting in a consistent and historical storage of all errors. You should be able to build reports about existing errors in specific systems, trends of errors appearing or getting fixed and so on.
Next, you need to start building a full-sale data quality metadata repository with a proper data model and design a suitable historical versioning for the above information. You need to store information like which rules were ran and when, which systems and tables they checked, and so on. To detect which systems have bee included in monitoring and also to recognize if systems are not monitored with correct rules. In practice, quality monitoring for data quality monitoring system. You should have statistics which systems are monitored with specific rules, when they were ran last time, aggregates of inspected tables, records and errors.
Typically, its more important to focus on errors that need immediate attention and "alert" an end-user to go fix the issue or triggers a new workflow or flag in source system. For example, invalid emails might be categorized as alerts and be just aggregate statistics. We have 2134223 invalid emails. Nobody cares. However, it might be more important to recognize invalid email of a person who has ordered his bills as digital invoices to his email. Alert. That kind of error (Invalid Email AND Email Invoicing) should trigger an alert and set up a flag in CRM for end users to try get email fixed. There should not be any error records for this error. But this kind of rule should be ran on top of all systems that store customer contact and billind preferences.
For a technical person, I could recommend this book. It's a good book that goes deeper in technical and logical issues of data quality assessment and monitoring systems. There is also a small metadata model for data quality metadata structures. https://www.amazon.com/Data-Quality-Assessment-Arkady-Maydanchik/dp/0977140024/

Related

Is there a library in Python that has a full list of registered domains on the internet?

I want to implement a program that basically detects to see if input from the user is not a genuine domain name or registered name on the ICANN.
Is there a library in Python that has a full list of registered domains on the internet?
No, not in Python or any other language, and for obvious reasons (don't you think such kind of list would get misused if it existed?)
First two points:
see if input from the user is not a genuine domain name or registered name on the ICANN.
ICANN has nothing to do here with your needs. It has no operational play in the day to day life of domain names, when they are registered or deleted. ICANN just define the list of current TLDs, that is basically topmost registries maintaining central databases with the list of domain names.
What is a "genuine" domain? But more importantly, what exactly do you need to test with domains? Is it like for people entering an email address to test if it is plausible? If so, testing the domain name is not the correct approach. But to say more really depends on your use case that you do not describe, which also makes your question offtopic as long as it has no more relationship with programming.
So just some generic ideas:
you can use the DNS to query domain names, but this has edge cases (and you need to understand how the DNS works) among which not all really registered domain names are published, for totally normal reasons
you can use whois, as John said, or better RDAP when it is available (at least all gTLDs); this has drawbacks too, as, without a solid library parsing the replies, you don't even have a standard way of finding out if a name does not exist, as the registries will give different free form strings for such cases; plus it is not suitable for high volume queries as it is heavily rate limited
if you would be really interested in something closer to a list of domain names, all gTLDs are required to publish their zonefile, at least daily, which is basically the list of all resolving domains (which is a subset of the list of all registered domains, consider a few % of difference), see https://czds.icann.org/ ; some ccTLDs do it too, but each one on their own policies and rules, some have an "open data" feature that gives things similar (often with a delay) or a list of "recently" registered domain names (so if you grab that days after days at some point you have something close to the list of all domains)
Any good programming language has already polished libraries for DNS queries, whois or RDAP queries, or parsing zonefiles.

How to build a real time recommendation engine with good performance?

I am a data analyst and just assigned to build a real time recommendation engine for our website.
I need to analyse the visitor behavior and do real time analysis for those input. Thus I have three questions about this project.
1) Users are not forced to sign-up. Is there any methodology to capture user behavior such as search and visit history?
2) The recommendation models can be pre-trained but the prediction process takes time. How can we improve the performance?
3) I only know how to write Python Scripts. How can I implement the recommendation engine with my python scripts?
Thanks.
===============
However, 90% of our customers purchase the products during their first visit and will not come back shortly.
We cannot make a ready model for new visitors.
And they prefer to use itemCF for the recommendation engine.
It sounds like a mission impossible now...
This is quite a broad question however I will do my best to answer:
Visit history can be tracked by enabling some form of analytics tracking on your domain. This can either be a pre-built solution that you implement and will provide a detailed overview of all visitors to your domain, usually with some form of dashboard. Most pre-built solutions provide a way to obtain the analytics that have been collected.
Another way would be to use browser cookies to store information pertaining to each visit to your domain and their search/page history. This information will be available to the website whenever the user visits it within the same browser. When the user visits your website, you could send the information to a server/rest endpoint which could analyse information (I.P/geolocation/number of visits/search/page history) and make recommendation based on that. Another common method is to track past purchases ect.
To improve performance one solution would be to always have the prediction model for a particular user ready for when they next visit the site. That way, there is no delay. However, the first time a user visits you likely wont have enough information to make detailed predictions so you will have to resort to providing options based on geolocation (which shouldn't take to long and wont impact performance)
There is another approach that can be taken and above mainly talked about making predictions based on a users behavior browsing the website. Content-based filtering is another approach which will recommend things that are similar to a item that the user is currently viewing. This approach is generally easier, as it just requires that you query a database for items that are similar in category, purpose/use ect.
There is no getting around using javascript for the client side stuff, however your recommendation engine can be built in Python (it could be a simple REST api endpoint with access to the items database). Most people use flask,django or eve to implement REST API's in Python.

Providing visibility of periodic changes to a database

This is quite a general question, though I’ll give the specific use case for context.
I'm using a FileMaker Pro database to record personal bird observations. For each bird on the national list, I have extracted quite a lot of base data by website scraping in Python, for example conservation status, geographical range, scientific name and so on. In day-to-day use of the database, this base data remains fixed and unchanging. However, once a year or so I will want to re-scrape the base data to pick up the most recent published information on status, range, and even changes in scientific name (that happens).
I know there are options such as PyFilemaker or bBox which should allow me to write to the FileMaker database from Python, so the update mechanism itself shouldn't be a problem.
It would be rather dangerous simply to overwrite all of last year’s base data with the newly scraped data, and I'm looking for general advice as to how best to provide visibility for the changes before manually importing them. What I have in mind is to use pandas to generate a spreadsheet using the base data, and to highlight the changed cells. Does that sound a sensible way of doing it? I suspect that this may be a very standard requirement, and if anybody could help out with comments on an approach which is straightforward to implement in Python that would be most helpful.
This is not a standard requirement and there is no easy way of doing this. The best way to track changes is a Source Control system like git, but it is not applicable to FileMaker Pro as the files are binary.
You can try your approach, or you can try to add the new records in FileMaker instead of updating them and flag them as current or use only the last record
There are some amazing guys here, but you might want to take it to one of the FileMAker forums as the FIleMAker audience there is much larger then in SO

Best strategy for error handling in an interface to a database and web display

I decided to ask this question after going back and forth 100s of times trying to place error handling routines to optimize data integrity while also taking into account speed and efficiency (and wasting 100s of hours in the process. So here's the setup.
Database -> python classes -> python code -> javascript
MongoDB | that represent | that serves | web interface
the data pages (pyramid)
I want data to be robust, that is the number one requirement. So right now I validate data on the javascript side of the page, but also validate in the python classes which more or so represent data structures. While most server routines run through python classes, sometimes that feel inefficient given that it have to pass through different levels of error checking.
EDIT: I guess I should clarify. I am not looking to unify validation of client and server side code. Sorry for the bad write-up. I'm looking more to figure out where the server side validation should be done. Should it be in the direct interface to the database, or in the web server code where the data is received.
for instance, if I have an object with a barcode, should I validate the barcode in the code that reviews the data through AJAX or should I just call the object's class and validate there?
Again, is there sort of guidelines on how to do error checking in general? I want to be sort of professional, and learn but hopefully not have to go through a whole book.
I am not a software engineer, but I hope those of you who are familiar with complex projects, can tell me where I can find few guidelines on how to model/error check in a situation like this.
I'm not necessarily looking for an answer, but more like pointing me to a short set of guidelines when creating projects with different layers like this. Hopefully not extremely long..
I don't even know what tags to use in the post. HELP!!
Validating on the client and validating on the server serve different purposes entirely. Validating on the server is to make sure your model invariants hold and has to be done to maintain data integrity. Validating on the client is so the user has a friendly error message telling him that his input would've validated data integrity instead of having a traceback blow up into his face.
So there's a subtle difference in that when validating on the server you only really care whether or not the data is valid. On the client you also care, on a finer-grained level, why the input could be invalid. (Another thing that has to be handled at the client is an input format error, i.e. entering characters where a number is expected.)
It is possible to meet in the middle a little. If your model validity constraints are specified declaratively, you can use that metadata to generate some of the client validations, but they're not really sufficient. A good example would be user registration. Commonly you want two password fields, and you want the input in both to match, but the model will only contain one attribute for the password. You might also want to check the password complexity, but it's not necessarily a domain model invariant. (That is, your application will function correctly even if users have weak passwords, and the password complexity policy can change over time without the data integrity breaking.)
Another problem specific to client-side validation is that you often need to express a dependency between the validation checks. I.e. you have a required field that's a number that must be lower than 100. You need to validate that a) the field has a value; b) that the field value is a valid integer; and c) the field value is lower than 100. If any of these checks fails, you want to avoid displaying unnecessary error messages for further checks in the sequence in order to tell the user what his specific mistake was. The model doesn't need to care about that distinction. (Aside: this is where some frameworks fail miserably - either JSF or Spring MVC or either of them first attempts to do data-type conversion from the input strings to the form object properties, and if that fails, they cannot perform any further validations.)
In conclusion, the above implies that if you care about data integrity, and usability, you necessarily have to validate data at least twice, since the validations achieve different purposes even if there's some overlap. Client-side validation will have more checks and more finer-grained checks than the model-layer validation. I wouldn't really try to unify them except where your chosen framework makes it easy. (I don't know about Pyramid - Django makes these concerns separate in that Forms are a different layer than your Models, both can be validated, and they're joined by ModelForms that let you add additional validations to the ones performed by the model.)
Not sure I fully understand your question, but error handling on pymongo can be found here -
http://api.mongodb.org/python/current/api/pymongo/errors.html
Not sure if you're using a particular ORM - the docs have links to what's available, and these individually have their own best usages:
http://api.mongodb.org/python/current/tools.html
Do you have a particular ORM that you're using, or implementing your own through pymongo?

Keeping track of user habits and activities? - Django

I was working on a project a few months ago, and had the need to implement an award system. Similar to StackOverflow's badge system. Badges
I might have not implemented it in the best possible way, and I am curious what your say in it would be.
What would a good way to track user activities, needed for badge awarding be?
Stackoverflow's system needs to know of a lot of information, and I also get the impression that there would be a lot of data processing complicating things.
I would assume that SO calculates badges once or twice every 24, and that maybe logs are stored or a server dedicated to badge calculation.
Thoughts?
I don't think is as complicated as you think. I highly doubt that SO calculates badges with some kind of user activity log (although technically the entire database is a user activity log). When I look at the lists of badges, I don't see anything that can't be implemented by running a SQL select query.
Some of the queries could be pretty complicated, and there might be some sort of fancy caching mechanism, but I don't see any reason why you would have to calculate badges in batches.
In general badge/point systems can be based on two things.
Activity log of interesting events, this is effectively the paper register receipt of what has happend such that you can re-compute from the ground up if it's ever needed. Can be as simple as (user_id, timestamp, event_id, event_detail)
Most of the time you've pre-designed your scoring/point system so you know exactly which counters to keep on a user. Now it's as simple as having a big record that contains all of the details. (user_id, reply_points, login_points, last_login, thumbs_up_points, etc.,etc.)
Now you can slap some simple methods on that model object and have it manage/store the points as needed.

Categories

Resources