I have a large json file from a web scraping project I've been doing for a while. Now I'm trying to build a web frontend using the JSON data. I'm having a hard time figuring out the best way to go about building it, though.
The json file looks like this:
{
"_id" : { "$oid" : "55d5c85a96cc6212bdd4ca08" },
"name" : "Example",
"url" : "http://example.com/blahblah",
"ts" : { "$date" : 1073423706824 }
}
I have a couple questions:
The json file would be added to overtime, so would the best solution be to regularly add to a database, or just keep the json file in the cloud somewhere and pull from it when needed?
If I put it in a database, how could I regularly add it to a database, without slowing down the front end of the site? I know I could use something like json_decode, but I've mostly only seen examples with a few lines of json, could it be used for larger json files?
If I put it in a database, would a relational db be faster/more efficient or something like mongodb?
After doing a lot of webscraping myself here's what I would recommend:
Decide between your relational and non, relational database. If your data is constantly changing with an unknown number of parameters I recommend using MongoDB (as it's almost JSON and is fully schemaless so easy to add new facets). If your data is all the same format then using a relational DB is a good step forward. PostgreSQL and MariaDB are good, open source options.
Convert your current JSON data into the DB format chosen and insert it.
Start scraping straight to the DB, try not to use JSON files any more.
Read from the database for your front end. If you're choosing Python, you could look at flask as a good option.
There is also a really interesting question on Store static data in an array or in a database previously posted with some in depth answers as to static files vs. Database.
If you take static files out of the equation and use databases here are the answers to your 3 questions;
Just use the database.
Adding to the database is simple. Once you've got it set up, your scraper can write straight to it with the relevant driver. Again, no need for JSON files.
It all depends on your data
Related
I'm doing a web scraping + data analysis project that consists of scraping product prices every day, clean the data, and store that data into a PostgreSQL database. The final user won't have access to the data in this database (scraping every day becomes huge, so eventually I won't be able to upload in GitHub), but I want to explain how to replicate the project. The steps are basically:
Scraping with Selenium (Python), and save the raw data into CSV files (already in GitHub);
Read these CSV files, clean the data, and store it into the database (the cleaning script already in GitHub);
Retrieve the data from the database to create dashboards and anything that I want (not yet implemented).
To clarify, my question is about how can I teach someone that sees my project, to replicate it, given that this person won't have the database info (tables, columns). My idea is:
Add SQL queries in a folder, showing how to create the database skeleton (same tables and columns);
Add in README info like creating the environment variables to access the database;
It is okay doing that? I'm looking for best practices in this context. Thanks!
I'm migrating data from SQL Server 2017 to Postgres 10.5, i.e., all the tables, stored procedures etc.
I want to compare the data consistency between SQL Server and Postgres databases after the data migration.
All I can think of now is using Python Pandas and loading the tables into data frames from SQL Server and also Postgres and compare the data frames.
But the data is around 6 GB which takes much time for loading table into the data frame and also hosted on a server which is not local to where I'm running the Python script. Is there any way to efficiently compare the data consistency across SQL Server and Postgres?
Yes, you can order the data by primary key, and then write the data to a json or xml file.
Then you can run diff over the two files.
You can also run this chunked by primary-key, that way you don't have to work with a huge file.
Log any diff that doesn't show as equal.
If it doesn't matter what the difference is, you could also just run MD5/SHA1 on the two file chunks, and if the hash machtches, there is no difference, if it doesn't, there is.
Speaking from experience with nhibernate, what you need to watch out for is:
bit fields
text, ntext, varchar(MAX), nvarchar(MAX) fields (they map to varchar with no length, by the way - encoding UTF8)
varbinary, varbinary(MAX), image (bytea[] vs. LOB)
xml
that all primary-key's id serial generator is reset after you inserted all data in pgsql.
Another thing to watch out is which time zone CURRENT_TIMESTAMP uses.
Note:
I'd actually run System.Data.DataRowComparer directly, without writing data to a file:
static void Main(string[] args)
{
DataTable dt1 = dt1();
DataTable dt2= dt2();
IEnumerable<DataRow> idr1 = dt1.Select();
IEnumerable<DataRow> idr2 = dt2.Select();
// MyDataRowComparer MyComparer = new MyDataRowComparer();
// IEnumerable<DataRow> Results = idr1.Except(idr2, MyComparer);
IEnumerable<DataRow> results = idr1.Except(idr2);
}
Then you write all non-matching DataRows into a logfile, for each table one directory (if there are differences).
Don't know what Python uses in place of System.Data.DataRowComparer, though.
Since this would be a one-time task, you could also opt to not do it in Python, and use C# instead (see above code sample).
Also, if you had large tables, you could use DataReader with sequential access to do the comparison. But if the other way cuts it, it reduces the required work considerably.
Have you considered making your SQL Server data visible within your Postgres with a Foreign Data Wrapper (FDW)?
https://github.com/tds-fdw/tds_fdw
I haven't used this FDW tool but, overall, the basic FDW setup process is simple. An FDW acts like a proxy/alias, allowing you to access remote data as though it were housed in Postgres. The tool linked above doesn't support joins, so you would have to perform your comparisons iteratively, etc. Depending on your setup, you would have to check if performance is adequate.
Please report back!
I have scraped data from a website using their API on a Django application. The data is JSON (a Python dictionary when I retrieve it on my end). The data has many, many fields. I want to store them in a database, so that I can create endpoints that will allow for lookup and modifications (updates). I need to use their fields to create the structure of my database. Any help on this issue or on how to tackle it would be greatly appreciated. I apologize if my question is not concise enough, please let me know if there is anything I need to specify.
I have seen many, many people saying to just populate it, such as this example How to populate a Django sqlite3 database. The issue is, there are so many fields that I cannot go and actually create the django model fields myself. From what I have read, it seems like I may be able to use serializers.ModelSerializer, although that seems to just populate a pre-existing db with already defined model.
Tricky to answer without details, but I would consider doing this in two steps - first, convert your json data to a database schema, for example using a tool like sqlify: https://sqlify.io/convert/json/to/sqlite
Then, create a database from the generated schema file, and use inspectdb to generate your django models: https://docs.djangoproject.com/en/2.2/ref/django-admin/#inspectdb
You'll probably need to tweak the generated schema and/or models, but this should go a long way towards automating the process.
I would go for a document database, like Elasticsearch or MongoDB.
Those are made for this kind of situation, look it up.
I have a Django app that uses django-piston to send out XML feeds to internal clients. Generally, these work pretty well but we have some XML feeds that currently run over 15 minutes long. This causes timeouts, and the feeds become unreliable.
I'm trying to ponder ways that I can improve this setup. If it requires some re-structuring of the data, that could be possible too.
Here is how the data collection currently looks:
class Data(models.Model)
# fields
class MetadataItem(models.Model)
data = models.ForeignKey(Data)
# handlers.py
data = Data.objects.filter(**kwargs)
for d in data:
for metaitem in d.metadataitem_set.all():
# There is usually anywhere between 55 - 95 entries in this loop
label = metaitem.get_label() # does some formatting here
data_metadata[label] = metaitem.body
Obviously, the core of the program is doing much more, but I'm just pointing out where the problem lies. When we have a data list of 300 it just becomes unreliable and times out.
What I've tried:
Getting a collection of all the data id's, then doing a single large query to get all the MetadataItem's. Finally, filtering those in my loop. This was to preserve some queries which it did reduce.
Using .values() to reduce model instance overhead, which did speed it up but not by much.
One idea I'm thinking one simpler solution to this is to write to a cache in steps. So to reduce time out; I would write the first 50 data sets, save to cache, adjust some counter, write the next 50, etc. Still need to ponder this.
Hoping someone can help lead me into the right direction with this.
The problem in the piece of code you posted is that Django doesn't include objects that are connected through a reverse relationship automatically, so you have to make a query for each object. There's a nice way around this, as Daniel Roseman points out in his blog!
If this doesn't solve your problem well, you could also have a look at trying to get everything in one raw sql query...
You could maybe further reduce the query count by first getting all Data id's and then using select_related to get the data and it's metadata in a single big query. This would greatly reduce the number of queries, but the size of the queries might be impractical/too big. Something like:
data_ids = Data.objects.filter(**kwargs).values_list('id', flat = True)
for i in data_ids:
data = Data.objects.get(pk = i).select_related()
# data.metadataitem_set.all() can now be called without quering the database
for metaitem in data.metadataitem_set.all():
# ...
However, I would suggest, if possible, to precompute the feeds from somewhere outside the webserver. Maybe you could store the result in memcache if it's smaller than 1 MB. Or you could be one of the cool new kids on the block and store the result in a "NoSQL" database like redis. Or you could just write it to a file on disk.
If you can change the structure of the data, maybe you can also change the datastore?
The "NoSQL" databases which allow some structure, like CouchDB or MongoDB could actually be useful here.
Let's say for every Data item you have a document. The document would have your normal fields. You would also add a 'metadata' field which is a list of metadata. What about the following datastructure:
{
'id': 'someid',
'field': 'value',
'metadata': [
{ 'key': 'value' },
{ 'key': 'value' }
]
}
You would then be able to easily get to a data record and get all it's metadata. For searching, add indexes to the fields in the 'data' document.
I've worked on a system in Erlang/OTP that used Mnesia which is basically a key-value database with some indexing and helpers. We used nested records heavily to great success.
I added this as a separate answer as it's totally different than the other.
Another idea is to use Celery (www.celeryproject.com) which is a task management system for python and django. You can use it to perform any long running tasks asynchronously without holding up your main app server.
I'd like to start by asking for your opinion on how I should tackle this task, instead of simply how to structure my code.
Here is what I'm trying to do: I have a lot of data loaded into a mysql table for a large number of unique names + dates (i.e., where the date is a separate field). My goal is to be able to select a particular name (using rawinput, and perhaps in the future add a drop-down menu) and see a monthly trend, with a moving average, and perhaps other stats, for one of the fields (revenue, revenue per month, clicks, etc). What is your advice - to move this data to an excel workbook via python, or is there a way to display this information in python (with charts that compare to excel, of course)?
Thanks!
Analyze of such data (name,date) could be seen as issuing ad-hoc SQL queries to get timeseries information.
You will 'sample' your information by a date/time frame (day/week/month/year or more detailled by hour/minute) depending of how large is your dataset.
I often use such query where the date field is truncate to the sample rate, in mysql DATE_FORMAT function is cool for that (postgres and oracle use date_trunc and trunc respectivly)
What you want to see in your data is in your your WHERE conditions.
select DATE_FORMAT(date_field,'%Y-%m-%d') as day,
COUNT(*) as nb_event
FROM yourtable
WHERE name = 'specific_value_to_analyze'
GROUP BY DATE_FORMAT(date_field,'%Y-%m-%d');
execute this query and output to a csv file. You could use direct mysql commands for that, but I recommend to make a python script that execute such query, and you can use getopt options for output formatting (with or without columns headers, use different separator than default one, etc). And even you can build dynamically the query based on some options.
To plot such information, look at time series tools. If you have missing data (date that won't appears in result of such sql query) you should take care for the choice. Excel is not the correct one for that, I think (or not master enough it), but could be a start.
Personaly I found dygraph, a javascript library, really cool for time series plotting, and it can be used with a csv file as source. Careful in such configuration, due to crossdomain security constraint, the csv file and html page that display the Dygraph object should be on the same server (or whatever the security constraint of your browser want to accept).
I used to build such webapp using django, as it's my favourite web framework, where I wrap url call as this :
GET /timeserie/view/<category>/<value_to_plot>
GET /timeserie/csv/<category>/<value_to_plot>
The first url call a view that simply output a template file with a variable that reference the url to get the csv file for the Dygraph object :
<script type="text/javascript">
g3 = new Dygraph(
document.getElementById("graphdiv3"),
"{{ csv_url }}",
{
rollPeriod: 15,
showRoller: true
}
);
</script>
The second url call a view that generate the sql query and output the result as text/csv to be rendered by Dygraph.
It's "home made" could stand simple or be extended, run easily on any desktop computer, could be extended to output json format for use by others javascript libraries/framework.
Else there is tool in opensource, related to such reporting (but timeseries capabilities are often not enough for my need) like Pentaho, JasperReport, SOFA. You make the query as datasource inside a report in such tool and build a graph that output timeserie.
I found that today web technique with correct javascript library/framework is really start to be correct to challenge that old fashion of reporting by such classical BI tools and it make things interactive :-)
Your problem can be broken down into two main pieces: analyzing the data, and presenting it. I assume that you already know how to do the data analysis part, and you're wondering how to present it.
This seems like a problem that's particularly well suited to a web app. Is there a reason why you would want to avoid that?
If you're very new to web programming and programming in general, then something like web2py could be an easy way to get started. There's a simple tutorial here.
For a desktop database-heavy app, have a look at dabo. It makes things like creating views on database tables really simple. wxpython, on which it's built, also has lots of simple graphing features.