Are there any existing batch log file aggregation solutions?

Are there any existing batch log file aggregation solutions? - python

I wish to export from multiple nodes log files (in my case apache access and error logs) and aggregate that data in batch, as a scheduled job. I have seen multiple solutions that work with streaming data (i.e think scribe). I would like a tool that gives me the flexibility to define the destination. This requirement comes from the fact that I want to use HDFS as the destination.
I have not been able to find a tool that supports this in batch. Before re-creating the wheel I wanted to ask the StackOverflow community for their input.
If a solution exists already in python that would be even better.

we use http://mergelog.sourceforge.net/ to merge all our apache logs..

take a look at Zomhg, its an aggregation/reporting system for log files using Hbase and Hdfs: http://github.com/zohmg/zohmg

Scribe can meet your requirements, there's a version (link) of scribe that can aggregate logs from multiple sources, and after reaching given threshold it stores everything in HDFS. I've used it and it works very well. Compilation is quite complicated, so if you have any problems ask a question.

PiCloud may help.
The PiCloud Platform gives you the freedom to develop your algorithms
and software without sinking time into all of the plumbing that comes
with provisioning, managing, and maintaining servers.

Related

How to verify if two systems are in sync

I have a requirement to test two applications (via automation using Python).
The requirement is for example we have a system called “www.abc.com”
where we develop and merge code in every 2 weeks and then we create a another system called “www.xyz.com” ( basically it is backup to the first system ), everytime we do a release and add/edit in the main system, we update in our back up system.
Now the question is i need to tests both the system, after every release (every 2 weeks) to see if they both are in sync (identical).
how do i fire a python automation test script (multiple tests) to check if for example databases, servers, UI, front end, check if code base are same in both systems? can i do that if yes any help and advice , please suggest so that i can implement possible solutions .

There are several ways you could approach this:
Assuming you are using some sort of source control you could write a script to make sure that the repo is up to date and then report back the results. See here and here. This probably won't cover the data in your databases, but there are numerous ways to back database backups and it will depend what programs you are using.
Another or additional way you might check is to write a script to gather a list of hashes or checksums of all the files you care about in both systems and then compare the list for differences.

How to use pyparted to inspect and change Partition Table

I would like to use pyparted (libparted python bindings) to implement a rather complex SD-Card initialization scheme.
Currently I'm using a bash script, but that is becoming rather messy.
Unfortunately I was unable to find any specification of libparted API (the API manual in parted /doc/ directory is useless and the Doxygen comments are incomplete to say the least).
What I need to do is:
retrieve current partitioning scheme (to be sure I'm dealing with the correct SD)
optionally retrieve some information from there (I know how to do this).
setup a custom partitioning scheme (>4 partitions, needs "extended")
initialize filesystems (one FAT32 + several ext4) (I am unsure if this can be done directly using pyparted or if I need to spawn mkfs instead)
Can someone suggest the right approach?

I've created some examples for pyparted covering most of this that will hopefully be merged.
In the meantime, you can see them in the pull request at:
https://github.com/dcantrell/pyparted/pull/64
BTW you should probably use GPT partition table these days too. That way you don't have to worry about primary/extended partitions, their all just partitions.
HTH

Spark and Cassandra through Python

I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?

Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()

I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.

Python support for Azure ML -- speed issue

We are trying to create an Azure ML web-service that will receive a (.csv) data file, do some processing, and return two similar files. The Python support recently added to the azure ML platform was very helpful and we were able to successfully port our code, run it in experiment mode and publish the web-service.
Using the "batch processing" API, we are now able to direct a file from blob-storage to the service and get the desired output. However, run-time for small files (a few KB) is significantly slower than on a local machine, and more importantly, the process seems to never return for slightly larger input data files (40MB). Processing time on my local machine for the same file is under 1 minute.
My question is if you can see anything we are doing wrong, or if there is a way to get this to speed up. Here is the DAG representation of the experiment:
Is this the way the experiment should be set up?

It looks like the problem was with processing of a timestamp column in the input table. The successful workaround was to explicitly force the column to be processed as string values, using the "Metadata Editor" block. The final model now looks like this:

File changed event network share

What I want to do:
I got two directories. Each one contains about 90.000 xml and bak files.
I need the xml files to sync at both folders when a file changes (of course the newer one should be copied).
The problem is:
Because of the huge amount of files and the fact that one of the directories is a network share I can't just loop though the directory and compare os.path.getmtime(file) values.
Even watchdog and PyQt don't work (tried the solutions from here and here).
The question:
Is there any other way to get a file changed event (on windows systems) which works for those configuration without looping though all those files?

So I finally found the solution:
I changed some of my network share settings and used the FileSystemWatcher
To prevent files getting synced on syncing i use a md5 filehash.
The code i use can be found at pastebin (It's a quick and dirty code and just the parts of it mentioned in the question here).

The code in...
https://stackoverflow.com/a/12345282/976427
Seems to work for me when passed a network share.

I'm risking giving an answer which is way off here (you didn't specify requirement regarding speed, etc) but... Dropbox would do exactly that for you for free, and would required writing no code at all.
Of course it might not suit your needs if you required real-time syncing, or if you want to avoid "sharing" your files with a third party (although you can encrypt them first).

Can you use the second option on this page?
http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html

From mentioning watchdog, I assume you are running under Linux. For the local machine inotify can help, but for the network share you are out of luck.
Mercurial's inotify extension http://hgbook.red-bean.com/read/adding-functionality-with-extensions.html has the same limitation.
In a similar situation (10K+ files) I have used a cloned mercurial repository with inotify on both the server and the local machine. They automatically commit and notified each other of changes. It had a slight delay (no problem in my case) but as a benifit had a full history
of changes and easily resynced after one of the systems had been down.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.