How to use pyparted to inspect and change Partition Table

How to use pyparted to inspect and change Partition Table - python

I would like to use pyparted (libparted python bindings) to implement a rather complex SD-Card initialization scheme.
Currently I'm using a bash script, but that is becoming rather messy.
Unfortunately I was unable to find any specification of libparted API (the API manual in parted /doc/ directory is useless and the Doxygen comments are incomplete to say the least).
What I need to do is:
retrieve current partitioning scheme (to be sure I'm dealing with the correct SD)
optionally retrieve some information from there (I know how to do this).
setup a custom partitioning scheme (>4 partitions, needs "extended")
initialize filesystems (one FAT32 + several ext4) (I am unsure if this can be done directly using pyparted or if I need to spawn mkfs instead)
Can someone suggest the right approach?

I've created some examples for pyparted covering most of this that will hopefully be merged.
In the meantime, you can see them in the pull request at:
https://github.com/dcantrell/pyparted/pull/64
BTW you should probably use GPT partition table these days too. That way you don't have to worry about primary/extended partitions, their all just partitions.
HTH

Related

Spark and Cassandra through Python

I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?

Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()

I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.

Struggling to take the next step in how to store my data

I'm struggling with how to store some telemetry streams. I've played with a number of things, and I find myself feeling like I'm at a writer's block.
Problem Description
Via a UDP connection, I receive telemetry from different sources. Each source is decomposed into a set of devices. And for each device there's at most 5 different value types I want to store. They come in no faster than once per minute, and may be sparse. The values are transmitted with a hybrid edge/level triggered scheme (send data for a value when it is either different enough or enough time has passed). So it's a 2 or 3 level hierarchy, with a dictionary of time series.
The thing I want to do most with the data is a) access the latest values and b) enumerate the timespans (begin/end/value). I don't really care about a lot of "correlations" between data. It's not the case that I want to compute averages, or correlate between them. Generally, I look at the latest value for given type, across all or some hierarchy derived subset. Or I focus one one value stream and am enumerating the spans.
I'm not a database expert at all. In fact I know very little. And my three colleagues aren't either. I do python (and want whatever I do to be python3). So I'd like whatever we do to be as approachable as possible. I'm currently trying to do development using Mint Linux. I don't care much about ACID and all that.
What I've Done So Far
Our first version of this used the Gemstone Smalltalk database. Building a specialized Timeseries object worked like a charm. I've done a lot of Smalltalk, but my colleagues haven't, and the Gemstone system is NOT just a "jump in and be happy right away". And we want to move away from Smalltalk (though I wish the marketplace made it otherwise). So that's out.
Played with RRD (Round Robin Database). A novel approach, but we don't need the compression that bad, and being edge triggered, it doesn't work well for our data capture model.
A friend talked me into using sqlite3. I may try this again. My first attempt didn't work out so well. I may have been trying to be too clever. I was trying to do things the "normalized" way. I found that I got something working at first OK. But getting the "latest" value for given field for a subset of devices, was getting to be some hairy (for me) SQL. And the speed for doing so was kind of disappointing. So it turned out I'd need to learn about indexing too. I found I was getting into a hole I didn't want to. And headed right back where we were with the Smalltalk DB, lot of specialized knowledge, me the only person that could work with it.
I thought I'd go the "roll your own" route. My data is not HUGE. Disk is cheap. And I know real well how to read/write files. And aren't filesystems hierarchical databases anyway? I'm sure that "people in the know" are rolling their eyes at this primitive approach, but this method was the most approachable. With a little bit of python code, I used directories for my structuring, and then a 2 file scheme for each value (one for the latest value, and an append log for the rest of the values). This has worked OK. But I'd rather not be liable for the wrinkles I haven't quite worked out yet. There's as much code involved in how the data is serialized to/from (just using simple strings right now). One nice thing about this approach, is that while I can write python scripts to analyze the data, some things can be done just fine with classic command line tools. E.g (simple query to show all latest rssi values).
ls Telemetry/*/*/rssi | xargs cat
I spent this morning looking at alternatives. Growsed the NOSQL sites. Read up on PyTables. Scanned ZODB tutorial. PyTables looks very suited for what I'm after. Hierarchy of named tables modeling timeseries. But I don't think PyTables works with python3 yet (at least, there is no debian/ubuntu package for python3 yet). Ditto for ZODB. And I'm afraid I don't know enough about what the many different NOSQL databases do to even take a stab at one.
Plea for Ideas
I find myself more bewildered and confused than at the start of this. I was probably too naive that I'd find something that could be a little more "fire and forget" and be past it at this point. Any advice and direction you have, would be hugely appreciated. If someone can give me a recipe that I can meet my needs without huge amounts of overhead/education/ingress, I'd mark that as the answer for sure.

Ok, I'm going to take a stab at this.
We use Elastic Search for a lot of our unstructured data: http://www.elasticsearch.org/. I'm no expert on this subject, but in my day-to-day, I rely on the indices a lot. Basically, you post JSON objects to the index, which lives on some server. You can query the index via the URL, or by posting a JSON object to the appropriate place. I use pyelasticsearch to connect to the indices---that package is well-documented, and the main class that you use is thread-safe.
The query language is pretty robust itself, but you could just as easily add a field to the records in the index that is "latest time" before you post the records.
Anyway, I don't feel that this deserves a check mark (even if you go that route), but it was too long for a comment.

What you describe fits the database model (ex, sqlite3).
Keep one table.
id, device_id, valuetype1, valuetype2, valuetype3, ... ,valuetypen, timestamp
I assume all devices are of the same type (IE, have the same set of values that you care about). If they do not, consider simply setting the value=null when it doesn't apply to a specific device type.
Each time you get an update, duplicate the last row and update the newest value:
INSERT INTO DeviceValueTable (device_id, valuetype1, valuetype2,..., timestamp)
SELECT device_id, valuetype1, #new_value, ...., NOW()
FROM DeviceValueTable
WHERE device_id = #device_id
ORDER BY timestamp DESC
LIMIT 1;
To get the latest values for a specific device:
SELECT *
FROM DeviceValueTable
WHERE device_id = #device_id
ORDER BY timestamp DESC
LIMIT 1;
To get the latest values for all devices:
select
DeviceValueTable.*
from
DeviceValueTable a
inner join
(select id, max(timestamp) as newest
from DeviceValueTable group by device_id) as b on
a.id = b.id
You might be worried about the cost (size of storing) the duplicate values. Rely on the database to handle compression.
Also, keep in mind simplicity over optimization. Make it work, then if it's too slow, find and fix the slowness.
Note, these queries were not tested on sqlite3 and may contain typos.

It sounds to me like you want an on-disk, implicitly sorted datastructure like a btree or similar.
Maybe check out:
http://liw.fi/larch/
http://www.egenix.com/products/python/mxBase/mxBeeBase/

Your issue isn't technical, its poor problem specification.
If you are doing anything with sensor data then the old laboratory maxim applies "If you don't write it down, it didn't happen". In the lab, that means a notebook and pen, on a computer it means ACID.
You also seem to be prematurely optimizing the solution, which is well known to be the root of all evil. You don't say what size the data are, but you do say they "come no faster than once per minute, and may be sparse". Assuming they are an even 1.0KB in size, that's 1.5MB/day or 5.3GB/year. My StupidPhone has more storage than you would need in a year, and my laptop has sneezes that are larger.
The biggest problem is that you claim to "know very little" about databases and that is the crux of the matter. Your data is standard old 1950s data-processing boring. You're jumping into buzzword storage technologies when SQLite would do everything you need if only you knew how to ask it. Given that you've got Smalltalk DB down, I'd be quite surprised if it took more than a day's study to learn all the conventional RDBM principles you need and then some.
After that, you'd be able to write a question that can be answered in more than generalities.

How to see changes for a mercurial file context?

I'm currently trying to write a script that will find all the files changed given a certain # in the task description, and I have gotten the script to work for that. But now I'm trying to sort it by whether the file was added, modified or removed. I've looked through the Mercurial API, but I can't find anything that can do what I want.
My code currently uses repo[revnum].description() and parses that to find which ones contain the #, and if they do, add the file context to a list.
This works fine and I can print a list of files, but I can't find a method to see what was done with each context. Can anyone help me out here, or point me to some better documentation?

Do you need to work with the Mercurial API? It is possible to do what you need by working with the output of hg log

In general, you should avoid writing scripts directly using the Mercurial API. It is better to write your scripts to use the CLI or perhaps even use hglib. As stated on the MercurialApi wiki:
For the vast majority of third party code, the best approach is to use
Mercurial's published, documented, and stable API: the command line
interface.
That being said, if you really need to use the API, you can use repo.status() to find the info you asked about:
modified, added, removed, deleted, unknown, ignored, clean = repo.status(revnum-1, revnum)

I ended up using something similar to what Tim said, although I did still use the API.
I imported commands from mercurial, and then called commands.status(repo.ui, repo, change=revnum)
I captured the output of this, using repo.ui.pushbuffer() and repo.ui.popbuffer() which was in the form
A file_path1
R file_path2
R file_path3
A file_path4
M file_path5
I parsed this input and sorted it into Add, remove, modify, etc..

Are there any existing batch log file aggregation solutions?

I wish to export from multiple nodes log files (in my case apache access and error logs) and aggregate that data in batch, as a scheduled job. I have seen multiple solutions that work with streaming data (i.e think scribe). I would like a tool that gives me the flexibility to define the destination. This requirement comes from the fact that I want to use HDFS as the destination.
I have not been able to find a tool that supports this in batch. Before re-creating the wheel I wanted to ask the StackOverflow community for their input.
If a solution exists already in python that would be even better.

we use http://mergelog.sourceforge.net/ to merge all our apache logs..

take a look at Zomhg, its an aggregation/reporting system for log files using Hbase and Hdfs: http://github.com/zohmg/zohmg

Scribe can meet your requirements, there's a version (link) of scribe that can aggregate logs from multiple sources, and after reaching given threshold it stores everything in HDFS. I've used it and it works very well. Compilation is quite complicated, so if you have any problems ask a question.

PiCloud may help.
The PiCloud Platform gives you the freedom to develop your algorithms
and software without sinking time into all of the plumbing that comes
with provisioning, managing, and maintaining servers.

Create a user-group in linux using python

I want to create a user group using python on CentOS system. When I say 'using python' I mean I don't want to do something like os.system and give the unix command to create a new group. I would like to know if there is any python module that deals with this.
Searching on the net did not reveal much about what I want, except for python user groups.. so I had to ask this.
I learned about the grp module by searching here on SO, but couldn't find anything about creating a group.
EDIT: I dont know if I have to start a new question for this, but I would also like to know how to add (existing) users to the newly created group.
Any help appreciated.
Thank you.

I don't know of a python module to do it, but the /etc/group and /etc/gshadow format is pretty standard, so if you wanted you could just open the files, parse their current contents and then add the new group if necessary.
Before you go doing this, consider:
What happens if you try to add a group that already exists on the system
What happens when multiple instances of your program try to add a group at the same time
What happens to your code when an incompatible change is made to the group format a couple releases down the line
NIS, LDAP, Kerberos, ...
If you're not willing to deal with these kinds of problems, just use the subprocess module and run groupadd. It will be way less likely to break your customers machines.
Another thing you could do that would be less fragile than writing your own would be to wrap the code in groupadd.c (in the shadow package) in Python and do it that way. I don't see this buying you much versus just exec'ing it, though, and it would add more complexity and fragility to your build.

I think you should use the commandline programs from your program, a lot of care has gone into making sure that they don't break the groups file if something goes wrong.
However the file format is quite straight forward to write something yourself if you choose to go that way

There are no library calls for creating a group. This is because there's really no such thing as creating a group. A GID is simply a number assigned to a process or a file. All these numbers exist already - there is nothing you need to do to start using a GID. With the appropriate privileges, you can call chown(2) to set the GID of a file to any number, or setgid(2) to set the GID of the current process (there's a little more to it than that, with effective IDs, supplementary IDs, etc).
Giving a name to a GID is done by an entry in /etc/group on basic Unix/Linux/POSIX systems, but that's really just a convention adhered to by the Unix/Linux/POSIX userland tools. Other network-based directories also exist, as mentioned by Jack Lloyd.
The man page group(5) describes the format of the /etc/group file, but it is not recommended that you write to it directly. Your distribution will have policies on how unnamed GIDs are allocated, such as reserving certain spaces for different purposes (fixed system groups, dynamic system groups, user groups, etc). The range of these number spaces differs on different distributions. These policies are usually encoded in the command-line tools that a sysadmin uses to assign unnamed GIDs.
This means the best way to add a group locally is to use the command-line tools.

If you are looking at Python, then try this program. Its fairly simple to use, and the code can easily be customized http://aleph-null.tv/downloads/mpb-adduser-1.tgz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.