Python alternative to csvfix - python

sorry for bad English. I am working in sales and processing many prices. Most of the problems i have been solved (scheduling, processing queue, storing, converting from Excel 2003/2007 or Access, encoding etc...). One of the unresolved problem is processing csv files. I need tool with many configuration options. Most of needed is filter/search by manufacturer or skip rows which not follows some conditions - may be amount of product or it's price and may be other. I found tool that solve this task - csvfix But there is disadvantages in it cause it can't work with non Latin alpabet. I searching alternative, may be you know? Best of all if it will be written on Python/Ruby. Thanks.

Do you want a programmable tool or a finished application? If the latter, take a look at csved. It offers lots of configuration and filtering options and handles non-ASCII data well (there also is a Unicode variant available. And it's cardware (free, but the author asks for a postcard).

Related

Parse CSV files with different quoting (and a wrong one)

I have to process CSV files (around ~500 000) in Python 3 which all came from the same source. However, for unknown reasons, some of them (a few minority) don't have the same quoting mechanism than the others.
Unfortunately, I have no way to distinguish a CSV quoting rule from the source itself (I can't make a rule based on the file name).
Here's the 3 kinds of CSV I have:
"field_A";"Field_B";"Field_C"
"";"john; the real one";"krasinski"
field_A;Field_B;Field_C
;john "the original";krasinski
"field_A";"Field_B";"Field_C";"Field_D"
"";"john;"krasinski;"2019-12-12"
I can parse the first by setting my quotingchar to ", the second one by having no quoting char, but I have no clue on how to process the last one, which includes fields that must be quoted, and fields that cannot be quoted (because of the single ")
Is there a way for me to deal with this different kind of files ?
In the end, I'd like to have the value krasinski only for the Field_C for example, but I either have "krasinski" and krasinski, or krasinski and an error.
All the solutions that came to me feel "wrong".
I'm running Python 3.8, and currently using what appears to be the recommended library but I'm not committed to it.
Thanks for your help !

Scripting library for monitoring server health?

Is there a scripting library preferably in Python/Perl/Ruby that allows you to get information on disk, load, a list of processes running, cpu usage in a standard way?
I always end up parsing df, uptime, ps etc. Given that these differ on different Unix flavors and need to be done in a totally different way on Windows, I would have thought that someone would have already done this.
Most simple is monit: http://mmonit.com/monit/
A step up, as #lawrencealan mentioned, is Nagios: http://nagios.org/
And here's a new interesting effort: http://amon.cx/
(ruby) Daniel Berger maintains a lot of gems in this field. Look for sys-cpu, sys-uptime, sys-uname, sys-proctable, sys-host, sys-admin, sys-filesystem - all multi-platform AFAIK.
Have you looked into Nagios? http://nagios.org/
There are an abundance of agents: http://exchange.nagios.org/directory/Addons/Monitoring-Agents

Create a user-group in linux using python

I want to create a user group using python on CentOS system. When I say 'using python' I mean I don't want to do something like os.system and give the unix command to create a new group. I would like to know if there is any python module that deals with this.
Searching on the net did not reveal much about what I want, except for python user groups.. so I had to ask this.
I learned about the grp module by searching here on SO, but couldn't find anything about creating a group.
EDIT: I dont know if I have to start a new question for this, but I would also like to know how to add (existing) users to the newly created group.
Any help appreciated.
Thank you.
I don't know of a python module to do it, but the /etc/group and /etc/gshadow format is pretty standard, so if you wanted you could just open the files, parse their current contents and then add the new group if necessary.
Before you go doing this, consider:
What happens if you try to add a group that already exists on the system
What happens when multiple instances of your program try to add a group at the same time
What happens to your code when an incompatible change is made to the group format a couple releases down the line
NIS, LDAP, Kerberos, ...
If you're not willing to deal with these kinds of problems, just use the subprocess module and run groupadd. It will be way less likely to break your customers machines.
Another thing you could do that would be less fragile than writing your own would be to wrap the code in groupadd.c (in the shadow package) in Python and do it that way. I don't see this buying you much versus just exec'ing it, though, and it would add more complexity and fragility to your build.
I think you should use the commandline programs from your program, a lot of care has gone into making sure that they don't break the groups file if something goes wrong.
However the file format is quite straight forward to write something yourself if you choose to go that way
There are no library calls for creating a group. This is because there's really no such thing as creating a group. A GID is simply a number assigned to a process or a file. All these numbers exist already - there is nothing you need to do to start using a GID. With the appropriate privileges, you can call chown(2) to set the GID of a file to any number, or setgid(2) to set the GID of the current process (there's a little more to it than that, with effective IDs, supplementary IDs, etc).
Giving a name to a GID is done by an entry in /etc/group on basic Unix/Linux/POSIX systems, but that's really just a convention adhered to by the Unix/Linux/POSIX userland tools. Other network-based directories also exist, as mentioned by Jack Lloyd.
The man page group(5) describes the format of the /etc/group file, but it is not recommended that you write to it directly. Your distribution will have policies on how unnamed GIDs are allocated, such as reserving certain spaces for different purposes (fixed system groups, dynamic system groups, user groups, etc). The range of these number spaces differs on different distributions. These policies are usually encoded in the command-line tools that a sysadmin uses to assign unnamed GIDs.
This means the best way to add a group locally is to use the command-line tools.
If you are looking at Python, then try this program. Its fairly simple to use, and the code can easily be customized http://aleph-null.tv/downloads/mpb-adduser-1.tgz

Does anybody use DjVu files in their production tools?

When it's about archiving and doc portability, it's all about PDF. I heard about DjVu somes years ago, and it seems to be now mature enough for serious usages. The benefits seems to be a small size format and a fast open / read experience.
But I have absolutely no feedback on how good / bad it is in the real world :
Is it technically hard to implement in traditional information management tools ?
Is is worth learning / implementing solution to generate / parse it when you now PDF ?
Is the final user feedback good when it comes to day to day use ?
How do you manage exchanges with the external world (the one with a PDF only state of mind) ?
As a programmer, what are the pro and cons ?
And what would you use to convince your boss to (or not to) use DjVU ?
And globally, what gain did you noticed after including DjVu in your workflow ?
Bonus question : do you know some good Python libs to hack some quick and dirty scripts as a begining ?
EDIT : doing some research, I ended up getting that Wikimedia use it to internally store its book collection but can't find any feedback about it. Anybody involved in that project around here ?
I've found DjVu to be ideal for image-intensive documents. I used to sell books of highly details maps, and those were always in DjVu. PDF however works really well; it's a standard, and -everybody- will be able to open it without installing additional software.
There's more info at:
http://print-driver.com/news/pdf-vs-djvu-i1909.html
Personally, I'd say until its graphic-rich documents, just stick to PDF.

What to write into log file?

My question is simple: what to write into a log.
Are there any conventions?
What do I have to put in?
Since my app has to be released, I'd like to have friendly logs, which could be read by most people without asking what it is.
I already have some ideas, like a timestamp, a unique identifier for each function/method, etc..
I'd like to have several log levels, like tracing/debugging, informations, errors/warnings.
Do you use some pre-formatted log resources?
Thank you
It's quite pleasant, and already implemented.
Read this: http://docs.python.org/library/logging.html
Edit
"easy to parse, read," are generally contradictory features. English -- easy to read, hard to parse. XML -- easy to parse, hard to read. There is no format that achieves easy-to-read and easy-to-parse. Some log formats are "not horrible to read and not impossible to parse".
Some folks create multiple handlers so that one log has several formats: a summary for people to read and an XML version for automated parsing.
Here are some suggestions for content:
timestamp
message
log message type (such as error, warning, trace, debug)
thread id ( so you can make sense of the log file from a multi threaded application)
Best practices for implementation:
Put a mutex around the write method so that you can be sure that each write is thread safe and will make sense.
Send 1 message at at a time to the log file, and specify the type of log message each time. Then you can set what type of logging you want to take on program startup.
Use no buffering on the file, or flush often in case there is a program crash.
Edit: I just noticed the question was tagged with Python, so please see S. Lott's answer before mine. It may be enough for your needs.
Since you tagged your question python, I refer you to this question as well.
As for the content, Brian proposal is good. Remember however to add the program name if you are using a shared log.
The important thing about a logfile is "greppability". Try to provide all your info in a single line, with proper string identifiers that are unique (also in radix) for a particular condition. As for the timestamp, use the ISO-8601 standard which sorts nicely.
A good idea is to look at log analysis software. Unless you plan to write your own, you will probably want to exploit an existing log analysis package such as Analog. If that is the case, you will probably want to generate a log output that is similar enough to the formats that it accepts. It will allow you to create nice charts and graphs with minimal effort!
In my opinion, best approach is to use existing logging libraries such as log4j (or it's variations for other languages). It gives you control on how your messages are formatted and you can change it without ever touching your code. It follows best practices, robust and used by millions of users. Of course, you can write you own logging framework, but that would be very odd thing to do, unless you need something very specific. In any case, walk though their documentation and see how log statements are presented there.
Check out log4py - Python ported version of log4j, I think there are several implementations for Python out there..
Python's logging library are thread-safe for single process threads.
timeStamp i.e. DateTime YYYY/MM/DD:HH:mm:ss:ms
User
Thread ID
Function Name
Message/Error Message/Success Message/Function Trace
Have this in XML format and you can then easily write a parser for it.
<log>
<logEntry DebugLevel="0|1|2|3|4|5....">
<TimeStamp format="YYYY/MM/DD:HH:mm:ss:ms" value="2009/04/22:14:12:33:120" />
<ThreadId value="" />
<FunctionName value="" />
<Message type="Error|Success|Failure|Status|Trace">
Your message goes here
</Message>
</logEntry>
</log>

Categories

Resources