Parse CSV files with different quoting (and a wrong one)

Parse CSV files with different quoting (and a wrong one) - python

I have to process CSV files (around ~500 000) in Python 3 which all came from the same source. However, for unknown reasons, some of them (a few minority) don't have the same quoting mechanism than the others.
Unfortunately, I have no way to distinguish a CSV quoting rule from the source itself (I can't make a rule based on the file name).
Here's the 3 kinds of CSV I have:
"field_A";"Field_B";"Field_C"
"";"john; the real one";"krasinski"
field_A;Field_B;Field_C
;john "the original";krasinski
"field_A";"Field_B";"Field_C";"Field_D"
"";"john;"krasinski;"2019-12-12"
I can parse the first by setting my quotingchar to ", the second one by having no quoting char, but I have no clue on how to process the last one, which includes fields that must be quoted, and fields that cannot be quoted (because of the single ")
Is there a way for me to deal with this different kind of files ?
In the end, I'd like to have the value krasinski only for the Field_C for example, but I either have "krasinski" and krasinski, or krasinski and an error.
All the solutions that came to me feel "wrong".
I'm running Python 3.8, and currently using what appears to be the recommended library but I'm not committed to it.
Thanks for your help !

Related

Removing JSON objects that aren't correctly formatted Python

I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?

There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.

If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.

How do I find unique, non-English characters in a SQL script that has a lot of tables, scripts, etc. related to it?

I am getting a UnicodeDecodeError in my Python script and I know that the unique character is not in Latin (or English), and I know what row it is in (there are thousands of columns). How do I go through my SQL code to find this unique character/these unique characters?

Do a binary search. Break the files (or the scripts, or whatever) in half, and process both files. One will (should) fail, and the other shouldn't. If they both have errors, doesn't matter, just pick one.
Continue splitting the broken files until you've narrowed it down to the something more manageable that you can probe in to, even down to the actual line that's failing.

Python-pandas with large/disordered text files

I have a large (for my experience level anyway) text file of astrophysical data and I'm trying to get a handle on python/pandas. As a noob to python, it's comin' along slowly. Here is a sample of the text file, it's a 145Mb total file. When I'm trying to read this in pandas I'm getting confused because I don't know what to use pd.read_table(example.txt) or pd.read_csv(example.csv). In either case I can't call on a specific column without ipython freaking out, such as here. I know I'm doing something absent-minded. Can anyone explain what that might be? I've done this same procedure with smaller files and it works great, but this one seems to be limiting it's output, or just not working at all.
Thanks.

It looks like your columns are separated by varying amounts of whitespace, so you'll need to specify that as the separator. Try read_csv(example.csv, sep=r'\s+'). \s+ is the regular expression for "any amount of whitespace". Also, you should remove that # character from the beginning of the first line, as that will be read as an extra column and will mess up the reading.

Python alternative to csvfix

sorry for bad English. I am working in sales and processing many prices. Most of the problems i have been solved (scheduling, processing queue, storing, converting from Excel 2003/2007 or Access, encoding etc...). One of the unresolved problem is processing csv files. I need tool with many configuration options. Most of needed is filter/search by manufacturer or skip rows which not follows some conditions - may be amount of product or it's price and may be other. I found tool that solve this task - csvfix But there is disadvantages in it cause it can't work with non Latin alpabet. I searching alternative, may be you know? Best of all if it will be written on Python/Ruby. Thanks.

Do you want a programmable tool or a finished application? If the latter, take a look at csved. It offers lots of configuration and filtering options and handles non-ASCII data well (there also is a Unicode variant available. And it's cardware (free, but the author asks for a postcard).

Create a user-group in linux using python

I want to create a user group using python on CentOS system. When I say 'using python' I mean I don't want to do something like os.system and give the unix command to create a new group. I would like to know if there is any python module that deals with this.
Searching on the net did not reveal much about what I want, except for python user groups.. so I had to ask this.
I learned about the grp module by searching here on SO, but couldn't find anything about creating a group.
EDIT: I dont know if I have to start a new question for this, but I would also like to know how to add (existing) users to the newly created group.
Any help appreciated.
Thank you.

I don't know of a python module to do it, but the /etc/group and /etc/gshadow format is pretty standard, so if you wanted you could just open the files, parse their current contents and then add the new group if necessary.
Before you go doing this, consider:
What happens if you try to add a group that already exists on the system
What happens when multiple instances of your program try to add a group at the same time
What happens to your code when an incompatible change is made to the group format a couple releases down the line
NIS, LDAP, Kerberos, ...
If you're not willing to deal with these kinds of problems, just use the subprocess module and run groupadd. It will be way less likely to break your customers machines.
Another thing you could do that would be less fragile than writing your own would be to wrap the code in groupadd.c (in the shadow package) in Python and do it that way. I don't see this buying you much versus just exec'ing it, though, and it would add more complexity and fragility to your build.

I think you should use the commandline programs from your program, a lot of care has gone into making sure that they don't break the groups file if something goes wrong.
However the file format is quite straight forward to write something yourself if you choose to go that way

There are no library calls for creating a group. This is because there's really no such thing as creating a group. A GID is simply a number assigned to a process or a file. All these numbers exist already - there is nothing you need to do to start using a GID. With the appropriate privileges, you can call chown(2) to set the GID of a file to any number, or setgid(2) to set the GID of the current process (there's a little more to it than that, with effective IDs, supplementary IDs, etc).
Giving a name to a GID is done by an entry in /etc/group on basic Unix/Linux/POSIX systems, but that's really just a convention adhered to by the Unix/Linux/POSIX userland tools. Other network-based directories also exist, as mentioned by Jack Lloyd.
The man page group(5) describes the format of the /etc/group file, but it is not recommended that you write to it directly. Your distribution will have policies on how unnamed GIDs are allocated, such as reserving certain spaces for different purposes (fixed system groups, dynamic system groups, user groups, etc). The range of these number spaces differs on different distributions. These policies are usually encoded in the command-line tools that a sysadmin uses to assign unnamed GIDs.
This means the best way to add a group locally is to use the command-line tools.

If you are looking at Python, then try this program. Its fairly simple to use, and the code can easily be customized http://aleph-null.tv/downloads/mpb-adduser-1.tgz

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse CSV files with different quoting (and a wrong one) - python

Related

Removing JSON objects that aren't correctly formatted Python

How do I find unique, non-English characters in a SQL script that has a lot of tables, scripts, etc. related to it?

Python-pandas with large/disordered text files

Python alternative to csvfix

Create a user-group in linux using python

Categories

Resources