How to best flatten NDJson data in Python - python

I have a huge file (>400MB) of NDJson formatted data and like to flatten it into a table format for further analysis.
I started iterate through the various objects manually but some are rather deep and might even change over time, so I was hoping for a more general approach.
I was certain pandas lib would offer something but could not find anything that would help my case. Also, the several other libs I found seem to not ‘fully’ provide what I was hoping for (flatten_json). It all seems very early on.
Is it possible that there is not good (fast and easy) solve for this at this time?
Any help is appreciated

pandas read_json has a bool param lines, set this to True to read ndjsons
data_frame = pd.read_json('ndjson_file.json', lines=True)

Related

Efficient Way to Read SAS file with over 100 million rows into pandas

I have an SAS file that is roughly 112 million rows. I do not actually have access to SAS software, so I need to get this data into, preferably, a pandas DataFrame or something very similar in the python family. I just don't know how to do this efficiently. ie, just doing df = pd.read_sas(filename.sas7bdat) takes a few hours. I can do chunk sizes but that doesn't really solve the underlying problem. Is there any faster way to get this into pandas, or do I just have to eat the multi-hour wait? Additionally, even when I have read in the file, I can barely do anything with it because iterating over the df takes forever as well. It usually just ends up crashing the Jupyter kernel. Thanks in advance for any advice in this regard!
Regarding the first part, I guess there is not much to do as the read_sas options are limited.
For the second part 1. iterating manually through rows is slow and not pandas philosophy. Whenever possible use vectorial operations. 2. Try to look into specialized solutions for large datasets, like dask. Also read how to scale to large dataframes.
Maybe you don't need your entire file to work on it so you can take 10%. You can also change your variable types to reduce its memory.
if you want to store a df and re use it instead of re importing the entire file each time you want to work on it you can save it as a pickle file (.pkl) and re open it by using pandas.read_pickle

Is there a good way of saving a Spacy doc in a Pandas dataframe

I'm in the process of figuring this out but wanted to document on stack overflow since this wasn't easily searchable. (Also, hopefully someone can answer this before I do).
df.loc[:,'corpus_spacy_doc'] = df['text_corpus'].apply(lambda cell: nlp(cell))
So now I can do all sorts of nlp stuff to corpus_spacy_doc which is great. But I would like to have a good way of saving the state of this dataframe since df.to_csv() obviously won't work. Been looking to see if this is possible with parquet but I don't think it is.
As of right now it seems my best solution is using the spacy method of serializing the list of docs (https://spacy.io/usage/saving-loading) and loading with pandas dataframe later.
To summarize, I now want a pythonic way of doing something like
df.to_something(fname = fname)
Has anyone else gone through this or have a good answer?
So this was pretty easy and seems to solved for what I'm doing with regular df.to_pickle()

can you subset while reading in a csv in python

I have daily weather data in csv since 1980, >10GB in size. The column I am interested in date, and I want to be able to have a user select a date so that only the results from that date are returned.
I wonder if it is possible to read in and subset at the same time to save memory and computation
I am relatively new to python and tried:
d=pd.read_csv('weather.csv',sep='\t')['Date' == 'yyyymmdd']
to no avail.
Is it possible to read in all of the data that is only present for a single day (ei 20011004)?
Short answer: from a csv you'll not be able to do so.
Long answer: csv formats are very handy for humans to read, but it's the worst for machines to operate with. You'll need to parse line by line until you find the lines where the date fits the requested one.
A possible solution: You should convert the csv into a more amenable format for such operations. My suggestion would be to go with something like hdf5. You can read the whole csv with pandas and then save it as a hdf5 file as d.to_hdf('weather.h5', format='table'). You can check the pandas hdf documentation here. This should allow you to handle in a more memory and cpu efficient way.
Binary files can implement indexes and sorting in such a way that you don't have to go through all the data to check for those pieces you need. The same ideas apply to databases.
Addendum: There are other options for binary formats, like parquet (which maybe would be even better you should test) or feather (if you want some level of "native" interoperativity with R). You might want to check the following post for some insights regarding loading/saving times in different formats and their size.

Divide one "column" by another in Tab Delimited file

I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.
In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps

Using pandas over csv library for manipulating CSV files in Python3

Forgive me if my questions is too general, or if its been asked before. I've been tasked to manipulate (e.g. copy and paste several range of entries, perform calculations on them, and then save them all to a new csv file) several large datasets in Python3.
What are the pros/cons of using the aforementioned libraries?
Thanks in advance.
I have not used CSV library, but many people are enjoying the benefits of Pandas. Pandas provides a lot of the tools you'll need, based off Numpy. You can easily then use more advance libraries for all sorts of analysis (sklearn for machine learning, nltk for nlp, etc.).
For your purposes, you'll find it easy to manage different cdv's, merge, concatenate, do whatever you want really.
Heres a link to a quick start guide. Lots of other resources out there as well.
getting started with pandas python
http://pandas.pydata.org/pandas-docs/stable/10min.html
Hope that helps a little bit.
You should always try to use as much as possible the work that other people have already been doing for you (such as programming the pandas library). This saves you a lot of time. Pandas has a lot to offer when you want to process such files so this seems to me to be the the best way to deal with such files. Since the question is very general, I can also only give a general answer... When you use pandas, you will however need to read more in the documentation. But I would not say that this is a downside.

Categories

Resources