Forgive me if my questions is too general, or if its been asked before. I've been tasked to manipulate (e.g. copy and paste several range of entries, perform calculations on them, and then save them all to a new csv file) several large datasets in Python3.
What are the pros/cons of using the aforementioned libraries?
Thanks in advance.
I have not used CSV library, but many people are enjoying the benefits of Pandas. Pandas provides a lot of the tools you'll need, based off Numpy. You can easily then use more advance libraries for all sorts of analysis (sklearn for machine learning, nltk for nlp, etc.).
For your purposes, you'll find it easy to manage different cdv's, merge, concatenate, do whatever you want really.
Heres a link to a quick start guide. Lots of other resources out there as well.
getting started with pandas python
http://pandas.pydata.org/pandas-docs/stable/10min.html
Hope that helps a little bit.
You should always try to use as much as possible the work that other people have already been doing for you (such as programming the pandas library). This saves you a lot of time. Pandas has a lot to offer when you want to process such files so this seems to me to be the the best way to deal with such files. Since the question is very general, I can also only give a general answer... When you use pandas, you will however need to read more in the documentation. But I would not say that this is a downside.
Related
I'm in the process of figuring this out but wanted to document on stack overflow since this wasn't easily searchable. (Also, hopefully someone can answer this before I do).
df.loc[:,'corpus_spacy_doc'] = df['text_corpus'].apply(lambda cell: nlp(cell))
So now I can do all sorts of nlp stuff to corpus_spacy_doc which is great. But I would like to have a good way of saving the state of this dataframe since df.to_csv() obviously won't work. Been looking to see if this is possible with parquet but I don't think it is.
As of right now it seems my best solution is using the spacy method of serializing the list of docs (https://spacy.io/usage/saving-loading) and loading with pandas dataframe later.
To summarize, I now want a pythonic way of doing something like
df.to_something(fname = fname)
Has anyone else gone through this or have a good answer?
So this was pretty easy and seems to solved for what I'm doing with regular df.to_pickle()
I am really struggling with a programming task I have been handed, I have been asked to read from a CSV file which shows the names of beaches and a rating and then work out the average rating. Any help I could get with this would be great also Ithe task given to me am at a beginner level so please don't judge.
For reading CSV files, check out Python's standard library csv(https://docs.python.org/3.7/library/csv.html)
This content will have enough tutorials and readings that will guide you through your assignment.
As for taking sums and averages, Python's built-in functions(https://docs.python.org/3.7/library/functions.html) will do you enough good.
If you are stuck with any of them, feel free to add comments.
Good luck!
I have a huge file (>400MB) of NDJson formatted data and like to flatten it into a table format for further analysis.
I started iterate through the various objects manually but some are rather deep and might even change over time, so I was hoping for a more general approach.
I was certain pandas lib would offer something but could not find anything that would help my case. Also, the several other libs I found seem to not ‘fully’ provide what I was hoping for (flatten_json). It all seems very early on.
Is it possible that there is not good (fast and easy) solve for this at this time?
Any help is appreciated
pandas read_json has a bool param lines, set this to True to read ndjsons
data_frame = pd.read_json('ndjson_file.json', lines=True)
I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.
In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps
I've been learning the ins and outs of Pandas by way of manipulating large csv files obtained online, the files are time-series of financial data. I have so far figured out how to use HDFStore to store and manipulate them, however I was wondering if there exists an easier way to update the files, without re-downloading the entire source file?
I ask because I'm working with 12 ~300+MB files, which update every 15mins. While I don't need the update to be continuous it'd be swell to not download what I already have.
The Blaze library from Continuum should help you. You can find an introduction here.