Imagine that you working with a large dataset, distributed over a bunch of CSV files. You open an IPython notebook and explore stuff, do some transformations, reorder and clean up data.
Then you start doing some experiments with the data, create some more notebooks and in the end find yourself heaped up with a bunch of different notebooks which have data transformation pipelines buried in them.
How to organize data exploration/transformation/learning-from-it process in such a way, that:
complexity doesn't blow, raising gradually;
keep your codebase managable and navigable;
be able to reproduce and adjust data transformation pipelines?
Well, I have this problem now and then when working with a big set of data. Complexity is something I learned to live with, sometimes it's hard to keep things simple.
What i think that help's me a lot is putting all in a GIT repository, if you manage it well and make frequent commits with well written messages you can track the transformation to your data easily.
Every time I make some test, I create a new branch and do my work on it. If it gets to nowhere I just go back to my master branch and keep working from there, but the work I done is still available for reference if I need it.
If it leads to something useful I just merge it to my master branch and keep working on new tests, making new branches, as needed.
I don't think it answer all of your question and also don't know if you already use some sort version control in your notebooks, but that is something that helps me a lot and I really recommend it when using jupyter-notebooks.
Related
I love pycharm as it is very useful for my data analysis.
However there is something that I still can't figure out a big problem.
When I start to save a lot of variables, it is very useful. But sometimes, especially when I want to run a piece of row using seaborn and create a new graph, sometimes, all my variables disseapear and I have to reload them again from scratch.
I'd like to know, do you know a way to keep the data stored and run only a piece of my code without getting this problem ?
Thank you
I am not really sure what issue you are having, but from the sounds of it you just need to store a bunch of data and only use some of it.
If that's the case, I would save files which each set of data and then import the one you want to use at that time.
If you stick to the same data names then your program can just do something like:
import data_435 as data
# now your code can always access data.whatever_var
# even though you have 435 different data sets
I have written a Python script which models an academic problem which I wish to publish. I will put the source on Github and some academics that just happen to know Python may get my source and play with it themselves. However there are probably more academics that may be interested in the model but that are not python programmers and I would like them to be able to run my model too. Even though they are not programmers they could at least try out editing the values of some of the parameters to see how that affects the results. So now my question is how could I arrange for a non-python programmer to run a Python program as easily (for them) as possible. I would guess that my options may be...
google colab
an online python compiler like this one
compiling the program into an exe (and letting the user set parameters via a config file)
something else?
So now a couple of complications that makes my problem trickier.
The output of the program is graphical and uses matplotlib. As I understand it, the utilities that turn python scripts into exe files struggle or fail altogether when it comes to matplotlib.
The source is split into two separate files, one small neat file which contains the model and the user might like to have a good look at it and get the gist of it even if they're not really a python programmer. And a separate large ugly file which just handles the graphics - an academic would have no interest in this and I'd like to spare them the gory details.
EDIT: I did ask a related question here - but that was all about programmers that won't mind doing things like installing python and using pip... this question is in relation to non-programmers who would not be comfortable doing things like that.
Colab can handle the 2 problems, but you may need to adapt some code.
Matplotlib interface: Colab can display plots just fine. But you may want user to interact with slider, checkbox, dropdown menu. Then, you need to use Colab's own Form UI, or pywidgets. See an example here
2 separate python files: you can convert one of them to a notebook. Then import the other. Or you can create a new notebook that import both files. Here's an example.
I have huge data stored in cassandra and I wanted to process it using spark through python.
I just wanted to know how to interconnect spark and cassandra through python.
I have seen people using sc.cassandraTable but it isnt working and fetching all the data at once from cassandra and then feeding to spark doesnt make sense.
Any suggestions?
Have you tried the examples in the documentation.
Spark Cassandra Connector Python Documentation
spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="kv", keyspace="test")\
.load().show()
I'll just give my "short" 2 cents. The official docs are totally fine for you to get started. You might want to specify why this isn't working, i.e. did you run out of memory (perhaps you just need to increase the "driver" memory) or is there some specific error that is causing your example not to work. Also it would be nice if you provided that example.
Here are some of my opinions/experiences that I had. Usually, not always, but most of the time you have multiple columns in partitions. You don't always have to load all the data in a table and more or less you can keep the processing (most of the time) within a single partition. Since the data is sorted within a partition this usually goes pretty fast. And didn't present any significant problem.
If you don't want the whole store in casssandra fetch to spark cycle to do your processing you have really a lot of the solutions out there. Basically that would be quora material. Here are some of the more common one:
Do the processing in your application right away - might require some sort of inter instance communication framework like hazelcast of even better akka cluster this is really a wide topic
spark streaming - just do your processing right away in micro batching and flush results for reading to some persistence layer - might be cassandra
apache flink - use proper streaming solution and periodically flush state of the process to i.e. cassandra
Store data into cassandra the way it's supposed to be read - this approach is the most adviseable (just hard to say with the info you provided)
The list could go on and on ... User defined function in cassandra, aggregate functions if your task is something simpler.
It might be also a good idea that you provide some details about your use case. More or less what I said here is pretty general and vague, but then again putting this all into a comment just wouldn't make sense.
Is there a way to merge the change logs from several different Mercurial repositories? By "merge" here I just mean integrate into a single display; this is nothing to do with merging in the source control sense.
In other words, I want to run hg log on several different repositories at once. The entries should be sorted by date regardless of which repository they're from, but be limited to the last n days (configurable), and should include entries from all branches of all the repositories. It would also be nice to filter by author and do this in a graphical client like TortoiseHg. Does anyone know of an existing tool or script that would do this? Or, failing that, a good way to access the log entries programmically? (Mercurial is written in Python, which would be ideal, but I can't find any information on a simple API for this.)
Background: We are gradually beginning to transition from SVN to Mercurial. The old repository was not just monolithic in the sense of one server, but also in the sense that there was one huge repository for all projects (albeit with a sensible directory structure). Our new Mercurial repositories are more focused! In general, this works much better, but we miss one useful feature from SVN: being able to use svn log at the root of the repository to see everything we have been working on recently. It's very useful for filling in timesheets, giving yourself a sense of purpose, etc.
I figured out a way of doing this myself. In short, I merge all the revisions into one mega-repo, and I can then look at this in TortoiseHG. Of course, it's a total mess, but it's good enough to get a summary of what happened recently.
I do this in three steps:
(Optional) Run hg convert on each source repository using the branchmap feature to rename each branch from original to reponame/original. This makes it easier later to identify which revision came from which source repository. (More faithful to SVN would be to use the filemap feature instead.)
On a new repository, run hg pull -f to force-pull from the individual repositories into a one big one. This gets all the revisions in one place, but they show up in the wrong order.
Use the method described in this answer to create yet another repository that contains all the changes from the one created in step 2 but sorted into the right order. (Actually I use a slight variant: I get the hashes and compare against the hashes in the destination, check that the destination has a prefix of the source's, and only copy the new ones across.)
This is all done from a Python script, but although Mercurial is written in Python I just use the command line interface using the subprocess module. Running through the three steps only copies the new revisions without rebuilding everything from scratch, unless you add a new repo.
I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).
This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.