Keep Attributes attached to dataset in Pandas and Dask - python

I use Pandas and Dask all the time. I also have a number of custom classes and functions which I utilize a lot for different analyses, which I am always having to edit to account for either Dask or Pandas. I consistently find myself in a situation where I wish I could assign attributes to the dataset which I am analyzing, minimizing the compute command from dask and also allowing easier management of functions as I switch between data types. Something effectively akin to:
import pandas as pd
import dask.dataframe as dd
from pydataset import data
df = data('titanic')
setattr(df, 'vals12', 1)
test = dd.from_pandas(df, npartitions = 2)
test.vals12 #would still contain the attribute vals12
df = test.compute()
df.vals12 #would still contain the attribute vals12
However, I do not know of a way to achieve this, without editing the base packages (Pandas / Dask). As a result, I was wondering if there was a way to achieve the above example without creating a new class (or static version of the packages) or if there is a way to "branch" the repos in a non-public way (allowing for my edits to be added, but still allowing me to easily get future features without pain)?

In the upcoming release of Dask, you will be able to do this by using the recent attrs feature in pandas 1.0. For now, you can pip install dask from Github to use this functionality.
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame({
"a":[0,1,2],
"b":[2,3,4]
})
df.attrs["vals12"] = 1
ddf = dd.from_pandas(df, npartitions=2)
ddf.attrs
{'vals12': 1}

Related

Is it possibe to change similar libraries (Data Analysis) in Python within the same code?

I use the modin library for multiprocessing.
While the library is great for faster processing, it fails at merge and I would like to revert to default pandas in between the code.
I understand as per PEP 8: E402 conventions, import should be declared once and at the top of the code however my case would need otherwise.
import pandas as pd
import modin.pandas as mpd
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df = mpd.read_csv()
do stuff
Then I would like to revert to default pandas within the same code
but how would i do the below in pandas as there does not seem to be a clear way to switch from pd and mpd in the below lines and unfortunately modin seems to take precedence over pandas.
df = df.loc[:, df.columns.intersection(['col1', 'col2'])]
df = df.drop_duplicates()
df = df.sort_values(['col1', 'col2'], ascending=[True, True])
Is it possible?
if yes, how?
You can simply do the following :
import modin.pandas as mpd
import pandas as pd
This way you have both modin as well as original pandas in memory and you can efficiently switch as per your need.
Since many have posted answers however in this particular case, as applicable and pointed out by #Nin17 and this comment from Modin GitHub, to convert from Modin to Pandas for single core processing of some of the operations like df.merge you can use
import pandas as pd
import modin.pandas as mpd
import os
import ray
ray.init()
os.environ["MODIN_ENGINE"] = "ray"
df_modin = mpd.read_csv() #reading dataframe into Modin for parallel processing
df_pandas = df_modin._to_pandas() #converting Modin Dataframe into pandas for single core processing
and if you would like to reconvert the dataframe to a modin dataframe for parallel processing
df_modin = mpd.DataFrame(df_pandas)
You can try pandarallel package instead of modin , It is based on similar concept : https://pypi.org/project/pandarallel/#description
Pandarallel Benchmarks : https://libraries.io/pypi/pandarallel
As #Nin17 said in a comment on the question, this comment from the Modin GitHub describes how to convert a Modin dataframe to pandas. Once you have a pandas dataframe, you call any pandas method on it. This other comment from the same issue describes how to convert the pandas dataframe back to a Modin dataframe.

How to speed up pandas dataframe creation from huge file?

I have a file bigger than 7GB. I am trying to place it into a dataframe using pandas, like this:
df = pd.read_csv('data.csv')
But it takes too long. Is there a better way to speed up the dataframe creation? I was considering changing the parameter engine='c', since it says in the documentation:
"engine{‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine is currently more feature-complete."
But I dont see much gain in speed
If the problem is you are not able to create the dataframe since the big size makes the operation to fail, you can check how to chunk it in this answer
In case it is created at some point, but you consider it is too slow, then you can use datatable to read the file, then convert to pandas, and continue with your operations:
import pandas as pd
import datatable as dt
# Read with databale
datatable_df = dt.fread('myfile.csv')
# Then convert the dataframe into pandas
pandas_df = frame_datatable.to_pandas()

Databricks: How to switch from R Dataframe to Pandas Dataframe (R to python in the same notebook)

I am writing R code in a Databricks notebook that performs several operations in R. Once the dataframe is cleaned up, I would like to invoke it in a python cell using '%python' and therefore use python code to continue operations on the dataframe.
I would thus like to transform, within the python block, my R Dataframe into a Pandas dataframe. Does anybody know how to do this? Thanks!
I think the namespace between different kernels is separate on Databricks. So even in the same notebook, you will not see an R variable in Python or vice versa.
My understanding is that there are two methods to share data between kernels: 1) using the filesystem (csv, etc) and 2) temporary Databricks tables. I believe the latter is the more typical route[1].
Filesystem:
%r
write.csv(df, "/FileStore/tmp.csv")
%python
import pandas as pd
df = pd.read_csv("/FileStore/tmp.csv")
Temporary databricks table:
%r
library(SparkR)
sparkR.session()
df <- read.df("path/to/original_file.csv", source="csv")
registerTempTable(df, "tmp_df")
%python
df = spark.sql("select * from tmp_df").toPandas()
[1] https://forums.databricks.com/questions/16039/use-python-and-r-variable-in-the-same-notebook-amo.html
Note: Since rpy2 release 3.3.0 explicit conversion is done as follows
import rpy2.robjects as ro
dt = pd.DataFrame()
To R DataFrame
r_dt = ro.conversion.py2rpy(dt)
To pandas DataFrame
pd_dt = ro.conversion.rpy2py(r_dt)

How to move from Graphlab to pandas

I've been learning Graphlab, but wanted to take a look at pandas as well since it's open source and in the future I might find myself at a company that doesn't have a GL license, and I was wondering how pandas would handle creating a basic model the way I can with GL.
data = pd.read_csv("~/Downloads/diamonds.csv")
sframe = gl.SFrame(data)
train_data, test_data = sframe.random_split(.8, seed=1)
train, test = train_test_split(data, train_size=0.75, random_state=88)
reg_model = gl.linear_regression.create(train_data, target="price", features=["carat","cut","color"], validation_set=None)
What would be the pandas equivalent of the last line above?
pandas itself doesn't have any predictive modeling built in (that i know of).
Here is a good link on how to leverage pandas in a statistical model. This one too.
pandas is probably one of the best (if not the best) modules for data manipulation in Python. It'll make storing data and manipulating the data for modeling much easier than lists and reading CSVs, etc.
Reading in files is as easy as (notice how intuitive it is):
import pandas as pd
# Excel
df1 = read_excel(PATH_HERE)
# Csv
df1 = read_csv(PATH_HERE)
# JSON
df1 = read_json(PATH_HERE)
and to spit it out:
# Excel
d1.to_excel(PATH_HERE)
# Need I go on again??
It also makes filtering and slicing your data very simple. Here is the official doc:
For modeling purposes have a look at
sklearn and NLTK for text analysis. There are others, but those are the ones I've used.
For modelling, you have to use sklearn library. The last line equivalent is:
model = sklearn.linear_model.LogisticRegression()
model.fit(train_data["carat","cut","color"], train_data["price"])
docs

How to get read excel data into an array with python

In the lab that I work in, we process a lot of data produced by a 96 well plate reader. I'm trying to write a script that will perform a few calculations and output a bar graph using matplotlib.
The problem is that the plate reader outputs data into a .xlsx file. I understand that some modules like pandas have a read_excel function, can you explain how I should go about reading the excel file and putting it into a dataframe?
Thanks
Data sample of a 24 well plate (for simplicity):
0.0868 0.0910 0.0912 0.0929 0.1082 0.1350
0.0466 0.0499 0.0367 0.0445 0.0480 0.0615
0.6998 0.8476 0.9605 0.0429 1.1092 0.0644
0.0970 0.0931 0.1090 0.1002 0.1265 0.1455
I'm not exactly sure what you mean when you say array, but if you mean into a matrix, might you be looking for:
import pandas as pd
df = pd.read_excel([path here])
df.as_matrix()
This returns a numpy.ndarray type.
This task is super easy in Pandas these days.
import pandas as pd
df = pd.read_excel('file_name_here.xlsx', sheet_name='Sheet1')
or
df = pd.read_csv('file_name_here.csv')
This returns a pandas.DataFrame object which is very powerful for performing operations by column, row, over an entire df, or over individual items with iterrows. Not to mention slicing in different ways.
There is awesome xlrd package with quick start example here.
You can just google it to find code snippets. I have never used panda's read_excel function, but xlrd covers all my needs, and can offer even more, I believe.
You could also try it with my wrapper library, which uses xlrd as well:
import pyexcel as pe # pip install pyexcel
import pyexcel.ext.xls # pip install pyexcel-xls
your_matrix = pe.get_array(file_name=path_here) # done

Categories

Resources