Appending new column to dask dataframe - python

This is a follow up question to Shuffling data in dask.
I have an existing dask dataframe df where I wish to do the following:
df['rand_index'] = np.random.permutation(len(df))
However, this gives the error, Column assignment doesn't support type ndarray. I tried to use df.assign(rand_index = np.random.permutation(len(df)) which gives the same error.
Here is a minimal (not) working sample:
import pandas as pd
import dask.dataframe as dd
import numpy as np
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*10, 'B':[3,2,1]*10}), npartitions=10)
df['rand_index'] = np.random.permutation(len(df))
Note:
The previous question mentioned using df = df.map_partitions(add_random_column_to_pandas_dataframe, ...) but I'm not sure if that is relevant to this particular case.
Edit 1
I attempted
df['rand_index'] = dd.from_array(np.random.permutation(len_df)) which, executed without an issue. When I inspected df.head() it seems that the new column was created just fine. However, when I look at df.tail() the rand_index is a bunch of NaNs.
In fact just to confirm I checked df.rand_index.max().compute() which turned out to be smaller than len(df)-1. So this is probably where df.map_partitions comes into play as I suspect this is an issue with dask being partitioned. In my particular case I have 80 partitions (not referring to the sample case).

You would need to turn np.random.permutation(len(df)) into type that dask understands:
permutations = dd.from_array(np.random.permutation(len(df)))
df['rand_index'] = permutations
df
This would yield:
Dask DataFrame Structure:
A B rand_index
npartitions=10
0 int64 int64 int32
3 ... ... ...
... ... ... ...
27 ... ... ...
29 ... ... ...
Dask Name: assign, 61 tasks
So it is up to you now if you want to .compute() to calculate actual results.

To assign a column you should use df.assign

Got the same problem as in Edit 1.
My work around is to get a unique column from the existing dataframe and feed into the dataframe that is to be appended.
import dask.dataframe as dd
import dask.array as da
import numpy as np
import panda as pd
df = dd.from_pandas(pd.DataFrame({'A':[1,2,3]*2, 'B':[3,2,1]*2, 'idx':[0,1,2,3,4,5]}), npartitions=10)
chunks = tuple(df.map_partitions(len).compute())
size = sum(chunks)
permutations = da.from_array(np.random.permutation(len(df)), chunks=chunks)
idx = da.from_array(df['idx'].compute(), chunks=chunks)
ddf = dd.concat([dd.from_dask_array(c) for c in [idx,permutations]], axis = 1)
ddf.columns = ['idx','rand_idx']
df = df.merge(ddf, on='idx')
df = df.set_index('rand_idx')
df.compute().head()

Related

DASK: Replace infinite (inf) values in single column

I have a dask dataframe in which I have a few inf values appearing. I wish to areplace these on a per column basis, because where inf exists I can replace with a value that is appropriate to the upper bounds that can be expected from that column.
I'm having some trouble understanding the documentation, or rather translating it into something I can use to replace infinite values.
What I have been trying is roughly around the below, replacing inf with 1000 - however the inf value seems to remain in place, unchanged.
Any advice on how to do this would be excellent. Because this is a huge dataframe (10m rows, 40 cols) I'd prefer to do it in a fashion that doesn't use lamba or loops- which the below should basically achieve, but doesn't.
ddf['mycolumn'].replace(np.inf,1000)
Following #Enzo's comment, make sure you are assigning the replaced values back to the original column:
import numpy as np
import pandas as pd
import dask.dataframe as dd
df = pd.DataFrame([1, 2, np.inf], columns=['a'])
ddf = dd.from_pandas(df, npartitions=2)
ddf['a'] = ddf['a'].replace(np.inf, 1000)
# check results with: ddf.compute()
# a
# 0 1.0
# 1 2.0
# 2 1000.0

Running a Levene's test over a list instead of separate vectors

I am analysing a number of data sets for homogenous variance using scipy.stats.levene(). I have each one of my data set in a separate column that I put into pandas dataframes and drop NaN values.
My problem is this - I have a lot of samples - and would like to run the test simultaneously on all data sets in a list (eg. something like list_of_samples = list(df.columns.values). But all my attempts at this give me the error message ValueError: Must enter at least two input vectors.
All help and feedback is greatly appreciated!
My code so far.
### Import modules
import scipy
import csv
import pandas as pd
### Open dataframe, drop one column and NaN values
df = pd.read_csv('data.csv')
df = df.drop(['Object'],axis=1)
df = df.dropna()
### Put sample data in dataframe by column
sample1 = df['column1']
sample2 = df['column2']
sample3 = df['column3']
sample4 = df['column4']
w = scipy.stats.levene(sample1,sample2,sample3,sample4)[0]
pvalue = scipy.stats.levene(sample1,sample2,sample3,sample4)[1]
if pvalue<0.05:
Result = "Data shows variance"
scipy.stats.levene
else:
Result = "Data shows no variance"
print(w, pvalue)
I have found a solution by using the pingouin module, which is based on scipy. pingouin allows the test to be run on data in list format (organized either Long and Wide)
I found it very helpful to review the meaning of Long and Wide data, and also dependent and grouping variable.
Below is an example code
import numpy as np
import pingouin as pg
data = pd.read_csv('data.csv')
test = pg.homoscedasticity(data, dv='column1', group='column2', method='levene', alpha=0.05)
The output is a float with the following format.
W pval equal_var
levene 1.583536 0.066545 True

How to pick the numeric columns in pd.Dataframe() [duplicate]

Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)
You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']
You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.
df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')
Simple one-liner:
df.select_dtypes('number').columns
Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']
This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index
We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'
Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.
Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.
Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.
A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.
def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)

pandas sample based on criteria

I would like to use pandas sample function but with a criteria without grouping or filtering data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print df.sample(n=100)
This will sample 100 rows, but what if i want to sample 50 rows containing 0 to 50 rows containing 1 in df['a'].
You can use the == operator to make a list* of boolean values. And when said list is put into the getter ([]) it will filter the values. If you want to, you can use n=50 to create a sample size of 50 rows.
New code
df[df['a']==1].sample(n=50)
Full code
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=5, size=(10000, 2)),columns=['a', 'b'])
print(df[df['a']==1].sample(n=50))
*List isn't literally a list in this context, but it is a great word for explaining how it works. It's a technically a DataFrame that maps rows to a true/false value.
More obscure DataFrame sampling
If you want to sample all 50 where a is 1 or 0:
print(df[(df['a']==1) | (df['a']==0)].sample(n=50))
And if you want to sample 50 of each:
df1 = df[df['a']==1].sample(n=50)
df0 = df[df['a']==0].sample(n=50)
print(pd.concat([df1,df0]))

Pandas: df.groupby() is too slow for big data set. Any alternatives methods?

I have a pandas.DataFrame with 3.8 Million rows and one column, and I'm trying to group them by index.
The index is the customer ID. I want to group the qty_liter by the index:
df = df.groupby(df.index).sum()
But it takes forever to finish the computation. Are there any alternative ways to deal with a very large data set?
Here is the df.info():
<class 'pandas.core.frame.DataFrame'>
Index: 3842595 entries, -2147153165 to \N
Data columns (total 1 columns):
qty_liter object
dtypes: object(1)
memory usage: 58.6+ MB
The data looks like this:
The problem is that your data are not numeric. Processing strings takes a lot longer than processing numbers. Try this first:
df.index = df.index.astype(int)
df.qty_liter = df.qty_liter.astype(float)
Then do groupby() again. It should be much faster. If it is, see if you can modify your data loading step to have the proper dtypes from the beginning.
Your data is classified into too many categories, which is the main reason that makes the groupby code too slow. I tried using Bodo to see how it would do with the groupby on a large data set. I ran the code with regular sequential Pandas and parallelized Bodo. It took about 20 seconds for Pandas and only 5 seconds for Bodo to run. Bodo basically parallelizes your Pandas code automatically and allows you to run it on multiple processors, which you cannot do with native pandas. It is free for up to four cores: https://docs.bodo.ai/latest/source/installation_and_setup/install.html
Notes on data generation: I generated a relatively large dataset with 20 million rows and 18 numerical columns. To make the generated data more resemblant to your dataset, two other columns named “index” and “qty_liter” are added.
#data generation
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(20000000, 18), columns = list('ABCDEFGHIJKLMNOPQR'))
df['index'] = np.random.randint(2147400000,2147500000,20000000).astype(str)
df['qty_liter'] = np.random.randn(20000000)
df.to_parquet("data.pq")
With Regular Pandas:
import time
import pandas as pd
import numpy as np
start = time.time()
df = pd.read_parquet("data.pq")
grouped = df.groupby(['index'])['qty_liter'].sum()
end = time.time()
print("computation time: ", end - start)
print(grouped.head())
output:
computation time: 19.29292106628418
index
2147400000 29.701094
2147400001 -7.164031
2147400002 -21.104117
2147400003 7.315127
2147400004 -12.661605
Name: qty_liter, dtype: float64
With Bodo:
%%px
import numpy as np
import pandas as pd
import time
import bodo
#bodo.jit(distributed = ['df'])
def group_by():
start = time.time()
df = pd.read_parquet("data.pq")
df = df.groupby(['index'])['qty_liter'].sum()
end = time.time()
print("computation time: ", end - start)
print(df.head())
return df
df = group_by()
output:
[stdout:0]
computation time: 5.12944599299226
index
2147437531 6.975570
2147456463 1.729212
2147447371 26.358158
2147407055 -6.885663
2147454784 -5.721883
Name: qty_liter, dtype: float64
Disclaimer: I am a data scientist advocate working at Bodo.ai
I do not use string, but integer values that define the groups. Still it is very slow: about 3 mins vs. a fraction of a second in Stata. The number of observations is about 113k, the number of groups defined by x, y, z is about 26k.
a= df.groupby(["x", "y", "z"])["b"].describe()[['max']]
x,y,z: integer values
b: real value
Use categorical data type if you can't convert values to numeric:
df.astype('category')
Then when you do the groupby, set observed=True
df = df.groupby(df.index, observed=True).sum()
From the documentation:
observed bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

Categories

Resources