Python Dask map_partitions

Python Dask map_partitions - python

Probably a continuation of this question, working from the dask docs examples for map_partitions.
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)
from random import randint
def myadd(df):
new_value = df.x + randint(1,4)
return new_value
res = ddf.map_partitions(lambda df: df.assign(z=myadd)).compute()
res
In the above code, randint is only being called once, not once per row as I would expect. How come?
Output:
X Y Z
1 1 4
2 2 5
3 3 6
4 4 7
5 5 8

If you performed the same operation (df.x + randint(1,4)) on the original pandas dataframe, you would only get one random number, added to every previous value of the column. This is doing exactly the same as the pandas case, except that it is being called once for each partition - this is what map_partition does.
If you wanted a new random number for every row, you should first think of how you would achieve this with pandas. I can immediately think of two:
df.x.map(lambda x: x + random.randint(1, 4))
or
df.x + np.random.randint(1, 4, size=len(df.x))
If you replace your newvalue = line with one of these, it will work as expected.

Related

Generating n amount new rows on a pandas dataframe based off values given in other columns

So, I have the following sample dataframe (included only one row for clarity/simplicity):
df = pd.DataFrame({'base_number': [2],
'std_dev': [1]})
df['amount_needed'] = 5
df['upper_bound'] = df['base_number'] + df['std_dev']
df['lower_bound'] = df['base_number'] - df['std_dev']
For each given rows, I would like to generate the amount of rows such that the total amount per row is the number given by df['amount_needed'] (so 5, in this example). I would like those 5 new rows to be spread across a spectrum given by df['upper_bound'] and df['lower_bound']. So for the example above, I would like the following result as an output:
df_new = pd.DataFrame({'base_number': [1, 1.5, 2, 2.5, 3]})
Of course, this process will be done for all rows in a much larger dataframe, with many other columns which aren't relevant to this particular issue, which is why I'm trying to find a way to automate this process.

One row of df will create one series (or one data frame). Here's one way to iterate over df and create the series with the values you specified:
for row in df.itertuples():
arr = np.linspace(row.lower_bound,
row.upper_bound,
row.amount_needed)
s = pd.Series(arr).rename('base_number')
print(s)
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
Name: base_number, dtype: float64

Ended up using jsmart's contribution and working on it to generate a new dataframe, conserving original id's in order to merge the other columns from the old one onto this new one according to id as needed (whole process shown below):
amount_needed = 5
df = pd.DataFrame({'base_number': [2, 4, 8, 0],
'std_dev': [1, 2, 3, 0]})
df['amount_needed'] = amount_needed
df['upper_bound'] = df['base_number'] + df['std_dev']
df['lower_bound'] = df['base_number'] - df['std_dev']
s1 = pd.Series([],dtype = int)
for row in df.itertuples():
arr = np.linspace(row.lower_bound,
row.upper_bound,
row.amount_needed)
s = pd.Series(arr).rename('base_number')
s1 = pd.concat([s1, s])
df_new = pd.DataFrame({'base_number': s1})
ids_og = list(range(1, len(df) + 1))
ids_og = [ids_og] * amount_needed
ids_og = sorted(list(itertools.chain.from_iterable(ids_og)))
df_new['id'] = ids_og

Pandas: efficient way to get a random subset from each row within a restricted column range

I have some numerical time-series of varying lengths stored in a wide pandas dataframe. Each row corresponds to one series and each column to a measurement time point. Because of their varying length, those series can have missing values (NA) tails either left (first time points) or right (last time points) or both. There is always a continuous stripe without NA of a minimum length on each row.
I need to get a random subset of fixed length from each of these rows, without including any NA. Ideally, I wish to keep the original dataframe intact and to report the subsets in a new one.
I managed to obtain this output with a very inefficient for loop that goes through each row one by one, determines a start for the crop position such that NAs will not be included in the output and copies the cropped result. This works but it is extremely slow on large datasets. Here is the code:
import pandas as pd
import numpy as np
from copy import copy
def crop_random(df_in, output_length, ignore_na_tails=True):
# Initialize new dataframe
colnames = ['X_' + str(i) for i in range(output_length)]
df_crop = pd.DataFrame(index=df_in.index, columns=colnames)
# Go through all rows
for irow in range(df_in.shape[0]):
series = copy(df_in.iloc[irow, :])
series = np.array(series).astype('float')
length = len(series)
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
# Range where the subset might start
lo = pos_non_na[0][0]
hi = pos_non_na[0][-1]
left = np.random.randint(lo, hi - output_length + 2)
else:
left = np.random.randint(0, length - output_length)
series = series[left : left + output_length]
df_crop.iloc[irow, :] = series
return df_crop
And a toy example:
df = pd.DataFrame.from_dict({'t0': [np.NaN, 1, np.NaN],
't1': [np.NaN, 2, np.NaN],
't2': [np.NaN, 3, np.NaN],
't3': [1, 4, 1],
't4': [2, 5, 2],
't5': [3, 6, 3],
't6': [4, 7, np.NaN],
't7': [5, 8, np.NaN],
't8': [6, 9, np.NaN]})
# t0 t1 t2 t3 t4 t5 t6 t7 t8
# 0 NaN NaN NaN 1 2 3 4 5 6
# 1 1 2 3 4 5 6 7 8 9
# 2 NaN NaN NaN 1 2 3 NaN NaN NaN
crop_random(df, 3)
# One possible output:
# X_0 X_1 X_2
# 0 2 3 4
# 1 7 8 9
# 2 1 2 3
How could I achieve same results in a way adapted to large dataframes?
Edit: Moved my improved solution to the answer section.

I managed to speed up things quite drastically with:
def crop_random(dataset, output_length, ignore_na_tails=True):
# Get a random range to crop for each row
def get_range_crop(series, output_length, ignore_na_tails):
series = np.array(series).astype('float')
if ignore_na_tails:
pos_non_na = np.where(~np.isnan(series))
start = pos_non_na[0][0]
end = pos_non_na[0][-1]
left = np.random.randint(start,
end - output_length + 2) # +1 to include last in randint; +1 for slction span
else:
length = len(series)
left = np.random.randint(0, length - output_length)
right = left + output_length
return left, right
# Crop the rows to random range, reset_index to do concat without recreating new columns
range_subset = dataset.apply(get_range_crop, args=(output_length,ignore_na_tails, ), axis = 1)
new_rows = [dataset.iloc[irow, range_subset[irow][0]: range_subset[irow][1]]
for irow in range(dataset.shape[0])]
for row in new_rows:
row.reset_index(drop=True, inplace=True)
# Concatenate all rows
dataset_cropped = pd.concat(new_rows, axis=1).T
return dataset_cropped

is it possible to use numpy to calculate on recursive data [duplicate]

I have a time-series A holding several values. I need to obtain a series B that is defined algebraically as follows:
B[t] = a * A[t] + b * B[t-1]
where we can assume B[0] = 0, and a and b are real numbers.
Is there any way to do this type of recursive computation in Pandas? Or do I have no choice but to loop in Python as suggested in this answer?
As an example of input:
> A = pd.Series(np.random.randn(10,))
0 -0.310354
1 -0.739515
2 -0.065390
3 0.214966
4 -0.605490
5 1.293448
6 -3.068725
7 -0.208818
8 0.930881
9 1.669210

As I noted in a comment, you can use scipy.signal.lfilter. In this case (assuming A is a one-dimensional numpy array), all you need is:
B = lfilter([a], [1.0, -b], A)
Here's a complete script:
import numpy as np
from scipy.signal import lfilter
np.random.seed(123)
A = np.random.randn(10)
a = 2.0
b = 3.0
# Compute the recursion using lfilter.
# [a] and [1, -b] are the coefficients of the numerator and
# denominator, resp., of the filter's transfer function.
B = lfilter([a], [1, -b], A)
print B
# Compare to a simple loop.
B2 = np.empty(len(A))
for k in range(0, len(B2)):
if k == 0:
B2[k] = a*A[k]
else:
B2[k] = a*A[k] + b*B2[k-1]
print B2
print "max difference:", np.max(np.abs(B2 - B))
The output of the script is:
[ -2.17126121e+00 -4.51909273e+00 -1.29913212e+01 -4.19865530e+01
-1.27116859e+02 -3.78047705e+02 -1.13899647e+03 -3.41784725e+03
-1.02510099e+04 -3.07547631e+04]
[ -2.17126121e+00 -4.51909273e+00 -1.29913212e+01 -4.19865530e+01
-1.27116859e+02 -3.78047705e+02 -1.13899647e+03 -3.41784725e+03
-1.02510099e+04 -3.07547631e+04]
max difference: 0.0
Another example, in IPython, using a pandas DataFrame instead of a numpy array:
If you have
In [12]: df = pd.DataFrame([1, 7, 9, 5], columns=['A'])
In [13]: df
Out[13]:
A
0 1
1 7
2 9
3 5
and you want to create a new column, B, such that B[k] = A[k] + 2*B[k-1] (with B[k] == 0 for k < 0), you can write
In [14]: df['B'] = lfilter([1], [1, -2], df['A'].astype(float))
In [15]: df
Out[15]:
A B
0 1 1
1 7 9
2 9 27
3 5 59

Find location of specific value in dataframe of distances

I've been browsing for an answer to my issue but I can't seem to find a suitable solution. I have a dataframe with distances (NxN cells) and I find the minimum distance of the whole dataframe with:
min_distance = distances.values.min()
Now I need to find the location (which row and which column of the dataframe) of the min_distance. Any ideas?
EDIT
Minimal code
import numpy as np
import pandas as pd
distances=[]
for i in range(5):
distances.append([])
for j in range(5):
distances[i].append(np.random.randint(10))
distances=pd.DataFrame(distances)
min_distance = distances.values.min()
print "Minimum=", min_distance
print "Location of minimum value="

I depends on what form you want your result in. But a very straight forward approach would be to use stack and idxmin.
Like so:
Setup
import pandas as pd
df = pd.DataFrame([[2, 2, 2], [2, 1, 2], [2, 2, 2]],
columns=list('ABC'), index=list('abc'))
print df
A B C
a 2 2 2
b 2 1 2
c 2 2 2
We should expect the min to be 1 and the location to be row b columns B
Solution
df.stack().idxmin()
('b', 'B')
Now you could manipulate this to deliver this any other way. This just happens to deliver a tuple.

Generate example:
N = 4
df = pd.DataFrame(np.random.rand(N,N))
Find minimal index of flattened dataframe:
idx_min = df.values.flatten().argmin()
Simple arithmetic to get the row and column numbers back:
row = ((idx_min + 1) // N) - 1
column = idx_min - (row * N)

Assign values in Pandas series based on condition?

I have a dataframe df like
A B
1 2
3 4
I then want to create 2 new series
t = pd.Series()
r = pd.Series()
I was able to assign values to t using the condition cond as below
t = "1+" + df.A.astype(str) + '+' + df.B.astype(str)
cond = df['A']<df['B']
team[cond] = "1+" + df.loc[cond,'B'].astype(str) + '+' + df.loc[cond,'A'].astype(str)
But I'm having problems with r. I just want r to contain values of 2 when con is satisfied and 1 otherwise
If I just try
r = 1
r[cond] = 2
Then I get TypeError: 'int' object does not support item assignment
I figure I could just run a for loop through df and check the cases in cond through each row of df, but I was wondering if Pandas offers a more efficient way instead?

You will laugh at how easy this is:
r = cond + 1
The reason is that cond is a boolean (True and False) which evaluate to 1 and 0. If you add one to it, it coerces the boolean to an int, which will mean True maps to 2 and False maps to one.
df = pd.DataFrame({'A': [1, 3, 4],
'B': [2, 4, 3]})
cond = df['A'] < df['B']
>>> cond + 1
0 2
1 2
2 1
dtype: int64

When you assign 1 to r as in
r = 1
r now references the integer 1. So when you call r[cond] you're treating an integer like a series.
You want to first create a series of ones for r the size of cond. Something like
r = pd.Series(np.ones(cond.shape))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Dask map_partitions - python

Related

Generating n amount new rows on a pandas dataframe based off values given in other columns

Pandas: efficient way to get a random subset from each row within a restricted column range

is it possible to use numpy to calculate on recursive data [duplicate]

Find location of specific value in dataframe of distances

Assign values in Pandas series based on condition?

Categories

Resources