Related
This is similar to previous questions about how to expand a list-based column across several columns, but the solutions I'm seeing don't seem to work for Dask. Note, that the true DFs I'm working with are too large to hold in memory, so converting to pandas first is not an option.
I have a df with column that contains lists:
df = pd.DataFrame({'a': [np.random.randint(100, size=4) for _ in range(20)]})
dask_df = dd.from_pandas(df, chunksize=10)
dask_df['a'].compute()
0 [52, 38, 59, 78]
1 [79, 71, 13, 63]
2 [15, 81, 79, 76]
3 [53, 4, 94, 62]
4 [91, 34, 26, 92]
5 [96, 1, 69, 27]
6 [84, 91, 96, 68]
7 [93, 56, 45, 40]
8 [54, 1, 96, 76]
9 [27, 11, 79, 7]
10 [27, 60, 78, 23]
11 [56, 61, 88, 68]
12 [81, 10, 79, 65]
13 [34, 49, 30, 3]
14 [32, 46, 53, 62]
15 [20, 46, 87, 31]
16 [89, 9, 11, 4]
17 [26, 46, 19, 27]
18 [79, 44, 45, 56]
19 [22, 18, 31, 90]
Name: a, dtype: object
According to this solution, if this were a pd.DataFrame I could do something like this:
new_dask_df = dask_df['a'].apply(pd.Series)
ValueError: The columns in the computed data do not match the columns in the provided metadata
Extra: [1, 2, 3]
Missing: []
There's another solution listed here:
import dask.array as da
import dask.dataframe as dd
x = da.ones((4, 2), chunks=(2, 2))
df = dd.io.from_dask_array(x, columns=['a', 'b'])
df.compute()
So for dask I tried:
df = dd.io.from_dask_array(dask_df.values)
but that just spits out the same DF I have from before:
[1]: https://i.stack.imgur.com/T099A.png
Not really sure why as the types between the example 'x' and the values in my df are the same:
print(type(dask_df.values), type(x))
<class 'dask.array.core.Array'> <class 'dask.array.core.Array'>
print(type(dask_df.values.compute()[0]), type(x.compute()[0]))
<class 'numpy.ndarray'> <class 'numpy.ndarray'>
Edit: I kind of having a working solution but it involves iterating through each groupby object. It feels like there should be a better way:
dask_groups = dask_df.explode('a').reset_index().groupby('index')
final_df = []
for idx in dask_df.index.values.compute():
group = dask_groups.get_group(idx).drop(columns='index').compute()
group_size = list(range(len(group)))
row = group.transpose()
row.columns = group_size
row['index'] = idx
final_df.append(dd.from_pandas(row, chunksize=10))
final_df = dd.concat(final_df).set_index('index')
In this case dask doesn't know what to expect from the outcome, so it's best to specify meta explicitly:
# this is a short-cut to use the existing pandas df
# in actual code it is sufficient to provide an
# empty series with the expected dtype
meta = df['a'].apply(pd.Series)
new_dask_df = dask_df['a'].apply(pd.Series, meta=meta)
new_dask_df.compute()
I got a working solution. My original function created a list which resulted in the column of lists, as above. Changing the applied function to return a dask bag seems to do the trick:
def create_df_row(x):
vals = np.random.randint(2, size=4)
return db.from_sequence([vals], partition_size=2).to_dataframe()
test_df = dd.from_pandas(pd.DataFrame({'a':[random.choice(['a', 'b', 'c']) for _ in range(20)]}), chunksize=10)
test_df.head()
mini_dfs = [*test_df.groupby('a')['a'].apply(lambda x: create_df_row(x))]
result = dd.concat(mini_dfs)
result.compute().head()
But not sure if this solves the in-memory issue as now i'm holding a list of groupby results.
Here's how to expand a list-like column across multiple columns manually:
dask_df["a0"] = dask_df["a"].str[0]
dask_df["a1"] = dask_df["a"].str[1]
dask_df["a2"] = dask_df["a"].str[2]
dask_df["a3"] = dask_df["a"].str[3]
print(dask_df.head())
a a0 a1 a2 a3
0 [71, 16, 0, 10] 71 16 0 10
1 [59, 65, 99, 74] 59 65 99 74
2 [83, 26, 33, 38] 83 26 33 38
3 [70, 5, 19, 37] 70 5 19 37
4 [0, 59, 4, 80] 0 59 4 80
SultanOrazbayev's answer seems more elegant.
I have a dataframe of size 700x20. My data are pixel intensity coordinates for specific locations on an image, where i have 14 people where each has 50 images. I am trying to perform dimensionality reduction and for such task one of the steps require me to calculate the mean between each class, where i have two classes. In my dataframe in every 50th row are the features that belongs to a class, therefore i'd have from 0 to 50 features for class A, 51 to 100 features for class B, 101-150 for class A, 151-200 for class B and so on.
What i want to do is calculate the mean for every nth given row, from N to M and calculate the mean value. Here's a link for the dataframe for better visualization of the problem: Dataframe pickle file
What i tried was ordering the the dataframe and calculate separately but it didn't work, it calculated the mean for every row and grouped them in 14 different classes.
class_feature_means = pd.DataFrame(columns=target_names)
for c, rows in df.groupby('class'):
class_feature_means[c] = rows.mean()
class_feature_means
Minimal reproducible example:
my_array = np.asarray([[31, 25, 17, 62],
[31, 26, 19, 59,],
[31, 23, 17, 67,],
[31, 23, 19, 67,],
[31, 28, 17, 65,],
[32, 26, 19, 62,],
[32, 26, 17, 66,],
[30, 24, 17, 68],
[29, 24, 17, 68],
[33, 24, 17, 68],
[32, 52, 16, 68],
[29, 24, 17, 68],
[33, 24, 17, 68],
[32, 52, 16, 68],
[29, 24, 17, 68],
[33, 24, 17, 68],
[32, 52, 16, 68],
[30, 25, 16, 97]])
my_array = my_array.reshape(18, 4)
my_array = my_array.reshape(18, 4)
indices = sorted(list(range(0,int(my_array.shape[0]/3)))*3)
class_dict = dict(zip(range(0,int((my_array.shape[0]/3))), string.ascii_uppercase))
target_names = ["Index_" + c for c in class_dict.values()]
pixel_index = [1, 2, 3, 4]
X = pd.DataFrame(my_array, columns= pixel_index)
y = pd.Categorical.from_codes(indices,target_names)
df = X.join(pd.Series(y,name='class'))
df
Basically what i want to do is group into a unique class A, C, E, take their sum and divide by 3, therefore achieving mean value for class A or lets call it class 0.
Then, group into a unique class B, D, F, take their sum and divide by 3, therefore achieving mean value for class B, or class 1.
Create helper array with inteegr division and modulo for groups and pass to groupby for aggregate sum, last divide:
N = 3
arr = np.arange(len(df)) // N % 2
print (arr)
[0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1]
df = df.groupby(arr).sum() / N
print (df)
1 2 3 4
0 92.666667 82.666667 51.333333 198.000000
1 94.333333 92.666667 51.333333 210.333333
Input Dataframe as below
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,'nan','nan',4],
'r_id' : [[34, 44, 23, 11, 71], [53, 33, 73, 41], [17], [10, 31], [17], [75, 8],[7],[68],[50],[]]
}
df = pd.DataFrame.from_dict(data)
df
Out[240]:
s_id r_id
0 5 [34, 44, 23, 11, 71]
1 7 [53, 33, 73, 41]
2 26 [17]
3 70 [10, 31]
4 55 [17]
5 71 [75, 8]
6 8 [7]
7 nan [68]
8 nan [50]
9 4 []
Expected dataframe
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,'nan','nan',4],
'r_id' : [[5,34, 44, 23, 11, 71], [7,53, 33, 73, 41], [26,17], [70,10, 31], [55,17], [71,75, 8],[8,7],[68],[50],[4]]
}
df = pd.DataFrame.from_dict(data)
df
Out[241]:
s_id r_id
0 5 [5, 34, 44, 23, 11, 71]
1 7 [7, 53, 33, 73, 41]
2 26 [26, 17]
3 70 [70, 10, 31]
4 55 [55, 17]
5 71 [71, 75, 8]
6 8 [8, 7]
7 nan [68]
8 nan [50]
9 4 [4]
Need to populate the list column with the elements from S_id as the first element in the list column of r_id, I also have nan values and some of them are appearing as float columns, Thanking you.
I tried the following,
df['r_id'] = df["s_id"].apply(lambda x : x.append(df['r_id']) )
df['r_id'] = df["s_id"].apply(lambda x : [x].append(df['r_id'].values.tolist()))
If nans are missing values use apply with convert value to one element list with converting to integers and filter for omit mising values:
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,np.nan,np.nan,4],
'r_id' : [[34, 44, 23, 11, 71], [53, 33, 73, 41],
[17], [10, 31], [17], [75, 8],[7],[68],[50],[]]
}
df = pd.DataFrame.from_dict(data)
print (df)
f = lambda x : [int(x["s_id"])] + x['r_id'] if pd.notna(x["s_id"]) else x['r_id']
df['r_id'] = df.apply(f, axis=1)
print (df)
s_id r_id
0 5.0 [5, 34, 44, 23, 11, 71]
1 7.0 [7, 53, 33, 73, 41]
2 26.0 [26, 17]
3 70.0 [70, 10, 31]
4 55.0 [55, 17]
5 71.0 [71, 75, 8]
6 8.0 [8, 7]
7 NaN [68]
8 NaN [50]
9 4.0 [4]
Another idea is filter column and apply function to non NaNs rows:
m = df["s_id"].notna()
f = lambda x : [int(x["s_id"])] + x['r_id']
df.loc[m, 'r_id'] = df[m].apply(f, axis=1)
print (df)
s_id r_id
0 5.0 [5, 34, 44, 23, 11, 71]
1 7.0 [7, 53, 33, 73, 41]
2 26.0 [26, 17]
3 70.0 [70, 10, 31]
4 55.0 [55, 17]
5 71.0 [71, 75, 8]
6 8.0 [8, 7]
7 NaN [68]
8 NaN [50]
9 4.0 [4]
I have an Nx2 matrix such as:
M = [[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]]
I need to create a Nx3 matrix, that reflects the relationship of the rows from the first matrix in the following way:
Use the right column to identify candidates for range boundaries, the condition is value >= 1000
This condition applied to the matrix:
[[10, 1000],
[20, 5000],
[32, 3000],
[35, 3500],
[50, 5000],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000],]
So far I came up with "M[M[:,1]>=1000]" which works. For this new matrix I want to now check the points in the first column where distance to the next point <= 10 applies, and use these as range boundaries.
What I came up with so far: np.diff(M[:,0]) <= 10 which returns:
[True, False, True, False, True, True, True, False]
This is where I'm stuck. I want to use this condition to define lower and upper boundary of a range. For example:
[[10, 1000], #<- Range 1 start
[20, 5000], #<- Range 1 end (as 32 would be 12 points away)
[32, 3000], #<- Range 2 start
[35, 3500], #<- Range 2 end
[50, 5000], #<- Range 3 start
[55, 2000], #<- Range 3 cont (as 55 is only 5 points away)
[58, 3000], #<- Range 3 cont
[66, 4000], #<- Range 3 end
[90, 5000]] #<- Range 4 start and end (as there is no point +-10)
Lastly, referring back to the very first matrix, I want to add the right-column values together for each range within (and including) the boundaries.
So I have the four ranges which define start and stop for boundaries.
Range 1: Start 10, end 20
Range 2: Start 32, end 35
Range 3: Start 50, end 66
Range 4: Start 90, end 90
The resulting matrix would look like this, where column 0 is the start boundary, column 1 the end boundary and column 2 the added values from matrix M from the right column in between start and end.
[[10, 20, 7000], # 7000 = 1000+200+800+5000
[32, 35, 6500], # 6500 = 3000+3500
[50, 66, 14100], # 14100 = 5000+100+2000+3000+4000
[90, 90, 5000]] # 5000 = just 5000 as upper=lower boundary
I got stuck at the second step, after I get the true/false values for range boundaries. But how to create the ranges from the boolean values, and then how to add values together within these ranges is unclear for me. Would appreciate any suggestions. Also, I'm not sure on my approach, maybe there is a better way to get from the first to the last matrix, maybe skipping one step??
EDIT
So, I came a bit further with the middle step, and I can now return the start and end values of the range:
start_diffs = np.diff(M[:,0]) > 10
start_indexes = np.insert(start_diffs, 0, True)
end_diffs = np.diff(M[:,0]) > 10
end_indexes = np.insert(end_diffs, -1, True)
start_values = M[:,0][start_indexes]
end_values = M[:,0][end_indexes]
print(np.array([start_values, end_values]).T)
Returns:
[[10 20]
[32 35]
[50 66]
[90 90]]
What is missing is somehow using these ranges now to calculate the sums from matrix M in the right column.
If you are open to using pandas, here's a solution that seems a bit over-thought in retrospect, but works:
# Initial array
M = np.array([[10, 1000],
[11, 200],
[15, 800],
[20, 5000],
[28, 100],
[32, 3000],
[35, 3500],
[38, 100],
[50, 5000],
[51, 100],
[55, 2000],
[58, 3000],
[66, 4000],
[90, 5000]])
# Build a DataFrame with default integer index and column labels
df = pd.DataFrame(M)
# Get a subset of rows that represent potential interval edges
subset = df[df[1] >= 1000].copy()
# If a row is the first row in a new range, flag it with 1.
# Then cumulatively sum these 1s. This labels each row with a
# unique integer, one per range
subset[2] = (subset[0].diff() > 10).astype(int).cumsum()
# Get the start and end values of each range
edges = subset.groupby(2).agg({0: ['first', 'last']})
edges
0
first last
2
0 10 20
1 32 35
2 50 66
3 90 90
# Build a pandas IntervalIndex out of these interval edges
tups = list(edges.itertuples(index=False, name=None))
idx = pd.IntervalIndex.from_tuples(tups, closed='both')
# Build a Series that maps each interval to a unique range number
mapping = pd.Series(range(len(idx)), index=idx)
# Apply this mapping to create a new column of the original df
df[2] = [mapping.loc[i] if idx.contains(i) else None for i in df[0]]
df
0 1 2
0 10 1000 0.0
1 11 200 0.0
2 15 800 0.0
3 20 5000 0.0
4 28 100 NaN
5 32 3000 1.0
6 35 3500 1.0
7 38 100 NaN
8 50 5000 2.0
9 51 100 2.0
10 55 2000 2.0
11 58 3000 2.0
12 66 4000 2.0
13 90 5000 3.0
# Group by this new column, get edges of each interval,
# sum values, and get the underlying numpy array
df.groupby(2).agg({0: ['first', 'last'], 1: 'sum'}).values
array([[ 10, 20, 7000],
[ 32, 35, 6500],
[ 50, 66, 14100],
[ 90, 90, 5000]])
I have the following dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
And I wrote the following code to generate a new DataFrame with a modified output for each 'sim'
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
df = table2.filter(regex="sim|Type")
Output:
>>> df
Product Type sim_1 sim_2
0 A 35.0 60.0
1 B -39.0 36.0
2 C 56.0 92.0
3 D 23.0 33.0
I want to run this on 10,000 sims, and currently each loop takes about .25 seconds. Is there any way to modify this code to avoid the loop and be more time efficient?
Edit: If you're curious what this code is trying to accomplish you can see my self-answered somewhat disorganized question here: Pandas DataFrame: Complex linear interpolation
I was able to accomplish this with no loops using the following code:
As a result on my 10k x 200 table it ran in 3 minutes instead of the previous 2 hours.
Unfortunately now I need to run it on a 10k x 4k table, and I hit MemoryError on that one, but it may be out of the scope of this question.
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']