Using pd.cut & pd.vales_count then results as 2d array - python

Use case
I get random observations from a population.
Then I group them by bin using pd.cut
Then I extract values with pd.values_counts
I want to get the calculated interval labels and the frequency count
I want to 'glue' the labels column to the frequency counts column to get 2d array (with 2 columns, and n interval rows)
I want to convert 2d array to a list for COM interop.
I am close to desired output but I am Python newbie so some smart guy can optimize my label code.
The problem here is the constraint of the final output which needs to be a list so it can be marshalled via COM interop layer to Excel VBA.
import inspect
import numpy as np
import pandas as pd
from scipy.stats import skewnorm
pop = skewnorm.rvs(0, size=20)
bins=[-5,-4,-3,-2,-1,0,1,2,3,4,5]
bins2 = np.array(bins)
bins3 = pd.cut(pop,bins2)
bins4 = [0]*(bins2.size-1)
#print my own labels, doh!
idx=0
for binLoop in bins3.categories:
intervalAsString="(" + str(binLoop.left)+ "," + str(binLoop.right)+"]"
print (intervalAsString)
bins4[idx]=intervalAsString
idx=idx+1
table = pd.value_counts(bins3, sort=False)
joined = np.vstack((bins4,table.tolist()))
print (joined)
Target output a 2d array convertible to a list
| (-5, -4] | 0 |
| (-4, -3] | 0 |
| (-3, -2] | 0 |
| (-2, -1] | 1 |
| (-1, 0] | 3 |
| (0, 1] | 9 |
| (1, 2] | 4 |
| (2, 3] | 2 |
| (3, 4] | 1 |
| (4, 5] | 0 |

If I understand you correctly, the following should do what you are after:
pop = skewnorm.rvs(0, size=20)
bins = range(-5, 5)
binned = pd.cut(pop, bins)
# create the histogram data
hist = binned.value_counts()
# hist is a pandas series with a categorical index describing the bins
# `index.astype(str)` will convert the categories to strings.
hist.index = hist.index.astype(str)
# `.reset_index()` will turn the index into an ordinary column
# `.values` gives you the underlying numpy array
# `tolist()` converts the numpy array to a native python list o' lists.
print(hist.reset_index().values.tolist())

Related

How to insert an array into another one using slices

I want to create a large NumPy array (L) to hold the result of some operations. However, I can only compute one part of L at a time. Then, to have L, I need to create an array of zeros of shape L.shape, and fill it using these parts or subarrays. I'm currently able to do it, but in a very inefficient way.
If the shape of L is (x, y, z, a, b, c), then I'm creating a NumPy arrays of shape (x, y, z, 1, b, c) which correspond to the different parts of L, from part 0 to part a-1. I'm forced to create arrays of this particular shape due to the operations involved.
In order to fill the array of zeros, I'm creating one Pandas DataFrame per subarray (or part). Each dataframe contains the indices and the values of one subarray of shape (x, y, z, 1, b, c), like this:
index0 | index1 | index2 | index3 | index4 | index5 | value
------------------------------------------------------------
0 | 0 | 0 | 0 | 0 | 0 | 434.2
0 | 0 | 0 | 0 | 0 | 1 | 234.5
..., and so on.
Because of the shape (x, y, z, 1, b, c), index3 can only contain zeros. So, there's a change to make before the values can be inserted at the right index of L: the column at index3 will contain only 0s for the first subarray, only 1s for the second subarray, etc. So, df['index3'] = subarray_number, where subarray_number goes from 0 to a-1. Only the column at index3 is changed.
So, the fifth subarray represented as a dataframe would look like this:
index0 | index1 | index2 | index3 | index4 | index5 | value
------------------------------------------------------------
0 | 0 | 0 | 4 | 0 | 0 | 434.2
0 | 0 | 0 | 4 | 0 | 1 | 234.5
...
x-1 | y-1 | z-1 | 4 | b-1 | c-1 | 371.8
After this, I only have to iterate over the rows of each of the dataframes using iterrows, and assign the values to the corresponding indices of the array of zeros, like this:
for subarray_df in subarrays_dfs:
for i, row in subarray_df.iterrows():
index0, index1, index2, index3, index4, index5, value = row
L[index0][index1][index2][index3][index4][index5] = value
The problem is that converting the arrays to dataframes and then assigning the values one by one is expensive, especially for large arrays. I would like to insert the subarrays in L directly without having to go through this intermediate step.
I tried using slices but the generated array is not the one I expect. This is what I'm doing:
L[:subarray.shape[0], :subarray.shape[1], :subarray.shape[2],
subarray_number, :subarray.shape[4], :subarray.shape[5]] = subarray
What would be the right way of using slices to fill L the way I need?
Thanks!
Your example is not very clear, but maybe you can adapt something from this code snippet. It looks to me like you are generating your L array, shape (x, y, z, a, b, c), by computing a slices, of shape (x, y, z, b, c), equivalent to (x, y, z, 1, b, c). Let me know if I am completely wrong.
import numpy as np
L = np.zeros((10, 10, 10, 2, 10, 10)) # shape (x, y, z, a, b, c)
def compute():
return np.random.rand(10, 10, 10, 10, 10) # shape (x, y, z, b, c)
for k in range(L.shape[3]):
L[:, :, :, k, :, :] = compute() # Select slice of shape (x, y, z, b, c)
Basically, It computes a slice of the array (part of the array) and place it at the desired location.
One thing to note, an array of shape (x, y, z, a, b, c) can quickly get out of hand. For instance I naively tried to do L = np.zeros((100, 100, 100, 5, 100, 100)), resulting in a 373 Gb RAM allocation.. Depending on the size of your data, maybe you could work only on each slices and store them to disk when the others are not in use?
Following my comment, some snippet for you to get this dimension problem:
import numpy as np
L = np.zeros((10, 10, 10))
L.shape # (10, 10, 10)
L[:, 0, :].shape # (10, 10)
L[:, 0:3, :].shape # (10, 3, 10)
The slice on : selects all, the slice on x:y selets all from x to y and the slice on a specific index k selects only that 'line'/'column' (analogy for 2D), and thus returning an array of dimension n-1. In 2D, a line or column would be 1D.

Converting nested dictionary to dataframe with the keys as rownames and the dictionaries in the values as columns?

I have a dataframe that consists of a large number of frequency counts, where the column labels are features being counted and row labels are the pages in which features are being counted. I need to find the probability of each feature occurring across all pages, so I'm trying unsuccessfully to iterate through each column, dividing each sum by the sum of all columns, and save the result in a dictionary as the value corresponding to a key which is taken from the column labels.
My dataframe looks something like this:
|---------|----------|
| Word1 | Word2 |
----|---------|----------|
pg1 | 0 | 1 |
----|---------|----------|
pg2 | 3 | 2 |
----|---------|----------|
pg3 | 9 | 0 |
----|---------|----------|
pg4 | 1 | 6 |
----|---------|----------|
pg5 | 2 | 3 |
----|---------|----------|
pg6 | 0 | 2 |
----|---------|----------|
And I want my output to be a dictionary with the words as the keys and the sum(column) / sum(table) as the values, like this:
{ Word1: .517 , Word2: .483 }
So far I've attempted the following:
dict = {}
for x in df.sum(axis = 0):
dict[x] = x / sum(df.sum(axis = 0))
print(dict)
but the command never completes. I'm not sure whether I've done something wrong in my code or whether perhaps my laptop simply doesn't have the ability to deal with the size of my dataset.
Can anyone point me in the right direction?
It looks like you can take the sum of each column and then divide by the flattened values of the sum across the entire underlying arrays in the DF, eg:
df.sum().div(df.values.sum()).to_dict()
That'll give you:
{'Word1': 0.5172413793103449, 'Word2': 0.4827586206896552}

Effectively apply a function to every possible pairs from a Pandas Series

I have an indexed Pandas Series with 20k entries. Each entry is an array of strings.
id | value
0 | ['abc', 'abc', 'def']
1 | ['bac', 'c', 'def', 'a']
2 | ...
...|
20k| ['aaa', 'rzt']
I want to compare each entry (lists of strings) with every other entry of the series. I have a complex comparison function which takes two lists of strings and return a float.
The result should be a matrix.
id | 0 | 1 | 2 | ... | 20k
0 | 1 0.5 0.4
1 | 0.5 1 0.2
2 | 0.4 0.2 1
...|
20k|
A double loop computing the result of every matrix element takes my computer more than 3 hours.
How can I effectively apply/parallelise my comparison function? I tried broadcasting using numpy arrays without success (no speedup).
values = df['value'].values
broadcasted = np.broadcast(values, values[:,None])
result = np.empty(broadcasted.shape)
result.flat = [compare_function(u,v) for (u,v) in broadcasted]

What is the fastest way to conditionally change the values of a dataframe in every index and column?

Is there a way to reduce by a constant number each element of a dataframe verifying a condition including their own value without using a loop?
For instance, each cells < 2 sees its value reducing by 1.
Thank you very much.
I like to do this masking.
Here is an inefficient loop using your example
#Example using loop
for val in df['column']:
if(val<2):
val = val - 1
The following code gives the same result, but it will generally be much faster because it does not use a loop.
# Same effect using masks
mask = (df['column'] < 2) #Find entries that are less than 2.
df.loc[mask,'column'] = df.loc[mask,'column'] - 1 #Subtract 1.
I am not sure if this is the fastest, but you can use the .apply function:
import pandas as pd
df = pd.DataFrame(data=np.array([[1,2,3], [2,2,2], [4,4,4]]),
columns=['x', 'y', 'z'])
def conditional_add(x):
if x > 2:
return x + 2
else:
return x
df['x'] = df['x'].apply(conditional_add)
Will add 2 to the final row of column x.
More like (data from Willie)
df-((df<2)*2)
Out[727]:
x y z
0 -1 2 3
1 2 2 2
2 4 4 4
In this case I would use the np.where method from the NumPy library.
The method uses the following logic:
np.where(<condition>, <value if true>, <value if false>)
Example:
# import modules which are needed
import pandas as pd
import numpy as np
# create exmaple dataframe
df = pd.DataFrame({'A':[3,1,5,0.5,2,0.2]})
| A |
|-----|
| 3 |
| 1 |
| 5 |
| 0.5 |
| 2 |
| 0.2 |
# apply the np.where method with conditional statement
df['A'] = np.where(df.A < 2, df.A - 1, df.A)
| A |
|------|
| 3 |
| 0.0 |
| 5 |
| -0.5 |
| 2 |
| -0.8 |`

Pandas: Apply function over each pair of columns under constraints

As the title says, I'm trying to apply a function over each pair of columns of a dataframe under some conditions. I'm going to try to illustrate this. My df is of the form:
Code | 14 | 17 | 19 | ...
w1 | 0 | 5 | 3 | ...
w2 | 2 | 5 | 4 | ...
w3 | 0 | 0 | 5 | ...
The Code corresponds to a determined location in a rectangular grid and the ws are different words. I would like to apply cosine similarity measure between each pair of columns only (EDITED!) if the sum of items in one of the columns of the pair is greater thah 5.
The desired output would be something like:
| [14,17] | [14,19] | [14,...] | [17,19] | ...
Sim |cs(14,17) |cs(14,19) |cs(14,...) |cs(17,19)..| ...
cs is the result of the cosine similarity for each pair of columns.
Is there any suitable method to do this?
Any help would be appreciated :-)
To apply the cosine metric to each pair from two collections of inputs, you
could use scipy.spatial.distance.cdist. This will be much much faster than
using a double Python loop.
Let one collection be all the columns of df. Let the other collection be only those columns where the sum is greater than 5:
import pandas as pd
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
Then all the cosine similarities can be computed with one call to cdist:
import scipy.spatial.distance as SSD
values = SSD.cdist(df2.T, df.T, metric='cosine')
# array([[ 2.92893219e-01, 1.11022302e-16, 3.00000000e-01],
# [ 4.34314575e-01, 3.00000000e-01, 1.11022302e-16]])
The values can be wrapped in a new DataFrame and reshaped:
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
import pandas as pd
import scipy.spatial.distance as SSD
df = pd.DataFrame({'14':[0,2,0], '17':[5,5,0], '19':[3,4,5]})
mask = df.sum(axis=0) > 5
df2 = df.loc[:, mask]
values = SSD.cdist(df2.T, df.T, metric='cosine')
result = pd.DataFrame(values, columns=df.columns, index=df2.columns)
result = result.stack()
mask = result.index.get_level_values(0) != result.index.get_level_values(1)
result = result.loc[mask]
print(result)
yields the Series
17 14 0.292893
19 0.300000
19 14 0.434315
17 0.300000

Categories

Resources