Building a function that divides dataframe into groups

Building a function that divides dataframe into groups - python

I am intrested in creating a function that does the folloing:
accepts 2 parameters: a DataFrame and an integer.
adds a column to the DF called "group"
giving each row an integer based on his integer location. the number of groups should be as the number of integer given to the function.
if the number of rows is not dividable by the integer given, the remaning rows should be splitted as evenly as possible between the groups. this is the part im having problems with.
Here is a menual exemple i made to clarify my intentions:
I would to get from this DF:
d = {'value': [1,2,3,4,5,6,7,8,9,10,11,12,13],}
df_init = pd.DataFrame(data=d)
By this function:
wanted function(df_init,5)
To this finel DF:
s = {'value': [1,2,3,4,5,6,7,8,9,10,11,12,13],'group':[1,1,1,2,2,2,3,3,3,4,4,5,5]}
df_finel = pd.DataFrame(data=d)
If I can make the question any clearer, please tell me how and ill fix it.

Use np.array_split
In [5481]: [i for i, x in enumerate(np.array_split(np.arange(len(df)), 5), 1) for _ in x]
Out[5481]: [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5]
Assign it
In [5487]: df['group'] = [i for i, x in
enumerate(np.array_split(np.arange(len(df)), 5), 1) for _ in x]
In [5488]: df
Out[5488]:
value group
0 1 1
1 2 1
2 3 1
3 4 2
4 5 2
5 6 2
6 7 3
7 8 3
8 9 3
9 10 4
10 11 4
11 12 5
12 13 5
Details
Original df
In [5491]: df
Out[5491]:
value
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
10 11
11 12
12 13
The act
In [5492]: np.array_split(np.arange(len(df)), 5)
Out[5492]:
[array([0, 1, 2]),
array([3, 4, 5]),
array([6, 7, 8]),
array([ 9, 10]),
array([11, 12])]

Related

Check that all columns are the same when doing pd.util.hash_pandas_object

I am developing an application that takes as input data frames.
An example of one of the many data frame can be like this
df = pd.DataFrame({'store': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'quarter': [1, 1, 2, 2, 1, 1, 2, 2,2,2,2,2],
'employee': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST'],
'foo': [1, 1, 2, 2, 1, 1, 9, 2,2,4,2,2],
'columnX': ['Blank_A09', 'Control_4p','13_MEG3','04_GRB10','02_PLAGL1','Control_21q','01_PLAGL1','11_KCNQ10T1','16_SNRPN','09_H19','Control_6p','06_MEST']})
print(df)
store quarter employee foo columnX
0 Blank_A09 1 Blank_A09 1 Blank_A09
1 Control_4p 1 Control_4p 1 Control_4p
2 13_MEG3 2 13_MEG3 2 13_MEG3
3 04_GRB10 2 04_GRB10 2 04_GRB10
4 02_PLAGL1 1 02_PLAGL1 1 02_PLAGL1
5 Control_21q 1 Control_21q 1 Control_21q
6 01_PLAGL1 2 01_PLAGL1 9 01_PLAGL1
7 11_KCNQ10T1 2 11_KCNQ10T1 2 11_KCNQ10T1
8 16_SNRPN 2 16_SNRPN 2 16_SNRPN
9 09_H19 2 09_H19 4 09_H19
10 Control_6p 2 Control_6p 2 Control_6p
11 06_MEST 2 06_MEST 2 06_MEST
I need to chack that odd columns are the same. I do this
# Select odd columns
df_odd = df.iloc[:,::2]
# Do a hash with these columns
pd.util.hash_pandas_object(df.T, index=False)
store 18266754969677227875
employee 18266754969677227875
columnX 18266754969677227875
dtype: uint64
How can I now to check that these hashes are the same?

The hashing ensures "order" of the values as a different order would give a different hash.
To check that all odd columns are identical you can use:
pd.util.hash_pandas_object(df.iloc[:,::2].T, index=False).nunique() == 1
output: True

Determine if Values are within range based on pandas DataFrame column

I am trying to determine whether or a given value in a row of a DataFrame is within two other columns from a separate DataFrame, or if that estimate is zero.
import pandas as pd
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
pe1 pe2
0 1 3
1 4 6
2 5 8
3 10 2
To be more clear, is it possible to develop a for-loop or use a function that can look at pe1 and its corresponding values and determine if they are within lo1 and up1, if lo1 and up1 cross zero, and if pe1=0? I am having a hard time coding this in Python.
EDIT: I'd like the output to be something like:
m1 m2
0 0 3
1 4 0
2 0 0
3 0 0
Since the only pe that falls within its corresponding lo and up column are in the first row, second column, and second row, first column.

You can eventually concatenate the two dataframes along the horizontal axis and then use np.where. This has a similar behaviour as where used by RJ Adriaansen.
import pandas as pd
import numpy as np
# Data
df1 = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
# concatenate dfs
df = pd.concat([df1, df2], axis=1)
where now df looks like
lo1 up1 lo2 up2 pe1 pe2
0 -1 2 1 3 1 3
1 4 6 7 8 4 6
2 -2 10 11 13 5 8
3 5 6 8 9 10 2
Finally we use np.where and between
for k in [1, 2]:
df[f"m{k}"] = np.where(
(df[f"pe{k}"].between(df[f"lo{k}"], df[f"up{k}"]) &
df[f"lo{k}"].gt(0)),
df[f"pe{k}"],
0)
and the result is
lo1 up1 lo2 up2 pe1 pe2 m1 m2
0 -1 2 1 3 1 3 0 3
1 4 6 7 8 4 6 4 0
2 -2 10 11 13 5 8 0 0
3 5 6 8 9 10 2 0 0

You can create a boolean mask for the required condition. For pe1 that would be:
value in lo1 is smaller or equal to pe1
value in up1 is larger or equal to pe1
value in lo1 is larger than 0
This would make this mask:
(df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0)
which returns:
0 False
1 True
2 False
3 False
dtype: bool
Now you can use where to keep the values that match True and replace those who don't with 0:
df2['pe1'] = df2['pe1'].where((df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0), other=0)
df2['pe2'] = df2['pe2'].where((df['lo2'] <= df2['pe2']) & (df['up2'] >= df2['pe2']) & (df['lo2'] > 0), other=0)
Result:
pe1
pe2
0
0
3
1
4
0
2
0
0
3
0
0
To loop all columns:
for i in df2.columns:
nr = i[2:] #remove the first two characters to get the number, then use that number to match the columns in the other df
df2[i] = df2[i].where((df[f'lo{nr}'] <= df2[i]) & (df[f'up{nr}'] >= df2[i]) & (df[f'lo{nr}'] > 0), other=0)

Python how to add two elements of a dataframe keeping the result before

I would like to add values in my dataframe between them but each time keeping the result of the addition before.
To put it simply, i would like to do :
df['col'][0]
df['col'][0] + df['col'][1]
df['col'][0] + df['col'][1] + df['col'][2]
df['col'][0] + df['col'][1] + df['col'][2] + df['col'][3]
.
.
.
df['col'][0] + ... + df['col'][n]
I would like to put each of the values in a list.
Could you help me ?
thank you so much

You can use cumsum:
In [679]: df = pd.DataFrame({"A":[5, 3, 6, 4],
...: "B":[11, 2, 4, 3],
...: "C":[4, 3, 8, 5],
...: "D":[5, 4, 2, 8]})
In [680]: df
Out[680]:
A B C D
0 5 11 4 5
1 3 2 3 4
2 6 4 8 2
3 4 3 5 8
In [682]: df.A.cumsum(axis=0)
Out[682]:
0 5
1 8
2 14
3 18

Ranking groups based on size

Sample Data:
id cluster
1 3
2 3
3 3
4 3
5 1
6 1
7 2
8 2
9 2
10 4
11 4
12 5
13 6
What I would like to do is replace the largest cluster id with 0 and the second largest with 1 and so on and so forth. Output would be as shown below.
id cluster
1 0
2 0
3 0
4 0
5 2
6 2
7 1
8 1
9 1
10 3
11 3
12 4
13 5
I'm not quite sure where to start with this. Any help would be much appreciated.

The objective is to relabel groups defined in the 'cluster' column by the corresponding rank of that group's total value count within the column. We'll break this down into several steps:
Integer factorization. Find an integer representation where each unique value in the column gets its own integer. We'll start with zero.
We then need the counts of each of these unique values.
We need to rank the unique values by their counts.
We assign the ranks back to the positions of the original column.
Approach 1
Using Numpy's numpy.unique + argsort
TL;DR
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
(-c).argsort()[i]
Turns out, numpy.unique performs the task of integer factorization and counting values in one go. In the process, we get unique values as well, but we don't really need those. Also, the integer factorization isn't obvious. That's because per the numpy.unique function, the return value we're looking for is called the inverse. It's called the inverse because it was intended to act as a way to get back the original array given the array of unique values. So if we let
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_couns=True
)
You'll see i looks like:
array([2, 2, 2, 2, 0, 0, 1, 1, 1, 3, 3, 4, 5])
And if we did u[i] we get back the original df.cluster.values
array([3, 3, 3, 3, 1, 1, 2, 2, 2, 4, 4, 5, 6])
But we are going to use it as integer factorization.
Next, we need the counts c
array([2, 3, 4, 2, 1, 1])
I'm going to propose the use of argsort but it's confusing. So I'll try to show it:
np.row_stack([c, (-c).argsort()])
array([[2, 3, 4, 2, 1, 1],
[2, 1, 0, 3, 4, 5]])
What argsort does in general is to place the top spot (position 0), the position to draw from in the originating array.
# position 2
# is best
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# top spot
# from
# position 2
# position 1
# goes to
# pen-ultimate spot
# |
# v
# array([[2, 3, 4, 2, 1, 1],
# [2, 1, 0, 3, 4, 5]])
# ^
# |
# pen-ultimate spot
# from
# position 1
What this allows us to do is to slice this argsort result with our integer factorization to arrive at a remapping of the ranks.
# i is
# [2 2 2 2 0 0 1 1 1 3 3 4 5]
# (-c).argsort() is
# [2 1 0 3 4 5]
# argsort
# slice
# \ / This is our integer factorization
# a i
# [[0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [0 2] <-- 0 is second position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [2 0] <-- 2 is zeroth position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [1 1] <-- 1 is first position in argsort
# [3 3] <-- 3 is third position in argsort
# [3 3] <-- 3 is third position in argsort
# [4 4] <-- 4 is fourth position in argsort
# [5 5]] <-- 5 is fifth position in argsort
We can then drop it into the column with pd.DataFrame.assign
u, i, c = np.unique(
df.cluster.values,
return_inverse=True,
return_counts=True
)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5
Approach 2
I'm going to leverage the same concepts. However, I'll use Pandas pandas.factorize to get integer factorization with numpy.bincount to count values. The reason to use this approach is because Numpy's unique actually sorts the values in the midst of factorizing and counting. pandas.factorize does not. For larger data sets, big oh is our friend as this remains O(n) while the Numpy approach is O(nlogn).
i, u = pd.factorize(df.cluster.values)
c = np.bincount(i)
df.assign(cluster=(-c).argsort()[i])
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3
10 11 3
11 12 4
12 13 5

You can use groupby, transform, and rank:
df['cluster'] = df.groupby('cluster').transform('count')\
.rank(ascending=False, method='dense')\
.sub(1).astype(int)
Output:
id cluster
0 1 0
1 2 0
2 3 0
3 4 0
4 5 2
5 6 2
6 7 1
7 8 1
8 9 1
9 10 3

By using category and value_counts
df.cluster.map((-df.cluster.value_counts()).astype('category').cat.codes
)
Out[151]:
0 0
1 0
2 0
3 0
4 2
5 2
6 1
7 1
8 1
9 3
Name: cluster, dtype: int8

This isn't the cleanest solution but it does work. Feel free to suggest improvements:
valueCounts = df.groupby('cluster')['cluster'].count()
valueCounts_sorted = df.sort_values(ascending=False)
for i in valueCounts_sorted.index.values:
print (i)
temp = df[df.cluster == i]
temp["random"] = count
idx = temp.index.values
df.loc[idx, "cluster"] = temp.random.values
count += 1

numpy.random.randint does not return a list separte by comma

I am running this code:
import numpy as np
Z=np.ones(10)
I = np.random.randint(0,len(Z),20).
print I
#[9 0 0 1 0 2 3 4 3 3 2 2 7 8 1 9 9 2 1 7]
#so this instruction does not work
print Z[I]
return a list without where the elelements does not separates by comma as mentioned here randint

The output on that page shows the interpreter (or repr) output. Also, I changed it to randint and removed the period that would have thrown a syntax error.
import numpy as np
I = np.random.randint(0, 10, 10)
print(I) # => [9 4 2 7 6 3 4 5 6 2]
print(repr(I)) # => array([9, 4, 2, 7, 6, 3, 4, 5, 6, 2])
print(type(I)) # => <type 'numpy.ndarray'>
L = list(I)
print(L) # => [9, 4, 2, 7, 6, 3, 4, 5, 6, 2]

Changing the randomint to randint works for me:
Z=np.arange(10)
I = np.random.randint(0,len(Z),20)
print I
#[9 0 0 1 0 2 3 4 3 3 2 2 7 8 1 9 9 2 1 7]
#so this instruction works for me
print Z[I]
# [3 9 6 6 7 7 7 3 7 5 5 2 1 1 5 7 1 0 7 4]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Building a function that divides dataframe into groups - python

Related

Check that all columns are the same when doing pd.util.hash_pandas_object

Determine if Values are within range based on pandas DataFrame column

Python how to add two elements of a dataframe keeping the result before

Ranking groups based on size

numpy.random.randint does not return a list separte by comma

Categories

Resources