I am trying to replicate a specific method of attributing records into deciles, and there is the pandas.qcut() function which does a good job. My only concern is that there doesn't be a method to attribute an uneven number to a specific bin as denoted by the method I am trying to replicate.
This is my example:
num = np.random.rand(153, 1)
my_list = map(lambda x: x[0], num)
ser = pd.Series(my_list)
bins = pd.qcut(ser, 10, labels=False)
bins.value_counts()
Which outputs:
9 16
4 16
0 16
8 15
7 15
6 15
5 15
3 15
2 15
1 15
There are 7 with 15 and 3 with 16, what I would like to do is to specify the bins that would receive 16 records:
9 16 <
4 16
0 16
8 15
7 15
6 15
5 15 <
3 15
2 15 <
1 15
Is this possible using pd.qcut?
As there was no answer, and asking a few people it didn't seem possible, I have cobbled together a function that does this:
def defined_qcut(df, value_series, number_of_bins, bins_for_extras, labels=False):
if max(bins_for_extras) > number_of_bins or any(x < 0 for x in bins_for_extras):
raise ValueError('Attempted to allocate to a bin that doesnt exist')
base_number, number_of_values_to_allocate = divmod(df[value_series].count(), number_of_bins)
bins_for_extras = bins_for_extras[:number_of_values_to_allocate]
if number_of_values_to_allocate == 0:
df['bins'] = pd.qcut(df[value_series], number_of_bins, labels=labels)
return df
elif number_of_values_to_allocate > len(bins_for_extras):
raise ValueError('There are more values to allocate than the list provided, please select more bins')
bins = {}
for i in range(number_of_bins):
number_of_values_in_bin = base_number
if i in bins_for_extras:
number_of_values_in_bin += 1
bins[i] = number_of_values_in_bin
df1 = df.copy()
df1['rank'] = df1[value_series].rank()
df1 = df1.sort_values(by=['rank'])
df1['bins'] = 0
row_to_start_allocate = 0
row_to_end_allocate = 0
for bin_number, number_in_bin in bins.items():
row_to_end_allocate += number_in_bin
bins.update({bin_number: [number_in_bin, row_to_start_allocate, row_to_end_allocate]})
row_to_start_allocate = row_to_end_allocate
conditions = [df1['rank'].iloc[v[1]: v[2]] for k, v in bins.items()]
series_to_add = pd.Series()
for idx, series in enumerate(conditions):
series[series > -1] = idx
series_to_add = series_to_add.append(series)
df1['bins'] = series_to_add
df1 = df1.reset_index()
return df1
It ain't pretty, but it does the job. You pass in the dataframe, the name of the column with the values, and an ordered list of the bins where any extra values should be allocated. I'd happily take some advise as to how to improve this code.
Related
I want to replicate the data from the same dataframe when a certain condition is fulfilled.
Dataframe:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
I want to replicate the dataframe when going through a loop and there is a difference greater than 4 in row.hour.
Expected Output:
Hour,Wage
1,15
2,17
4,20
10,25
15,26
16,30
17,40
19,15
2,17
4,20
i want to replicate the rows when the iterating through all the row and there is a difference greater than 4 in row.hour
row.hour[0] = 1
row.hour[1] = 2.here the difference between is 1 but in (row.hour[2]=4 and row,hour[3]=10).here the difference is 6 which is greater than 4.I want to replicate the data above of the index where this condition(greater than 4) is fulfilled
I can replicate the data with **df = pd.concat([df]*2, ignore_index=False)**.but it does not replicate when i run it with if statement
I tried the code below but nothing is happening.
**for i in range(0,len(df)-1):
if (df.iloc[i,0] - df.iloc[i+1,0]) > 4 :
df = pd.concat([df]*2, ignore_index=False)**
My understanding is: you want to compare 'Hour' values for two successive rows.
If the difference is > 4 you want to add the previous row to the DF.
If that is what you want try this:
Create a DF:
j = pd.DataFrame({'Hour':[1, 2, 4,10,15,16,17,19],
'Wage':[15,17,20,25,26,30,40,15]})
Define a function:
def f1(d):
dn = d.copy()
for x in range(len(d)-2):
if (abs(d.iloc[x+1].Hour - d.iloc[x+2].Hour) > 4):
idx = x + 0.5
dn.loc[idx] = d.iloc[x]['Hour'], d.iloc[x]['Wage']
dn = dn.sort_index().reset_index(drop=True)
return dn
Call the function passing your DF:
nd = f1(j)
Hour Wage
0 1 15
1 2 17
2 2 17
3 4 20
4 4 20
5 10 25
6 15 26
7 16 30
8 17 40
9 19 15
In line
if df.iloc[i,0] - df.iloc[i+1,0] > 4
you calculate 4-10 instead of 10-4 so you check -6 > 4 instead of 6 > 4
You have to replace items
if df.iloc[i+1,0] - df.iloc[i,0] > 4
or use abs() if you want to replicate in both situations - > 4 and < -4
if abs(df.iloc[i+1,0] - df.iloc[i,0]) > 4
If you would use print( df.iloc[i,0] - df.iloc[i+1,0]) (or debuger) the you would see it.
I have the following working code that sets 1 to "new_col" at the locations pointed by intervals dictated by starts and ends.
import pandas as pd
import numpy as np
df = pd.DataFrame({"a": np.arange(10)})
starts = [1, 5, 8]
ends = [1, 6, 10]
value = 1
df["new_col"] = 0
for s, e in zip(starts, ends):
df.loc[s:e, "new_col"] = value
print(df)
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
I want these intervals to come from another dataframe pointer_df.
How to vectorize this?
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
Attempt:
df.loc[pointer_df["starts"]:pointer_df["ends"], "new_col"] = 2
print(df)
obviously doesn't work and gives
raise AssertionError("Start slice bound is non-scalar")
AssertionError: Start slice bound is non-scalar
EDIT:
it seems all answers use some kind of pythonic for loop.
the question was how to vectorize the operation above?
Is this not doable without for loops/list comprehentions?
You could do:
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
rang = np.arange(len(df))
indices = [i for s, e in pointer_df.to_numpy() for i in rang[slice(s, e + 1, None)]]
df.loc[indices, 'new_col'] = value
print(df)
Output
a new_col
0 0 0
1 1 1
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
If you want a method that do not uses uses any for loop or list comprehension, only relies on numpy, you could do:
def indices(start, end, ma=10):
limits = end + 1
lens = np.where(limits < ma, limits, end) - start
np.cumsum(lens, out=lens)
i = np.ones(lens[-1], dtype=int)
i[0] = start[0]
i[lens[:-1]] += start[1:]
i[lens[:-1]] -= limits[:-1]
np.cumsum(i, out=i)
return i
pointer_df = pd.DataFrame({"starts": starts, "ends": ends})
df.loc[indices(pointer_df.starts.values, pointer_df.ends.values, ma=len(df)), "new_col"] = value
print(df)
I adapted the method to your use case from the one in this answer.
for i,j in zip(pointer_df["starts"],pointer_df["ends"]):
print (i,j)
Apply same method but on your dictionary
from itertools import product
import pandas as pd
df = pd.DataFrame.from_records(product(range(10), range(10)))
df = df.sample(90)
df.columns = "c1 c2".split()
df = df.sort_values(df.columns.tolist()).reset_index(drop=True)
# c1 c2
# 0 0 0
# 1 0 1
# 2 0 2
# 3 0 3
# 4 0 4
# .. .. ..
# 85 9 4
# 86 9 5
# 87 9 7
# 88 9 8
# 89 9 9
#
# [90 rows x 2 columns]
How do I quickly find, identify, and remove the last duplicate of all symmetric pairs in this data frame?
An example of symmetric pair is that '(0, 1)' is equal to '(1, 0)'. The latter should be removed.
The algorithm must be fast, so it is recommended to use numpy. Converting to python object is not allowed.
You can sort the values, then groupby:
a= np.sort(df.to_numpy(), axis=1)
df.groupby([a[:,0], a[:,1]], as_index=False, sort=False).first()
Option 2: If you have a lot of pairs c1, c2, groupby can be slow. In that case, we can assign new values and filter by drop_duplicates:
a= np.sort(df.to_numpy(), axis=1)
(df.assign(one=a[:,0], two=a[:,1]) # one and two can be changed
.drop_duplicates(['one','two']) # taken from above
.reindex(df.columns, axis=1)
)
One way is using np.unique with return_index=True and use the result to index the dataframe:
a = np.sort(df.values)
_, ix = np.unique(a, return_index=True, axis=0)
print(df.iloc[ix, :])
c1 c2
0 0 0
1 0 1
20 2 0
3 0 3
40 4 0
50 5 0
6 0 6
70 7 0
8 0 8
9 0 9
11 1 1
21 2 1
13 1 3
41 4 1
51 5 1
16 1 6
71 7 1
...
frozenset
mask = pd.Series(map(frozenset, zip(df.c1, df.c2))).duplicated()
df[~mask]
I will do
df[~pd.DataFrame(np.sort(df.values,1)).duplicated().values]
From pandas and numpy tri
s=pd.crosstab(df.c1,df.c2)
s=s.mask(np.triu(np.ones(s.shape)).astype(np.bool) & s==0).stack().reset_index()
Here's one NumPy based one for integers -
def remove_symm_pairs(df):
a = df.to_numpy(copy=False)
b = np.sort(a,axis=1)
idx = np.ravel_multi_index(b.T,(b.max(0)+1))
sidx = idx.argsort(kind='mergesort')
p = idx[sidx]
m = np.r_[True,p[:-1]!=p[1:]]
a_out = a[np.sort(sidx[m])]
df_out = pd.DataFrame(a_out)
return df_out
If you want to keep the index data as it is, use return df.iloc[np.sort(sidx[m])].
For generic numbers (ints/floats, etc.), we will use a view-based one -
# https://stackoverflow.com/a/44999009/ #Divakar
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
and simply replace the step to get idx with idx = view1D(b) in remove_symm_pairs.
If this needs to be fast, and if your variables are integer, then the following trick may help: let v,w be the columns of your vector; construct [v+w, np.abs(v-w)] =: [x, y]; then sort this matrix lexicographically, remove duplicates, and finally map it back to [v, w] = [(x+y), (x-y)]/2.
I have dataframe with 2 columns in it Column A and Column B and an array of alphabets from A to P which are as follows
df = pd.DataFrame({
'Column_A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B':[]
})
the array is as follows:
label = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P']
Expected output is
'A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'B':['A','A','A','A','A','E','E','E','E','E','I','I','I','I','I','M']
Value from Column B changes as soon as value from Column A is 1. and the value is taken from the given array 'label'
I have tried using this for loop
for row in df.index:
try:
if df.loc[row,'Column_A'] == 1:
df.at[row, 'Column_B'] = label[row+4]
print(label[row])
else:
df.ColumnB.fillna('ffill')
except IndexError:
row = (row+4)%4
df.at[row, 'Coumn_B'] = label[row]
I also want to loopback if it reaches the last value in 'Label' Array.
Some solution that should do the trick looks like:
label=list('ABCDEFGHIJKLMNOP')
df = pd.DataFrame({
'Column_A': [0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B': label
})
Not exactly sure, what you intended with the fillna, because I think you don't need it.
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row in df.index:
if df.loc[row,'Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
I also avoid the exception handling in this case, because the "index overflow" can be handled without exception handling.
Btw. if you have a large dataframe you can probably make the code faster by eliminating one lookup (but you'd need to verify if it really runs faster). The solution would look like this then:
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row, record in df.iterrows():
if record['Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
Option 1
cond1 = df.Column_A == 1
cond2 = df.index == 0
mappr = lambda x: label[x]
df.assign(Column_B=np.where(cond1 | cond2, df.index.map(mappr), np.nan)).ffill()
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Option 2
a = np.append(0, np.flatnonzero(df.Column_A))
b = df.Column_A.to_numpy().cumsum()
c = np.array(label)
df.assign(Column_B=c[a[b]])
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Using groupby with transform then map
df.reset_index().groupby(df.Column_A.eq(1).cumsum())['index'].transform('first').map(dict(enumerate(label)))
Out[139]:
0 A
1 A
2 A
3 A
4 A
5 F
6 F
7 F
8 F
9 F
10 K
11 K
12 K
13 K
14 K
15 P
Name: index, dtype: object
Given a set or a list (assume its ordered)
myset = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
I want to find out how many numbers appear in a range.
say my range is 10. Then given the list above, I have two sets of 10.
I want the function to return [10,10]
if my range was 15. Then I should get [15,5]
The range will change. Here is what I came up with
myRange = 10
start = 1
current = start
next = current + myRange
count = 0
setTotal = []
for i in myset:
if i >= current and i < next :
count = count + 1
print str(i)+" in "+str(len(setTotal)+1)
else:
current = current + myRange
next = myRange + current
if next >= myset[-1]:
next = myset[-1]
setTotal.append(count)
count = 0
print setTotal
Output
1 in 1
2 in 1
3 in 1
4 in 1
5 in 1
6 in 1
7 in 1
8 in 1
9 in 1
10 in 1
12 in 2
13 in 2
14 in 2
15 in 2
16 in 2
17 in 2
18 in 2
19 in 2
[10, 8]
notice 11 and 20 where skipped. I also played around with the condition and got wired results.
EDIT: Range defines a range that every value in the range should be counted into one chuck.
think of a range as from current value to currentvalue+range as one chunk.
EDIT:
Wanted output:
1 in 1
2 in 1
3 in 1
4 in 1
5 in 1
6 in 1
7 in 1
8 in 1
9 in 1
10 in 1
11 in 2
12 in 2
13 in 2
14 in 2
15 in 2
16 in 2
17 in 2
18 in 2
19 in 2
[10, 10]
With the right key function, thegroupbymethod in the itertoolsmodule makes doing this fairly simple:
from itertools import groupby
def ranger(values, range_size):
def keyfunc(n):
key = n/(range_size+1) + 1
print '{} in {}'.format(n, key)
return key
return [len(list(g)) for k, g in groupby(values, key=keyfunc)]
myset = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
print ranger(myset, 10)
print ranger(myset, 15)
You want to use simple division and the remainder; the divmod() function gives you both:
def chunks(lst, size):
count, remainder = divmod(len(lst), size)
return [size] * count + ([remainder] if remainder else [])
To create your desired output, then use the output of chunks():
lst = range(1, 21)
size = 10
start = 0
for count, chunk in enumerate(chunks(lst, size), 1):
for i in lst[start:start + chunk]:
print '{} in {}'.format(i, count)
start += chunk
count is the number of the current chunk (starting at 1; python uses 0-based indexing normally).
This prints:
1 in 1
2 in 1
3 in 1
4 in 1
5 in 1
6 in 1
7 in 1
8 in 1
9 in 1
10 in 1
11 in 2
12 in 2
13 in 2
14 in 2
15 in 2
16 in 2
17 in 2
18 in 2
19 in 2
20 in 2
If you don't care about what numbers are in a given chunk, you can calculate the size easily:
def chunk_sizes(lst, size):
complete = len(lst) // size # Number of `size`-sized chunks
partial = len(lst) % size # Last chunk
if partial: # Sometimes the last chunk is empty
return [size] * complete + [partial]
else:
return [size] * complete