Replacing duplicate values in a dataframe

Replacing duplicate values in a dataframe - python

I've the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
Which gives:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 1
7 1
8 2
9 2
10 5
11 9
12 1
13 2
14 3
15 3
16 3
17 5
I want to replace duplicate values in the 'ID' column with the lowest, not yet used, value. However, consequtive identical values should be seen as a group and their values should be changed in the same way. For example: the first two values are both 1. These are consequtive so they are a group and the second '1' should therefore not be replaced with a '2'. Row 14-16 are three consequtive threes. The value 3 has already been used to replace above values, so these threes need to be replaced. But they're consequtive, thus a group, and should get the same replacemnt value. The expected outcome is as follows and will make it more clear:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 3
7 3
8 4
9 4
10 7
11 9
12 8
13 10
14 11
15 11
16 11
17 12

df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
def fun():
v, dub = 1, set()
d = yield
while True:
num = d.iloc[0]['ID']
if num in dub:
while v in dub:
v += 1
d.ID = num = v
dub.add(num)
d = yield d
f = fun()
next(f)
df = df.groupby([df['ID'].diff().ne(0).cumsum(), 'ID'], as_index=False).apply(lambda x: f.send(x))
print(df)
Output:
ID
0 1
1 1
2 2
3 5
4 5
5 6
6 3
7 3
8 4
9 4
10 7
11 9
12 8
13 10
14 11
15 11
16 11
17 12

I made up a way to get your outcome using for loops and dictionaries. More difficult I expected to be fair, the code can seem a bit complex at first but it isnt. Probably, there's a way to do it with multiple logical vectors, but I don't know.
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 5, 5, 6, 1, 1, 2, 2, 5, 9, 1, 2, 3, 3, 3, 5]})
print(df)
####################
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {} # Empty dict
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq))) # range values
d = {a[k]:False for k in range(len(a))} # Fill dict
d[df.ID[0]] = True # Set first value in col as True
for m in range(1,df.shape[0]):
# Find a value different from previous one
# therefore, beginning of new subgroup
if diffs[m-1] != 0:
# Check if value was before in the ID column
if d[df.ID[m]] == True:
# Get the lowest value which wasn't used
lowest = [k for k, v in d.items() if v == False][0]
# loop over the subgroup (which differences are 0)
for n in range(m+1,df.shape[0]):
if diffs[n-1] > 0: # If find a new subgroup
break # then stop looping
# Replace the subgroup with the lowest value
df.ID[m:n] = lowest # n is the final index of the subgroup
# *Exception in case last number is a subgroup itself
# then previous for loop doesnt work
if m == df.shape[0]-1:
df.ID[m] = lowest
# Update dictionaries for values retrieved from ID column
d[df.ID[m]] = True
print(df)
Therefore, what you want is to think about your column ID as subgroups or different arrays, checking different conditions and making different operations then. You can think about your column as a set of multiple arrays:
[1, 1 | 2 | 5, 5 | 6 | 1, 1 | 2, 2 | 5 | 9 | 1 | 2 | 3, 3, 3 | 5]
What you need to do is find the limits of that subgroups and check whether they meet certain conditions (1. not a previous number, 2. the lowest number which we didn't use). We can know the subgroups if we calculate the differences between a value and the previous one
diffs = np.diff(df.ID) # differences ID(k) - ID(k-1)
We can know the conditions using a dictionary which keys are the integers in the array or longer values we could need and values are whether we have used them or not (True or False).
To do so, we need to do the max value of the ID column. However, we need to build the dictionary with more numbers as there are in the column (in your example the max(input) = 9 and max(output) = 12). You could do randomly, I chose to calculate the possible proportion we could need attending to the number of rows and number of unique values in the column (the last input in a = range... ).
uniq = sorted(pd.unique(df.ID)) # unique values in ID colums
# dict with range of numbers from min to max in ID col
d = {}
a = range(uniq[0],uniq[-1]*int(df.shape[0]/len(uniq)))
d = {a[k]:False for k in range(len(a))}
d[df.ID[0]] = True # Set first value in col as True
Last part of the code is a main for loop with some If and another for inside, it works as:
# 1. Loop over ID column
# 2. Check if ID[m] value is different number from previous one (diff != 0)
# 3. Check if the ID[m] value is already in the ID column.
# 4. Calculate lowest value (first key == False in dict) and change the subset of
# in the ID
# 5. How is made step 4, if last value is a subset itself, it doesn't work, so
# there's a little condition to check it will work.
# 6. Update the dict every time a new value shows up.
I am sure there are many ways to shorten this code. But this work and should work with larger dataframes and the same conditions.

Related

Checking the length of a part of a dataframe in conditional row selection in pandas

Suppose I have a pandas dataframe like this:
first second third
1 2 2 1
2 2 1 0
3 3 4 5
4 4 6 3
5 5 4 3
6 8 8 4
7 3 4 2
8 5 6 6
and could be created with the code:
dataframe = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
I want to select the rows in which the value of the second column is more than the value of the first column and at the same time the values in the third column are less than the values in the second column for k consecutive rows where the last row of these k consecutive rows is exactly before the row in which the value of the second column is more than the value of the first column, and k could be any integer between 2 and 4 (closed interval).
So, the output should be rows:
3, 7, 8
To get the above-mentioned result using conditional row selection in pandas, I know I should write a code like this:
dataframe[(dataframe['first'] < dataframe['second']) & (second_condition)].index
But I don't know what to write for the second_condition which I have explained above. Can anyone help me with this?

The trick here is to calculate the rolling sum on a boolean mask to find out the number of values in k previous rows where third column is less than the second column
k = 2
m1 = df['second'].gt(df['first'])
m2 = df['third'].lt(df['second']).shift(fill_value=0).rolling(k).sum().eq(k)
print(df[m1 & m2])
first second third
3 3 4 5
7 3 4 2
8 5 6 6

I will center my answer in the second part of your question. You need to use shift function to compare. It allows you to shift rows.
Assuming your k is fixed at 2, you should do something like this:
import pandas as pd
df = pd.DataFrame(
{
'first': [2, 2, 3, 4, 5, 8, 3, 5],
'second': [2, 1, 4, 6, 4, 8, 4, 6],
'third': [1, 0, 5, 3, 3, 4, 2, 6]
}
)
# this is the line
df[(df['third'] < df['second'].shift(1)) & (df['third'] < df['second'].shift(2))]
What's going on?
Start comparing 'third' with previous value of 'second' by shifting one row, and then shift it two places in a second condition.
Note this only works for fixed values of k. What if k is variable?
In such case, you need to write your condition dynamically. The following code assumes that condition must be met for all values of n in [1,k]
k = 2 # pick any k > 1
df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
What's going on here?: Long answer
first, we check using the shift trick, which are the rows that meet your criteria for every value of n in [1, k]:
In [1]: [df['third'] < df['second'].shift(n) for n in range(1, k+1)]
out[1]:
[0 False
1 True
2 False
3 True
4 True
5 False
6 True
7 False
dtype: bool,
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool]
then, we concatenate them to create a single dataframe, with a column for each of the k values.
In [2]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)])
Out[2]:
0 False
1 True
2 False
3 True
4 True
5 False
6 True
7 False
0 False
1 False
2 False
3 False
4 True
5 True
6 True
7 True
dtype: bool
finally, we pick to use as index all rows that meets the criteria for any column (i.e. value of n). So: if it is true for any n, we will return it:
In [3]: pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)
Out[3]:
0 False
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
Then, all you need to do is to project over your original dataframe and pick up the index.
In [3]: df[~pd.concat([df['third'] < df['second'].shift(n) for n in range(1, k+1)]).any(level=0)].index
Out[3]: Int64Index([0, 2], dtype='int64')
Final note
If the criteria must be met for all the values n in [1, k], then replace .any with .all.

# First condition is easy
cond1 = df["second"] > df["first"]
# Since the second condition compare the second and third column, let's compute
# the result before hand for convenience
s = df["third"] < df["second"]
# Now we are gonna run a rolling window over `s`. What we want is that the
# previous `k` rows of `s` are all True.
# A rolling window always ends on the current row but you want it to end on the
# previous row. So we will increase the window size by 1 and exclude the last
# element from comparison.
all_true = lambda arr: arr[:-1].all()
cond2_with_k_equal_2 = s.rolling(3).apply(all_true, raw=True)
cond2_with_k_equal_3 = s.rolling(4).apply(all_true, raw=True)
cond2_with_k_equal_4 = s.rolling(5).apply(all_true, raw=True)
cond2 = cond2_with_k_equal_2 | cond2_with_k_equal_3 | cond2_with_k_equal_4
# Or you can consolidate them into a loop
cond2 = pd.Series(False, df.index)
for k in range(2,5):
cond2 |= s.rolling(k+1).apply(all_true, raw=True)
# Get the result
df[cond1 & cond2]

Count values in previous rows that are greater than current row value

I want to find the count for the number of previous rows that have the a greater value than the current row in a column and store it in a new column. It would be like a rolling countif that goes back to the beginning of the column. The desired example output below shows the value column given and the count column I want to create.
Desired Output:
Value Count
5 0
7 0
4 2
12 0
3 4
4 3
1 6
I plan on using this code with a large dataframe so the fastest way possible is appreciated.

We can do subtract.outer from numpy , then get lower tri and find the value is less than 0, and sum the value per row
a = np.sum(np.tril(np.subtract.outer(df.Value.values,df.Value.values), k=0)<0, axis=1)
# results in array([0, 0, 2, 0, 4, 3, 6])
df['Count'] = a

IMPORTANT: this only works with pandas < 1.0.0 and the error seems to be a pandas bug. An issue is already created at https://github.com/pandas-dev/pandas/issues/35203
We can do this with expanding and applying a function which checks for values that are higher than the last element in the expanding array.
import pandas as pd
import numpy as np
# setup
df = pd.DataFrame([5,7,4,12,3,4,1], columns=['Value'])
# calculate countif
df['Count'] = df.Value.expanding(1).apply(lambda x: np.sum(np.where(x > x[-1], 1, 0))).astype('int')
Input
Value
0 5
1 7
2 4
3 12
4 3
5 4
6 1
Output
Value Count
0 5 0
1 7 0
2 4 2
3 12 0
4 3 4
5 4 3
6 1 6

count = []
for i in range(len(values)):
count = 0
for j in values[:i]:
if values[i] < j:
count += 1
count.append(count)

The below generator will do what you need. You may be able to further optimize this if needed.
def generator (data) :
i=0
count_dict ={}
while i<len(data) :
m=max(data)
v=data[i]
count_dict[v] =count_dict[v] +1 if v in count_dict else 1
t=sum([(count_dict[j] if j in count_dict else 0) for j in range(v+1,m)])
i +=1
yield t
d=[1, 5,7,3,5,8]
foo=generator (d)
result =[b for b in foo]
print(result)

Create sequential event id for groups of consecutive ones

I have a df like so:
Period Count
1 1
2 0
3 1
4 1
5 0
6 0
7 1
8 1
9 1
10 0
and I want to return a 'Event ID' in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Period Count Event_ID
1 1 0
2 0 0
3 1 1
4 1 1
5 0 0
6 0 0
7 1 2
8 1 2
9 1 2
10 0 0
I have researched and found solutions that allow me to flag out consecutive group of similar numbers (e.g 1) but I haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.

This will do the job:
ones = df.groupby('Count').groups[1].tolist()
# creates a list of the indices with a '1': [0, 2, 3, 6, 7, 8]
event_id = [0] * len(df.index)
# creates a list of length 10 for Event_ID with all '0'
# find consecutive numbers in the list of ones (yields [2,3] and [6,7,8]):
for k, g in itertools.groupby(enumerate(ones), lambda ix : ix[0] - ix[1]):
sublist = list(map(operator.itemgetter(1), g))
if len(sublist) > 1:
for i in sublist:
event_id[i] = len(sublist)-1
# event_id is now [0, 0, 1, 1, 0, 0, 2, 2, 2, 0]
df['Event_ID'] = event_id
The for loop is adapted from this example (using itertools, other approaches are possible too).

Pandas Sampling Every Time Condition is Met

Given a pandas dataframe, I want to get the indices of each row when the sum of the previous row's column value (or current row's column value) is equal to or greater than n, then the sum restarts back to zero. So for example, if our dataframe has values:
index colB
1 10
2 20
3 5
4 5
5 15
6 5
7 7
8 3
and say n=10, then the indices I want are [1, 2, 4, 5, 7] since the previous rows (or current row) for ColB add up to 10.
So far, I can do a for-loop on this dataframe to get the indices I want, but when there are many rows, it is very slow. Therefore, I am seeking help on a faster method. Thanks!

There may be a clever way with some combination of cumsum() but this is a tough problem because the value needs to reset after the sum is greater than n. So it's kind of like a rolling sum with a window no greater than n.
I would probably use a custom function for this.
import pandas as pd
def go(s, n=10):
r = []
c = 0
s = s.tolist()
cv = 0
for v in s:
if v >= n:
r.append(c)
c += 1
else:
cv += v
if cv >= n:
r.append(c)
c += 1
cv = 0
else:
r.append(c)
return r
df = pd.DataFrame.from_dict({
'colB': {0: 10, 1: 20, 2: 5, 3: 5, 4: 15, 5: 5, 6: 1, 7: 3, 8: 1, 9: 12}})
df['group'] = df.apply(go, args=(10,))
indices = df['group'].drop_duplicates().index
Note that I modified your example numbers a bit.
If df is this:
colB
0 10
1 20
2 5
3 5
4 15
5 5
6 1
7 3
8 1
9 12
indices is:
Int64Index([0, 1, 2, 4, 5, 9], dtype='int64')

Python random function to select a new item from a list of values

I need to fetch random numbers from a list of values in Python. I tried using random.choice() function but it sometimes returns same values consecutively. I want to return new random values from the list each time. Is there any function in Python that allows me to perform such an action ?

Create a copy of the list, shuffle it, then pop items from that one by one as you need a new random value:
shuffled = origlist[:]
random.shuffle(shuffled)
def produce_random_value():
return shuffled.pop()
This is guaranteed to not repeat elements. You can, however, run out of numbers to pick, at which point you could copy again and re-shuffle.
To do this continuously, you could make this a generator function:
def produce_randomly_from(items):
while True:
shuffled = list(items)
random.shuffle(shuffled)
while shuffled:
yield shuffled.pop()
then use this in a loop or grab a new value with the next() function:
random_items = produce_randomly_from(inputsequence)
# grab one random value from the sequence
random_item = next(random_items)

Here is an example:
>>> random.sample(range(10), 10)
[9, 5, 2, 0, 6, 3, 1, 8, 7, 4]
Just replace the sequence given by range with the one you want to choose from. The second number is how many samples, and should be the length of the input sequence.

If you just want to avoid consecutive random values, you can try this:
import random
def nonrepeating_rand(n):
''' Generate random numbers in [0, n) such that no two consecutive numbers are equal. '''
k = random.randrange(n)
while 1:
yield k
k2 = random.randrange(n-1)
if k2 >= k: # Skip over the previous number
k2 += 1
k = k2
Test:
for i,j in zip(range(25), nonrepeating_rand(3)):
print i,j
prints (for example)
0 1
1 0
2 2
3 0
4 2
5 0
6 2
7 1
8 0
9 1
10 0
11 2
12 0
13 1
14 0
15 2
16 1
17 0
18 2
19 1
20 0
21 2
22 1
23 2
24 0
You can use nonrepeating_rand(len(your_list)) to get random indices for your list.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replacing duplicate values in a dataframe - python

Related

Checking the length of a part of a dataframe in conditional row selection in pandas

Count values in previous rows that are greater than current row value

Create sequential event id for groups of consecutive ones

Pandas Sampling Every Time Condition is Met

Python random function to select a new item from a list of values

Categories

Resources