Duplicating rows with certain value in a column

Duplicating rows with certain value in a column - python

I have to duplicate rows that have a certain value in a column and replace the value with another value.
For instance, I have this data:
import pandas as pd
df = pd.DataFrame({'Date': [1, 2, 3, 4], 'B': [1, 2, 3, 2], 'C': ['A','B','C','D']})
Now, I want to duplicate the rows that have 2 in column 'B' then change 2 to 4
df = pd.DataFrame({'Date': [1, 2, 2, 3, 4, 4], 'B': [1, 2, 4, 3, 2, 4], 'C': ['A','B','B','C','D','D']})
Please help me on this one. Thank you.

You can use append, to append the rows where B == 2, which you can extract using loc, but also reassigning B to 4 using assign. If order matters, you can then order by C (to reproduce your desired frame):
>>> df.append(df[df.B.eq(2)].assign(B=4)).sort_values('C')
B C Date
0 1 A 1
1 2 B 2
1 4 B 2
2 3 C 3
3 2 D 4
3 4 D 4

Related

Python - Group(Cluster/Sort) arrays based on ranking information

I have a dataframe looks like this:
A B C D
0 5 4 3 2
1 4 5 3 2
2 3 5 2 1
3 4 2 5 1
4 4 5 2 1
5 4 3 5 1
...
I converted the dataframe into 2D arrays like this:
[[5 4 3 2]
[4 5 3 2]
[3 5 2 1]
[4 2 5 1]
[4 5 2 1]
[4 3 5 1]
...]
The score of each row 1-5 actually means the people give the scores to item A, B, C, D. I would like to identify the people who have the same ranking, for example the people think A > B > C > D. And I would like to regroup these arrays based on the ranking information like this:
2DArray1: [[5 4 3 2]]
2DArray2: [[4 5 3 2]
[3 5 2 1]
[4 5 2 1]]
2DArray3: [[4 2 5 1]
[4 3 5 1]]
For example 2DArray2 means the people who think B > A > C > D, 2DArray3 are the people think C > A > B > D . I tried different sort functions in numpy but I cannot find one suitable. How should I do?

Numpy doesn't have a groupby function, because a groupby would return a list of lists of different sizes; whereas numpy mostly only deals with "rectangle" arrays.
A workaround would be to sort the rows so that similar rows are adjacent, then produce an array of the indices of the beginning of each group.
Since I'm too lazy to do that, here is a solution without numpy instead:
Index by the permutation directly
For each row, we compute the corresponding permutation of 'ABCD'. Then, we add the row to a dict of lists of rows, where the dictionary keys are the corresponding permutations.
from collections import defaultdict
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
(0, 1, 2, 3): [[5, 4, 3, 2]],
(1, 0, 2, 3): [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
(2, 0, 1, 3): [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Note that with this solution, the results might not be what you expect if some users give the same score to two different items, because sorted doesn't keep ex-aequo; instead it breaks ties by order of appearance (in this case, this means ties between two items are broken alphabetically).
Index by the index of the permutation
The permutations of 'ABCD' can be ordered lexicographically: 'ABCD' comes first, then 'ABDC' comes second, then 'ACBD' comes third...
As it turns out, there is an algorithm to compute the index at which a given permutation would come in that sequence! And that algorithm is implemented in python module more_itertools:
more_itertools.permutation_index
So, we can replace our tuple key tuple(sorted(range(len(row)), key=lambda i: row[i], reverse=True)) by a simple number key permutation_index(row, sorted(row, reverse=True)).
from collections import defaultdict
from more_itertools import permutation_index
a = [[5, 4, 3, 2], [4, 5, 3, 2], [3, 5, 2, 1], [4, 2, 5, 1], [4, 5, 2, 1], [4, 3, 5, 1]]
groups = defaultdict(list)
for row in a:
groups[permutation_index(row, sorted(row, reverse=True))].append(row)
print(groups)
Output:
defaultdict(<class 'list'>, {
0: [[5, 4, 3, 2]],
6: [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
8: [[4, 2, 5, 1], [4, 3, 5, 1]]
})
Mixing permutation_index and pandas
Since the output of permutation_index is a simple number, we can easily include it in a numpy array or a pandas dataframe as a new column:
import pandas as pd
from more_itertools import permutation_index
df = pd.DataFrame({'A': [5,4,3,4,4,4], 'B': [4,5,5,2,5,3], 'C': [3,2,2,5,2,5], 'D': [2,2,1,1,1,1]})
df['perm_idx'] = df.apply(lambda row: permutation_index(row, sorted(row, reverse=True)), axis=1)
print(df)
A B C D perm_idx
0 5 4 3 2 0
1 4 5 2 2 6
2 3 5 2 1 6
3 4 2 5 1 8
4 4 5 2 1 6
5 4 3 5 1 8
for idx, sub_df in df.groupby('perm_idx'):
print(idx)
print(sub_df)
0
A B C D perm_idx
0 5 4 3 2 0
6
A B C D perm_idx
1 4 5 2 2 6
2 3 5 2 1 6
4 4 5 2 1 6
8
A B C D perm_idx
3 4 2 5 1 8
5 4 3 5 1 8

You can
(i) transpose df and convert it to a dictionary,
(ii) sort this dictionary by value and get the keys,
(iii) join the sorted keys for each "person" and assign this dict to df['ranks'],
(iv) aggregate ranking points and assign it to df['pref'],
(v) groupby(['ranks']) and create lists from pref
df = pd.DataFrame({'A': {0: 5, 1: 4, 2: 3, 3: 4, 4: 4, 5: 4},
'B': {0: 4, 1: 5, 2: 5, 3: 2, 4: 5, 5: 3},
'C': {0: 3, 1: 3, 2: 2, 3: 5, 4: 2, 5: 5},
'D': {0: 2, 1: 2, 2: 1, 3: 1, 4: 1, 5: 1}})
df['ranks'] = pd.Series({k : ''.join(list(zip(*sorted(v.items(), key=lambda d:d[1],
reverse=True)))[0])
for k,v in df.T.to_dict().items()})
df['pref'] = df.loc[:,'A':'D'].values.tolist()
out = df[['ranks','pref']].groupby('ranks').agg(list).to_dict()['pref']
Output:
{'ABCD': [[5, 4, 3, 2]],
'BACD': [[4, 5, 3, 2], [3, 5, 2, 1], [4, 5, 2, 1]],
'CABD': [[4, 2, 5, 1], [4, 3, 5, 1]]}

Pandas DataFrame filter by multiple column criterias and multiple intervals

I have checked several answers but found no luck so far.
My dataset is like this:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
Location Place Value1 Value2
A 1 1 1
A 2 1 1
A 3 2 2
B 4 3 3
C 2 4 4
C 3 5 5
and I have a list of intervals:
A: [0, 1]
A: [3, 5]
B: [1, 3]
C: [1, 4]
C: [6, 10]
Now I want that every row that have Location equal to that of the filter list, should have the Place in range of the filter. So the desired output will be:
Location Place Value1 Value2
A 1 1 1
A 3 2 2
C 2 4 4
C 3 5 5
I know that I can chain multiple between conditions by | , but I have a really long list of intervals so manually enter the condition is not feasible. I also consider forloop to slice the data by location first, but I think there could be more efficient way.
Thank you for your help.
Edit: Currently the list of intervals is just strings like this
A 0 1
A 3 5
B 1 3
C 1 4
C 6 10
but I would like to slice them into list of dicts. Better structure for it is also welcome!

First define dataframe df and filters dff:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
dff = pd.DataFrame({'Location':['A','A','B','C','C'],
'fPlace':[[0,1], [3, 5], [1, 3], [1, 4], [6, 10]]})
dff[['p1', 'p2']] = pd.DataFrame(dff["fPlace"].to_list())
now dff is:
Location fPlace p1 p2
0 A [0, 1] 0 1
1 A [3, 5] 3 5
2 B [1, 3] 1 3
3 C [1, 4] 1 4
4 C [6, 10] 6 10
where fPlace transformed to lower and upper bounds p1 and p2 indicates filters that should be applied to Place. Next:
df.merge(dff).query('Place >= p1 and Place <= p2').drop(columns = ['fPlace','p1','p2'])
result:
Location Place Value1 Value2
0 A 1 1 1
5 A 3 2 2
7 C 2 4 4
9 C 3 5 5

Prerequisites:
# presumed setup for your intervals:
intervals = {
"A": [
[0, 1],
[3, 5],
],
"B": [
[1, 3],
],
"C": [
[1, 4],
[6, 10],
],
}
Actual solution:
x = df["Location"].map(intervals).explode().str
l, r = x[0], x[1]
res = df["Place"].loc[l.index].between(l, r)
res = res.loc[res].index.unique()
res = df.loc[res]
Outputs:
>>> res
Location Place Value1 Value2
0 A 1 1 1
2 A 3 2 2
4 C 2 4 4
5 C 3 5 5

Use of index in pandas DataFrame for groupby and aggregation

I want to aggregate a single column DataFrame and count the number of elements. However, I always end up with an empty DataFrame:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[46]:
Empty DataFrame
Columns: []
Index: [1, 2, 3, 4, 5]
If I add a second column, I get the desired result:
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5], "B":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").count()
Out[45]:
B
A
1 1
2 1
3 1
4 1
5 3
Can you explain the reason for this?

Give this a shot:
import pandas as pd
print(pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A")["A"].count())
prints
A
1 1
2 1
3 1
4 1
5 3

You have to add the grouped by column in your result:
import pandas as pd
pd.DataFrame({"A":[1, 2, 3, 4, 5, 5, 5]}).groupby("A").A.count()
Output:
A
1 1
2 1
3 1
4 1
5 3

How can I return a larger dataframe from two dataframes

e.g. I have two dataframes:
a = pd.DataFrame({'A':[1,2,3],'B':[6,5,4]})
b = pd.DataFrame({'A':[3,2,1],'B':[4,5,6]})
I want to get a dataframe c consisting of the larger value in each position of a & b:
c = max_function(a,b) = pd.DataFrame(max(a.iloc[i,j], b.iloc[i,j]))
c = pd.DataFrame({'A':[3,2,3],'B':[6,5,6]})
I don't want to generate c by comparing each value in a & b because the real dataframes in my work is very large.
So I wonder if there's a ready-made pandas function which can do this? Thanks!

You could use numpy.maximum:
import pandas as pd
import numpy as np
a = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]})
b = pd.DataFrame({'A': [3, 2, 1], 'B': [4, 5, 6]})
c = np.maximum(a, b)
print(c)
Output
A B
0 3 6
1 2 5
2 3 6

Extracting values from a dictionary for a respective key

I have a dictionary in a below-mentioned pattern:
dict_one = {1: [2, 3, 4], 2: [3, 4, 4, 5],3 : [2, 5, 6, 6]}
I need to get an output such that for each key I have only one value adjacent to it and then finally I need to create a data frame out of it.
The output would be similar to:
1 2
1 3
1 4
2 3
2 4
2 4
2 5
3 2
3 5
3 6
3 6
Please help me with this.
dict_one = {1: [2, 3, 4], 2: [3, 4, 4, 5],3 : [2, 5, 6, 6]}
df_column = ['key','value']
for key in dict_one.keys():
value = dict_one.values()
row = (key,value)
extended_ground_truth = pd.DataFrame.from_dict(row, orient='index', columns=df_column)
extended_ground_truth.to_csv("extended_ground_truth.csv", index=None)

You can normalize the data as you iterate the dictionary
df=pd.DataFrame(((key, value[0]) for key,value in dict_one.items()),
columns=["key", "value"])

You can wrap the values in lists, then use DataFrame.from_dict and finally use explode to expand the lists:
pd.DataFrame.from_dict({k: [v] for k, v in dict_one.items()}, orient='index').explode(0)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Duplicating rows with certain value in a column - python

Related

Python - Group(Cluster/Sort) arrays based on ranking information

Pandas DataFrame filter by multiple column criterias and multiple intervals

Use of index in pandas DataFrame for groupby and aggregation

How can I return a larger dataframe from two dataframes

Extracting values from a dictionary for a respective key

Categories

Resources