I have a sample table like this:
Dataframe: df
Col1 Col2 Col3 Col4
A 1 10 i
A 1 11 k
A 1 12 a
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
B 2 21 w
B 2 25 e
B 2 36 q
C 1 23 a
C 1 24 b
I'm trying to get all records/rows of the groups (Col1, Col2) that has the smaller number of records AND skipping over those groups that have only 1 record (in this example Col1 = 'C'). So, the output would be as follows:
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
since group (A,2) has 2 records compared to group (A,1) which has 3 records.
I tried to approach this issue from different angles but just can't seem to get the result that I need. I am able to find the groups that I need using a combination of groupby, filter and agg but how do I now use this as a select filter on df? After spending a lot of time on this, I wasn't even sure that the approach was correct as it looked overly complicated. I am sure that there is an elegant solution but I just can't see it.
Any advise on how to approach this would be greatly appreciated.
I had this to get the groups for which I wanted the rows displayed:
groups = df.groupby(["Col1, Col2"])["Col2"].agg({'no':'count'})
filteredGroups = groups.groupby(level=0).filter(lambda group: group.size > 1)
print filteredGroups.groupby(level=0).agg('idxmin')
The second line was to account for groups that may have only one record as those I don't want to consider. Honestly, I tried so many variations and approaches that eventually did not give me the result that I wanted. I see that all answers are not one-liners so that at least I don't feel like I was over thinking the problem.
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df['rnk'] = df.groupby('Col1')['sz'].rank(method='min')
df['rnk_rev'] = df.groupby('Col1')['sz'].rank(method='min',ascending=False)
df.loc[ (df['rnk'] == 1.0) & (df['rnk_rev'] != 1.0) ]
Col1 Col2 Col3 Col4 sz rnk rnk_rev
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
Edit: changed "count" to "size" (as in #Marco Spinaci's answer) which doesn't matter in this example but might if there were missing values.
And for clarity, here's what the df looks like before dropping the selected rows.
Col1 Col2 Col3 Col4 sz rnk rnk_rev
0 A 1 10 i 3 3.0 1.0
1 A 1 11 k 3 3.0 1.0
2 A 1 12 a 3 3.0 1.0
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
7 B 2 21 w 3 3.0 1.0
8 B 2 25 e 3 3.0 1.0
9 B 2 36 q 3 3.0 1.0
10 C 1 23 a 2 1.0 1.0
11 C 1 24 b 2 1.0 1.0
Definitely not a nice answer, but it should work:
tmp = df[['col1','col2']].groupby(['col1','col2'], as_index=False).size()
df['occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]][df.col2[i]])
df['min_occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]].min())
df[df.occurrencies == df.min_occurrencies]
But there must be a more clever way to use groupby than creating an auxiliary data frame...
The following is a solution that is based on the groupby.apply methodology. Other simpler methods are available by creating data Series as in JohnE's method which is superior I would say.
The solution works by grouping the dataframe at the Col1 level and then passing a function to apply that further groups the data by Col2. Each sub_group is then assessed to yield the smallest group. Note that ties in size will be determined by whichever is evaluated first. This may not be desirable.
#create data
import pandas as pd
df = pd.DataFrame({
"Col1" : ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"],
"Col2" : [1, 1, 1, 2, 2, 1, 1, 2, 2, 2],
"Col3" : [10, 11, 12, 10, 11, 15, 16, 21, 25, 36],
"Col4" : ["i", "k", "a", "w", "e", "s", "d", "w", "e", "q"]
})
Grouped = df.groupby("Col1")
def transFunc(x):
smallest = [None, None]
sub_groups = x.groupby("Col2")
for group, data in sub_groups:
if not smallest[1] or len(data) < smallest[1]:
smallest[0] = group
smallest[1] = len(data)
return sub_groups.get_group(smallest[0])
Grouped.apply(transFunc).reset_index(drop = True)
Edit to assign the result
result = Grouped.apply(transFunc).reset_index(drop = True)
print(result)
I would like to add a shorter yet readable version of JohnE's solution
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df.groupby('Col1').filter(lambda x: x['sz'].rank(method='min') == 1 and x['sz'].rank(method='min', ascending=False) != 1)
Related
I have this dataframe that looks like this
data = {'col1': ['a', 'b', 'c'],
'col2': [10, 5, 20]}
df_sample = pd.DataFrame(data=data)
I want to calculate the rank of col2. I wrote this function
def rank_by(df):
if df.shape[0] >= 10:
df.sort_values(by=['col2'])
l = []
for val in df['col2']:
rank = (val/df['col2'].max())*10
l.append(rank)
df['rank'] = l
return df
Please assume col2 has more than 10 values. I want to know if there is a more pythonic way of applying the function defined above.
It looks like you just want the ratio to the max value multiplied by 10:
df_sample['rank'] = df_sample['col2'].div(df_sample['col2'].max()).mul(10)
print(df_sample.sort_values(by='col2'))
Output:
col1 col2 rank
4 e 2 0.8
8 i 2 0.8
3 d 4 1.6
1 b 5 2.0
6 g 6 2.4
9 j 9 3.6
0 a 10 4.0
7 h 12 4.8
2 c 20 8.0
5 f 25 10.0
Used input:
data = {'col1': list('abcdefghij'),
'col2': [10, 5, 20,4,2,25,6,12,2,9]}
df_sample = pd.DataFrame(data=data)
Use pd.Series.rank:
df_sample['rank'] = df_sample['col2'].rank()
Output:
col1 col2 rank
0 a 10 2.0
1 b 5 1.0
2 c 20 3.0
Note, there are different methods to handle ties.
Env: Python 3.9.6, Pandas 1.3.5
I have a DataFrame and a Series like below
df = pd.DataFrame({"C1" : ["A", "B", "C", "D"]})
sr = pd.Series(data = [1, 2, 3, 4, 5],
index = ["A", "A", "B", "C", "D"])
"""
[DataFrame]
C1
0 A
1 B
2 C
3 D
[Series]
A 1
A 2
B 3
C 4
D 5
"""
What I tried,
df["C2"] = df["C1"].map(sr)
But InvalidIndexError occurred because the series has duplicate keys ("A").
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Is there any method to make DF like below?
C1 C2
0 A 1
1 A 2
2 B 3
3 C 4
4 D 5
or
C1 C2
0 A 1
1 B 3
2 C 4
3 D 5
4 A 2
Row indices do not matter.
The question was heavily edited and now has a very different meaning.
You want a simple merge:
df.merge(sr.rename('C2'),
left_on='C1', right_index=True)
Output:
C1 C2
0 A 1
0 A 2
1 B 3
2 C 4
3 D 5
old answer
First, I don't reproduce your issue (tested with 3M rows on pandas 1.3.5).
Then why do you use slicing and not map? This would have the advantage of systematically outputting the correct number of rows (NaN if the key is absent):
Example:
sr = pd.Series({10:"A", 13:"B", 16:"C", 18:"D"})
df = pd.DataFrame({"C1":np.random.randint(10, 20, size=3000000)})
df['C2'] = df['C1'].map(sr)
print(df.head())
output:
C1 C2
0 10 A
1 18 D
2 10 A
3 13 B
4 15 NaN
I have a dataframe df. I want to add 2 new columns 0 and 1 and add data to these columns by one row at a time, and not the complete column at once. By using pd.Series for all the rows in df I am getting NaN value in the new column data other than the last row. Provide me a way to fix this.
I need to add data by one row at a time. Please provide solution accordingly.
df
val
1
2
3
code
for j in range(len(df)):
for i in range(2):
cal = df.val.iloc[j] + 10
df[i] = pd.Series(cal, index=df.index[[j]])
output
val | 0 | 1
1 | NaN | NaN
2 | NaN | NaN
3 | 13.0 | 13.0
expected output
val | 0 | 1
1 | 11.0 | 11.0
2 | 12.0 | 12.0
3 | 13.0 | 13.0
EDIT
I had actually asked a question on stackoverflow whose answer I could not get. That is why I had tried to condense the question and present it this way. If possible you all can check the original question here
It is not clear why you are trying to add rows one at a time with inefficient methods, hence I suggest not to use this code but to rely on vectorized solutions.
However, if you really want to do it for some reason, you should modify your cycle like this
for j in range(len(df)):
for i in range(2):
cal = df.val.iloc[j] + 10
df.loc[j, i] = cal
# val 0 1
# 0 1 11.0 11.0
# 1 2 12.0 12.0
# 2 3 13.0 13.0
Use apply function
In [29]: df
Out[29]:
val
0 1
1 2
2 3
In [13]: df[0] = df["val"].apply(lambda x: x + 10)
In [14]: df[1] = df["val"].apply(lambda x: x + 10)
In [15]: df
Out[15]:
val 0 1
0 1 11 11
1 2 12 12
2 3 13 13
Or use iterrows
In [21]: temp = []
In [22]: for inex,row in df.iterrows():
...: temp.append(row["val"] + 10)
...:
In [23]: temp
Out[23]: [11, 12, 13]
In [24]: df[0] = temp
In [25]: df[1] = temp
In [26]: df
Out[26]:
val 0 1
0 1 11 11
1 2 12 12
2 3 13 13
Disclaimer - you should NOT use this code. It's the wrong way. But - given that you want to do it row by row, here's a solution:
df = pd.DataFrame({"val": [1,2, 3]})
for i in df.index:
val = df.loc[i, "val"]
for j in [0,1]:
df.loc[i, j] = val + 10
print(df)
==>
val 0 1
0 1 11.0 11.0
1 2 12.0 12.0
2 3 13.0 13.0
The proper way would be to do something like:
df = pd.DataFrame({"val": [1,2, 3]})
df[0] = df.val + 10
df[1] = df.val + 10
Same results basically, much better when it comes to pandas.
maybe:
for i in range(len(df)):
df["val"].iloc[i] = df.val.iloc[i] + 10
This question already has answers here:
add a row at top in pandas dataframe [duplicate]
(6 answers)
Closed 4 years ago.
I would like to move an entire row (index and values) from the last row to the first row of a DataFrame. Every other example I can find either uses an ordered row index (to be specific - my row index is not a numerical sequence - so I cannot simply add at -1 and then reindex with +1) or moves the values while maintaining the original index. My DF has descriptions as the index and the values are discrete to the index description.
I'm adding a row and then would like to move that into row 1. Here is the setup:
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'F', 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
}).set_index('col1')
#output
In [7]: df
Out[7]:
col2 col3
col1
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
I then add a new row as follows:
df.loc["Deferred Description"] = pd.Series([''])
In [9]: df
Out[9]:
col2 col3
col1
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
Deferred Description NaN NaN
I would like the resulting output to be:
In [9]: df
Out[9]:
col2 col3
col1
Defenses Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I've tried using df.shift() but only the values shift. I've also tried df.sort_index() but that requires the index to be ordered (there are several SO examples using df.loc[-1] = ... then then reindexing with df.index = df.index + 1). In my case I need the Defenses Description to be the first row.
Your problem is not one of cyclic shifting, but a simpler oneāone of insertion (which is why I've chosen to mark this question as duplicate).
Construct an empty DataFrame and then concatenate the two using pd.concat.
pd.concat([pd.DataFrame(columns=df.columns, index=['Deferred Description']), df])
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
If this were columns, it'd have been easier. Funnily enough, pandas has a DataFrame.insert function that works for columns, but not rows.
Generalized Cyclic Shifting
If you were curious to know how you'd cyclically shift a dataFrame, you can use np.roll.
# apply this fix to your existing DataFrame
pd.DataFrame(np.roll(df.values, 1, axis=0),
index=np.roll(df.index, 1), columns=df.columns
)
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
This, thankfully, also works when you have duplicate index values. If the index or columns aren't important, then pd.DataFrame(np.roll(df.values, 1, axis=0)) works well enough.
You can using append
pd.DataFrame({'col2':[np.nan],'col3':[np.nan]},index=["Deferred Description"]).append(df)
Out[294]:
col2 col3
Deferred Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I have the following DataFrame:
df1 = pd.DataFrame(columns=["DATE","QTY1", "QTY2", "ID", "CODE"])
df1["DATE"] = ["2000-05-01", "2000-05-03", "2001-01-15", "2001-02-20", "2001-02-22"]
df1["QTY1"] = [10, 11, 12,5,4]
df1["QTY2"] = [100, 101, 102,15,14]
df1["ID"] = [1,2,3,4,5]
df1["CODE"] = ["A", "B", "C", "D", "E"]
df2 = pd.DataFrame(columns=["DATE","QTY1", "QTY2", "ID", "CODE"])
df2["DATE"] = ["2000-05-02", "2000-05-04", "2001-01-12", "2001-03-28", "2001-08-21", "2005-07-01"]
df2["QTY1"] = [9, 101, 11,5.1,100, 10]
df2["QTY2"] = [99, 12, 1000,6,3, 1]
df2["ID"] = [1,2,3,8,5, 9]
df2["CODE"] = ["F", "G", "H", "I", "L", "M"]
df1:
DATE QTY1 QTY2 ID CODE
0 2000-05-01 10 100 1 A
1 2000-05-03 11 101 2 B
2 2001-01-15 12 102 3 C
3 2001-02-20 5 15 4 D
4 2001-02-22 4 14 5 E
df2
DATE QTY1 QTY2 ID CODE
0 2000-05-02 9.0 99 1 F
1 2000-05-04 101.0 12 2 G
2 2001-01-12 11.0 1000 3 H
3 2001-03-28 5.1 6 8 I
4 2001-08-21 100.0 3 5 L
5 2005-07-01 10 1 9 M
My goal is to create a signature with some tolerance for each row and match rows on both DF that fall in such interval.
The signature for each row is structure as follow:
DATE (with tolerance of +/- 5 days)
Qty1 (with tolerance of 10%)
Qty2 (with tolerance of 10%)
ID (perfect match).
So for example the matching result for the above DFs will return the following rows (the first line from each DF) grouped by signature:
Signature1 2000-05-01 10 100 1 A
2000-05-02 9.0 99 1 F
All the others rows do not respect one or more tolerances.
Currently I am doing that with a classical for loop using iterrows() and checking all fields but for large DFs performances are quite poor.
I was wondering if there is a more pandas-like approach that could help me to speed it up.
Thanks