I have a DataFrame, with two columns. I want to delete the first 3 rows values of each ids. If the id has less or equal to three rows, delete those rows also. Like in the following, the ids 3 and 1 have 3 and 2 rows, sod they should be deleted. for ids 4 and 2, only the rows 4, 5 are preserved.
import pandas as pd
df = pd.DataFrame()
df ['id'] = [4,4,4,4, 4,2, 2,2,2,2,3,3,3, 1, 1]
df ['value'] = [2,1,1,2, 3, 4, 6,-1,-2,2,-3,5,7, -2, 5]
Here is the DataFrame which I want.
Number each "id" using groupby + cumcount and filter the rows where the the number is more than 2:
out = df[df.groupby('id').cumcount() > 2]
Output:
id value
3 4 2
4 4 3
8 2 -2
9 2 2
Use Series.value_counts and Series.map in order to performance a boolean indexing
new_df = df[df['id'].map(df['id'].value_counts().gt(2))]
id value
3 4 2
4 4 3
8 2 -2
9 2 2
Using cumcount is the way but with drop work as well
out = df.groupby('id',sort=False).apply(lambda x : x.drop(x.index[:3])).reset_index(drop=True)
Out[12]:
id value
0 4 2
1 4 3
2 2 -2
3 2 2
Related
Initial dataframe looks as follows:
>>>>df
id param
1 4
1 15
1 3
2 2
2 7
4 8
4 6
4 11
How to achieve the following scheme by putting only the first 2 values of each id into new row? Resulting df should look as follows:
>>>>df
col_a col_b
4 15
2 7
8 6
I tried to achieve by using transpose and iloc but did not succeed.
Columns names are just for clarification. It is sufficient if index is displayed only (e.g. 0, 1, 2,..).
You can use a double groupby on 'id' to first get the first two rows of each group and then join your 'param' column, and thereafter expand it into new columns. Lastly, rename accordingly:
new = df.groupby('id').head(2).groupby('id',as_index=False).agg({'param':list}).param.apply(pd.Series)
new.columns = ['col_a', 'col_b']
Prints:
col_a col_b
0 4 15
1 2 7
2 8 6
You can first take groupby with head(2) and then split every 2 elements in a list:
a = df.groupby("id")['param'].head(2).tolist()
out = pd.DataFrame([a[i:i + 2] for i in range(0, len(a), 2)],columns=['col_a','col_b'])
print(out)
col_a col_b
0 4 15
1 2 7
2 8 6
Is there a way to multiply each element of a row of a dataframe by an element of the same row from a particular column of another dataframe.
For example, such that:
df1:
1 2 3
2 2 2
3 2 1
and df2:
x 1 b
z 2 c
x 4 a
results in
1 2 3
4 4 4
12 8 4
So basically such that df1[i,:] * df2[i,j] = df3[i,:].
Multiply the first df by the column of the second df
Assuming your column names are 0,1,2
df1.mul(df2[1],0)
Output
0 1 2
0 1 2 3
1 4 4 4
2 12 8 4
Here you go.
I have created a variable that allows you to select that which column of the second dataframe you want to multiply with the numbers in the first dataframe.
arr1 = np.array(df1) # df1 to array
which_df2col_to_multiply = 0 # select col from df2
arr2 = np.array(df2)[:, which_df2col_to_multiply ] # selected col to array
print(np.transpose(arr2*arr1)) # result
This is the output:
[[1 2 3]
[4 4 4]
[12 8 4]]
I have a small dataframe produced from value_counts() that I want to plot with a categorical x axis. It s a bit bigger than this but:
Age Income
25-30 10
65-70 5
35-40 2
I want to be able to manually reorder the rows. How do I do this?
You can reorder rows with .reindex:
>>> df
a b
0 1 4
1 2 5
2 3 6
>>> df.reindex([1, 2, 0])
a b
1 2 5
2 3 6
0 1 4
From here Link, you can create a sorting criteria and use that:
df = pd.DataFrame({'Age':['25-30','65-70','35-40'],'Income':[10,5,2]})
sort_criteria = {'25-30': 0, '35-40': 1, '65-70': 2}
df = df.loc[df['Age'].map(sort_criteria).sort_values(ascending = True).index]
I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
I have a pandas dataframe with index [0, 1, 2...], and a list something like this: [1, 2, 2, 0, 1...].
I'd like to add a 'count' column to the dataframe, that reflects the number of times the digit in the index is referenced in the list.
Given the example lists above, the 'count' column would have the value 2 at index 2, because 2 occurred twice (so far). Is there a more efficient way to do this than iterating over the list?
Well here is a way of doing it, first load the list into a df, then add the 'occurrence' column using value_counts and then merge this to your orig df:
In [61]:
df = pd.DataFrame({'a':np.arange(10)})
l=[1,2,2,0,1]
df1 = pd.DataFrame(l, columns=['data'])
df1['occurence'] = df1['data'].map(df1['data'].value_counts())
df1
Out[61]:
data occurence
0 1 2
1 2 2
2 2 2
3 0 1
4 1 2
In [65]:
df.merge(s, left_index=True, right_on='data',how='left').fillna(0).drop_duplicates().reset_index(drop=True)
Out[65]:
a data count
0 0 0 1
1 1 1 2
2 2 2 2
3 3 3 0
4 4 4 0
5 5 5 0
6 6 6 0
7 7 7 0
8 8 8 0
9 9 9 0
Counting occurences of numbers in a dataframe is easy in pandas
You just use the Series.value_counts method.
Then you join the grouped dataframe with the original one using the pandas.merge function.
Setting up a DataFrame like the one you have:
df = pd.DataFrame({'nomnom':np.random.choice(['cookies', 'biscuits', 'cake', 'lie'], 10)})
df is now a DataFrame with some arbitrary data in it (since you said you had more data in there).
nomnom
0 biscuits
1 lie
2 biscuits
3 cake
4 lie
5 cookies
6 cake
7 cake
8 cake
9 cake
Setting up a list like the one you have:
yourlist = np.random.choice(10, 10)
yourlist is now:
array([2, 9, 2, 3, 4, 8, 5, 8, 6, 8])
The actual code you need (TLDR;):
counts = pd.DataFrame(pd.value_counts(yourlist))
pd.merge(left=df, left_index=True,
right=counts, right_index=True,
how='left').fillna(0)