I have the following dataset:
d = {'id': [1,1,1,1,3,3,3,4,4,4], 'number': [3,3,3,1,4,6,4,5,5,3]}
df = pd.DataFrame(data=d)
I want to get a new dataframe with the columns "id" and "final_number", where each id is assigned to the most "popular" number within each group of id's form the table above. How can I do it ?
The result should be:
the most "popular" number should be mode
df.groupby('id').number.apply(lambda x : x.mode()[0]).reset_index()
Out[1499]:
id number
0 1 3
1 3 4
2 4 5
Using groupby + value_counts + head -
df.groupby('id')\
.number.value_counts()\
.groupby(level=0)\
.head(1)\
.reset_index(name='count')\
.drop('count', 1)
id number
0 1 3
1 3 4
2 4 5
Related
I have a dataframe table which looks like the following
How do I reconstruct this table and sum the rows where length is under 2. so the output df looks like the following
please any suggestion would be greatly appreciated.
thanks
Shei
Add a new column to indicate the length group:
df['len_group'] = df['length'].astype(str)
df.loc[df['length']<=2, 'len_group'] = '<=2'
Then groupby on the new column:
df.groupby('len_group')
Test Case:
df = pd.DataFrame({'length':[1,2,3,4,5], 'val':[2,3,4,5,6]})
length val
0 1 2
1 2 3
2 3 4
3 4 5
4 5 6
df['len_group'] = df['length'].astype(str)
df.loc[df['length']<=2, 'len_group'] = '<=2'
df_result = df.groupby('len_group')[['val']].sum()
len_group val
3 4
4 5
5 6
<=2 5
I have a sorted dataframe with an ID, and a value column, which looks like:
ID value
A 10
A 10
A 10
B 15
B 15
C 10
C 10
...
How can i create a new dataframe, that it counts the "new" distinct values in terms of the number of different IDS, so that it basically goes over my dataframe and looks like:
Number of ID Number of distinct values
1 1
2 2
3 2
In that case above we have 3 different IDs, but ID A and C have the same value.
So the first row in the new dataframe:
Numer of ID = 1, because we have 1 different ID so far
Number of distinct values= 1 , because we have one distinct value so far
Second row:
Number of ID=2, because we are going to row 4 in the old dataframe( we only are interessted in new IDS)
Number of disntinct values=2, because the value changed to 15 and didn't occur so far
I think you need processing new DataFrame by DataFrame.drop_duplicates with factorize and cumsum:
Replace duplicated values to NaN, forward filling them and then call pd.factorize:
df1 = df.drop_duplicates(['ID','value']).copy()
df1['Number of ID'] = range(1, len(df1)+1)
df1['Number of distinct values'] = pd.factorize(df1['value'].mask(df1['value'].duplicated()).ffill())[0] + 1
print (df1)
ID value Number of ID Number of distinct values
0 A 10 1 1
3 B 15 2 2
5 C 10 3 2
I change data for better testing:
print (df)
ID value
0 A 10
1 A 10
2 A 10
3 B 15
4 B 15
5 C 10
6 C 15
df1 = df.drop_duplicates(['ID','value']).copy()
df1['Number of ID'] = range(1, len(df1)+1)
df1['Number of distinct values'] = pd.factorize(df1['value'].mask(df1['value'].duplicated()).ffill())[0] + 1
print (df1)
ID value Number of ID Number of distinct values
0 A 10 1 1
3 B 15 2 2
5 C 10 3 2
6 C 15 4 2
Working wrong if multiple values value per ID:
df = pd.DataFrame({'Number of ID': range(1, len(df1)+1),
'Number of distinct values': np.cumsum(pd.factorize(df1['value'])[0])+1})
print (df)
Number of ID Number of distinct values
0 1 1
1 2 2
2 3 2
3 4 3
I want to repeat a specific row of pandas data frame for a given number of times.
For example, this is my data frame
df= pd.DataFrame({
'id' : ['1','1', '2', '2','2','3'],
'val' : ['2015_11','2016_2','2011_9','2011_11','2012_2','2018_2'],
'data':['a','a','b','b','b','c']
})
print(df)
Here, "Val" column contains date in string format. It has a specific pattern 'Year_month'. For the same "id", I want the rows repeated the number of times that is equivalent to the difference between the given "val" column values. All other columns except the val column should have the duplicated value of previous row.
The output should be:
Using resample:
df.val = pd.to_datetime(df.val, format='%Y_%m')
out = df.set_index('val').groupby('id').data.resample('1m').ffill().reset_index()
out.assign(val=out.val.dt.strftime('%Y_%m'))
id val data
0 1 2015_11 a
1 1 2015_12 a
2 1 2016_01 a
3 1 2016_02 a
4 2 2011_09 b
5 2 2011_10 b
6 2 2011_11 b
7 2 2011_12 b
8 2 2012_01 b
9 2 2012_02 b
10 3 2018_02 c
I have a data frame df1 with data that looks like this:
Item Store Sales Dept
0 1 1 5 A
1 1 2 3 A
2 1 3 4 A
3 2 1 3 A
4 2 2 3 A
I then want to use group by to see the total sales by item:
df2 = df1.groupby(['Item']).agg({'Item':'first','Sales':'sum'})
Which gives me:
Item Sales
0 1 12
1 2 6
And then I add a column with the rank of the item in terms of number of sales:
df2['Item Rank'] = df2['Sales'].rank(ascending=False,method='min').astype(int)
So that I get:
Item Sales Item Rank
0 1 12 1
1 2 6 2
I now want to add the Dept column to df2, so that I have
Item Sales Item Rank Dept
0 1 12 1 A
1 2 6 2 A
But everything I have tried has failed.
I either get an empty column, when I try to add the column in from the beginning, or a df with the wrong size if I try to concatenate the new df with the column from the original df.
df.groupby(['Item']).agg({'Item':'first','Sales':'sum','Dept': 'first'}).\
assign(Itemrank=df.Sales.rank(ascending=False,method='min').astype(int) )
Out[64]:
Item Dept Sales Itemrank
Item
1 1 A 12 3
2 2 A 6 2
This is unusual but if you can add the Dept column when you're doing the groupby itself:
A simple option is just to hard code the value if you already know what it needs to be:
df2 = df1.groupby(['Item']).agg({'Item':'first',
'Sales':'sum',
'Dept': lambda x: 'A'})
Or you could take it from the dataframe itself:
df2 = df1.groupby(['Item']).agg({'Item':'first',
'Sales':'sum',
'Dept': lambda x: df1['Dept'].iloc[0]})
I have a data-frame which I'm using the pandas.groupby on a specific column and then running aggregate statistics on the produced groups (mean, median, count). I want to treat certain column values as members of the same group produced by the groupby rather than a distinct group per distinct value in the column which was used for the grouping. I was looking how I would accomplish such a thing.
For example:
>> my_df
ID SUB_NUM ELAPSED_TIME
1 1 1.7
2 2 1.4
3 2 2.1
4 4 3.0
5 6 1.8
6 6 1.2
So instead of the typical behavior:
>> my_df.groupby([SUB_NUM]).agg([count])
ID SUB_NUM Count
1 1 1
2 2 2
4 4 1
5 6 2
I want certain values (SUB_NUM in [1, 2]) to be computed as one group so instead something like below is produced:
>> # Some mystery pandas function calls
ID SUB_NUM Count
1 1, 2 3
4 4 1
5 6 2
Any help would be much appreciated, thanks!
For me works:
#for join values convert values to string
df['SUB_NUM'] = df['SUB_NUM'].astype(str)
#create mapping dict by dict comprehension
L = ['1','2']
d = {x: ','.join(L) for x in L}
print (d)
{'2': '1,2', '1': '1,2'}
#replace values by dict
a = df['SUB_NUM'].replace(d)
print (a)
0 1,2
1 1,2
2 1,2
3 4
4 6
5 6
Name: SUB_NUM, dtype: object
#groupby by mapping column and aggregating `first` and `size`
print (df.groupby(a)
.agg({'ID':'first', 'ELAPSED_TIME':'size'})
.rename(columns={'ELAPSED_TIME':'Count'})
.reset_index())
SUB_NUM ID Count
0 1,2 1 3
1 4 4 1
2 6 5 2
What is the difference between size and count in pandas?
You can create another column mapping the SUB_NUM values to actual groups and then group by it.
my_df['SUB_GROUP'] = my_df['SUB_NUM'].apply(lambda x: 1 if x < 3 else x)
my_df.groupby(['SUB_GROUP']).agg([count])