Replace the 0 in a column with groupby median in pandas - python
I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 0
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 0
A 50 2017-04-18 6 0
B 200 2017-12-08 25 0
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 0
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 0
From the above I would like replace 0 salary with groupby median of the the column product.
Explanation:
A : 15, 20, 20, 25, 45
So the median = 20.
B : 60, 80, 90, 100, 100, 110
So the median = 95.
Expected Output
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 20
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 95
A 50 2017-04-18 6 20
B 200 2017-12-08 25 95
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 20
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 95
You can try this using masking 0 values using pd.Series.mask and use np.nanmedian here.
fill_vals = df.salary.mask(df.salary.eq(0)).groupby(df['product']).transform(np.nanmedian)
df.assign(salary = df.salary.mask(df.salary.eq(0), fill_vals))
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25
1 A 50 2017-01-03 4 20
2 B 200 2016-12-24 10 100
3 A 50 2017-01-18 3 20
4 B 200 2017-01-28 15 80
5 A 50 2017-01-18 6 15
6 B 200 2017-01-28 20 95
7 A 50 2017-04-18 6 20
8 B 200 2017-12-08 25 95
9 A 50 2017-11-18 6 20
10 B 200 2017-08-21 20 90
11 B 200 2017-12-28 30 110
12 A 50 2018-03-18 10 20
13 B 300 2018-06-08 45 100
14 B 300 2018-09-20 50 60
15 A 50 2018-11-18 8 45
16 B 300 2018-11-28 35 95
OR
Using np.where
df['salary'] = (np.where(df['salary']==0,df['salary'].replace(0,np.nan).
groupby(df['product']).transform('median'),df['salary']))
first use .groupby and .transform the column to shown the grouped by median. Finally, locate salaries that are 0 with .loc and set them equal to the median salary.
#NOTE - the below line of code uses `median` instead of `np.nanmedian`. These will return different results...
#To anyone reading this, please know which one to use according to your situation...
#As you can see the outputs are different between Chester's answer and mine.
df.loc[df['salary'] == 0, 'salary'] = df.groupby('product')['salary'].transform('median')
df
output:
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25.0
1 A 50 2017-01-03 4 20.0
2 B 200 2016-12-24 10 100.0
3 A 50 2017-01-18 3 17.5
4 B 200 2017-01-28 15 80.0
5 A 50 2017-01-18 6 15.0
6 B 200 2017-01-28 20 80.0
7 A 50 2017-04-18 6 17.5
8 B 200 2017-12-08 25 80.0
9 A 50 2017-11-18 6 20.0
10 B 200 2017-08-21 20 90.0
11 B 200 2017-12-28 30 110.0
12 A 50 2018-03-18 10 17.5
13 B 300 2018-06-08 45 100.0
14 B 300 2018-09-20 50 60.0
15 A 50 2018-11-18 8 45.0
16 B 300 2018-11-28 35 80.0
Related
Slice values of a column and calculate average in python
I have a dataframe with three columns: a b c 0 73 12 73 80 2 80 100 5 100 150 13 Values in "a" and "b" are days. I need to find the average values of "c" in each 30 day-interval (slice values inside [min(a),max(b)] in 30 days and calculate average of c). I want as a result have a dataframe like this: aa bb c_avg 0 30 12 30 60 12 60 90 6.33 90 120 9 120 150 13 Another sample data could be: a b c 0 1264.0 1629.0 0.000000 1 1629.0 1632.0 133.333333 6 1632.0 1699.0 0.000000 2 1699.0 1706.0 21.428571 7 1706.0 1723.0 0.000000 3 1723.0 1726.0 50.000000 8 1726.0 1890.0 0.000000 4 1890.0 1893.0 33.333333 1 1893.0 1994.0 0.000000 How can I get to the final table?
First create ranges DataFrame by ranges defined a and b columns: a = np.arange(0, 180, 30) df1 = pd.DataFrame({'aa':a[:-1], 'bb':a[1:]}) #print (df1) Then cross join all rows by helper column tmp: df3 = pd.merge(df1.assign(tmp=1), df.assign(tmp=1), on='tmp') #print (df3) And last filter - There are 2 solution by columns for filtering: df4 = df3[df3['aa'].between(df3['a'], df3['b']) | df3['bb'].between(df3['a'], df3['b'])] print (df4) aa bb tmp a b c 0 0 30 1 0 73 12 4 30 60 1 0 73 12 8 60 90 1 0 73 12 10 60 90 1 80 100 5 14 90 120 1 80 100 5 15 90 120 1 100 150 13 19 120 150 1 100 150 13 df4 = df4.groupby(['aa','bb'], as_index=False)['c'].mean() print (df4) aa bb c 0 0 30 12.0 1 30 60 12.0 2 60 90 8.5 3 90 120 9.0 4 120 150 13.0 df5 = df3[df3['a'].between(df3['aa'], df3['bb']) | df3['b'].between(df3['aa'], df3['bb'])] print (df5) aa bb tmp a b c 0 0 30 1 0 73 12 8 60 90 1 0 73 12 9 60 90 1 73 80 2 10 60 90 1 80 100 5 14 90 120 1 80 100 5 15 90 120 1 100 150 13 19 120 150 1 100 150 13 df5 = df5.groupby(['aa','bb'], as_index=False)['c'].mean() print (df5) aa bb c 0 0 30 12.000000 1 60 90 6.333333 2 90 120 9.000000 3 120 150 13.000000
Add a column in dataframe conditionally from values in other dataframe python
i have a table in pandas df id product_1 count 1 100 10 2 200 20 3 100 30 4 400 40 5 500 50 6 200 60 7 100 70 also i have another table in dataframe df2 product score 100 5 200 10 300 15 400 20 500 25 600 30 700 35 i have to create a new column score in my first df, taking values of score from df2 with respect to product_1. my final output should be. df = id product_1 count score 1 100 10 5 2 200 20 10 3 100 30 5 4 400 40 20 5 500 50 25 6 200 60 10 7 100 70 5 Any ideas how to achieve it?
Use map: df['score'] = df['product_1'].map(df2.set_index('product')['score'].to_dict()) print (df) id product_1 count score 0 1 100 10 5 1 2 200 20 10 2 3 100 30 5 3 4 400 40 20 4 5 500 50 25 5 6 200 60 10 6 7 100 70 5 Or merge: df = pd.merge(df,df2, left_on='product_1', right_on='product', how='left') print (df) id product_1 count product score 0 1 100 10 100 5 1 2 200 20 200 10 2 3 100 30 100 5 3 4 400 40 400 20 4 5 500 50 500 25 5 6 200 60 200 10 6 7 100 70 100 5 EDIT by comment: df['score'] = df['product_1'].map(df2.set_index('product')['score'].to_dict()) df['final_score'] = (df['count'].mul(0.6).div(df.id)).add(df.score.mul(0.4)) print (df) id product_1 count score final_score 0 1 100 10 5 8.0 1 2 200 20 10 10.0 2 3 100 30 5 8.0 3 4 400 40 20 14.0 4 5 500 50 25 16.0 5 6 200 60 10 10.0 6 7 100 70 5 8.0
Pandas compare 2 dataframes by specific rows in all columns
I have the following Pandas dataframe of some raw numbers: import numpy as np import pandas as pd pd.set_option('display.max_rows', 500) pd.set_option('display.max_columns', 500) pd.set_option('display.width', 10000) col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2'] col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan] cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']] both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]] processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3'] df_raw = pd.DataFrame(zip(*cols_raw)) df_temp = pd.DataFrame(zip(*both_values)) df_raw = pd.concat([df_raw,df_temp]) df_raw.columns=col_raw_headers df_raw.insert(0,'Tr_id',col_raw_trial_info) df_raw.reset_index(drop=True,inplace=True) It looks like this: Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2 0 Quantity1 1 1 2 1 2 3 1 1 2 3 1 Quantity2 75 11 0.9 70 60 17 6 3 5 7 2 Quantity3 9 20 17 4 74 12 34 43 92 11 3 Quantity4 7 -17 102 75 41 -89 496 12 17 62 4 Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11 5 Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5 6 TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45 7 NaN 1 6 65 4 2 22 1 34 51 12 8 NaN 2 5 3 3 3 1 6 37 20 71 9 NaN 3 3 6 6 11 6 3 46 1 34 10 NaN 4 7 78 8 55 32 2 918 34 94 11 NaN 8 3 9 3 3 5 6 0 12 1 12 NaN 4 23 2 5 7 6 55 37 59 73 13 NaN 3 27 45 66 33 4 22 91 78 46 14 NaN 8 3 6 32 65 3 6 12 6 51 15 NaN 7 11 7 84 34 898 23 68 101 21 I have a separate dataframe of a processed version of these numbers where: some of the header rows from above have been deleted, the column names have been changed Here is the second dataframe: df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols) df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]] 8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3 0 4 2 12 34 1 65 6 1 51 22 1 3 3 71 37 2 3 5 6 20 1 2 6 11 34 46 3 6 3 3 1 6 3 8 55 94 918 4 78 7 2 34 32 4 3 3 1 0 8 9 3 6 12 5 5 5 7 73 37 4 2 23 55 59 6 6 66 33 46 91 3 45 27 22 78 4 7 32 65 51 12 8 6 3 6 6 3 8 84 34 21 68 7 7 11 23 101 898 Common parts of each dataframe: For each column, rows 8 onwards of the raw dataframe are the same as row 1 onwards from the processed dataframe. The order of columns in both dataframes is not the same. Output combination: I am looking to compare rows 8-16 in columns 1-10 of the raw dataframe dr_raw to the processed dataframe df_processed. If the columns match each other, then I would like to extract rows 1-7 of the df_raw and the column header from df_processed. Example: the values in column c_1trial only matches values in rows 8-16 from the column 07_08_19 #1. I would 2 steps: (1) I would like to find some way to determine that these 2 columns are matching each other, (2) if 2 columns do match eachother, then in the sample output, I would like to select rows from the matching columns. Here is the output I am looking to get: Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2 Quantity1 1 1 2 1 2 3 1 1 2 3 Quantity2 75 11 0.9 70 60 17 6 3 5 7 Quantity3 9 20 17 4 74 12 34 43 92 11 Proc_Name c_1trial 14_1 14_2 8_1 8_2 8_3 28_1 24_1 24_2 24_3 Quantity4 7 -17 102 75 41 -89 496 12 17 62 Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11 Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5 TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45 My attempts are giving trouble: print (df_raw.iloc[7:,1:] == df_processed).all(axis=1) gives ValueError: Can only compare identically-labeled DataFrame objects and print (df_raw.ix[7:].values == df_processed.values) #gives False gives False The problem with my second attempt is that I am not selecting .all(axis=1). When I make a comparison I want to do this across all rows of every column, not just one row. Question: Is there a way to select out the output I showed above from these 2 dataframes?
Does this look like the output you're looking for? Raw dataframe df: Tr_id 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1 0 Quantity1 1 1 2 1 2 1 Quantity2 75 11 0.9 70 60 2 Quantity3 9 20 17 4 74 3 Quantity4 7 -17 102 75 41 4 Quantity5 -4 12 56 0.8 -36 5 Quantity6 0.4 0.8 0.6 0.4 0.3 6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019 7 NaN 1 6 65 4 2 8 NaN 2 5 3 3 3 9 NaN 3 3 6 6 11 10 NaN 4 7 78 8 55 11 NaN 8 3 9 3 3 12 NaN 4 23 2 5 7 13 NaN 3 27 45 66 33 14 NaN 8 3 6 32 65 15 NaN 7 11 7 84 34 11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3 0 3 1 1 2 3 1 17 6 3 5 7 2 12 34 43 92 11 3 -89 496 12 17 62 4 30 -84 -23 64 -11 5 0.1 0.5 0.5 0.5 0.5 6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020 7 22 1 34 51 12 8 1 6 37 20 71 9 6 3 46 1 34 10 32 2 918 34 94 11 5 6 0 12 1 12 6 55 37 59 73 13 4 22 91 78 46 14 3 6 12 6 51 15 898 23 68 101 21 Processed dataframe dfp: 8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3 0 4 2 12 34 1 65 6 1 51 22 1 3 3 71 37 2 3 5 6 20 1 2 6 11 34 46 3 6 3 3 1 6 3 8 55 94 918 4 78 7 2 34 32 4 3 3 1 0 8 9 3 6 12 5 5 5 7 73 37 4 2 23 55 59 6 6 66 33 46 91 3 45 27 22 78 4 7 32 65 51 12 8 6 3 6 6 3 8 84 34 21 68 7 7 11 23 101 898 Code: df = pd.read_csv('raw_df.csv') # raw dataframe dfp = pd.read_csv('processed_df.csv') # processed dataframe dfr = df.drop('Tr_id', axis=1) x = pd.DataFrame() for col_raw in dfr.columns: for col_p in dfp.columns: if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all(): series = dfr[col_raw].head(7).tolist() series.append(col_raw) x[col_p] = series x = pd.concat([df['Tr_id'].head(7), x], axis=1) Output: Tr_id c_1trial 14_1 14_2 8_1 8_2 0 Quantity1 1 1 2 1 2 1 Quantity2 75 11 0.9 70 60 2 Quantity3 9 20 17 4 74 3 Quantity4 7 -17 102 75 41 4 Quantity5 -4 12 56 0.8 -36 5 Quantity6 0.4 0.8 0.6 0.4 0.3 6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019 7 NaN 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1 8_3 28_1 24_1 24_2 24_3 0 3 1 1 2 3 1 17 6 3 5 7 2 12 34 43 92 11 3 -89 496 12 17 62 4 30 -84 -23 64 -11 5 0.1 0.5 0.5 0.5 0.5 6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020 7 11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3 I think the code could be more concise but maybe this does the job.
alternative solution, using DataFrame.isin() method: In [171]: df1 Out[171]: a b c 0 1 1 3 1 0 2 4 2 4 2 2 3 0 3 3 4 0 4 4 In [172]: df2 Out[172]: a b c 0 0 3 3 1 1 1 1 2 0 3 4 3 4 2 3 4 0 4 4 In [173]: common = pd.merge(df1, df2) In [174]: common Out[174]: a b c 0 0 3 3 1 0 4 4 In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)] Out[175]: a b c 3 0 3 3 4 0 4 4 Or if you want to subtract second data set from the first one. I.e. Pandas equivalent for SQL's: select col1, .., colN from tableA minus select col1, .., colN from tableB in Pandas: In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)] Out[176]: a b c 0 1 1 3 1 0 2 4 2 4 2 2
I came up with this using loops. It is very disappointing: holder = [] for randm,pp in enumerate(list(df_processed)): list1 = df_processed[pp].tolist() for car,rr in enumerate(list(df_raw)): list2 = df_raw.loc[7:,rr].tolist() if list1==list2: holder.append([rr,pp]) df_intermediate = pd.DataFrame(holder,columns=['A','B']) df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()] df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist() df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]]) df_c.iloc[-1,0]='Proc_Name' df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True) Output: Tr_id 11_31_19 #1 11_31_19 #1.1 12_15_20 #2.2 12_15_20 #2 07_08_19 #1 07_08_19 #2.1 07_08_19 #2 12_15_20 #1 12_15_20 #2.1 11_31_19 #1.3 0 Quantity1 1 2 3 1 1 2 1 1 2 3 1 Quantity2 70 60 7 3 75 0.9 11 6 5 17 2 Quantity3 4 74 11 43 9 17 20 34 92 12 3 Proc_Name 8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3 4 Quantity4 75 41 62 12 7 102 -17 496 17 -89 5 Quantity5 0.8 -36 -11 -23 -4 56 12 -84 64 30 6 Quantity6 0.4 0.3 0.5 0.5 0.4 0.6 0.8 0.5 0.5 0.1 7 TimeStamp 11/31/2019 11:15 11/31/2019 16:50 12/15/2020 21:45 12/15/2020 07:01 07/08/2019 05:11 07/08/2019 21:04 07/08/2019 10:54 12/15/2020 01:36 12/15/2020 11:15 11/31/2019 21:33 The order of the columns is different than what I required, but that is a minor problem. The real problem with this approach is using loops. I wish there was a better way to do this using some built-in Pandas functionality. If you have a better solution, please post it. thank you.
How can I change date type for graph labels
You can see out[7] -> date type 2015.01.02 but You can see out[17] -> x-axis on graph only shows 0, 50, 100 How can I change x-axis like 2015.01.02 Any suggestions on this please ?
You can try first convert column date to_datetime, then set_index and last change datetime format of axis x by strftime and set_major_formatter: import matplotlib.pyplot as plt import matplotlib.ticker as ticker df['date'] = pd.to_datetime(df['date'] ) print df date a b c 0 2015-01-02 10 20 3 1 2015-01-05 40 50 6 2 2015-01-06 70 80 8 3 2015-01-07 80 50 9 4 2015-01-08 90 50 3 5 2015-01-09 10 20 3 6 2015-01-10 40 50 6 7 2015-01-11 70 80 8 8 2015-01-12 80 50 9 9 2015-01-13 90 50 3 a = df['a'].pct_change() b = df['b'].pct_change() c = df['c'].pct_change() df['corr'] = pd.rolling_corr(a,b,3) df = df.set_index('date') print df a b c corr date 2015-01-02 10 20 3 NaN 2015-01-05 40 50 6 NaN 2015-01-06 70 80 8 NaN 2015-01-07 80 50 9 0.941542 2015-01-08 90 50 3 0.914615 2015-01-09 10 20 3 0.776273 2015-01-10 40 50 6 0.999635 2015-01-11 70 80 8 0.985112 2015-01-12 80 50 9 0.941542 2015-01-13 90 50 3 0.914615 ax = df['corr'].plot() ticklabels = df.index.strftime('%Y.%m.%d') ax.xaxis.set_major_formatter(ticker.FixedFormatter(ticklabels)) plt.show() EDIT by comment: It seems like bug, but you can change index and then rotate labels: df['date'] = pd.to_datetime(df['date'] ) #print df a = df['a'].pct_change() b = df['b'].pct_change() c = df['c'].pct_change() df['corr'] = pd.rolling_corr(a,b,3) df = df.set_index('date') print df a b c corr date 2015-01-02 10 20 3 NaN 2015-01-05 40 50 6 NaN 2015-01-06 70 80 8 NaN 2015-01-07 80 50 9 0.941542 2015-01-08 90 50 3 0.914615 2015-01-09 10 20 3 0.776273 2015-01-10 40 50 6 0.999635 2015-01-11 70 80 8 0.985112 2015-01-12 80 50 9 0.941542 2015-01-13 90 50 3 0.914615 df.index = df.index.strftime('%Y.%m.%d') print df a b c corr 2015.01.02 10 20 3 NaN 2015.01.05 40 50 6 NaN 2015.01.06 70 80 8 NaN 2015.01.07 80 50 9 0.941542 2015.01.08 90 50 3 0.914615 2015.01.09 10 20 3 0.776273 2015.01.10 40 50 6 0.999635 2015.01.11 70 80 8 0.985112 2015.01.12 80 50 9 0.941542 2015.01.13 90 50 3 0.914615 ax = df['corr'].plot() plt.xticks(rotation=90) plt.show()
Pandas difference between groupby-size and unique
The goal here is to see how many unique values i have in my database. This is the code i have written: apps = pd.read_csv('ConcatOwned1_900.csv', sep='\t', usecols=['appid']) apps[('appid')] = apps[('appid')].astype(int) apps_list=apps['appid'].unique() b = apps.groupby('appid').size() blist = b.unique() print len(apps_list), len(blist), len(set(b)) >>>7672 2164 2164 Why is there difference in those two methods? Due to request i am posting some of my data: Unnamed: 0 StudID No appid work work2 0 0 76561193665298433 0 10 nan 0 1 1 76561193665298433 1 20 nan 0 2 2 76561193665298433 2 30 nan 0 3 3 76561193665298433 3 40 nan 0 4 4 76561193665298433 4 50 nan 0 5 5 76561193665298433 5 60 nan 0 6 6 76561193665298433 6 70 nan 0 7 7 76561193665298433 7 80 nan 0 8 8 76561193665298433 8 100 nan 0 9 9 76561193665298433 9 130 nan 0 10 10 76561193665298433 10 220 nan 0 11 11 76561193665298433 11 240 nan 0 12 12 76561193665298433 12 280 nan 0 13 13 76561193665298433 13 300 nan 0 14 14 76561193665298433 14 320 nan 0 15 15 76561193665298433 15 340 nan 0 16 16 76561193665298433 16 360 nan 0 17 17 76561193665298433 17 380 nan 0 18 18 76561193665298433 18 400 nan 0 19 19 76561193665298433 19 420 nan 0 20 20 76561193665298433 20 500 nan 0 21 21 76561193665298433 21 550 nan 0 22 22 76561193665298433 22 620 6.0 3064 33 33 76561193665298434 0 10 nan 837 34 34 76561193665298434 1 20 nan 27 35 35 76561193665298434 2 30 nan 9 36 36 76561193665298434 3 40 nan 5 37 37 76561193665298434 4 50 nan 2 38 38 76561193665298434 5 60 nan 0 39 39 76561193665298434 6 70 nan 403 40 40 76561193665298434 7 130 nan 0 41 41 76561193665298434 8 80 nan 6 42 42 76561193665298434 9 100 nan 10 43 43 76561193665298434 10 220 nan 14
IIUC based on attached piece of the dataframe it seems that you should analyze b.index, not values of b. Just look: b = apps.groupby('appid').size() In [24]: b Out[24]: appid 10 2 20 2 30 2 40 2 50 2 60 2 70 2 80 2 100 2 130 2 220 2 240 1 280 1 300 1 320 1 340 1 360 1 380 1 400 1 420 1 500 1 550 1 620 1 dtype: int64 In [25]: set(b) Out[25]: {1, 2} But if you do it for b.index you'll get the same values for all 3 methods: blist = b.index.unique() In [30]: len(apps_list), len(blist), len(set(b.index)) Out[30]: (23, 23, 23)