Replace the 0 in a column with groupby median in pandas

Replace the 0 in a column with groupby median in pandas - python

I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 0
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 0
A 50 2017-04-18 6 0
B 200 2017-12-08 25 0
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 0
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 0
From the above I would like replace 0 salary with groupby median of the the column product.
Explanation:
A : 15, 20, 20, 25, 45
So the median = 20.
B : 60, 80, 90, 100, 100, 110
So the median = 95.
Expected Output
product profit bougt_date discount salary
A 50 2016-12-01 5 25
A 50 2017-01-03 4 20
B 200 2016-12-24 10 100
A 50 2017-01-18 3 20
B 200 2017-01-28 15 80
A 50 2017-01-18 6 15
B 200 2017-01-28 20 95
A 50 2017-04-18 6 20
B 200 2017-12-08 25 95
A 50 2017-11-18 6 20
B 200 2017-08-21 20 90
B 200 2017-12-28 30 110
A 50 2018-03-18 10 20
B 300 2018-06-08 45 100
B 300 2018-09-20 50 60
A 50 2018-11-18 8 45
B 300 2018-11-28 35 95

You can try this using masking 0 values using pd.Series.mask and use np.nanmedian here.
fill_vals = df.salary.mask(df.salary.eq(0)).groupby(df['product']).transform(np.nanmedian)
df.assign(salary = df.salary.mask(df.salary.eq(0), fill_vals))
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25
1 A 50 2017-01-03 4 20
2 B 200 2016-12-24 10 100
3 A 50 2017-01-18 3 20
4 B 200 2017-01-28 15 80
5 A 50 2017-01-18 6 15
6 B 200 2017-01-28 20 95
7 A 50 2017-04-18 6 20
8 B 200 2017-12-08 25 95
9 A 50 2017-11-18 6 20
10 B 200 2017-08-21 20 90
11 B 200 2017-12-28 30 110
12 A 50 2018-03-18 10 20
13 B 300 2018-06-08 45 100
14 B 300 2018-09-20 50 60
15 A 50 2018-11-18 8 45
16 B 300 2018-11-28 35 95
OR
Using np.where
df['salary'] = (np.where(df['salary']==0,df['salary'].replace(0,np.nan).
groupby(df['product']).transform('median'),df['salary']))

first use .groupby and .transform the column to shown the grouped by median. Finally, locate salaries that are 0 with .loc and set them equal to the median salary.
#NOTE - the below line of code uses `median` instead of `np.nanmedian`. These will return different results...
#To anyone reading this, please know which one to use according to your situation...
#As you can see the outputs are different between Chester's answer and mine.
df.loc[df['salary'] == 0, 'salary'] = df.groupby('product')['salary'].transform('median')
df
output:
product profit bougt_date discount salary
0 A 50 2016-12-01 5 25.0
1 A 50 2017-01-03 4 20.0
2 B 200 2016-12-24 10 100.0
3 A 50 2017-01-18 3 17.5
4 B 200 2017-01-28 15 80.0
5 A 50 2017-01-18 6 15.0
6 B 200 2017-01-28 20 80.0
7 A 50 2017-04-18 6 17.5
8 B 200 2017-12-08 25 80.0
9 A 50 2017-11-18 6 20.0
10 B 200 2017-08-21 20 90.0
11 B 200 2017-12-28 30 110.0
12 A 50 2018-03-18 10 17.5
13 B 300 2018-06-08 45 100.0
14 B 300 2018-09-20 50 60.0
15 A 50 2018-11-18 8 45.0
16 B 300 2018-11-28 35 80.0

Related

Slice values of a column and calculate average in python

I have a dataframe with three columns:
a b c
0 73 12
73 80 2
80 100 5
100 150 13
Values in "a" and "b" are days. I need to find the average values of "c" in each 30 day-interval (slice values inside [min(a),max(b)] in 30 days and calculate average of c). I want as a result have a dataframe like this:
aa bb c_avg
0 30 12
30 60 12
60 90 6.33
90 120 9
120 150 13
Another sample data could be:
a b c
0 1264.0 1629.0 0.000000
1 1629.0 1632.0 133.333333
6 1632.0 1699.0 0.000000
2 1699.0 1706.0 21.428571
7 1706.0 1723.0 0.000000
3 1723.0 1726.0 50.000000
8 1726.0 1890.0 0.000000
4 1890.0 1893.0 33.333333
1 1893.0 1994.0 0.000000
How can I get to the final table?

First create ranges DataFrame by ranges defined a and b columns:
a = np.arange(0, 180, 30)
df1 = pd.DataFrame({'aa':a[:-1], 'bb':a[1:]})
#print (df1)
Then cross join all rows by helper column tmp:
df3 = pd.merge(df1.assign(tmp=1), df.assign(tmp=1), on='tmp')
#print (df3)
And last filter - There are 2 solution by columns for filtering:
df4 = df3[df3['aa'].between(df3['a'], df3['b']) | df3['bb'].between(df3['a'], df3['b'])]
print (df4)
aa bb tmp a b c
0 0 30 1 0 73 12
4 30 60 1 0 73 12
8 60 90 1 0 73 12
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df4 = df4.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df4)
aa bb c
0 0 30 12.0
1 30 60 12.0
2 60 90 8.5
3 90 120 9.0
4 120 150 13.0
df5 = df3[df3['a'].between(df3['aa'], df3['bb']) | df3['b'].between(df3['aa'], df3['bb'])]
print (df5)
aa bb tmp a b c
0 0 30 1 0 73 12
8 60 90 1 0 73 12
9 60 90 1 73 80 2
10 60 90 1 80 100 5
14 90 120 1 80 100 5
15 90 120 1 100 150 13
19 120 150 1 100 150 13
df5 = df5.groupby(['aa','bb'], as_index=False)['c'].mean()
print (df5)
aa bb c
0 0 30 12.000000
1 60 90 6.333333
2 90 120 9.000000
3 120 150 13.000000

Add a column in dataframe conditionally from values in other dataframe python

i have a table in pandas df
id product_1 count
1 100 10
2 200 20
3 100 30
4 400 40
5 500 50
6 200 60
7 100 70
also i have another table in dataframe df2
product score
100 5
200 10
300 15
400 20
500 25
600 30
700 35
i have to create a new column score in my first df, taking values of score from df2 with respect to product_1.
my final output should be. df =
id product_1 count score
1 100 10 5
2 200 20 10
3 100 30 5
4 400 40 20
5 500 50 25
6 200 60 10
7 100 70 5
Any ideas how to achieve it?

Use map:
df['score'] = df['product_1'].map(df2.set_index('product')['score'].to_dict())
print (df)
id product_1 count score
0 1 100 10 5
1 2 200 20 10
2 3 100 30 5
3 4 400 40 20
4 5 500 50 25
5 6 200 60 10
6 7 100 70 5
Or merge:
df = pd.merge(df,df2, left_on='product_1', right_on='product', how='left')
print (df)
id product_1 count product score
0 1 100 10 100 5
1 2 200 20 200 10
2 3 100 30 100 5
3 4 400 40 400 20
4 5 500 50 500 25
5 6 200 60 200 10
6 7 100 70 100 5
EDIT by comment:
df['score'] = df['product_1'].map(df2.set_index('product')['score'].to_dict())
df['final_score'] = (df['count'].mul(0.6).div(df.id)).add(df.score.mul(0.4))
print (df)
id product_1 count score final_score
0 1 100 10 5 8.0
1 2 200 20 10 10.0
2 3 100 30 5 8.0
3 4 400 40 20 14.0
4 5 500 50 25 16.0
5 6 200 60 10 10.0
6 7 100 70 5 8.0

Pandas compare 2 dataframes by specific rows in all columns

I have the following Pandas dataframe of some raw numbers:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 10000)
col_raw_headers = ['07_08_19 #1','07_08_19 #2','07_08_19 #2.1','11_31_19 #1','11_31_19 #1.1','11_31_19 #1.3','12_15_20 #1','12_15_20 #2','12_15_20 #2.1','12_15_20 #2.2']
col_raw_trial_info = ['Quantity1','Quantity2','Quantity3','Quantity4','Quantity5','Quantity6','TimeStamp',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]
cols_raw = [[1,75,9,7,-4,0.4,'07/08/2019 05:11'],[1,11,20,-17,12,0.8,'07/08/2019 10:54'],[2,0.9,17,102,56,0.6,'07/08/2019 21:04'],[1,70,4,75,0.8,0.4,'11/31/2019 11:15'],[2,60,74,41,-36,0.3,'11/31/2019 16:50'],[3,17,12,-89,30,0.1,'11/31/2019 21:33'],[1,6,34,496,-84,0.5,'12/15/2020 01:36'],[1,3,43,12,-23,0.5,'12/15/2020 07:01'],[2,5,92,17,64,0.5,'12/15/2020 11:15'],[3,7,11,62,-11,0.5,'12/15/2020 21:45']]
both_values = [[1,2,3,4,8,4,3,8,7],[6,5,3,7,3,23,27,3,11],[65,3,6,78,9,2,45,6,7],[4,3,6,8,3,5,66,32,84],[2,3,11,55,3,7,33,65,34],[22,1,6,32,5,6,4,3,898],[1,6,3,2,6,55,22,6,23],[34,37,46,918,0,37,91,12,68],[51,20,1,34,12,59,78,6,101],[12,71,34,94,1,73,46,51,21]]
processed_cols = ['c_1trial','14_1','14_2','8_1','8_2','8_3','28_1','24_1','24_2','24_3']
df_raw = pd.DataFrame(zip(*cols_raw))
df_temp = pd.DataFrame(zip(*both_values))
df_raw = pd.concat([df_raw,df_temp])
df_raw.columns=col_raw_headers
df_raw.insert(0,'Tr_id',col_raw_trial_info)
df_raw.reset_index(drop=True,inplace=True)
It looks like this:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
0 Quantity1 1 1 2 1 2 3 1 1 2 3
1 Quantity2 75 11 0.9 70 60 17 6 3 5 7
2 Quantity3 9 20 17 4 74 12 34 43 92 11
3 Quantity4 7 -17 102 75 41 -89 496 12 17 62
4 Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
5 Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
6 TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
7 NaN 1 6 65 4 2 22 1 34 51 12
8 NaN 2 5 3 3 3 1 6 37 20 71
9 NaN 3 3 6 6 11 6 3 46 1 34
10 NaN 4 7 78 8 55 32 2 918 34 94
11 NaN 8 3 9 3 3 5 6 0 12 1
12 NaN 4 23 2 5 7 6 55 37 59 73
13 NaN 3 27 45 66 33 4 22 91 78 46
14 NaN 8 3 6 32 65 3 6 12 6 51
15 NaN 7 11 7 84 34 898 23 68 101 21
I have a separate dataframe of a processed version of these numbers where:
some of the header rows from above have been deleted,
the column names have been changed
Here is the second dataframe:
df_processed = pd.DataFrame(zip(*both_values),columns=processed_cols)
df_processed = df_processed[[3,4,9,7,0,2,1,6,8,5]]
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Common parts of each dataframe:
For each column, rows 8 onwards of the raw dataframe are the same as row 1 onwards from the processed dataframe. The order of columns in both dataframes is not the same.
Output combination:
I am looking to compare rows 8-16 in columns 1-10 of the raw dataframe dr_raw to the processed dataframe df_processed. If the columns match each other, then I would like to extract rows 1-7 of the df_raw and the column header from df_processed.
Example:
the values in column c_1trial only matches values in rows 8-16 from the column 07_08_19 #1. I would 2 steps: (1) I would like to find some way to determine that these 2 columns are matching each other, (2) if 2 columns do match eachother, then in the sample output, I would like to select rows from the matching columns.
Here is the output I am looking to get:
Tr_id 07_08_19 #1 07_08_19 #2 07_08_19 #2.1 11_31_19 #1 11_31_19 #1.1 11_31_19 #1.3 12_15_20 #1 12_15_20 #2 12_15_20 #2.1 12_15_20 #2.2
Quantity1 1 1 2 1 2 3 1 1 2 3
Quantity2 75 11 0.9 70 60 17 6 3 5 7
Quantity3 9 20 17 4 74 12 34 43 92 11
Proc_Name c_1trial 14_1 14_2 8_1 8_2 8_3 28_1 24_1 24_2 24_3
Quantity4 7 -17 102 75 41 -89 496 12 17 62
Quantity5 -4 12 56 0.8 -36 30 -84 -23 64 -11
Quantity6 0.4 0.8 0.6 0.4 0.3 0.1 0.5 0.5 0.5 0.5
TimeStamp 07/08/2019 05:11 07/08/2019 10:54 07/08/2019 21:04 11/31/2019 11:15 11/31/2019 16:50 11/31/2019 21:33 12/15/2020 01:36 12/15/2020 07:01 12/15/2020 11:15 12/15/2020 21:45
My attempts are giving trouble:
print (df_raw.iloc[7:,1:] == df_processed).all(axis=1)
gives
ValueError: Can only compare identically-labeled DataFrame objects
and
print (df_raw.ix[7:].values == df_processed.values) #gives False
gives
False
The problem with my second attempt is that I am not selecting .all(axis=1). When I make a comparison I want to do this across all rows of every column, not just one row.
Question:
Is there a way to select out the output I showed above from these 2 dataframes?

Does this look like the output you're looking for?
Raw dataframe df:
Tr_id 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 1 6 65 4 2
8 NaN 2 5 3 3 3
9 NaN 3 3 6 6 11
10 NaN 4 7 78 8 55
11 NaN 8 3 9 3 3
12 NaN 4 23 2 5 7
13 NaN 3 27 45 66 33
14 NaN 8 3 6 32 65
15 NaN 7 11 7 84 34
11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 22 1 34 51 12
8 1 6 37 20 71
9 6 3 46 1 34
10 32 2 918 34 94
11 5 6 0 12 1
12 6 55 37 59 73
13 4 22 91 78 46
14 3 6 12 6 51
15 898 23 68 101 21
Processed dataframe dfp:
8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
0 4 2 12 34 1 65 6 1 51 22
1 3 3 71 37 2 3 5 6 20 1
2 6 11 34 46 3 6 3 3 1 6
3 8 55 94 918 4 78 7 2 34 32
4 3 3 1 0 8 9 3 6 12 5
5 5 7 73 37 4 2 23 55 59 6
6 66 33 46 91 3 45 27 22 78 4
7 32 65 51 12 8 6 3 6 6 3
8 84 34 21 68 7 7 11 23 101 898
Code:
df = pd.read_csv('raw_df.csv') # raw dataframe
dfp = pd.read_csv('processed_df.csv') # processed dataframe
dfr = df.drop('Tr_id', axis=1)
x = pd.DataFrame()
for col_raw in dfr.columns:
for col_p in dfp.columns:
if (dfr.tail(9).astype(int)[col_raw] == dfp[col_p]).all():
series = dfr[col_raw].head(7).tolist()
series.append(col_raw)
x[col_p] = series
x = pd.concat([df['Tr_id'].head(7), x], axis=1)
Output:
Tr_id c_1trial 14_1 14_2 8_1 8_2
0 Quantity1 1 1 2 1 2
1 Quantity2 75 11 0.9 70 60
2 Quantity3 9 20 17 4 74
3 Quantity4 7 -17 102 75 41
4 Quantity5 -4 12 56 0.8 -36
5 Quantity6 0.4 0.8 0.6 0.4 0.3
6 TimeStamp 07/08/2019 07/08/2019 07/08/2019 11/31/2019 11/31/2019
7 NaN 07_08_19 07_08_19.1 07_08_19.2 11_31_19 11_31_19.1
8_3 28_1 24_1 24_2 24_3
0 3 1 1 2 3
1 17 6 3 5 7
2 12 34 43 92 11
3 -89 496 12 17 62
4 30 -84 -23 64 -11
5 0.1 0.5 0.5 0.5 0.5
6 11/31/2019 12/15/2020 12/15/2020 12/15/2020 12/15/2020
7 11_31_19.2 12_15_20 12_15_20.1 12_15_20.2 12_15_20.3
I think the code could be more concise but maybe this does the job.

alternative solution, using DataFrame.isin() method:
In [171]: df1
Out[171]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2
3 0 3 3
4 0 4 4
In [172]: df2
Out[172]:
a b c
0 0 3 3
1 1 1 1
2 0 3 4
3 4 2 3
4 0 4 4
In [173]: common = pd.merge(df1, df2)
In [174]: common
Out[174]:
a b c
0 0 3 3
1 0 4 4
In [175]: df1[df1.isin(common.to_dict('list')).all(axis=1)]
Out[175]:
a b c
3 0 3 3
4 0 4 4
Or if you want to subtract second data set from the first one. I.e. Pandas equivalent for SQL's:
select col1, .., colN from tableA
minus
select col1, .., colN from tableB
in Pandas:
In [176]: df1[~df1.isin(common.to_dict('list')).all(axis=1)]
Out[176]:
a b c
0 1 1 3
1 0 2 4
2 4 2 2

I came up with this using loops. It is very disappointing:
holder = []
for randm,pp in enumerate(list(df_processed)):
list1 = df_processed[pp].tolist()
for car,rr in enumerate(list(df_raw)):
list2 = df_raw.loc[7:,rr].tolist()
if list1==list2:
holder.append([rr,pp])
df_intermediate = pd.DataFrame(holder,columns=['A','B'])
df_c = df_raw.loc[:6,df_intermediate.iloc[:,0].tolist()]
df_c.loc[df_c.shape[0]] = df_intermediate.iloc[:,1].tolist()
df_c.insert(0,list(df_raw)[0],df_raw[list(df_raw)[0]])
df_c.iloc[-1,0]='Proc_Name'
df_c = df_c.reindex([0,1,2]+[7]+[3,4,5,6]).reset_index(drop=True)
Output:
Tr_id 11_31_19 #1 11_31_19 #1.1 12_15_20 #2.2 12_15_20 #2 07_08_19 #1 07_08_19 #2.1 07_08_19 #2 12_15_20 #1 12_15_20 #2.1 11_31_19 #1.3
0 Quantity1 1 2 3 1 1 2 1 1 2 3
1 Quantity2 70 60 7 3 75 0.9 11 6 5 17
2 Quantity3 4 74 11 43 9 17 20 34 92 12
3 Proc_Name 8_1 8_2 24_3 24_1 c_1trial 14_2 14_1 28_1 24_2 8_3
4 Quantity4 75 41 62 12 7 102 -17 496 17 -89
5 Quantity5 0.8 -36 -11 -23 -4 56 12 -84 64 30
6 Quantity6 0.4 0.3 0.5 0.5 0.4 0.6 0.8 0.5 0.5 0.1
7 TimeStamp 11/31/2019 11:15 11/31/2019 16:50 12/15/2020 21:45 12/15/2020 07:01 07/08/2019 05:11 07/08/2019 21:04 07/08/2019 10:54 12/15/2020 01:36 12/15/2020 11:15 11/31/2019 21:33
The order of the columns is different than what I required, but that is a minor problem.
The real problem with this approach is using loops.
I wish there was a better way to do this using some built-in Pandas functionality. If you have a better solution, please post it. thank you.

How can I change date type for graph labels

You can see out[7] -> date type 2015.01.02
but
You can see out[17] -> x-axis on graph only shows 0, 50, 100
How can I change x-axis like 2015.01.02
Any suggestions on this please ?

You can try first convert column date to_datetime, then set_index and last change datetime format of axis x by strftime and set_major_formatter:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
df['date'] = pd.to_datetime(df['date'] )
print df
date a b c
0 2015-01-02 10 20 3
1 2015-01-05 40 50 6
2 2015-01-06 70 80 8
3 2015-01-07 80 50 9
4 2015-01-08 90 50 3
5 2015-01-09 10 20 3
6 2015-01-10 40 50 6
7 2015-01-11 70 80 8
8 2015-01-12 80 50 9
9 2015-01-13 90 50 3
a = df['a'].pct_change()
b = df['b'].pct_change()
c = df['c'].pct_change()
df['corr'] = pd.rolling_corr(a,b,3)
df = df.set_index('date')
print df
a b c corr
date
2015-01-02 10 20 3 NaN
2015-01-05 40 50 6 NaN
2015-01-06 70 80 8 NaN
2015-01-07 80 50 9 0.941542
2015-01-08 90 50 3 0.914615
2015-01-09 10 20 3 0.776273
2015-01-10 40 50 6 0.999635
2015-01-11 70 80 8 0.985112
2015-01-12 80 50 9 0.941542
2015-01-13 90 50 3 0.914615
ax = df['corr'].plot()
ticklabels = df.index.strftime('%Y.%m.%d')
ax.xaxis.set_major_formatter(ticker.FixedFormatter(ticklabels))
plt.show()
EDIT by comment:
It seems like bug, but you can change index and then rotate labels:
df['date'] = pd.to_datetime(df['date'] )
#print df
a = df['a'].pct_change()
b = df['b'].pct_change()
c = df['c'].pct_change()
df['corr'] = pd.rolling_corr(a,b,3)
df = df.set_index('date')
print df
a b c corr
date
2015-01-02 10 20 3 NaN
2015-01-05 40 50 6 NaN
2015-01-06 70 80 8 NaN
2015-01-07 80 50 9 0.941542
2015-01-08 90 50 3 0.914615
2015-01-09 10 20 3 0.776273
2015-01-10 40 50 6 0.999635
2015-01-11 70 80 8 0.985112
2015-01-12 80 50 9 0.941542
2015-01-13 90 50 3 0.914615
df.index = df.index.strftime('%Y.%m.%d')
print df
a b c corr
2015.01.02 10 20 3 NaN
2015.01.05 40 50 6 NaN
2015.01.06 70 80 8 NaN
2015.01.07 80 50 9 0.941542
2015.01.08 90 50 3 0.914615
2015.01.09 10 20 3 0.776273
2015.01.10 40 50 6 0.999635
2015.01.11 70 80 8 0.985112
2015.01.12 80 50 9 0.941542
2015.01.13 90 50 3 0.914615
ax = df['corr'].plot()
plt.xticks(rotation=90)
plt.show()

Pandas difference between groupby-size and unique

The goal here is to see how many unique values i have in my database. This is the code i have written:
apps = pd.read_csv('ConcatOwned1_900.csv', sep='\t', usecols=['appid'])
apps[('appid')] = apps[('appid')].astype(int)
apps_list=apps['appid'].unique()
b = apps.groupby('appid').size()
blist = b.unique()
print len(apps_list), len(blist), len(set(b))
>>>7672 2164 2164
Why is there difference in those two methods?
Due to request i am posting some of my data:
Unnamed: 0 StudID No appid work work2
0 0 76561193665298433 0 10 nan 0
1 1 76561193665298433 1 20 nan 0
2 2 76561193665298433 2 30 nan 0
3 3 76561193665298433 3 40 nan 0
4 4 76561193665298433 4 50 nan 0
5 5 76561193665298433 5 60 nan 0
6 6 76561193665298433 6 70 nan 0
7 7 76561193665298433 7 80 nan 0
8 8 76561193665298433 8 100 nan 0
9 9 76561193665298433 9 130 nan 0
10 10 76561193665298433 10 220 nan 0
11 11 76561193665298433 11 240 nan 0
12 12 76561193665298433 12 280 nan 0
13 13 76561193665298433 13 300 nan 0
14 14 76561193665298433 14 320 nan 0
15 15 76561193665298433 15 340 nan 0
16 16 76561193665298433 16 360 nan 0
17 17 76561193665298433 17 380 nan 0
18 18 76561193665298433 18 400 nan 0
19 19 76561193665298433 19 420 nan 0
20 20 76561193665298433 20 500 nan 0
21 21 76561193665298433 21 550 nan 0
22 22 76561193665298433 22 620 6.0 3064
33 33 76561193665298434 0 10 nan 837
34 34 76561193665298434 1 20 nan 27
35 35 76561193665298434 2 30 nan 9
36 36 76561193665298434 3 40 nan 5
37 37 76561193665298434 4 50 nan 2
38 38 76561193665298434 5 60 nan 0
39 39 76561193665298434 6 70 nan 403
40 40 76561193665298434 7 130 nan 0
41 41 76561193665298434 8 80 nan 6
42 42 76561193665298434 9 100 nan 10
43 43 76561193665298434 10 220 nan 14

IIUC based on attached piece of the dataframe it seems that you should analyze b.index, not values of b. Just look:
b = apps.groupby('appid').size()
In [24]: b
Out[24]:
appid
10 2
20 2
30 2
40 2
50 2
60 2
70 2
80 2
100 2
130 2
220 2
240 1
280 1
300 1
320 1
340 1
360 1
380 1
400 1
420 1
500 1
550 1
620 1
dtype: int64
In [25]: set(b)
Out[25]: {1, 2}
But if you do it for b.index you'll get the same values for all 3 methods:
blist = b.index.unique()
In [30]: len(apps_list), len(blist), len(set(b.index))
Out[30]: (23, 23, 23)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace the 0 in a column with groupby median in pandas - python

Related

Slice values of a column and calculate average in python

Add a column in dataframe conditionally from values in other dataframe python

Pandas compare 2 dataframes by specific rows in all columns

How can I change date type for graph labels

Pandas difference between groupby-size and unique

Categories

Resources