how to separate one DataFrame into two small ones - python

I have a big DataFrame as below:
count mean median min max std
datet
2001-05-16 17 NaN NaN NaN NaN NaN
2001-05-17 24 8.28 8.27 8.15 8.46 0.09
2001-05-18 24 8.41 8.31 8.18 8.85 0.19
2001-05-19 24 10.44 10.64 9.03 10.98 0.60
2001-05-20 24 10.53 10.56 9.98 10.92 0.28
2001-05-21 24 10.28 10.31 9.90 10.66 0.23
2001-05-22 24 10.40 10.42 10.17 10.67 0.17
2001-05-23 24 10.04 10.03 9.87 10.17 0.08
2001-05-24 24 9.63 9.66 9.41 9.88 0.15
2001-05-25 24 9.21 9.22 9.01 9.41 0.11
how can I separate this DataFrame into two small ones according to before or after date '2001-05-20'? like below:
df1:
count mean median min max std
datet
2001-05-16 17 NaN NaN NaN NaN NaN
2001-05-17 24 8.28 8.27 8.15 8.46 0.09
2001-05-18 24 8.41 8.31 8.18 8.85 0.19
2001-05-19 24 10.44 10.64 9.03 10.98 0.60
2001-05-20 24 10.53 10.56 9.98 10.92 0.28
df2:
count mean median min max std
datet
2001-05-21 24 10.28 10.31 9.90 10.66 0.23
2001-05-22 24 10.40 10.42 10.17 10.67 0.17
2001-05-23 24 10.04 10.03 9.87 10.17 0.08
2001-05-24 24 9.63 9.66 9.41 9.88 0.15
2001-05-25 24 9.21 9.22 9.01 9.41 0.11

For a single before/after split, I think grouping by a boolean criterion is the most direct approach.
In [1]: df = DataFrame(np.random.randn(10),
index=pd.date_range('2001-05-16', '2001-05-25'))
In [2]: grouper = df.groupby(df.index < pd.Timestamp('2001-05-21'))
In [3]: before, after = grouper.get_group(True), grouper.get_group(False)
In [4]: before
Out[4]:
0
2001-05-16 2.560516
2001-05-17 -2.207314
2001-05-18 0.646882
2001-05-19 0.660611
2001-05-20 0.437303
And after comes out right as well. Can anyone improve on my In [3]?

0.11-dev (.ix will work equivalently)
In [16]: df.loc[:'20010520']
Out[16]:
0
2001-05-16 0.105445
2001-05-17 1.660771
2001-05-18 0.485668
2001-05-19 -0.102616
2001-05-20 -0.228228
In [17]: df.loc['20010521':]
Out[17]:
0
2001-05-21 -0.024324
2001-05-22 -1.004362
2001-05-23 2.342225
2001-05-24 1.124695
2001-05-25 -0.291302
or (ix will work here as well, this is just more explicit)
In [27]: i = df.index.get_loc('20010520')
In [28]: df.iloc[:i+1]
Out[28]:
0
2001-05-16 0.105445
2001-05-17 1.660771
2001-05-18 0.485668
2001-05-19 -0.102616
2001-05-20 -0.228228
In [29]: df.iloc[i+1:]
Out[29]:
0
2001-05-21 -0.024324
2001-05-22 -1.004362
2001-05-23 2.342225
2001-05-24 1.124695
2001-05-25 -0.291302

Related

week of the year aggregation using python (week starts from 01 01 YYYY)

I search in previous questions, and it does not resolve what i am searching, please can u help me
I have a dataset from
Date T2M Y T F H G Week_Number
0 1981-01-01 11.08 17.35 6.94 0.00 5.37 4.63 1
1 1981-01-02 10.82 16.41 7.51 0.00 5.55 2.73 1
2 1981-01-03 10.74 15.64 7.35 0.00 6.23 2.33 1
3 1981-01-04 11.17 15.99 8.46 0.00 6.16 1.66 1
4 1981-01-05 10.20 15.60 6.87 0.12 6.10 2.78 2
5 1981-01-06 10.35 16.16 5.95 0.00 6.59 3.92 2
6 1981-01-07 12.26 18.24 9.30 0.00 6.10 2.30 2
7 1981-01-08 12.76 19.23 8.72 0.00 6.29 3.96 2
8 1981-01-09 12.61 17.80 8.90 0.00 6.71 2.05 2
I already created a column of the week number using this code
df['Week_Number'] = df['Date'].dt.week
but it gives me only the first four days of the year that design the first week, maybe it means that the week start from monday. In my cases I don t give interest if it start from monday or another day, I just want to subdivise each year every seven days (group every 7 days of each year like from 1 1 1980 to 07 1 1980 FISRT WEEK, and go on, and every next year the first week starts too from 1 1 xxxx
If you want your week numbers to start from the 1st of January, irrespective of the day of week, simply get the day of year, subtract 1 and compute the integer division by 7:
df['Date'] = pd.to_datetime(df['Date'])
df['week_number'] = df['Date'].dt.dayofyear.sub(1).floordiv(7).add(1)
NB. you do not need to add 1 if you want the first week to start with 0
output:
Date T2M Y T F H G Week_Number week_number
0 1981-01-01 11.08 17.35 6.94 0.00 5.37 4.63 1 1
1 1981-01-02 10.82 16.41 7.51 0.00 5.55 2.73 1 1
2 1981-01-03 10.74 15.64 7.35 0.00 6.23 2.33 1 1
3 1981-01-04 11.17 15.99 8.46 0.00 6.16 1.66 1 1
4 1981-01-05 10.20 15.60 6.87 0.12 6.10 2.78 2 1
5 1981-01-06 10.35 16.16 5.95 0.00 6.59 3.92 2 1
6 1981-01-07 12.26 18.24 9.30 0.00 6.10 2.30 2 1
7 1981-01-08 12.76 19.23 8.72 0.00 6.29 3.96 2 2
8 1981-01-09 12.61 17.80 8.90 0.00 6.71 2.05 2 2
Then you can use the new column to groupby, for example:
df.groupby('week_number').agg({'Date': ['min', 'max'], 'T2M': 'sum'})
output:
Date T2M
min max sum
week_number
1 1981-01-01 1981-01-07 76.62
2 1981-01-08 1981-01-09 25.37

I wish to optimize the code using pythonic ways using lambda and pandas

I have the following Dataframe:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067
13 23 NaN NaN NaN NaN NaN NaN NaN NaN 983.5 BQ0067
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067
17 11 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 BQ0068
18 21 4.83 11.9 28.1 44.2 54.63 16.76 6.70 0.19 953.7 BQ0068
19 22 4.40 10.7 26.3 43.4 57.55 19.85 8.59 0.53 974.9 BQ0068
20 23 17.61 43.8 67.9 122.6 221.20 0.75 0.33 58.27 974.9 BQ0068
21 31 15.09 22.3 33.3 45.6 59.45 0.98 0.38 0.73 1773.7 BQ0068
I wish to do the following things:
Steps:
Whenever the TEST_NUMBER 11 is NaN(null values), I need to remove all rows of particular
PRODUCT_NO. For example, in the given dataframe, PRODUCT_NO. BQ0068 has TEST_NUMBER 11
with NaN values, hence all rows of BQ0068 should be removed.
If any TEST_NUMBER other than TEST_NUMBER 11 has NaN values, then only that particular
TEST_NUMBER's row should be removed. For example, PRODUCT_NO BQ0067 has row of TEST_NUMBER 23 with NaN values. Hence only that particular row of TEST_NUMBER 23should be removed.
After doing the above steps, I need to the computation, for example, for PRODUCT_NO BQ0066, I
need compute the the difference between rows in following way,
TEST_NUMBER 21 - TEST_NUMBER 11, TEST_NUMBER 22 - TEST_NUMBER 11, TEST_NUMBER 23 - TEST_NUMBER 11, TEST_NUMBER 24 - TEST_NUMBER 25,
TEST_NUMBER 21 - TEST_NUMBER 11. And then TEST_NUMBER 31 - TEST_NUMBER 25,
TEST_NUMBER 32 - TEST_NUMBER 25, TEST_NUMBER 33 - TEST_NUMBER 25, TEST_NUMBER 34 -
TEST_NUMBER 25. And carry on the same procedure for successive PRODUCT_NO. As you can see
TEST_NUMBERS frequency is different for each PRODUCT_NO. But in all cases, every
PRODUCT_NO will have only one TEST_NUMBER 11 and the other TEST_NUMBERS will be in range
of 21 to 29 i.e. 21, 22, 23, 24, 25, 26, 27, 28, 29 and 31, 32, 33 ,34, 35, 36, 37, 38, 39
PYTHON CODE
def pick_closest_sample(sample_list, sample_no):
sample_list = sorted(sample_list)
buffer = []
for number in sample_list:
if sample_no // 10 == number// 10:
break
else:
buffer.append(number)
if len(buffer) > 0:
return buffer[-1]
return sample_no
def add_closest_sample_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
if subset.iloc[0].isnull().sum() == 0:
subset.dropna(inplace = True)
sample_list = subset['TEST_NUMBER'].to_list()
subset['target_sample'] = subset['TEST_NUMBER'].apply(lambda x: pick_closest_sample(sample_list,x))
out.append(subset)
if len(out)>0:
out = pd.concat(out)
out.dropna(inplace=True)
return out
Output of above two functions:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO target_sample
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 11
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 11
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 11
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 11
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 11
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 11
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 25
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 25
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 25
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 25
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 11
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 11
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 22
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 22
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 22
As you can see, all rows of PRODUCT_NO BQ0068 are removed as TEST_NUMBER 11 had NaN values. Also only row of TEST_NUMBER 23 of PRODUCT_NO BQ0067 is removed as it had NaN values. So the requirements mentioned in the first two steps are met. Now the computation for PRODUCT_NO BQ0067 will be like TEST_NUMBER 31 - TEST_NUMBER 22, TEST_NUMBER 32 - TEST_NUMBER 22, TEST_NUMBER 33 - TEST_NUMBER 22
PYTHON CODE
def compute_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
target_list = list(subset['target_sample'].unique())
for target in target_list:
target_df = subset[subset['target_sample'] == target]
target_subset = [subset[subset['TEST_NUMBER'] == target]]*len(target_df)
target_subset = pd.concat(target_subset)
if len(target_subset)> 0:
target_subset.index = target_df.index
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for col in diff_cols:
target_df[col + '_diff'] = target_df[col] - target_subset[col]
out.append(target_df)
if len(out)>0:
out = pd.concat(out)
return out
Output of the above function:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM ... target_sample D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 ... 11 -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 ... 11 -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 ... 11 -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 ... 11 -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 ... 11 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 ... 25 -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 ... 25 -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 ... 25 -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 ... 25 -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 ... 11 -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 ... 11 -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 ... 22 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 ... 22 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 ... 22 -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 790.3
Kindly help me optimize the code of three functions I posted, so I could write them in more pythonic way.
Points 1. and 2. can be achieved in a few line with pandas functions.
You can then calculate "target_sample" and your diff_col in the same loop using groupby:
# 1. Whenever TEST_NUMBER == 11 has D1 value NaN, remove all rows with this PRODUCT_NO
drop_prod_no = df[(df.TEST_NUMBER==11) & (df.D1.isna())]["PRODUCT_NO"]
df.drop(df[df.PRODUCT_NO.isin(drop_prod_no)].index, axis=0, inplace=True)
# 2. Drop remaining rows with NaN values
df.dropna(inplace=True)
# 3. set column "target_sample" and calculate diffs
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
closest_sample = last_sample = 11
for index, row in subset.iterrows():
if row.TEST_NUMBER // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
subset.at[index, "target_sample"] = closest_sample
last_sample = row.TEST_NUMBER
for col in diff_cols:
subset.at[index, col + "_diff"] = subset.at[index, col] - float(subset[subset.TEST_NUMBER==closest_sample][col])
new_df = pd.concat([new_df, subset])
print(new_df)
Output:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 ... D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 ... -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 ... -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 ... -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 ... -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 ... 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 ... -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 ... -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 ... -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 ... -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 ... -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 ... -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 ... 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 ... 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 ... -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 79
Edit: you can avoid using ìterrows by applying lambda functions like you did:
# 3. set column "target_sample" and calculate diffs
def get_closest_sample(samples, test_no):
closest_sample = last_sample = 11
for smpl in samples:
if smpl // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
if smpl == test_no:
break
last_sample = smpl
return closest_sample
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
sample_list = list(subset["TEST_NUMBER"])
subset["target_sample"] = subset["TEST_NUMBER"].apply(lambda x: get_closest_sample(sample_list, x))
for col in diff_cols:
subset[col + "_diff"] = subset.apply(lambda row: row[col]-float(subset[subset.TEST_NUMBER==row["target_sample"]][col]), axis=1)
new_df = pd.concat([new_df, subset])
print(new_df)

How to title a pandas dataframe

I have the following code that prints out descriptive statistics with df.describe for each class of a categorical variable
for i in list(merged.Response.unique()):
print(merged[(merged.Response==i)].describe().round(2))
and it returns
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 687.00 687.00 687.00 687.00 687.00
mean 24.75 13.45 4.56 9.61 243.91
std 7.04 3.35 0.17 1.95 107.45
min 11.00 7.00 4.13 5.85 83.27
25% 20.00 11.00 4.45 8.18 167.44
50% 24.00 13.00 4.57 9.34 213.08
75% 29.00 15.00 4.67 10.51 289.74
max 51.00 24.00 4.97 15.75 700.80
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 1099.0 1099.00 1099.00 1099.00 1099.00
mean 17.2 6.85 4.08 5.18 97.88
std 12.8 2.47 0.24 1.45 101.26
min 1.0 2.00 3.24 2.40 5.72
25% 7.0 5.00 3.89 4.12 31.38
50% 14.0 7.00 4.13 5.21 62.58
75% 24.0 8.00 4.22 5.86 130.90
max 55.0 21.00 4.91 13.46 686.46
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 392.00 392.00 392.00 392.00 392.00
mean 12.41 11.46 4.44 10.13 125.04
std 3.75 3.34 0.19 1.94 43.91
min 3.00 6.00 4.02 6.98 36.92
25% 10.00 9.00 4.31 8.71 92.68
50% 13.00 10.00 4.38 9.30 121.58
75% 15.00 13.00 4.51 11.00 148.64
max 26.00 22.00 4.94 16.25 266.56
Is there any way I can title each summary table so I know which class is which?
I treid the following with the pandas styler, but despite titling the dataframe, it only printed one of them and it doesn't look as good (I'm in google colab btw):
for i in list(merged.Response.unique()):
test = merged[(merged.Response==i)].describe().round(2).style.set_caption(i)
test
AmznPrime
OrderCount OrderAvgSize AvgDeliverCost AvgOrderValue CustomerValue
count 392.000000 392.000000 392.000000 392.000000 392.000000
mean 12.410000 11.460000 4.440000 10.130000 125.040000
std 3.750000 3.340000 0.190000 1.940000 43.910000
min 3.000000 6.000000 4.020000 6.980000 36.920000
25% 10.000000 9.000000 4.310000 8.710000 92.680000
50% 13.000000 10.000000 4.380000 9.300000 121.580000
75% 15.000000 13.000000 4.510000 11.000000 148.640000
max 26.000000 22.000000 4.940000 16.250000 266.560000
All help is appreciated. Thanks!
How about:
merged.groupby("Response").describe().round(2)
To match your expected output, do stack/unstack:
merged.groupby("Response").describe().stack(level=1).unstack(level=0)

How do I plot data from multiple CSVs each with different column numbers

The file has no header but I would need to select say columns, 5,9, 13, 17 etc against column 2 (time). How can this be achieved in the case where the headers are present as well. Edit : Each file contains data for one day, the time format is GPS time which is the YR,Day of YR and Sec since midnight. How can i plot for say 1=30 January 2019?
Here is one code i tried
import numpy as np
import glob,os
import matplotlib.pyplot as plt
files = glob.glob('*.s4')
#print(files)
for file in files:
f=np.loadtxt(file,skiprows=3)
#print(file[0:9].upper())
for i in range (5,50,4):
t=f[:,2]/3600.;s4=f[:,i]
pos= np.where (t)[0]
pos1=np.where(s4[pos]<0.15)[0];s4[pos1]='nan'
plt.scatter(t,s4)
#print(len(s4))
plt.xticks(np.arange(0, 26, 2))
#plt.title(str(i))
plt.show()
The problem is that particular code only plots for one day at a time.
Here is a sample of the data.
19 001 45 11 1 0.07 214.9 37.5 8 0.08 314.5 34.2 10 0.14 102.6 14.3 11 0.07 241.2 49.6 14 0.07 152.0 50.0 18 0.05 212.7 68.0 22 0.08 226.1 33.7 27 0.06 346.0 22.0 31 0.04 63.5 47.7 32 0.06 144.3 30.4 138 0.09 282.0 17.8
19 001 105 11 1 0.05 214.9 37.9 8 0.07 314.9 33.8 10 0.24 102.2 14.1 11 0.07 241.7 49.9 14 0.06 151.9 49.6 18 0.06 213.0 68.4 22 0.12 225.7 34.0 27 0.06 346.2 21.7 31 0.04 64.1 47.9 32 0.06 144.2 30.0 138 0.09 282.0 17.8
19 001 165 11 1 0.06 214.9 38.4 8 0.11 315.3 33.5 10 0.12 101.8 13.9 11 0.06 242.3 50.1 14 0.06 151.8 49.1 18 0.05 213.4 68.9 22 0.07 225.2 34.2 27 0.11 346.5 21.3 31 0.04 64.8 48.2 32 0.10 144.0 29.6 138 0.09 282.0 17.8
19 001 225 11 1 0.06 214.9 38.8 8 0.06 315.8 33.2 10 0.10 101.4 13.7 11 0.06 242.8 50.4 14 0.05 151.7 48.6 18 0.04 213.7 69.4 22 0.06 224.8 34.4 27 0.08 346.8 20.9 31 0.05 65.5 48.4 32 0.09 143.9 29.2 138 0.09 282.0 17.8
19 001 285 11 1 0.06 215.0 39.2 8 0.11 316.2 32.9 10 0.14 100.9 13.6 11 0.05 243.4 50.6 14 0.06 151.6 48.2 18 0.06 214.1 69.8 22 0.08 224.4 34.7 27 0.07 347.0 20.5 31 0.06 66.1 48.6 32 0.09 143.7 28.8 138 0.09 282.0 17.8
19 001 345 11 1 0.06 215.0 39.7 8 0.08 316.6 32.5 10 0.10 100.5 13.4 11 0.04 244.0 50.9 14 0.06 151.5 47.7 18 0.04 214.6 70.3 22 0.07 223.9 34.9 27 0.08 347.3 20.2 31 0.07 66.8 48.9 32 0.08 143.6 28.4 138 0.09 282.0 17.8
19 001 405 11 1 0.06 215.1 40.1 8 0.07 317.0 32.2 10 0.13 100.1 13.2 11 0.05 244.6 51.1 14 0.08 151.4 47.3 18 0.05 215.0 70.8 22 0.07 223.5 35.1 27 0.12 347.5 19.8 31 0.08 67.5 49.1 32 0.12 143.4 28.0 138 0.09 282.0 17.8
19 001 465 11 1 0.06 215.1 40.5 8 0.12 317.4 31.9 10 0.10 99.7 13.0 11 0.08 245.2 51.4 14 0.05 151.3 46.8 18 0.06 215.5 71.2 22 0.06 223.0 35.4 27 0.12 347.8 19.4 31 0.03 68.2 49.3 32 0.18 143.3 27.7 138 0.09 282.0 17.8
19 001 525 11 1 0.09 215.2 40.9 8 0.12 317.9 31.5 10 0.11 99.3 12.8 11 0.04 245.8 51.6 14 0.15 151.2 46.4 18 0.06 216.0 71.7 22 0.06 222.6 35.6 27 0.08 348.0 19.1 31 0.05 68.9 49.5 32 0.08 143.1 27.3 138 0.09 282.0 17.8
19 001 585 11 1 0.07 215.2 41.4 8 0.09 318.3 31.2 10 0.12 98.9 12.6 11 0.04 246.5 51.8 14 0.06 151.1 45.9 18 0.05 216.5 72.2 22 0.06 222.1 35.8 27 0.08 348.3 18.7 31 0.07 69.6 49.7 32 0.11 143.0 26.9 138 0.09 282.0 17.8
Assuming that a space character is the column separator, you can load them into a list of lists:
data = []
with open(datafile,'r') as file:
for line in file:
# splits into list based on white space separator
data.append(line.split)
Taking part of your example: to compare the values in column 2 with column 5 you could do:
for line in data:
if line[1] == line[4]:
print("it's a match!")
If you have a header you want to ignore, just skip the first line when you open the file:
with open(datafile,'r') as file:
# do nothing with this line
header = f.readline()
...

pandas: t-test and p-value of month over month mean difference in aggregated dataframe using groupby function

This is my first posted question, so please excuse if it doesn't look good.
I have a source data file which I transform to the following dataframe using pandas groupby aggregation
pd.read_csv('R:/Python ETL/AGG7.csv', sep=',')
Treatment Month stdev n avg
0 AAAA 1/1/2016 1.92 309 7.57
1 AAAA 2/1/2016 1.89 79 7.46
2 AAAA 3/1/2016 2.25 158 7.20
3 AAAA 4/1/2016 2.23 22 7.68
4 BBBB 1/1/2016 2.04 175 7.10
5 BBBB 2/1/2016 1.96 33 7.09
6 BBBB 3/1/2016 2.02 110 7.32
7 BBBB 4/1/2016 1.73 25 7.92
8 CCCC 1/1/2016 2.42 111 7.40
9 CCCC 2/1/2016 1.45 22 7.73
10 CCCC 3/1/2016 2.44 21 6.95
11 CCCC 4/1/2016 2.84 92 6.92
What I need is 2 additional columns with month over month difference (MoM diff) and p-value of T-tests of those differences.
MoM diff pValue
-0.11 0.35
-0.26 0.62
0.48 0.65
-0.01 0.02
0.23 0.44
0.6 0.83
0.33 0.46
-0.78 0.79
-0.03 0.04
The problem is that I cannot get them on the fly using pandas group by with scipy.stats ttest_ind function from original dataset and ttest_ind_from_stats function from the shown aggregated dataframe. I tried many different approaches, but with no success. Can anyone help, please?
You can use df.shift with groupby to have the shifted values:
df[["avg_2", "n_2", "stdev_2"]] = df.groupby("Treatment")["avg", "n", "stdev"].shift()
df
Out[7]:
Treatment Month stdev n avg avg_2 n_2 stdev_2
0 AAAA 2016-01-01 1.92 309 7.57 NaN NaN NaN
1 AAAA 2016-01-02 1.89 79 7.46 7.57 309.0 1.92
2 AAAA 2016-01-03 2.25 158 7.20 7.46 79.0 1.89
3 AAAA 2016-01-04 2.23 22 7.68 7.20 158.0 2.25
4 BBBB 2016-01-01 2.04 175 7.10 NaN NaN NaN
5 BBBB 2016-01-02 1.96 33 7.09 7.10 175.0 2.04
6 BBBB 2016-01-03 2.02 110 7.32 7.09 33.0 1.96
7 BBBB 2016-01-04 1.73 25 7.92 7.32 110.0 2.02
8 CCCC 2016-01-01 2.42 111 7.40 NaN NaN NaN
9 CCCC 2016-01-02 1.45 22 7.73 7.40 111.0 2.42
10 CCCC 2016-01-03 2.44 21 6.95 7.73 22.0 1.45
11 CCCC 2016-01-04 2.84 92 6.92 6.95 21.0 2.44
You can filter out NaN values with pd.notnull:
df2 = df[pd.notnull(df.avg_2)].copy()
And you can get the results of the t-tests with:
import scipy.stats as ss
res = ss.ttest_ind_from_stats(df2.avg, df2.stdev, df2.n, df2.avg_2, df2.stdev_2, df2.n_2, equal_var=False)
If you want the mean differences and p-values in this dataframe:
df2["dif_avg"] = df2.avg - df2.avg_2
df2["p_value"] = res.pvalue
Out[22]:
Month stdev n avg avg_2 n_2 stdev_2 dif_avg p_value
1 2016-01-02 1.89 79 7.46 7.57 309.0 1.92 -0.11 0.646226
2 2016-01-03 2.25 158 7.20 7.46 79.0 1.89 -0.26 0.350814
3 2016-01-04 2.23 22 7.68 7.20 158.0 2.25 0.48 0.353023
5 2016-01-02 1.96 33 7.09 7.10 175.0 2.04 -0.01 0.978808
6 2016-01-03 2.02 110 7.32 7.09 33.0 1.96 0.23 0.559625
7 2016-01-04 1.73 25 7.92 7.32 110.0 2.02 0.60 0.137527
9 2016-01-02 1.45 22 7.73 7.40 111.0 2.42 0.33 0.395806
10 2016-01-03 2.44 21 6.95 7.73 22.0 1.45 -0.78 0.214270
11 2016-01-04 2.84 92 6.92 6.95 21.0 2.44 -0.03 0.961019
Line-by-line:
import csv
import scipy.stats as ss
results = []
treatment1 = ""
with open('R:/Python ETL/AGG7.csv') as f:
reader = csv.reader(f)
next(reader, None)
for line in reader:
treatment2, stdev2, n2, avg2 = line[0], float(line[2]), int(line[3]), float(line[4])
if treatment2 == treatment1:
ttest_res = ss.ttest_ind_from_stats(avg1, stdev1, n1, avg2, stdev2, n2, equal_var=False)
results.append((avg2-avg1, ttest_res.pvalue))
treatment1, stdev1, n1, avg1 = treatment2, stdev2, n2, avg2
is that what you need?
In [154]: df
Out[154]:
Treatment Month stdev n avg
0 AAAA 1/1/2016 1.92 309 7.57
1 AAAA 2/1/2016 1.89 79 7.46
2 AAAA 3/1/2016 2.25 158 7.20
3 AAAA 4/1/2016 2.23 22 7.68
4 BBBB 1/1/2016 2.04 175 7.10
5 BBBB 2/1/2016 1.96 33 7.09
6 BBBB 3/1/2016 2.02 110 7.32
7 BBBB 4/1/2016 1.73 25 7.92
8 CCCC 1/1/2016 2.42 111 7.40
9 CCCC 2/1/2016 1.45 22 7.73
10 CCCC 3/1/2016 2.44 21 6.95
11 CCCC 4/1/2016 2.84 92 6.92
In [155]: df.stdev.diff()
Out[155]:
0 NaN
1 -0.03
2 0.36
3 -0.02
4 -0.19
5 -0.08
6 0.06
7 -0.29
8 0.69
9 -0.97
10 0.99
11 0.40
Name: stdev, dtype: float64
let's shift it one row up:
In [156]: df.stdev.diff().shift(-1)
Out[156]:
0 -0.03
1 0.36
2 -0.02
3 -0.19
4 -0.08
5 0.06
6 -0.29
7 0.69
8 -0.97
9 0.99
10 0.40
11 NaN
Name: stdev, dtype: float64

Categories

Resources