I have the following txt file:
Temp Hi Low Out Dew Wind Wind Wind Hi Hi Wind Heat THW THSW Rain Solar Solar Hi Solar UV UV Hi Heat Cool In In In In In In Air Wind Wind ISS Arc.
Date Time Out Temp Temp Hum Pt. Speed Dir Run Speed Dir Chill Index Index Index Bar Rain Rate Rad. Energy Rad. Index Dose UV D-D D-D Temp Hum Dew Heat EMC Density ET Samp Tx Recept Int.
01/01/16 12:30 a 13.8 13.8 13.6 88 11.9 0.0 --- 0.00 0.0 --- 13.8 13.8 13.8 12.4 1012.3 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.094 0.000 21.5 50 10.6 20.7 9.25 1.1823 0.00 702 1 100.0 30
01/01/16 1:00 a 13.6 13.8 13.2 88 11.7 0.0 --- 0.00 0.0 --- 13.6 13.6 13.6 12.2 1012.2 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.098 0.000 21.5 50 10.6 20.7 9.25 1.1823 0.00 702 1 100.0 30
01/01/16 1:30 a 14.5 14.5 13.6 81 11.3 0.0 --- 0.00 0.0 --- 14.5 14.4 14.4 12.9 1012.2 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.080 0.000 21.5 50 10.6 20.7 9.25 1.1822 0.00 703 1 100.0 30
01/01/16 2:00 a 15.2 15.2 14.5 75 10.8 0.0 --- 0.00 0.0 --- 15.2 14.9 14.9 13.4 1012.0 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.066 0.000 21.4 49 10.2 20.5 9.05 1.1829 0.00 702 1 100.0 30
01/01/16 2:30 a 14.4 15.2 14.0 79 10.8 0.0 --- 0.00 0.0 --- 14.4 14.2 14.2 12.8 1012.2 0.20 0.0 0 0.00 0 0.0 0.00 0.0 0.082 0.000 21.4 48 9.9 20.4 8.86 1.1834 0.00 703 1 100.0 30
01/01/16 3:00 a 15.1 15.1 14.1 76 10.9 0.0 --- 0.00 0.0 --- 15.1 14.8 14.8 13.4 1011.9 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.068 0.000 21.4 48 9.9 20.4 8.86 1.1830 0.00 700 1 100.0 30
01/01/16 3:30 a 14.9 15.2 14.9 73 10.1 0.0 --- 0.00 0.0 --- 14.9 14.6 14.6 13.2 1011.9 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.071 0.000 21.4 47 9.6 20.3 8.75 1.1833 0.00 702 1 100.0 30
01/01/16 4:00 a 15.2 15.3 14.9 68 9.4 0.0 --- 0.00 0.0 --- 15.2 14.8 14.8 13.3 1011.9 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.065 0.000 21.4 47 9.6 20.3 8.75 1.1833 0.00 700 1 100.0 30
01/01/16 4:30 a 14.9 15.2 14.6 72 9.9 0.0 --- 0.00 0.0 --- 14.9 14.6 14.6 13.1 1011.8 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.072 0.000 21.3 46 9.2 20.2 8.64 1.1838 0.00 703 1 100.0 30
01/01/16 5:00 a 14.1 15.1 14.0 76 9.9 0.0 --- 0.00 0.0 --- 14.1 13.8 13.8 12.3 1012.1 0.00 0.0 0 0.00 0 0.0 0.00 0.0 0.088 0.000 21.3 46 9.2 20.2 8.64 1.1842 0.00 702 1 100.0 30
and I want to import it into a Data Frame but with one column contating the date and the time in 24 hour display together:
Time
01/01/16 12:30
.....
01/01/16 13:30
Is there an easy way to do this ?
Thank you !!
try this:
For dd/mm/yy format:
def parse_dt(dt, tm, ap):
return pd.to_datetime(dt + ' ' + tm + ap, dayfirst=True)
For mm/dd/yy format:
def parse_dt(dt, tm, ap):
return pd.to_datetime(dt + ' ' + tm + ap)
Parse CSV:
df = pd.read_csv(filename, sep='\s+', skiprows=2, header=None,
parse_dates={'ts': [0,1,2] }, date_parser=parse_dt)
Related
I have the following Dataframe:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067
13 23 NaN NaN NaN NaN NaN NaN NaN NaN 983.5 BQ0067
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067
17 11 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 BQ0068
18 21 4.83 11.9 28.1 44.2 54.63 16.76 6.70 0.19 953.7 BQ0068
19 22 4.40 10.7 26.3 43.4 57.55 19.85 8.59 0.53 974.9 BQ0068
20 23 17.61 43.8 67.9 122.6 221.20 0.75 0.33 58.27 974.9 BQ0068
21 31 15.09 22.3 33.3 45.6 59.45 0.98 0.38 0.73 1773.7 BQ0068
I wish to do the following things:
Steps:
Whenever the TEST_NUMBER 11 is NaN(null values), I need to remove all rows of particular
PRODUCT_NO. For example, in the given dataframe, PRODUCT_NO. BQ0068 has TEST_NUMBER 11
with NaN values, hence all rows of BQ0068 should be removed.
If any TEST_NUMBER other than TEST_NUMBER 11 has NaN values, then only that particular
TEST_NUMBER's row should be removed. For example, PRODUCT_NO BQ0067 has row of TEST_NUMBER 23 with NaN values. Hence only that particular row of TEST_NUMBER 23should be removed.
After doing the above steps, I need to the computation, for example, for PRODUCT_NO BQ0066, I
need compute the the difference between rows in following way,
TEST_NUMBER 21 - TEST_NUMBER 11, TEST_NUMBER 22 - TEST_NUMBER 11, TEST_NUMBER 23 - TEST_NUMBER 11, TEST_NUMBER 24 - TEST_NUMBER 25,
TEST_NUMBER 21 - TEST_NUMBER 11. And then TEST_NUMBER 31 - TEST_NUMBER 25,
TEST_NUMBER 32 - TEST_NUMBER 25, TEST_NUMBER 33 - TEST_NUMBER 25, TEST_NUMBER 34 -
TEST_NUMBER 25. And carry on the same procedure for successive PRODUCT_NO. As you can see
TEST_NUMBERS frequency is different for each PRODUCT_NO. But in all cases, every
PRODUCT_NO will have only one TEST_NUMBER 11 and the other TEST_NUMBERS will be in range
of 21 to 29 i.e. 21, 22, 23, 24, 25, 26, 27, 28, 29 and 31, 32, 33 ,34, 35, 36, 37, 38, 39
PYTHON CODE
def pick_closest_sample(sample_list, sample_no):
sample_list = sorted(sample_list)
buffer = []
for number in sample_list:
if sample_no // 10 == number// 10:
break
else:
buffer.append(number)
if len(buffer) > 0:
return buffer[-1]
return sample_no
def add_closest_sample_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
if subset.iloc[0].isnull().sum() == 0:
subset.dropna(inplace = True)
sample_list = subset['TEST_NUMBER'].to_list()
subset['target_sample'] = subset['TEST_NUMBER'].apply(lambda x: pick_closest_sample(sample_list,x))
out.append(subset)
if len(out)>0:
out = pd.concat(out)
out.dropna(inplace=True)
return out
Output of above two functions:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO target_sample
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 11
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 11
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 11
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 11
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 11
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 11
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 25
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 25
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 25
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 25
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 11
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 11
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 22
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 22
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 22
As you can see, all rows of PRODUCT_NO BQ0068 are removed as TEST_NUMBER 11 had NaN values. Also only row of TEST_NUMBER 23 of PRODUCT_NO BQ0067 is removed as it had NaN values. So the requirements mentioned in the first two steps are met. Now the computation for PRODUCT_NO BQ0067 will be like TEST_NUMBER 31 - TEST_NUMBER 22, TEST_NUMBER 32 - TEST_NUMBER 22, TEST_NUMBER 33 - TEST_NUMBER 22
PYTHON CODE
def compute_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
target_list = list(subset['target_sample'].unique())
for target in target_list:
target_df = subset[subset['target_sample'] == target]
target_subset = [subset[subset['TEST_NUMBER'] == target]]*len(target_df)
target_subset = pd.concat(target_subset)
if len(target_subset)> 0:
target_subset.index = target_df.index
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for col in diff_cols:
target_df[col + '_diff'] = target_df[col] - target_subset[col]
out.append(target_df)
if len(out)>0:
out = pd.concat(out)
return out
Output of the above function:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM ... target_sample D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 ... 11 -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 ... 11 -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 ... 11 -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 ... 11 -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 ... 11 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 ... 25 -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 ... 25 -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 ... 25 -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 ... 25 -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 ... 11 -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 ... 11 -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 ... 22 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 ... 22 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 ... 22 -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 790.3
Kindly help me optimize the code of three functions I posted, so I could write them in more pythonic way.
Points 1. and 2. can be achieved in a few line with pandas functions.
You can then calculate "target_sample" and your diff_col in the same loop using groupby:
# 1. Whenever TEST_NUMBER == 11 has D1 value NaN, remove all rows with this PRODUCT_NO
drop_prod_no = df[(df.TEST_NUMBER==11) & (df.D1.isna())]["PRODUCT_NO"]
df.drop(df[df.PRODUCT_NO.isin(drop_prod_no)].index, axis=0, inplace=True)
# 2. Drop remaining rows with NaN values
df.dropna(inplace=True)
# 3. set column "target_sample" and calculate diffs
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
closest_sample = last_sample = 11
for index, row in subset.iterrows():
if row.TEST_NUMBER // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
subset.at[index, "target_sample"] = closest_sample
last_sample = row.TEST_NUMBER
for col in diff_cols:
subset.at[index, col + "_diff"] = subset.at[index, col] - float(subset[subset.TEST_NUMBER==closest_sample][col])
new_df = pd.concat([new_df, subset])
print(new_df)
Output:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 ... D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 ... -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 ... -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 ... -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 ... -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 ... 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 ... -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 ... -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 ... -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 ... -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 ... -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 ... -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 ... 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 ... 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 ... -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 79
Edit: you can avoid using ìterrows by applying lambda functions like you did:
# 3. set column "target_sample" and calculate diffs
def get_closest_sample(samples, test_no):
closest_sample = last_sample = 11
for smpl in samples:
if smpl // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
if smpl == test_no:
break
last_sample = smpl
return closest_sample
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
sample_list = list(subset["TEST_NUMBER"])
subset["target_sample"] = subset["TEST_NUMBER"].apply(lambda x: get_closest_sample(sample_list, x))
for col in diff_cols:
subset[col + "_diff"] = subset.apply(lambda row: row[col]-float(subset[subset.TEST_NUMBER==row["target_sample"]][col]), axis=1)
new_df = pd.concat([new_df, subset])
print(new_df)
I have a bunch of stock data downloaded from yahoo finance. Each dataframe looks like this:
Date Open High Low Close Adj Close Volume
0 2019-03-11 2.73 2.81 2.71 2.75 2.75 243900
1 2019-03-12 2.66 2.78 2.66 2.75 2.75 69200
2 2019-03-13 2.75 2.80 2.71 2.77 2.77 61200
3 2019-03-14 2.77 2.79 2.75 2.75 2.75 48800
4 2019-03-15 2.76 2.79 2.75 2.79 2.79 124400
.. ... ... ... ... ... ... ...
282 2020-04-22 3.61 3.75 3.61 3.71 3.71 312900
283 2020-04-23 3.74 3.77 3.66 3.76 3.76 99800
284 2020-04-24 3.78 3.78 3.63 3.63 3.63 89100
285 2020-04-27 3.70 3.70 3.55 3.64 3.64 60600
286 2020-04-28 3.70 3.74 3.64 3.70 3.70 248300
I need to concat the data so it looks like the below multi-index format and I'm at a loss. I've tried a number of pd.concat([list of dfs], zip(cols,symbols), axis=[0,1]) combos with no luck so any help is appreciated!
Adj Close Close High Low Open Volume
CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP
Date
2019-04-30 1.85 3.08 0.69 1.85 3.08 0.69 1.94 3.10 0.70 1.74 3.05 0.67 1.74 3.07 0.70 24800 23900 30400
2019-05-01 1.81 3.15 0.65 1.81 3.15 0.65 1.85 3.17 0.69 1.75 3.06 0.62 1.76 3.09 0.67 15500 72800 85900
2019-05-02 1.80 3.12 0.66 1.80 3.12 0.66 1.87 3.16 0.66 1.76 3.10 0.65 1.80 3.16 0.65 12900 28100 97200
2019-05-03 1.85 3.14 0.67 1.85 3.14 0.67 1.89 3.19 0.69 1.74 3.06 0.62 1.74 3.12 0.62 43200 31300 27500
2019-05-06 1.85 3.13 0.66 1.85 3.13 0.66 1.89 3.25 0.69 1.75 3.11 0.65 1.79 3.11 0.67 37000 50200 31500
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-04-22 0.93 3.71 0.73 0.93 3.71 0.73 1.04 3.75 0.73 0.93 3.61 0.69 0.93 3.61 0.72 2600 312900 14600
2020-04-23 1.01 3.76 0.74 1.01 3.76 0.74 1.01 3.77 0.77 0.94 3.66 0.73 0.94 3.74 0.73 2500 99800 15200
2020-04-24 1.05 3.63 0.76 1.05 3.63 0.76 1.05 3.78 0.77 0.92 3.63 0.74 1.05 3.78 0.74 4400 89100 1300
2020-04-27 1.03 3.64 0.76 1.03 3.64 0.76 1.07 3.70 0.77 0.92 3.55 0.76 1.07 3.70 0.77 6200 60600 3500
2020-04-28 1.00 3.70 0.77 1.00 3.70 0.77 1.07 3.74 0.77 0.96 3.64 0.75 1.07 3.70 0.77 22300 248300 26100
EDIT per Quang Hoang's suggestion:
Tried:
ret = pd.concat(stock_data.values(), keys=stocks, axis=1)
ret = ret.swaplevel(0, 1, axis=1)
Got the following output which looks much closer but still off a bit:
Date Open High Low Close Adj Close Volume Date Open High Low Close Adj Close Volume Date Open High Low Close Adj Close Volume
CHNR CHNR CHNR CHNR CHNR CHNR CHNR GNSS GNSS GNSS GNSS GNSS GNSS GNSS SGRP SGRP SGRP SGRP SGRP SGRP SGRP
0 2010-04-29 11.39 11.74 11.39 11.57 11.57 3100 2019-03-11 2.73 2.81 2.71 2.75 2.75 243900.0 2010-04-29 0.79 0.79 0.79 0.79 0.79 0
1 2010-04-30 11.60 11.61 11.50 11.56 11.56 5400 2019-03-12 2.66 2.78 2.66 2.75 2.75 69200.0 2010-04-30 0.79 0.79 0.79 0.79 0.79 0
2 2010-05-03 11.95 11.95 11.22 11.44 11.44 19400 2019-03-13 2.75 2.80 2.71 2.77 2.77 61200.0 2010-05-03 0.79 0.79 0.79 0.79 0.79 0
3 2010-05-04 11.20 11.49 11.20 11.46 11.46 10700 2019-03-14 2.77 2.79 2.75 2.75 2.75 48800.0 2010-05-04 0.79 0.79 0.66 0.79 0.79 9700
4 2010-05-05 11.50 11.60 11.25 11.50 11.50 13400 2019-03-15 2.76 2.79 2.75 2.79 2.79 124400.0 2010-05-05 0.69 0.80 0.67 0.80 0.80 6700
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2512 2020-04-22 0.93 1.04 0.93 0.93 0.93 2600 NaT NaN NaN NaN NaN NaN NaN 2020-04-22 0.72 0.73 0.69 0.73 0.73 14600
2513 2020-04-23 0.94 1.01 0.94 1.01 1.01 2500 NaT NaN NaN NaN NaN NaN NaN 2020-04-23 0.73 0.77 0.73 0.74 0.74 15200
2514 2020-04-24 1.05 1.05 0.92 1.05 1.05 4400 NaT NaN NaN NaN NaN NaN NaN 2020-04-24 0.74 0.77 0.74 0.76 0.76 1300
2515 2020-04-27 1.07 1.07 0.92 1.03 1.03 6200 NaT NaN NaN NaN NaN NaN NaN 2020-04-27 0.77 0.77 0.76 0.76 0.76 3500
2516 2020-04-28 1.07 1.07 0.96 1.00 1.00 22300 NaT NaN NaN NaN NaN NaN NaN 2020-04-28 0.77 0.77 0.75 0.77 0.77 26100
I'm trying to find to what degree the chemical properties of a wine dataset influence the quality property of the dataset.
The error:
ValueError: y data is not in domain of logit link function. Expected
domain: [0.0, 1.0], but found [3.0, 9.0]
The code:
import pandas as pd
from pygam import LogisticGAM
white_data = pd.read_csv("winequality-white.csv",sep=';');
X = white_data[[
"fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide",
"total sulfur dioxide","density","pH","sulphates","alcohol"
]]
print(X.describe)
y = pd.Series(white_data["quality"]);
print(white_quality.describe)
white_gam = LogisticGAM().fit(X, y)
The output of said code:
<bound method NDFrame.describe of fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.0 0.27 0.36 20.7 0.045
1 6.3 0.30 0.34 1.6 0.049
2 8.1 0.28 0.40 6.9 0.050
3 7.2 0.23 0.32 8.5 0.058
4 7.2 0.23 0.32 8.5 0.058
... ... ... ... ... ...
4893 6.2 0.21 0.29 1.6 0.039
4894 6.6 0.32 0.36 8.0 0.047
4895 6.5 0.24 0.19 1.2 0.041
4896 5.5 0.29 0.30 1.1 0.022
4897 6.0 0.21 0.38 0.8 0.020
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 45.0 170.0 1.00100 3.00 0.45
1 14.0 132.0 0.99400 3.30 0.49
2 30.0 97.0 0.99510 3.26 0.44
3 47.0 186.0 0.99560 3.19 0.40
4 47.0 186.0 0.99560 3.19 0.40
... ... ... ... ... ...
4893 24.0 92.0 0.99114 3.27 0.50
4894 57.0 168.0 0.99490 3.15 0.46
4895 30.0 111.0 0.99254 2.99 0.46
4896 20.0 110.0 0.98869 3.34 0.38
4897 22.0 98.0 0.98941 3.26 0.32
alcohol
0 8.8
1 9.5
2 10.1
3 9.9
4 9.9
... ...
4893 11.2
4894 9.6
4895 9.4
4896 12.8
4897 11.8
[4898 rows x 11 columns]>
<bound method NDFrame.describe of 0 6
1 6
2 6
3 6
4 6
..
4893 6
4894 5
4895 6
4896 7
4897 6
Name: quality, Length: 4898, dtype: int64>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-e1c5720823a6> in <module>
16 print(white_quality.describe)
17
---> 18 white_gam = LogisticGAM().fit(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/pygam.py in fit(self, X, y, weights)
893
894 # validate data
--> 895 y = check_y(y, self.link, self.distribution, verbose=self.verbose)
896 X = check_X(X, verbose=self.verbose)
897 check_X_y(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/utils.py in check_y(y, link, dist, min_samples, verbose)
227 .format(link, get_link_domain(link, dist),
228 [float('%.2f'%np.min(y)),
--> 229 float('%.2f'%np.max(y))]))
230 return y
231
ValueError: y data is not in domain of logit link function. Expected domain: [0.0, 1.0], but found [3.0, 9.0]
The files: (I'm using Jupyter Notebook but I don't think you'd need to): https://drive.google.com/drive/folders/1RAj2Gh6WfdzpwtgbMaFVuvBVIWwoTUW5?usp=sharing
You probably want to use LinearGAM – LogisticGAM is for classification tasks.
I have the next DataFrame:
stock color 15M_c 60M_c mediodia 1D_c 1D-15M_c
0 PYPL rojo 0.32 0.32 0.47 -0.18 -0.50
1 MSFT verde -0.11 0.38 0.79 -0.48 -0.35
2 PYPL verde -1.44 -1.23 0.28 -1.13 0.30
3 V rojo -0.07 0.23 0.70 0.80 0.91
4 JD rojo 0.87 1.11 1.19 0.43 -0.42
5 FB verde 0.20 0.05 0.22 -0.66 -0.82
.. ... ... ... ... ... ... ...
282 GM verde 0.14 0.06 0.47 0.51 0.37
283 FB verde 0.09 -0.08 0.12 0.22 0.12
284 MSFT rojo -0.16 -0.23 -0.06 -0.01 0.14
285 PYPL verde -0.14 -0.41 -0.07 0.20 0.30
286 V verde -0.02 0.00 0.28 0.42 0.45
And first I grouped by 'stock' and 'color', I do it with the next code:
marcos = ['15M_c','60M_c','mediodia','1D_c','1D-15M_c']
grouped = data.groupby(['stock','color'])
res = grouped[marcos].agg([np.size, np.sum])
So in 'res' I get the next DataFrame:
15M_c 60M_c mediodia 1D_c 1D-15M_c
size sum size sum size sum size sum size sum
stock color
AAPL rojo 10.0 -0.46 10.0 -0.20 10.0 -0.33 10.0 -0.25 10.0 0.18
verde 8.0 1.39 8.0 2.48 8.0 1.06 8.0 -1.57 8.0 -2.88
... ... .. .. .. .. .. .. .. .. .. ..
FB verde 15.0 0.92 15.0 -0.64 15.0 -0.11 15.0 -0.89 15.0 -1.80
MSFT rojo 11.0 0.47 11.0 2.07 11.0 2.71 11.0 4.37 11.0 3.83
verde 18.0 1.46 18.0 2.12 18.0 1.26 18.0 0.97 18.0 -0.54
PYPL rojo 9.0 1.06 9.0 2.68 9.0 5.02 9.0 3.98 9.0 2.84
verde 17.0 -1.57 17.0 -2.40 17.0 0.29 17.0 -0.48 17.0 1.08
V rojo 1.0 -0.22 1.0 -0.28 1.0 -0.36 1.0 -0.29 1.0 -0.06
verde 9.0 -1.01 9.0 -1.42 9.0 -0.86 9.0 0.58 9.0 1.61
And then I want to sum 'verde' row with 'rojo' row for each 'stock', but multipliying rojo sum by -1. The final result I wanted is:
15M_c 60M_c mediodia 1D_c 1D-15M_c
size sum sum sum sum sum
stock
AAPL 18.0 1.85 2.68 1.39 -1.32 -3.06
... .. .. .. .. .. ..
FB 15.0 0.92 -0.64 -0.11 -0.89 -1.80
MSFT 29.0 0.99 0.05 -1.45 -3.40 -4.37
PYPL 26.0 -2.63 -5.08 .. .. ..
V 10.0 -0.79 -1.14 .. .. ..
Thank you very much in advance for your help.
pandas.IndexSlice
Use loc and IndexSlice to change the sign of appropriate values. Then use sum(level=0)
islc = pd.IndexSlice
res.loc[islc[:, 'rojo'], islc[:, 'sum']] *= -1
res.sum(level=0)
Convert the columns in marcos based on the value of color
import numpy as np
for m in marcos:
data[m] = np.where(data['color'] == 'rojo', -data[m], data[m])
Then you can skip grouping by color altogether:
grouped = foo.groupby(['stock'])
res = grouped[marcos].agg([np.size, np.sum])
Let's take as an example the following dataset:
make address all 3d our over length_total y
0 0.0 0.64 0.64 0.0 0.32 0.0 278 1
1 0.21 0.28 0.5 0.0 0.14 0.28 1028 1
2 0.06 0.0 0.71 0.0 1.23 0.19 2259 1
3 0.15 0.0 0.46 0.1 0.61 0.0 1257 1
4 0.06 0.12 0.77 0.0 0.19 0.32 749 1
5 0.0 0.0 0.0 0.0 0.0 0.0 21 1
6 0.0 0.0 0.25 0.0 0.38 0.25 184 1
7 0.0 0.69 0.34 0.0 0.34 0.0 261 1
8 0.0 0.0 0.0 0.0 0.9 0.0 25 1
9 0.0 0.0 1.42 0.0 0.71 0.35 205 1
10 0.0 0.0 0.0 0.0 0.0 0.0 23 0
11 0.48 0.0 0.0 0.0 0.48 0.0 37 0
12 0.12 0.0 0.25 0.0 0.0 0.0 491 0
13 0.08 0.08 0.25 0.2 0.0 0.25 807 0
14 0.0 0.0 0.0 0.0 0.0 0.0 38 0
15 0.24 0.0 0.12 0.0 0.0 0.12 227 0
16 0.0 0.0 0.0 0.0 0.75 0.0 77 0
17 0.1 0.0 0.21 0.0 0.0 0.0 571 0
18 0.51 0.0 0.0 0.0 0.0 0.0 74 0
19 0.3 0.0 0.15 0.0 0.0 0.15 155 0
I want to get pivot-table from the previous dataset, in which the columns (make, address all, 3d, our, over, length_total) will have their mean values processed by the column y. The following table is the expected result:
y
1 0
make 0.048 0.183
address 0.173 0.008
all 0.509 0.098
3d 0.01 0.02
our 0.482 0.123
over 0.139 0.052
length_total 626.7 250
Is it possible to get the desired result through pivot_table method from pandas.data object? If so, how?
Is there a more effective way to do this?
Some people like using stack or unstack, but I prefer good ol' pd.melt to "flatten" or "unpivot" a frame:
>>> df_m = pd.melt(df, id_vars="y")
>>> df_m.pivot_table(index="variable", columns="y")
value
y 0 1
variable
3d 0.020 0.010
address 0.008 0.173
all 0.098 0.509
length_total 250.000 626.700
make 0.183 0.048
our 0.123 0.482
over 0.052 0.139
(If you want to preserve the original column order as the new row order, you can use .loc to index into this, something like df2.loc[df.columns].dropna()).
Melting does the flattening, and preserves y as a column, putting the old column names as a new column called "variable" (which can be changed if you like):
>>> pd.melt(df, id_vars="y").head()
y variable value
0 1 make 0.00
1 1 make 0.21
2 1 make 0.06
3 1 make 0.15
4 1 make 0.06
After that we can call pivot_table as we would ordinarily.