Pandas dataframe dosen't recognize values in list - python

I have a list that looks something like this:
[ deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
0 89 152 NaN NaN NaN NaN NaN NaN 0.000074
1 0 25 0.20 0.72 0.08 2.00 1.30 5.8 0.000917
2 25 89 0.34 0.58 0.08 0.25 1.48 5.0 0.000091,
deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
29 0 25 0.07 0.12 0.81 3.0 1.20 6.3 0.0055
32 25 44 0.05 0.11 0.84 1.7 1.20 6.1 0.0055
41 44 70 0.04 0.08 0.88 0.6 1.58 6.4 0.0055
50 70 203 0.02 0.03 0.95 0.3 1.60 7.2 0.0055,
deptht depthb clay silt sand OM bulk_density pH \
6 157 203 0.335 0.323 0.342 0.25 1.90 7.9
8 0 25 0.225 0.527 0.248 2.00 1.40 6.2
9 25 66 0.420 0.502 0.078 0.75 1.53 6.5
12 66 109 0.240 0.518 0.242 0.25 1.53 7.5
15 109 157 0.240 0.560 0.200 0.25 1.45 7.9
sat_hidric_cond
6 0.000074
8 0.000917
9 0.000282
12 0.000776
15 0.000776 ,
deptht depthb clay silt sand OM bulk_density pH \
0 71 109 0.100 0.234 0.666 0.25 1.68 5.8
1 109 152 0.100 0.265 0.635 0.25 1.70 8.2
3 0 23 0.085 0.237 0.678 2.00 1.45 6.2
4 23 71 0.210 0.184 0.606 0.25 1.55 5.5
sat_hidric_cond
0 0.0023
1 0.0023
3 0.0028
4 0.0009 ,
deptht depthb clay silt sand OM bulk_density pH \
3 0 25 0.11 0.230 0.660 0.75 1.55 7.2
4 25 76 0.14 0.192 0.668 0.25 1.55 7.2
6 76 152 0.14 0.556 0.304 0.00 1.75 8.2
sat_hidric_cond
3 0.002800
4 0.002800
6 0.000091 ]
when I try to transform my list into a DataFrame with soil = pd.DataFrame(data)
I get this output
0
0 deptht depthb clay silt sand OM bul...
1 deptht depthb clay silt sand OM bul...
2 deptht depthb clay silt sand OM ...
3 deptht depthb clay silt sand OM ...
4 deptht depthb clay silt sand OM b...
Those are the five elements of my list but is not recognizing the values associated to each variable.
However when I use the squeeze function soil = soil.iloc[1].squeeze()
I get something similar to what I want as result:
deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
29 0 25 0.07 0.12 0.81 3.0 1.20 6.3 0.0055
32 25 44 0.05 0.11 0.84 1.7 1.20 6.1 0.0055
41 44 70 0.04 0.08 0.88 0.6 1.58 6.4 0.0055
50 70 203 0.02 0.03 0.95 0.3 1.60 7.2 0.0055
But I have to use the iloc function to individually select each element of the list.
What I'm looking for is a method that I can apply to the whole list and get an output like I get when I use the pandas squeeze method.
Any help is greatly appreciated.
Thank you very much.

data is a list and it seems you want to extract the second element of the list:
soil = pd.DataFrame(data[1])

Related

I wish to optimize the code using pythonic ways using lambda and pandas

I have the following Dataframe:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067
13 23 NaN NaN NaN NaN NaN NaN NaN NaN 983.5 BQ0067
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067
17 11 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 BQ0068
18 21 4.83 11.9 28.1 44.2 54.63 16.76 6.70 0.19 953.7 BQ0068
19 22 4.40 10.7 26.3 43.4 57.55 19.85 8.59 0.53 974.9 BQ0068
20 23 17.61 43.8 67.9 122.6 221.20 0.75 0.33 58.27 974.9 BQ0068
21 31 15.09 22.3 33.3 45.6 59.45 0.98 0.38 0.73 1773.7 BQ0068
I wish to do the following things:
Steps:
Whenever the TEST_NUMBER 11 is NaN(null values), I need to remove all rows of particular
PRODUCT_NO. For example, in the given dataframe, PRODUCT_NO. BQ0068 has TEST_NUMBER 11
with NaN values, hence all rows of BQ0068 should be removed.
If any TEST_NUMBER other than TEST_NUMBER 11 has NaN values, then only that particular
TEST_NUMBER's row should be removed. For example, PRODUCT_NO BQ0067 has row of TEST_NUMBER 23 with NaN values. Hence only that particular row of TEST_NUMBER 23should be removed.
After doing the above steps, I need to the computation, for example, for PRODUCT_NO BQ0066, I
need compute the the difference between rows in following way,
TEST_NUMBER 21 - TEST_NUMBER 11, TEST_NUMBER 22 - TEST_NUMBER 11, TEST_NUMBER 23 - TEST_NUMBER 11, TEST_NUMBER 24 - TEST_NUMBER 25,
TEST_NUMBER 21 - TEST_NUMBER 11. And then TEST_NUMBER 31 - TEST_NUMBER 25,
TEST_NUMBER 32 - TEST_NUMBER 25, TEST_NUMBER 33 - TEST_NUMBER 25, TEST_NUMBER 34 -
TEST_NUMBER 25. And carry on the same procedure for successive PRODUCT_NO. As you can see
TEST_NUMBERS frequency is different for each PRODUCT_NO. But in all cases, every
PRODUCT_NO will have only one TEST_NUMBER 11 and the other TEST_NUMBERS will be in range
of 21 to 29 i.e. 21, 22, 23, 24, 25, 26, 27, 28, 29 and 31, 32, 33 ,34, 35, 36, 37, 38, 39
PYTHON CODE
def pick_closest_sample(sample_list, sample_no):
sample_list = sorted(sample_list)
buffer = []
for number in sample_list:
if sample_no // 10 == number// 10:
break
else:
buffer.append(number)
if len(buffer) > 0:
return buffer[-1]
return sample_no
def add_closest_sample_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
if subset.iloc[0].isnull().sum() == 0:
subset.dropna(inplace = True)
sample_list = subset['TEST_NUMBER'].to_list()
subset['target_sample'] = subset['TEST_NUMBER'].apply(lambda x: pick_closest_sample(sample_list,x))
out.append(subset)
if len(out)>0:
out = pd.concat(out)
out.dropna(inplace=True)
return out
Output of above two functions:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO target_sample
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 11
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 11
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 11
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 11
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 11
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 11
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 25
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 25
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 25
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 25
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 11
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 11
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 22
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 22
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 22
As you can see, all rows of PRODUCT_NO BQ0068 are removed as TEST_NUMBER 11 had NaN values. Also only row of TEST_NUMBER 23 of PRODUCT_NO BQ0067 is removed as it had NaN values. So the requirements mentioned in the first two steps are met. Now the computation for PRODUCT_NO BQ0067 will be like TEST_NUMBER 31 - TEST_NUMBER 22, TEST_NUMBER 32 - TEST_NUMBER 22, TEST_NUMBER 33 - TEST_NUMBER 22
PYTHON CODE
def compute_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
target_list = list(subset['target_sample'].unique())
for target in target_list:
target_df = subset[subset['target_sample'] == target]
target_subset = [subset[subset['TEST_NUMBER'] == target]]*len(target_df)
target_subset = pd.concat(target_subset)
if len(target_subset)> 0:
target_subset.index = target_df.index
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for col in diff_cols:
target_df[col + '_diff'] = target_df[col] - target_subset[col]
out.append(target_df)
if len(out)>0:
out = pd.concat(out)
return out
Output of the above function:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM ... target_sample D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 ... 11 -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 ... 11 -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 ... 11 -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 ... 11 -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 ... 11 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 ... 25 -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 ... 25 -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 ... 25 -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 ... 25 -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 ... 11 -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 ... 11 -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 ... 22 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 ... 22 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 ... 22 -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 790.3
Kindly help me optimize the code of three functions I posted, so I could write them in more pythonic way.
Points 1. and 2. can be achieved in a few line with pandas functions.
You can then calculate "target_sample" and your diff_col in the same loop using groupby:
# 1. Whenever TEST_NUMBER == 11 has D1 value NaN, remove all rows with this PRODUCT_NO
drop_prod_no = df[(df.TEST_NUMBER==11) & (df.D1.isna())]["PRODUCT_NO"]
df.drop(df[df.PRODUCT_NO.isin(drop_prod_no)].index, axis=0, inplace=True)
# 2. Drop remaining rows with NaN values
df.dropna(inplace=True)
# 3. set column "target_sample" and calculate diffs
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
closest_sample = last_sample = 11
for index, row in subset.iterrows():
if row.TEST_NUMBER // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
subset.at[index, "target_sample"] = closest_sample
last_sample = row.TEST_NUMBER
for col in diff_cols:
subset.at[index, col + "_diff"] = subset.at[index, col] - float(subset[subset.TEST_NUMBER==closest_sample][col])
new_df = pd.concat([new_df, subset])
print(new_df)
Output:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 ... D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 ... -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 ... -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 ... -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 ... -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 ... 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 ... -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 ... -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 ... -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 ... -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 ... -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 ... -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 ... 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 ... 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 ... -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 79
Edit: you can avoid using ìterrows by applying lambda functions like you did:
# 3. set column "target_sample" and calculate diffs
def get_closest_sample(samples, test_no):
closest_sample = last_sample = 11
for smpl in samples:
if smpl // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
if smpl == test_no:
break
last_sample = smpl
return closest_sample
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
sample_list = list(subset["TEST_NUMBER"])
subset["target_sample"] = subset["TEST_NUMBER"].apply(lambda x: get_closest_sample(sample_list, x))
for col in diff_cols:
subset[col + "_diff"] = subset.apply(lambda row: row[col]-float(subset[subset.TEST_NUMBER==row["target_sample"]][col]), axis=1)
new_df = pd.concat([new_df, subset])
print(new_df)

Reading in a .txt file to get time series from rows of years and columns of monthly values

How could I read in a txt file like the one from
https://psl.noaa.gov/data/correlation/pna.data (example below)
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56
into a pandas dataframe to plot as a time series, for example from 1960-1965 with each value column (corresponding to months) being plotted? I rarely use .txt's
Here's what you can try:
import pandas as pd
import requests
import re
aa=requests.get("https://psl.noaa.gov/data/correlation/pna.data").text
aa=aa.split("\n")[1:-4]
aa=list(map(lambda x:x[1:],aa))
aa="\n".join(aa)
aa=re.sub(" +",",",aa)
with open("test.csv","w") as f:
f.write(aa)
df=pd.read_csv("test.csv", header=None, index_col=0).rename_axis('Year')
df.columns=list(pd.date_range(start='2021-01', freq='M', periods=12).month_name())
print(df.head())
df.to_csv("test.csv")
This is going to give you, in test.csv file:
Year
January
February
March.....
up to December
1948
73
67
67
773....
1949
73
67
67
773....
1950
73
67
67
773....
....
..
..
..
.......
....
..
..
..
.......
2021
73
88
84
733....
Use pd.read_fwf as suggested by #SanskarSingh
>>> pd.read_fwf('data.txt', header=None, index_col=0).rename_axis('Year')
1 2 3 4 5 6 7 8 9 10 11 12
Year
1960 -0.16 -0.22 -0.69 -0.07 0.99 1.20 1.11 1.85 -0.01 0.48 -0.52 1.15
1961 1.16 0.17 0.28 -1.14 -0.25 1.84 -0.52 0.47 1.10 -1.94 -0.40 -1.54
1962 -0.74 -0.54 -0.71 -1.50 -1.11 -0.97 -0.36 0.57 -0.83 1.33 0.53 -0.38
1963 0.09 0.79 -2.04 -0.79 -0.95 0.50 -1.10 -1.01 0.87 0.93 -0.31 1.46
1964 -0.44 1.36 -1.31 -1.30 -2.27 0.27 0.20 0.83 0.92 0.80 -0.78 -2.03
1965 -0.92 -1.03 -0.80 -1.07 -0.42 1.89 -1.26 0.32 0.36 1.42 -0.81 -1.56

How do I plot data from multiple CSVs each with different column numbers

The file has no header but I would need to select say columns, 5,9, 13, 17 etc against column 2 (time). How can this be achieved in the case where the headers are present as well. Edit : Each file contains data for one day, the time format is GPS time which is the YR,Day of YR and Sec since midnight. How can i plot for say 1=30 January 2019?
Here is one code i tried
import numpy as np
import glob,os
import matplotlib.pyplot as plt
files = glob.glob('*.s4')
#print(files)
for file in files:
f=np.loadtxt(file,skiprows=3)
#print(file[0:9].upper())
for i in range (5,50,4):
t=f[:,2]/3600.;s4=f[:,i]
pos= np.where (t)[0]
pos1=np.where(s4[pos]<0.15)[0];s4[pos1]='nan'
plt.scatter(t,s4)
#print(len(s4))
plt.xticks(np.arange(0, 26, 2))
#plt.title(str(i))
plt.show()
The problem is that particular code only plots for one day at a time.
Here is a sample of the data.
19 001 45 11 1 0.07 214.9 37.5 8 0.08 314.5 34.2 10 0.14 102.6 14.3 11 0.07 241.2 49.6 14 0.07 152.0 50.0 18 0.05 212.7 68.0 22 0.08 226.1 33.7 27 0.06 346.0 22.0 31 0.04 63.5 47.7 32 0.06 144.3 30.4 138 0.09 282.0 17.8
19 001 105 11 1 0.05 214.9 37.9 8 0.07 314.9 33.8 10 0.24 102.2 14.1 11 0.07 241.7 49.9 14 0.06 151.9 49.6 18 0.06 213.0 68.4 22 0.12 225.7 34.0 27 0.06 346.2 21.7 31 0.04 64.1 47.9 32 0.06 144.2 30.0 138 0.09 282.0 17.8
19 001 165 11 1 0.06 214.9 38.4 8 0.11 315.3 33.5 10 0.12 101.8 13.9 11 0.06 242.3 50.1 14 0.06 151.8 49.1 18 0.05 213.4 68.9 22 0.07 225.2 34.2 27 0.11 346.5 21.3 31 0.04 64.8 48.2 32 0.10 144.0 29.6 138 0.09 282.0 17.8
19 001 225 11 1 0.06 214.9 38.8 8 0.06 315.8 33.2 10 0.10 101.4 13.7 11 0.06 242.8 50.4 14 0.05 151.7 48.6 18 0.04 213.7 69.4 22 0.06 224.8 34.4 27 0.08 346.8 20.9 31 0.05 65.5 48.4 32 0.09 143.9 29.2 138 0.09 282.0 17.8
19 001 285 11 1 0.06 215.0 39.2 8 0.11 316.2 32.9 10 0.14 100.9 13.6 11 0.05 243.4 50.6 14 0.06 151.6 48.2 18 0.06 214.1 69.8 22 0.08 224.4 34.7 27 0.07 347.0 20.5 31 0.06 66.1 48.6 32 0.09 143.7 28.8 138 0.09 282.0 17.8
19 001 345 11 1 0.06 215.0 39.7 8 0.08 316.6 32.5 10 0.10 100.5 13.4 11 0.04 244.0 50.9 14 0.06 151.5 47.7 18 0.04 214.6 70.3 22 0.07 223.9 34.9 27 0.08 347.3 20.2 31 0.07 66.8 48.9 32 0.08 143.6 28.4 138 0.09 282.0 17.8
19 001 405 11 1 0.06 215.1 40.1 8 0.07 317.0 32.2 10 0.13 100.1 13.2 11 0.05 244.6 51.1 14 0.08 151.4 47.3 18 0.05 215.0 70.8 22 0.07 223.5 35.1 27 0.12 347.5 19.8 31 0.08 67.5 49.1 32 0.12 143.4 28.0 138 0.09 282.0 17.8
19 001 465 11 1 0.06 215.1 40.5 8 0.12 317.4 31.9 10 0.10 99.7 13.0 11 0.08 245.2 51.4 14 0.05 151.3 46.8 18 0.06 215.5 71.2 22 0.06 223.0 35.4 27 0.12 347.8 19.4 31 0.03 68.2 49.3 32 0.18 143.3 27.7 138 0.09 282.0 17.8
19 001 525 11 1 0.09 215.2 40.9 8 0.12 317.9 31.5 10 0.11 99.3 12.8 11 0.04 245.8 51.6 14 0.15 151.2 46.4 18 0.06 216.0 71.7 22 0.06 222.6 35.6 27 0.08 348.0 19.1 31 0.05 68.9 49.5 32 0.08 143.1 27.3 138 0.09 282.0 17.8
19 001 585 11 1 0.07 215.2 41.4 8 0.09 318.3 31.2 10 0.12 98.9 12.6 11 0.04 246.5 51.8 14 0.06 151.1 45.9 18 0.05 216.5 72.2 22 0.06 222.1 35.8 27 0.08 348.3 18.7 31 0.07 69.6 49.7 32 0.11 143.0 26.9 138 0.09 282.0 17.8
Assuming that a space character is the column separator, you can load them into a list of lists:
data = []
with open(datafile,'r') as file:
for line in file:
# splits into list based on white space separator
data.append(line.split)
Taking part of your example: to compare the values in column 2 with column 5 you could do:
for line in data:
if line[1] == line[4]:
print("it's a match!")
If you have a header you want to ignore, just skip the first line when you open the file:
with open(datafile,'r') as file:
# do nothing with this line
header = f.readline()
...

pyGAM `y data is not in domain of logit link function`

I'm trying to find to what degree the chemical properties of a wine dataset influence the quality property of the dataset.
The error:
ValueError: y data is not in domain of logit link function. Expected
domain: [0.0, 1.0], but found [3.0, 9.0]
The code:
import pandas as pd
from pygam import LogisticGAM
white_data = pd.read_csv("winequality-white.csv",sep=';');
X = white_data[[
"fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide",
"total sulfur dioxide","density","pH","sulphates","alcohol"
]]
print(X.describe)
y = pd.Series(white_data["quality"]);
print(white_quality.describe)
white_gam = LogisticGAM().fit(X, y)
The output of said code:
<bound method NDFrame.describe of fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.0 0.27 0.36 20.7 0.045
1 6.3 0.30 0.34 1.6 0.049
2 8.1 0.28 0.40 6.9 0.050
3 7.2 0.23 0.32 8.5 0.058
4 7.2 0.23 0.32 8.5 0.058
... ... ... ... ... ...
4893 6.2 0.21 0.29 1.6 0.039
4894 6.6 0.32 0.36 8.0 0.047
4895 6.5 0.24 0.19 1.2 0.041
4896 5.5 0.29 0.30 1.1 0.022
4897 6.0 0.21 0.38 0.8 0.020
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 45.0 170.0 1.00100 3.00 0.45
1 14.0 132.0 0.99400 3.30 0.49
2 30.0 97.0 0.99510 3.26 0.44
3 47.0 186.0 0.99560 3.19 0.40
4 47.0 186.0 0.99560 3.19 0.40
... ... ... ... ... ...
4893 24.0 92.0 0.99114 3.27 0.50
4894 57.0 168.0 0.99490 3.15 0.46
4895 30.0 111.0 0.99254 2.99 0.46
4896 20.0 110.0 0.98869 3.34 0.38
4897 22.0 98.0 0.98941 3.26 0.32
alcohol
0 8.8
1 9.5
2 10.1
3 9.9
4 9.9
... ...
4893 11.2
4894 9.6
4895 9.4
4896 12.8
4897 11.8
[4898 rows x 11 columns]>
<bound method NDFrame.describe of 0 6
1 6
2 6
3 6
4 6
..
4893 6
4894 5
4895 6
4896 7
4897 6
Name: quality, Length: 4898, dtype: int64>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-e1c5720823a6> in <module>
16 print(white_quality.describe)
17
---> 18 white_gam = LogisticGAM().fit(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/pygam.py in fit(self, X, y, weights)
893
894 # validate data
--> 895 y = check_y(y, self.link, self.distribution, verbose=self.verbose)
896 X = check_X(X, verbose=self.verbose)
897 check_X_y(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/utils.py in check_y(y, link, dist, min_samples, verbose)
227 .format(link, get_link_domain(link, dist),
228 [float('%.2f'%np.min(y)),
--> 229 float('%.2f'%np.max(y))]))
230 return y
231
ValueError: y data is not in domain of logit link function. Expected domain: [0.0, 1.0], but found [3.0, 9.0]
The files: (I'm using Jupyter Notebook but I don't think you'd need to): https://drive.google.com/drive/folders/1RAj2Gh6WfdzpwtgbMaFVuvBVIWwoTUW5?usp=sharing
You probably want to use LinearGAM – LogisticGAM is for classification tasks.

Sum results of pandas groupby

I have the next DataFrame:
stock color 15M_c 60M_c mediodia 1D_c 1D-15M_c
0 PYPL rojo 0.32 0.32 0.47 -0.18 -0.50
1 MSFT verde -0.11 0.38 0.79 -0.48 -0.35
2 PYPL verde -1.44 -1.23 0.28 -1.13 0.30
3 V rojo -0.07 0.23 0.70 0.80 0.91
4 JD rojo 0.87 1.11 1.19 0.43 -0.42
5 FB verde 0.20 0.05 0.22 -0.66 -0.82
.. ... ... ... ... ... ... ...
282 GM verde 0.14 0.06 0.47 0.51 0.37
283 FB verde 0.09 -0.08 0.12 0.22 0.12
284 MSFT rojo -0.16 -0.23 -0.06 -0.01 0.14
285 PYPL verde -0.14 -0.41 -0.07 0.20 0.30
286 V verde -0.02 0.00 0.28 0.42 0.45
And first I grouped by 'stock' and 'color', I do it with the next code:
marcos = ['15M_c','60M_c','mediodia','1D_c','1D-15M_c']
grouped = data.groupby(['stock','color'])
res = grouped[marcos].agg([np.size, np.sum])
So in 'res' I get the next DataFrame:
15M_c 60M_c mediodia 1D_c 1D-15M_c
size sum size sum size sum size sum size sum
stock color
AAPL rojo 10.0 -0.46 10.0 -0.20 10.0 -0.33 10.0 -0.25 10.0 0.18
verde 8.0 1.39 8.0 2.48 8.0 1.06 8.0 -1.57 8.0 -2.88
... ... .. .. .. .. .. .. .. .. .. ..
FB verde 15.0 0.92 15.0 -0.64 15.0 -0.11 15.0 -0.89 15.0 -1.80
MSFT rojo 11.0 0.47 11.0 2.07 11.0 2.71 11.0 4.37 11.0 3.83
verde 18.0 1.46 18.0 2.12 18.0 1.26 18.0 0.97 18.0 -0.54
PYPL rojo 9.0 1.06 9.0 2.68 9.0 5.02 9.0 3.98 9.0 2.84
verde 17.0 -1.57 17.0 -2.40 17.0 0.29 17.0 -0.48 17.0 1.08
V rojo 1.0 -0.22 1.0 -0.28 1.0 -0.36 1.0 -0.29 1.0 -0.06
verde 9.0 -1.01 9.0 -1.42 9.0 -0.86 9.0 0.58 9.0 1.61
And then I want to sum 'verde' row with 'rojo' row for each 'stock', but multipliying rojo sum by -1. The final result I wanted is:
15M_c 60M_c mediodia 1D_c 1D-15M_c
size sum sum sum sum sum
stock
AAPL 18.0 1.85 2.68 1.39 -1.32 -3.06
... .. .. .. .. .. ..
FB 15.0 0.92 -0.64 -0.11 -0.89 -1.80
MSFT 29.0 0.99 0.05 -1.45 -3.40 -4.37
PYPL 26.0 -2.63 -5.08 .. .. ..
V 10.0 -0.79 -1.14 .. .. ..
Thank you very much in advance for your help.
pandas.IndexSlice
Use loc and IndexSlice to change the sign of appropriate values. Then use sum(level=0)
islc = pd.IndexSlice
res.loc[islc[:, 'rojo'], islc[:, 'sum']] *= -1
res.sum(level=0)
Convert the columns in marcos based on the value of color
import numpy as np
for m in marcos:
data[m] = np.where(data['color'] == 'rojo', -data[m], data[m])
Then you can skip grouping by color altogether:
grouped = foo.groupby(['stock'])
res = grouped[marcos].agg([np.size, np.sum])

Categories

Resources