How to fill missing timestamps in pandas

How to fill missing timestamps in pandas - python

I have a CSV file as below:
t dd hh v.amm v.alc v.no2 v.cmo aqi
0 201811170000 17 0 0.40 0.41 1.33 1.55 2.45
1 201811170002 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170007 17 0 0.40 0.37 1.35 1.45 2.40
Now I have to fill in the missing minutes by last observation carried forward. Expected output:
t dd hh v.amm v.alc v.no2 v.cmo aqi
0 201811170000 17 0 0.40 0.41 1.33 1.55 2.45
1 201811170001 17 0 0.40 0.41 1.33 1.55 2.45
2 201811170002 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170003 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170004 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170005 17 0 0.40 0.41 1.34 1.51 2.46
2 201811170006 17 0 0.40 0.41 1.34 1.51 2.46
3 201811170007 17 0 0.40 0.37 1.35 1.45 2.40
I tried following this link but unable to achieve the expected output. Sorry I'm new to coding.

First create DatetimeIndex by to_datetime and DataFrame.set_index and then change frequency by DataFrame.asfreq:
df['t'] = pd.to_datetime(df['t'], format='%Y%m%d%H%M')
df = df.set_index('t').sort_index().asfreq('Min', method='ffill')
print (df)
dd hh v.amm v.alc v.no2 v.cmo aqi
t
2018-11-17 00:00:00 17 0 0.4 0.41 1.33 1.55 2.45
2018-11-17 00:01:00 17 0 0.4 0.41 1.33 1.55 2.45
2018-11-17 00:02:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:03:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:04:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:05:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:06:00 17 0 0.4 0.41 1.34 1.51 2.46
2018-11-17 00:07:00 17 0 0.4 0.37 1.35 1.45 2.40
Or use DataFrame.resample with Resampler.ffill:
df['t'] = pd.to_datetime(df['t'], format='%Y%m%d%H%M')
df = df.set_index('t').sort_index().resample('Min').ffill()

Related

I wish to optimize the code using pythonic ways using lambda and pandas

I have the following Dataframe:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067
13 23 NaN NaN NaN NaN NaN NaN NaN NaN 983.5 BQ0067
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067
17 11 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 BQ0068
18 21 4.83 11.9 28.1 44.2 54.63 16.76 6.70 0.19 953.7 BQ0068
19 22 4.40 10.7 26.3 43.4 57.55 19.85 8.59 0.53 974.9 BQ0068
20 23 17.61 43.8 67.9 122.6 221.20 0.75 0.33 58.27 974.9 BQ0068
21 31 15.09 22.3 33.3 45.6 59.45 0.98 0.38 0.73 1773.7 BQ0068
I wish to do the following things:
Steps:
Whenever the TEST_NUMBER 11 is NaN(null values), I need to remove all rows of particular
PRODUCT_NO. For example, in the given dataframe, PRODUCT_NO. BQ0068 has TEST_NUMBER 11
with NaN values, hence all rows of BQ0068 should be removed.
If any TEST_NUMBER other than TEST_NUMBER 11 has NaN values, then only that particular
TEST_NUMBER's row should be removed. For example, PRODUCT_NO BQ0067 has row of TEST_NUMBER 23 with NaN values. Hence only that particular row of TEST_NUMBER 23should be removed.
After doing the above steps, I need to the computation, for example, for PRODUCT_NO BQ0066, I
need compute the the difference between rows in following way,
TEST_NUMBER 21 - TEST_NUMBER 11, TEST_NUMBER 22 - TEST_NUMBER 11, TEST_NUMBER 23 - TEST_NUMBER 11, TEST_NUMBER 24 - TEST_NUMBER 25,
TEST_NUMBER 21 - TEST_NUMBER 11. And then TEST_NUMBER 31 - TEST_NUMBER 25,
TEST_NUMBER 32 - TEST_NUMBER 25, TEST_NUMBER 33 - TEST_NUMBER 25, TEST_NUMBER 34 -
TEST_NUMBER 25. And carry on the same procedure for successive PRODUCT_NO. As you can see
TEST_NUMBERS frequency is different for each PRODUCT_NO. But in all cases, every
PRODUCT_NO will have only one TEST_NUMBER 11 and the other TEST_NUMBERS will be in range
of 21 to 29 i.e. 21, 22, 23, 24, 25, 26, 27, 28, 29 and 31, 32, 33 ,34, 35, 36, 37, 38, 39
PYTHON CODE
def pick_closest_sample(sample_list, sample_no):
sample_list = sorted(sample_list)
buffer = []
for number in sample_list:
if sample_no // 10 == number// 10:
break
else:
buffer.append(number)
if len(buffer) > 0:
return buffer[-1]
return sample_no
def add_closest_sample_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
if subset.iloc[0].isnull().sum() == 0:
subset.dropna(inplace = True)
sample_list = subset['TEST_NUMBER'].to_list()
subset['target_sample'] = subset['TEST_NUMBER'].apply(lambda x: pick_closest_sample(sample_list,x))
out.append(subset)
if len(out)>0:
out = pd.concat(out)
out.dropna(inplace=True)
return out
Output of above two functions:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO target_sample
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 11
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 11
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 11
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 11
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 11
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 11
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 25
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 25
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 25
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 25
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 11
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 11
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 22
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 22
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 22
As you can see, all rows of PRODUCT_NO BQ0068 are removed as TEST_NUMBER 11 had NaN values. Also only row of TEST_NUMBER 23 of PRODUCT_NO BQ0067 is removed as it had NaN values. So the requirements mentioned in the first two steps are met. Now the computation for PRODUCT_NO BQ0067 will be like TEST_NUMBER 31 - TEST_NUMBER 22, TEST_NUMBER 32 - TEST_NUMBER 22, TEST_NUMBER 33 - TEST_NUMBER 22
PYTHON CODE
def compute_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
target_list = list(subset['target_sample'].unique())
for target in target_list:
target_df = subset[subset['target_sample'] == target]
target_subset = [subset[subset['TEST_NUMBER'] == target]]*len(target_df)
target_subset = pd.concat(target_subset)
if len(target_subset)> 0:
target_subset.index = target_df.index
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for col in diff_cols:
target_df[col + '_diff'] = target_df[col] - target_subset[col]
out.append(target_df)
if len(out)>0:
out = pd.concat(out)
return out
Output of the above function:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM ... target_sample D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 ... 11 -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 ... 11 -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 ... 11 -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 ... 11 -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 ... 11 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 ... 25 -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 ... 25 -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 ... 25 -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 ... 25 -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 ... 11 -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 ... 11 -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 ... 22 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 ... 22 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 ... 22 -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 790.3
Kindly help me optimize the code of three functions I posted, so I could write them in more pythonic way.

Points 1. and 2. can be achieved in a few line with pandas functions.
You can then calculate "target_sample" and your diff_col in the same loop using groupby:
# 1. Whenever TEST_NUMBER == 11 has D1 value NaN, remove all rows with this PRODUCT_NO
drop_prod_no = df[(df.TEST_NUMBER==11) & (df.D1.isna())]["PRODUCT_NO"]
df.drop(df[df.PRODUCT_NO.isin(drop_prod_no)].index, axis=0, inplace=True)
# 2. Drop remaining rows with NaN values
df.dropna(inplace=True)
# 3. set column "target_sample" and calculate diffs
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
closest_sample = last_sample = 11
for index, row in subset.iterrows():
if row.TEST_NUMBER // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
subset.at[index, "target_sample"] = closest_sample
last_sample = row.TEST_NUMBER
for col in diff_cols:
subset.at[index, col + "_diff"] = subset.at[index, col] - float(subset[subset.TEST_NUMBER==closest_sample][col])
new_df = pd.concat([new_df, subset])
print(new_df)
Output:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 ... D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 ... -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 ... -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 ... -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 ... -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 ... 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 ... -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 ... -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 ... -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 ... -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 ... -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 ... -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 ... 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 ... 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 ... -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 79
Edit: you can avoid using ìterrows by applying lambda functions like you did:
# 3. set column "target_sample" and calculate diffs
def get_closest_sample(samples, test_no):
closest_sample = last_sample = 11
for smpl in samples:
if smpl // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
if smpl == test_no:
break
last_sample = smpl
return closest_sample
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
sample_list = list(subset["TEST_NUMBER"])
subset["target_sample"] = subset["TEST_NUMBER"].apply(lambda x: get_closest_sample(sample_list, x))
for col in diff_cols:
subset[col + "_diff"] = subset.apply(lambda row: row[col]-float(subset[subset.TEST_NUMBER==row["target_sample"]][col]), axis=1)
new_df = pd.concat([new_df, subset])
print(new_df)

Convert multiple dataframes into a multi-index dataframe

I have a bunch of stock data downloaded from yahoo finance. Each dataframe looks like this:
Date Open High Low Close Adj Close Volume
0 2019-03-11 2.73 2.81 2.71 2.75 2.75 243900
1 2019-03-12 2.66 2.78 2.66 2.75 2.75 69200
2 2019-03-13 2.75 2.80 2.71 2.77 2.77 61200
3 2019-03-14 2.77 2.79 2.75 2.75 2.75 48800
4 2019-03-15 2.76 2.79 2.75 2.79 2.79 124400
.. ... ... ... ... ... ... ...
282 2020-04-22 3.61 3.75 3.61 3.71 3.71 312900
283 2020-04-23 3.74 3.77 3.66 3.76 3.76 99800
284 2020-04-24 3.78 3.78 3.63 3.63 3.63 89100
285 2020-04-27 3.70 3.70 3.55 3.64 3.64 60600
286 2020-04-28 3.70 3.74 3.64 3.70 3.70 248300
I need to concat the data so it looks like the below multi-index format and I'm at a loss. I've tried a number of pd.concat([list of dfs], zip(cols,symbols), axis=[0,1]) combos with no luck so any help is appreciated!
Adj Close Close High Low Open Volume
CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP
Date
2019-04-30 1.85 3.08 0.69 1.85 3.08 0.69 1.94 3.10 0.70 1.74 3.05 0.67 1.74 3.07 0.70 24800 23900 30400
2019-05-01 1.81 3.15 0.65 1.81 3.15 0.65 1.85 3.17 0.69 1.75 3.06 0.62 1.76 3.09 0.67 15500 72800 85900
2019-05-02 1.80 3.12 0.66 1.80 3.12 0.66 1.87 3.16 0.66 1.76 3.10 0.65 1.80 3.16 0.65 12900 28100 97200
2019-05-03 1.85 3.14 0.67 1.85 3.14 0.67 1.89 3.19 0.69 1.74 3.06 0.62 1.74 3.12 0.62 43200 31300 27500
2019-05-06 1.85 3.13 0.66 1.85 3.13 0.66 1.89 3.25 0.69 1.75 3.11 0.65 1.79 3.11 0.67 37000 50200 31500
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-04-22 0.93 3.71 0.73 0.93 3.71 0.73 1.04 3.75 0.73 0.93 3.61 0.69 0.93 3.61 0.72 2600 312900 14600
2020-04-23 1.01 3.76 0.74 1.01 3.76 0.74 1.01 3.77 0.77 0.94 3.66 0.73 0.94 3.74 0.73 2500 99800 15200
2020-04-24 1.05 3.63 0.76 1.05 3.63 0.76 1.05 3.78 0.77 0.92 3.63 0.74 1.05 3.78 0.74 4400 89100 1300
2020-04-27 1.03 3.64 0.76 1.03 3.64 0.76 1.07 3.70 0.77 0.92 3.55 0.76 1.07 3.70 0.77 6200 60600 3500
2020-04-28 1.00 3.70 0.77 1.00 3.70 0.77 1.07 3.74 0.77 0.96 3.64 0.75 1.07 3.70 0.77 22300 248300 26100
EDIT per Quang Hoang's suggestion:
Tried:
ret = pd.concat(stock_data.values(), keys=stocks, axis=1)
ret = ret.swaplevel(0, 1, axis=1)
Got the following output which looks much closer but still off a bit:
Date Open High Low Close Adj Close Volume Date Open High Low Close Adj Close Volume Date Open High Low Close Adj Close Volume
CHNR CHNR CHNR CHNR CHNR CHNR CHNR GNSS GNSS GNSS GNSS GNSS GNSS GNSS SGRP SGRP SGRP SGRP SGRP SGRP SGRP
0 2010-04-29 11.39 11.74 11.39 11.57 11.57 3100 2019-03-11 2.73 2.81 2.71 2.75 2.75 243900.0 2010-04-29 0.79 0.79 0.79 0.79 0.79 0
1 2010-04-30 11.60 11.61 11.50 11.56 11.56 5400 2019-03-12 2.66 2.78 2.66 2.75 2.75 69200.0 2010-04-30 0.79 0.79 0.79 0.79 0.79 0
2 2010-05-03 11.95 11.95 11.22 11.44 11.44 19400 2019-03-13 2.75 2.80 2.71 2.77 2.77 61200.0 2010-05-03 0.79 0.79 0.79 0.79 0.79 0
3 2010-05-04 11.20 11.49 11.20 11.46 11.46 10700 2019-03-14 2.77 2.79 2.75 2.75 2.75 48800.0 2010-05-04 0.79 0.79 0.66 0.79 0.79 9700
4 2010-05-05 11.50 11.60 11.25 11.50 11.50 13400 2019-03-15 2.76 2.79 2.75 2.79 2.79 124400.0 2010-05-05 0.69 0.80 0.67 0.80 0.80 6700
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2512 2020-04-22 0.93 1.04 0.93 0.93 0.93 2600 NaT NaN NaN NaN NaN NaN NaN 2020-04-22 0.72 0.73 0.69 0.73 0.73 14600
2513 2020-04-23 0.94 1.01 0.94 1.01 1.01 2500 NaT NaN NaN NaN NaN NaN NaN 2020-04-23 0.73 0.77 0.73 0.74 0.74 15200
2514 2020-04-24 1.05 1.05 0.92 1.05 1.05 4400 NaT NaN NaN NaN NaN NaN NaN 2020-04-24 0.74 0.77 0.74 0.76 0.76 1300
2515 2020-04-27 1.07 1.07 0.92 1.03 1.03 6200 NaT NaN NaN NaN NaN NaN NaN 2020-04-27 0.77 0.77 0.76 0.76 0.76 3500
2516 2020-04-28 1.07 1.07 0.96 1.00 1.00 22300 NaT NaN NaN NaN NaN NaN NaN 2020-04-28 0.77 0.77 0.75 0.77 0.77 26100

python add one column based on row value

I have one dataframe as below. I want to add one column based on the column 'need' (such as in row zero ,the need is 1, so select part1 value -0.17. I have pasted the dataframe which I want. Thanks.
df = pd.DataFrame({
'date': [20130101,20130101, 20130103, 20130104, 20130105, 20130107],
'need':[1,3,2,4,3,1],
'part1':[-0.17,-1.03,1.59,-0.05,-0.1,0.9],
'part2':[0.67,-0.03,1.95,-3.25,-0.3,0.6],
'part3':[0.7,-3,1.5,-0.25,-0.37,0.62],
'part4':[0.24,-0.44,1.335,-0.45,-0.57,0.92]
})
date need output part1 part2 part3 part4
0 20130101 1 -0.17 -0.17 0.67 0.70 0.240
1 20130101 3 -3.00 -1.03 -0.03 -3.00 -0.440
2 20130103 2 1.95 1.59 1.95 1.50 1.335
3 20130104 4 -0.45 -0.05 -3.25 -0.25 -0.450
4 20130105 3 -0.37 -0.10 -0.30 -0.37 -0.570
5 20130107 1 0.90 0.90 0.60 0.62 0.920

Use DataFrame.lookup:
df['new'] = df.lookup(df.index, 'part' + df['need'].astype(str))
print (df)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90
Numpy solution, is necessary sorting increaseing columns by 1 like in sample:
df['new'] = df.filter(like='part').values[np.arange(len(df)), df['need'] - 1]
print (df)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90

It should be fine
df['new'] = df.iloc[:, 1:].apply(lambda row: row['part'+str(int(row['need']))], axis=1)
date need part1 part2 part3 part4 new
0 20130101 1 -0.17 0.67 0.70 0.240 -0.17
1 20130101 3 -1.03 -0.03 -3.00 -0.440 -3.00
2 20130103 2 1.59 1.95 1.50 1.335 1.95
3 20130104 4 -0.05 -3.25 -0.25 -0.450 -0.45
4 20130105 3 -0.10 -0.30 -0.37 -0.570 -0.37
5 20130107 1 0.90 0.60 0.62 0.920 0.90

Creating a heatmap using python and csv file

I'm trying to create a heatmap, with the x axis being time, the y axis being detectors (it's for freeway speed detection), and the colour scheme and numbers on the graph being for occupancy or basically what values the csv has at that time and detector.
My first thought is to use matplotlib in conjunction with pandas and numpy.
I've been trying lots of different approaches and feel like i've hit a brickwall in terms of getting it working.
Does anyone have a good idea about using these tools?
Cheers!
Row Labels 14142OB_L1 14142OB_L2 14140OB_E1P0 14140OB_E1P1 14140OB_E2P0 14140OB_E2P1 14140OB_L1 14140OB_L2 14140OB_M1P0 14140OB_M1P1 14140OB_M2P0 14140OB_M2P1 14140OB_M3P0 14140OB_M3P1 14140OB_S1P0 14140OB_S1P1 14140OB_S2P0 14140OB_S2P1 14140OB_S3P0 14140OB_S3P1 14138OB_L1 14138OB_L2 14138OB_L3 14136OB_L1 14136OB_L2 14136OB_L3 14134OB_L1 14134OB_L2 14134OB_L3 14132OB_L1 14132OB_L2 14132OB_L3
00 - 01 hr 0.22 1.42 0.29 0.29 0.59 0.59 0.17 1.47 0.38 0.38 0.56 0.6 0.08 0.1 0.67 0.7 0.88 0.9 0.15 0.17 0.17 1.66 0.47 0.16 1.6 0.49 0.14 0.94 1.21 0.21 1.22 0.44
01 - 02 hr 0.08 0.77 0.08 0.07 0.24 0.24 0.1 0.73 0.08 0.09 0.21 0.23 0.05 0.06 0.21 0.23 0.29 0.29 0.1 0.1 0.08 0.83 0.17 0.1 0.77 0.18 0.08 0.4 0.57 0.07 0.64 0.18
02 - 03 hr 0.08 0.73 0.06 0.06 0.23 0.23 0.06 0.73 0.07 0.07 0.23 0.24 0.02 0.02 0.16 0.17 0.32 0.34 0.06 0.07 0.06 0.77 0.16 0.06 0.78 0.17 0.07 0.3 0.66 0.06 0.68 0.19
03 - 04 hr 0.05 0.85 0.06 0.06 0.22 0.23 0.04 0.86 0.05 0.05 0.2 0.21 0.1 0.11 0.11 0.12 0.32 0.33 0.15 0.16 0.03 0.93 0.14 0.03 0.89 0.15 0.03 0.41 0.61 0.02 0.73 0.21
04 - 05 hr 0.13 1.25 0.09 0.09 0.24 0.24 0.12 1.25 0.11 0.11 0.2 0.21 0.08 0.09 0.19 0.2 0.32 0.34 0.15 0.15 0.1 1.33 0.18 0.11 1.35 0.19 0.11 0.52 1 0.07 1.08 0.29
05 - 06 hr 0.91 2.87 0.08 0.08 0.66 0.69 0.8 2.96 0.15 0.17 0.43 0.45 0.32 0.33 0.39 0.41 0.76 0.82 0.47 0.49 0.59 3.27 0.51 0.58 3.19 0.56 0.45 1.85 2.19 0.43 2.52 0.79
06 - 07 hr 3.92 5.44 1.29 1.14 4.03 4.12 3.19 6.03 1.66 1.69 3.26 3.44 1.84 1.93 13.03 14.97 13.81 19.23 4.69 5.59 3.03 6.72 3.01 2.78 6.81 3.02 1.52 4.22 7.13 2.54 5.94 2.88
07 - 08 hr 4.68 6.35 1.67 1.8 5.69 5.95 4.01 6.81 2.69 2.78 3.84 4.03 3.27 4.05 24.25 24.39 28.07 36.5 15.39 15.38 3.79 7.91 4.28 3.58 7.91 4.33 1.67 6.16 8.3 3.17 6.59 3.74
08 - 09 hr 5.21 6.31 2.51 2.82 7.46 7.72 4.53 6.65 9.03 8.98 13.94 12.77 6.73 8.55 47 48.38 50.08 48.32 22.83 21.91 4.29 8.27 5.04 4.15 8.27 5.16 2.44 6.24 9.17 3.26 6.81 4.16
09 - 10 hr 4.05 6.17 1.01 0.99 4.47 4.55 3.45 6.53 1.68 1.74 3.12 3.24 1.82 1.98 16.49 16.22 15.58 20.36 4.31 5.2 3.36 7.24 3.55 3.03 7.36 3.73 1.89 5.64 6.75 2.24 5.94 3.26
10 - 11 hr 3.62 6.64 1.14 1.15 4.11 4.18 3.23 6.87 1.79 1.87 3.03 3.13 1.72 1.89 15.02 18.75 17.25 22.61 3.06 3.24 3.06 7.69 3.23 2.87 7.49 3.56 2.06 4.99 7.05 2.26 6.2 3.07
11 - 12 hr 4.31 6.74 1.29 1.3 4.91 4.97 3.79 6.88 2.25 2.35 3.97 4.29 1.84 1.98 19.58 22.5 24.92 23.14 3.27 3.46 3.65 7.67 3.96 3.43 7.74 4 2.39 5.4 7.67 2.57 6.42 3.22
12 - 13 hr 4.53 6.9 1.4 1.39 5.81 5.9 3.96 7.18 2.69 2.86 4.94 5.28 2.15 2.29 24.46 28.34 36.59 31.06 5.4 5.39 3.95 7.98 4.54 3.7 8.03 4.69 2.36 5.99 8.29 3.01 6.61 3.37
13 - 14 hr 6.13 7.29 1.57 1.55 6.02 6.11 5.34 7.74 2.67 2.76 5.2 5.56 2.04 2.16 23.74 28.31 31.01 36.89 4.15 4.6 5.22 8.83 4.77 4.96 8.84 4.92 2.65 6.56 9.77 3.96 7.23 3.88
14 - 15 hr 8.72 8.22 2.93 3.06 8.58 8.9 8.94 9.57 17.69 17.2 18.99 23.58 2.37 3.69 38.81 53.33 49.93 45.42 5.69 4.3 8.13 10.04 5.45 7.03 9.94 5.51 3.59 7.41 12.4 5.92 8.04 4.4
15 - 16 hr 13.26 9.75 15.68 18.3 22.21 23.25 10.8 9.06 35.31 37.1 36.27 35.89 3.14 2.91 47.93 54.86 51.96 50.74 6.27 5.77 11.82 12.78 7.62 12.03 12.5 6.55 4.71 9.21 17.87 9.06 9.33 4.5
16 - 17 hr 18.25 14.92 4.95 4.63 9.68 10.2 20.14 16.68 21.38 21.39 23.92 28.11 1.75 1.86 48.15 47.31 46.65 50.4 3.46 3.31 21.52 16.97 7.37 18.47 14.84 7.51 6.88 15.52 27.8 11.17 9.35 5.34
17 - 18 hr 13.82 9.76 31.23 31.46 34.89 36.06 13.72 11.14 41.24 44.5 42 47.07 1.6 1.62 57.4 58.92 57.23 62.92 3.41 8.01 20.26 20.35 15.25 21.49 20.5 9.31 12.27 17.3 34.46 22.89 20.56 12.04
18 - 19 hr 7.51 5.81 50.48 49.94 45.97 46.43 8.65 5.95 49.26 48.28 51.04 46.46 2 3.04 56.08 56.39 54.95 59.06 3.18 6.47 13.44 13.73 25.79 17.67 21.52 19.26 6.35 11.52 22.13 11.31 10.4 5.42
19 - 20 hr 3.96 5.01 2.77 2.71 6.62 6.87 3.65 5.19 7.72 7.86 9.5 10.44 1.17 1.44 23.6 30.16 28.82 30.87 1.73 1.76 3.6 6.52 4.04 3.38 6.51 4.03 1.88 5.05 7.15 2.99 5.44 3.1
20 - 21 hr 2.16 3.72 1.75 1.74 3.96 4.02 2.03 3.72 2.62 2.73 4.32 4.54 0.76 0.79 18.41 23.69 30.91 31.05 1.31 1.26 2.1 4.76 2.97 1.93 4.75 2.97 1.43 3.43 4.9 1.73 3.9 2.27
21 - 22 hr 2.03 3.81 1.49 1.47 2.97 2.99 2 3.79 2.11 2.15 3.07 3.27 0.37 0.4 12.96 14.05 15.49 17.93 0.64 0.67 1.86 4.87 2.35 1.75 4.88 2.29 1.14 3.4 4.44 1.57 3.89 1.92
22 - 23 hr 1.33 3.2 1.21 1.22 2.46 2.5 1.21 3.23 1.75 1.79 2.36 2.48 0.35 0.38 6.19 9.26 10.48 12.16 0.57 0.58 1.28 3.85 2 1.23 3.84 1.96 0.82 2.74 3.55 1.12 3.29 1.73
23 - 24 hr 0.65 2.43 0.49 0.49 1.41 1.44 0.69 2.35 0.69 0.7 1.3 1.38 0.19 0.21 1.51 1.66 2.46 2.45 0.41 0.42 0.71 2.63 1.06 0.59 2.73 1.04 0.4 1.8 2.25 0.58 2.28 0.94
Grand Total 4.57 5.26 5.23 5.32 7.64 7.85 4.36 5.56 8.54 8.73 9.83 10.29 1.49 1.74 20.68 23.05 23.71 25.17 3.78 4.1 4.84 6.98 4.5 4.79 7.21 3.98 2.39 5.29 8.59 3.84 5.63 2.97
Here is the current script I'm using.
read_occupancy = pd.read_csv (r'C:\Users\holborm\Desktop\Visualisation\dataaxisplotstuff.csv') #read the csv file (put 'r' before the path string to address any special characters, such as '\'). Don't forget to put the file name at the end of the path + ".csv"
df = DataFrame(read_occupancy) # assign column names
#create time and detector name axis
time_axis = df.index
detector_axis = df.columns
plt.plot(df)
Using Seaborn
read_occupancy = pd.read_csv (r'C:\Users\holborm\Desktop\Visualisation\dataaxisplotstuff.csv') #read the csv file (put 'r' before the path string to address any special characters, such as '\'). Don't forget to put the file name at the end of the path + ".csv"
df = DataFrame(read_occupancy) # assign column names
#create time and detector name axis
sns.heatmap(df)
Error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-79-33a3388e21cc> in <module>()
6 #create time and detector name axis
7
----> 8 sns.heatmap(df)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in heatmap(data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, linewidths, linecolor, cbar, cbar_kws, cbar_ax, square, xticklabels, yticklabels, mask, ax, **kwargs)
515 plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,
516 annot_kws, cbar, cbar_kws, xticklabels,
--> 517 yticklabels, mask)
518
519 # Add the pcolormesh kwargs here
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in __init__(self, data, vmin, vmax, cmap, center, robust, annot, fmt, annot_kws, cbar, cbar_kws, xticklabels, yticklabels, mask)
166 # Determine good default values for the colormapping
167 self._determine_cmap_params(plot_data, vmin, vmax,
--> 168 cmap, center, robust)
169
170 # Sort out the annotations
~\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\matrix.py in _determine_cmap_params(self, plot_data, vmin, vmax, cmap, center, robust)
203 cmap, center, robust):
204 """Use some heuristics to set good defaults for colorbar and range."""
--> 205 calc_data = plot_data.data[~np.isnan(plot_data.data)]
206 if vmin is None:
207 vmin = np.percentile(calc_data, 2) if robust else calc_data.min()
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

You can use .set_index('Row Labels) to ensure your Row Labels column is interpreted as an axis for the heatmap and transpose your DataFrame with .T so that you get the time along the x-axis and the detectors for the y-axis.
sns.heatmap(df.set_index('Row Labels').T)

Access single cell of pandas dataframe?

I have the following data with some missing holes. I've looked over the 'how to handle missing data' but can't find anything that applies in this situation. Here is the data:
Species GearUsed AverageFishWeight(lbs) NormalRange(lbs) Caught
0 BlackBullhead Gillnet 0.11 0.8-7.7 0.18
1 BlackCrappie Trapnet 6.22 0.7-3.4 0.30
2 NaN Gillnet 1.00 0.6-3.5 0.30
3 Bluegill Trapnet 11.56 6.1-46.6 0.14
4 NaN Gillnet 1.44 NaN 0.21
5 BrownBullhead Trapnet 0.11 0.4-2.1 1.01
6 NorthernPike Trapnet 0.22 NaN 4.32
7 NaN Gillnet 2.22 3.5-10.5 5.63
8 Pumpkinseed Trapnet 0.89 2.0-8.5 0.23
9 RockBass Trapnet 0.22 0.5-1.8 0.04
10 Walleye Trapnet 0.22 0.3-0.7 0.28
11 NaN Gillnet 1.56 1.3-5.0 2.54
12 WhiteSucker Trapnet 0.33 0.3-1.4 2.76
13 NaN Gillnet 1.78 0.5-2.7 1.32
14 YellowPerch Trapnet 1.33 0.5-3.3 0.14
15 NaN Gillnet 27.67 3.4-43.6 0.14
I need the NaNs in the species column to just be the name above it, for example row 2 would be BlackCrappie. I would like to iterate through the frame and manually specify the species name but am not too sure of how, and also other answers recommend against iterating through the dataframe in the first place.
How do I access each cell of the frame individually? Thanks!
PS the column names are incorrect, there is not a 27lb yellow perch. :)

Do you want to fill the missing values in other rows as well? Seems to be what fillna() is for:
In [83]:
print df.fillna(method='pad')
Species GearUsed AverageFishWeight(lbs) NormalRange(lbs) Caught
0 BlackBullhead Gillnet 0.11 0.8-7.7 0.18
1 BlackCrappie Trapnet 6.22 0.7-3.4 0.30
2 BlackCrappie Gillnet 1.00 0.6-3.5 0.30
3 Bluegill Trapnet 11.56 6.1-46.6 0.14
4 Bluegill Gillnet 1.44 6.1-46.6 0.21
5 BrownBullhead Trapnet 0.11 0.4-2.1 1.01
6 NorthernPike Trapnet 0.22 0.4-2.1 4.32
7 NorthernPike Gillnet 2.22 3.5-10.5 5.63
8 Pumpkinseed Trapnet 0.89 2.0-8.5 0.23
9 RockBass Trapnet 0.22 0.5-1.8 0.04
10 Walleye Trapnet 0.22 0.3-0.7 0.28
11 Walleye Gillnet 1.56 1.3-5.0 2.54
12 WhiteSucker Trapnet 0.33 0.3-1.4 2.76
13 WhiteSucker Gillnet 1.78 0.5-2.7 1.32
14 YellowPerch Trapnet 1.33 0.5-3.3 0.14
15 YellowPerch Gillnet 27.67 3.4-43.6 0.14

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fill missing timestamps in pandas - python

Related

I wish to optimize the code using pythonic ways using lambda and pandas

Convert multiple dataframes into a multi-index dataframe

python add one column based on row value

Creating a heatmap using python and csv file

Access single cell of pandas dataframe?

Categories

Resources