I have the following Dataframe:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067
13 23 NaN NaN NaN NaN NaN NaN NaN NaN 983.5 BQ0067
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067
17 11 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 BQ0068
18 21 4.83 11.9 28.1 44.2 54.63 16.76 6.70 0.19 953.7 BQ0068
19 22 4.40 10.7 26.3 43.4 57.55 19.85 8.59 0.53 974.9 BQ0068
20 23 17.61 43.8 67.9 122.6 221.20 0.75 0.33 58.27 974.9 BQ0068
21 31 15.09 22.3 33.3 45.6 59.45 0.98 0.38 0.73 1773.7 BQ0068
I wish to do the following things:
Steps:
Whenever the TEST_NUMBER 11 is NaN(null values), I need to remove all rows of particular
PRODUCT_NO. For example, in the given dataframe, PRODUCT_NO. BQ0068 has TEST_NUMBER 11
with NaN values, hence all rows of BQ0068 should be removed.
If any TEST_NUMBER other than TEST_NUMBER 11 has NaN values, then only that particular
TEST_NUMBER's row should be removed. For example, PRODUCT_NO BQ0067 has row of TEST_NUMBER 23 with NaN values. Hence only that particular row of TEST_NUMBER 23should be removed.
After doing the above steps, I need to the computation, for example, for PRODUCT_NO BQ0066, I
need compute the the difference between rows in following way,
TEST_NUMBER 21 - TEST_NUMBER 11, TEST_NUMBER 22 - TEST_NUMBER 11, TEST_NUMBER 23 - TEST_NUMBER 11, TEST_NUMBER 24 - TEST_NUMBER 25,
TEST_NUMBER 21 - TEST_NUMBER 11. And then TEST_NUMBER 31 - TEST_NUMBER 25,
TEST_NUMBER 32 - TEST_NUMBER 25, TEST_NUMBER 33 - TEST_NUMBER 25, TEST_NUMBER 34 -
TEST_NUMBER 25. And carry on the same procedure for successive PRODUCT_NO. As you can see
TEST_NUMBERS frequency is different for each PRODUCT_NO. But in all cases, every
PRODUCT_NO will have only one TEST_NUMBER 11 and the other TEST_NUMBERS will be in range
of 21 to 29 i.e. 21, 22, 23, 24, 25, 26, 27, 28, 29 and 31, 32, 33 ,34, 35, 36, 37, 38, 39
PYTHON CODE
def pick_closest_sample(sample_list, sample_no):
sample_list = sorted(sample_list)
buffer = []
for number in sample_list:
if sample_no // 10 == number// 10:
break
else:
buffer.append(number)
if len(buffer) > 0:
return buffer[-1]
return sample_no
def add_closest_sample_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
if subset.iloc[0].isnull().sum() == 0:
subset.dropna(inplace = True)
sample_list = subset['TEST_NUMBER'].to_list()
subset['target_sample'] = subset['TEST_NUMBER'].apply(lambda x: pick_closest_sample(sample_list,x))
out.append(subset)
if len(out)>0:
out = pd.concat(out)
out.dropna(inplace=True)
return out
Output of above two functions:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO target_sample
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 11
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 11
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 11
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 11
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 11
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 11
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 25
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 25
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 25
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 25
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 11
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 11
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 22
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 22
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 22
As you can see, all rows of PRODUCT_NO BQ0068 are removed as TEST_NUMBER 11 had NaN values. Also only row of TEST_NUMBER 23 of PRODUCT_NO BQ0067 is removed as it had NaN values. So the requirements mentioned in the first two steps are met. Now the computation for PRODUCT_NO BQ0067 will be like TEST_NUMBER 31 - TEST_NUMBER 22, TEST_NUMBER 32 - TEST_NUMBER 22, TEST_NUMBER 33 - TEST_NUMBER 22
PYTHON CODE
def compute_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
target_list = list(subset['target_sample'].unique())
for target in target_list:
target_df = subset[subset['target_sample'] == target]
target_subset = [subset[subset['TEST_NUMBER'] == target]]*len(target_df)
target_subset = pd.concat(target_subset)
if len(target_subset)> 0:
target_subset.index = target_df.index
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for col in diff_cols:
target_df[col + '_diff'] = target_df[col] - target_subset[col]
out.append(target_df)
if len(out)>0:
out = pd.concat(out)
return out
Output of the above function:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM ... target_sample D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 ... 11 -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 ... 11 -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 ... 11 -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 ... 11 -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 ... 11 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 ... 25 -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 ... 25 -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 ... 25 -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 ... 25 -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 ... 11 -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 ... 11 -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 ... 22 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 ... 22 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 ... 22 -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 790.3
Kindly help me optimize the code of three functions I posted, so I could write them in more pythonic way.
Points 1. and 2. can be achieved in a few line with pandas functions.
You can then calculate "target_sample" and your diff_col in the same loop using groupby:
# 1. Whenever TEST_NUMBER == 11 has D1 value NaN, remove all rows with this PRODUCT_NO
drop_prod_no = df[(df.TEST_NUMBER==11) & (df.D1.isna())]["PRODUCT_NO"]
df.drop(df[df.PRODUCT_NO.isin(drop_prod_no)].index, axis=0, inplace=True)
# 2. Drop remaining rows with NaN values
df.dropna(inplace=True)
# 3. set column "target_sample" and calculate diffs
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
closest_sample = last_sample = 11
for index, row in subset.iterrows():
if row.TEST_NUMBER // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
subset.at[index, "target_sample"] = closest_sample
last_sample = row.TEST_NUMBER
for col in diff_cols:
subset.at[index, col + "_diff"] = subset.at[index, col] - float(subset[subset.TEST_NUMBER==closest_sample][col])
new_df = pd.concat([new_df, subset])
print(new_df)
Output:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 ... D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 ... -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 ... -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 ... -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 ... -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 ... 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 ... -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 ... -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 ... -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 ... -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 ... -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 ... -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 ... 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 ... 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 ... -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 79
Edit: you can avoid using ìterrows by applying lambda functions like you did:
# 3. set column "target_sample" and calculate diffs
def get_closest_sample(samples, test_no):
closest_sample = last_sample = 11
for smpl in samples:
if smpl // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
if smpl == test_no:
break
last_sample = smpl
return closest_sample
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
sample_list = list(subset["TEST_NUMBER"])
subset["target_sample"] = subset["TEST_NUMBER"].apply(lambda x: get_closest_sample(sample_list, x))
for col in diff_cols:
subset[col + "_diff"] = subset.apply(lambda row: row[col]-float(subset[subset.TEST_NUMBER==row["target_sample"]][col]), axis=1)
new_df = pd.concat([new_df, subset])
print(new_df)
I am new to Python and currently using BeautifulSoup with Python to try and pull some table data. I cannot get the individual elements out of the td. What I have so far is:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/').text
soup = BeautifulSoup(source, 'lxml')
td = soup.find_all('td', {'class': 'text-center'})
print(td)
This does display all of the td that I want to extract but am unable to figure out how to get each individual element out of the td.
Thank you in advanced for the help, it is much appreciated.
Try this:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/').text
soup = BeautifulSoup(source, 'lxml')
td = soup.find_all('td', {'class': 'text-center'})
print(*[text.get_text(strip=True) + '\n' for text in td])
Prints:
S10
NA
14
35.7%
0.91
1744
-48
33:19
11.2
12.4
5.5
7.0
50.0
64.3
2.71
54.2
1.00
57.1
1.14
and so on....
The following script extracts the data and saves the data to a csv file.
import requests
from bs4 import BeautifulSoup
import pandas as pd
res = requests.get('https://gol.gg/teams/list/season-ALL/split-ALL/region-ALL/tournament-LCS%20Summer%202020/week-ALL/')
soup = BeautifulSoup(res.text, 'html.parser')
table = soup.find("table", class_="table_list playerslist tablesaw trhover")
columns = [i.get_text(strip=True) for i in table.find("thead").find_all("th")]
data = []
table.find("thead").extract()
for tr in table.find_all("tr"):
data.append([td.get_text(strip=True) for td in tr.find_all("td")])
df = pd.DataFrame(data, columns=columns)
df.to_csv("data.csv", index=False)
Output:
Name Season Region Games Win rate K:D GPM GDM Game duration Kills / game Deaths / game Towers killed Towers lost FB% FT% DRAPG DRA% HERPG HER% DRA#15 TD#15 GD#15 NASHPG NASH% CSM DPM WPM VWPM WCPM
0 100 Thieves S10 NA 14 35.7% 0.91 1744 -48 33:19 11.2 12.4 5.5 7.0 50.0 64.3 2.71 54.2 1.00 57.1 1.14 0.4 -378 0.64 42.9 33.2 1937 3.0 1.19 1.31
1 CLG S10 NA 14 35.7% 0.81 1705 -120 35:25 10.6 13.2 4.9 7.9 28.6 28.6 1.93 31.5 0.57 28.6 0.64 -0.6 -1297 0.57 30.4 32.6 1826 3.2 1.17 1.37
2 Cloud9 S10 NA 14 78.6% 1.91 1922 302 28:52 15.0 7.9 8.3 3.1 64.3 64.3 3.07 72.5 1.43 71.4 1.29 0.7 2410 1.00 78.6 33.3 1921 3.0 1.10 1.26
3 Dignitas S10 NA 14 28.6% 0.86 1663 -147 32:44 8.9 10.4 3.9 8.1 42.9 35.7 2.14 41.7 0.57 28.6 0.79 -0.7 -796 0.36 25.0 32.5 1517 3.1 1.28 1.23
4 Evil Geniuses S10 NA 14 50.0% 0.85 1738 -0 34:09 11.1 13.1 6.5 6.0 64.3 57.1 2.36 48.5 1.00 53.6 1.00 0.5 397 0.50 46.5 32.3 1895 3.2 1.36 1.34
5 FlyQuest S10 NA 14 57.1% 1.28 1770 65 34:55 13.4 10.4 6.5 5.2 71.4 35.7 2.86 53.4 1.00 50.0 0.79 -0.1 69 0.71 69.2 32.7 1801 3.2 1.16 1.72
6 Golden Guardians S10 NA 14 50.0% 0.96 1740 6 36:13 10.7 11.1 6.3 6.1 50.0 35.7 3.29 62.8 0.86 42.9 1.43 0.1 711 0.50 43.6 33.7 1944 3.2 1.27 1.53
7 Immortals S10 NA 14 21.4% 0.54 1609 -246 33:54 7.5 14.0 4.3 7.9 35.7 35.7 2.29 39.9 1.00 53.6 0.79 -0.4 -1509 0.36 25.0 31.4 1734 3.3 1.37 1.47
8 Team Liquid S10 NA 14 78.6% 1.31 1796 135 35:07 11.4 8.6 7.9 4.4 42.9 64.3 2.36 43.6 0.93 50.0 1.14 0.2 522 1.21 78.6 33.1 1755 3.5 1.27 1.42
9 TSM S10 NA 14 64.3% 1.12 1768 52 34:20 11.6 10.4 7.2 5.7 50.0 78.6 2.79 51.9 1.21 64.3 0.93 0.1 -129 0.86 57.1 32.6 1729 3.2 1.33 1.33
I have a bunch of stock data downloaded from yahoo finance. Each dataframe looks like this:
Date Open High Low Close Adj Close Volume
0 2019-03-11 2.73 2.81 2.71 2.75 2.75 243900
1 2019-03-12 2.66 2.78 2.66 2.75 2.75 69200
2 2019-03-13 2.75 2.80 2.71 2.77 2.77 61200
3 2019-03-14 2.77 2.79 2.75 2.75 2.75 48800
4 2019-03-15 2.76 2.79 2.75 2.79 2.79 124400
.. ... ... ... ... ... ... ...
282 2020-04-22 3.61 3.75 3.61 3.71 3.71 312900
283 2020-04-23 3.74 3.77 3.66 3.76 3.76 99800
284 2020-04-24 3.78 3.78 3.63 3.63 3.63 89100
285 2020-04-27 3.70 3.70 3.55 3.64 3.64 60600
286 2020-04-28 3.70 3.74 3.64 3.70 3.70 248300
I need to concat the data so it looks like the below multi-index format and I'm at a loss. I've tried a number of pd.concat([list of dfs], zip(cols,symbols), axis=[0,1]) combos with no luck so any help is appreciated!
Adj Close Close High Low Open Volume
CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP CHNR GNSS SGRP
Date
2019-04-30 1.85 3.08 0.69 1.85 3.08 0.69 1.94 3.10 0.70 1.74 3.05 0.67 1.74 3.07 0.70 24800 23900 30400
2019-05-01 1.81 3.15 0.65 1.81 3.15 0.65 1.85 3.17 0.69 1.75 3.06 0.62 1.76 3.09 0.67 15500 72800 85900
2019-05-02 1.80 3.12 0.66 1.80 3.12 0.66 1.87 3.16 0.66 1.76 3.10 0.65 1.80 3.16 0.65 12900 28100 97200
2019-05-03 1.85 3.14 0.67 1.85 3.14 0.67 1.89 3.19 0.69 1.74 3.06 0.62 1.74 3.12 0.62 43200 31300 27500
2019-05-06 1.85 3.13 0.66 1.85 3.13 0.66 1.89 3.25 0.69 1.75 3.11 0.65 1.79 3.11 0.67 37000 50200 31500
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2020-04-22 0.93 3.71 0.73 0.93 3.71 0.73 1.04 3.75 0.73 0.93 3.61 0.69 0.93 3.61 0.72 2600 312900 14600
2020-04-23 1.01 3.76 0.74 1.01 3.76 0.74 1.01 3.77 0.77 0.94 3.66 0.73 0.94 3.74 0.73 2500 99800 15200
2020-04-24 1.05 3.63 0.76 1.05 3.63 0.76 1.05 3.78 0.77 0.92 3.63 0.74 1.05 3.78 0.74 4400 89100 1300
2020-04-27 1.03 3.64 0.76 1.03 3.64 0.76 1.07 3.70 0.77 0.92 3.55 0.76 1.07 3.70 0.77 6200 60600 3500
2020-04-28 1.00 3.70 0.77 1.00 3.70 0.77 1.07 3.74 0.77 0.96 3.64 0.75 1.07 3.70 0.77 22300 248300 26100
EDIT per Quang Hoang's suggestion:
Tried:
ret = pd.concat(stock_data.values(), keys=stocks, axis=1)
ret = ret.swaplevel(0, 1, axis=1)
Got the following output which looks much closer but still off a bit:
Date Open High Low Close Adj Close Volume Date Open High Low Close Adj Close Volume Date Open High Low Close Adj Close Volume
CHNR CHNR CHNR CHNR CHNR CHNR CHNR GNSS GNSS GNSS GNSS GNSS GNSS GNSS SGRP SGRP SGRP SGRP SGRP SGRP SGRP
0 2010-04-29 11.39 11.74 11.39 11.57 11.57 3100 2019-03-11 2.73 2.81 2.71 2.75 2.75 243900.0 2010-04-29 0.79 0.79 0.79 0.79 0.79 0
1 2010-04-30 11.60 11.61 11.50 11.56 11.56 5400 2019-03-12 2.66 2.78 2.66 2.75 2.75 69200.0 2010-04-30 0.79 0.79 0.79 0.79 0.79 0
2 2010-05-03 11.95 11.95 11.22 11.44 11.44 19400 2019-03-13 2.75 2.80 2.71 2.77 2.77 61200.0 2010-05-03 0.79 0.79 0.79 0.79 0.79 0
3 2010-05-04 11.20 11.49 11.20 11.46 11.46 10700 2019-03-14 2.77 2.79 2.75 2.75 2.75 48800.0 2010-05-04 0.79 0.79 0.66 0.79 0.79 9700
4 2010-05-05 11.50 11.60 11.25 11.50 11.50 13400 2019-03-15 2.76 2.79 2.75 2.79 2.79 124400.0 2010-05-05 0.69 0.80 0.67 0.80 0.80 6700
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2512 2020-04-22 0.93 1.04 0.93 0.93 0.93 2600 NaT NaN NaN NaN NaN NaN NaN 2020-04-22 0.72 0.73 0.69 0.73 0.73 14600
2513 2020-04-23 0.94 1.01 0.94 1.01 1.01 2500 NaT NaN NaN NaN NaN NaN NaN 2020-04-23 0.73 0.77 0.73 0.74 0.74 15200
2514 2020-04-24 1.05 1.05 0.92 1.05 1.05 4400 NaT NaN NaN NaN NaN NaN NaN 2020-04-24 0.74 0.77 0.74 0.76 0.76 1300
2515 2020-04-27 1.07 1.07 0.92 1.03 1.03 6200 NaT NaN NaN NaN NaN NaN NaN 2020-04-27 0.77 0.77 0.76 0.76 0.76 3500
2516 2020-04-28 1.07 1.07 0.96 1.00 1.00 22300 NaT NaN NaN NaN NaN NaN NaN 2020-04-28 0.77 0.77 0.75 0.77 0.77 26100
id MaxS MaxA MaxD
43290 9.511364 2.70 0.27
43290 7.547727 2.56 0.34
43290 7.465909 2.66 0.48
43290 7.404545 3.90 0.60
43290 7.772727 2.38 0.11
43290 7.936364 2.62 0.97
43290 7.650000 4.20 1.64
43290 3.088636 1.79 0.06
43290 4.377273 2.19 0.05
43290 6.750000 4.65 1.90
43290 5.461364 2.82 0.19
43290 7.363636 4.13 1.48
43290 11.270455 3.72 0.41
43290 10.186364 3.88 1.17
43290 3.109091 2.05 0.02
43290 7.834091 3.38 0.01
43290 3.252273 2.31 0.03
43290 7.854545 3.00 0.70
43290 9.756818 3.26 0.54
43290 6.954545 2.93 0.24
43291 4.070455 1.21 0.21
43291 6.034091 3.42 0.42
43291 8.018182 2.41 0.66
43291 7.956818 3.55 0.62
43291 8.161364 2.74 0.64
43291 8.263636 4.11 0.13
43291 2.618182 1.80 0.08
43291 2.168182 2.12 0.04
43291 6.095455 3.04 0.11
43291 9.061364 2.91 0.33
45880 5.236364 2.43 0.15
45880 14.972727 4.86 0.23
45880 9.593182 4.48 1.36
45880 4.459091 3.67 0.14
45880 17.325000 4.21 0.44
45880 11.086364 3.30 1.00
45880 5.277273 2.25 0.12
45880 7.547727 2.92 0.34
45880 11.270455 3.33 0.03
45880 13.990909 3.21 0.50
45880 9.122727 3.86 1.14
45880 6.790909 4.24 1.30
45880 8.100000 4.31 0.80
45880 5.809091 3.22 0.94
45881 6.565909 3.50 0.86
45881 10.452273 4.64 0.85
45881 7.281818 3.47 0.71
45881 9.347727 3.67 0.02
45881 14.318182 3.97 0.51
45881 5.481818 3.99 0.21
45881 7.425000 3.93 1.65
45881 8.836364 3.50 0.26
45881 5.277273 2.21 0.57
45881 12.865909 4.38 0.94
45881 7.200000 2.86 0.45
45881 7.138636 4.39 1.18
45881 8.815909 4.34 0.34
45881 9.490909 4.53 0.28
45881 17.652273 4.59 0.05
45881 11.106818 2.64 0.31
45881 9.511364 3.83 1.14
45881 8.284091 3.90 0.20
45881 9.306818 3.54 0.22
45881 5.195455 2.66 0.14
45881 3.477273 2.50 0.16
45881 7.179545 3.70 0.08
45881 8.447727 3.19 0.32
45881 4.990909 2.32 0.86
45881 16.465909 4.28 0.25
Hi all, as you can see I have the table above in a pandas dataframe. What I want to do is for every ID i would like the 3rd most to 7th most MaxS, MaxA and MaxD and I want to take an average of them. I know you can head or nlargest to get the top most numbers in these columns but I am not sure how to get the third most to seventh most for each ID. Also if you try nlargest on multiple columns pandas throws an error. So I'm not sure how to proceed with this problem.
Would really appreciate it if someone could help me find the average of the 3rd largest to 7th largest number in each of the three columns (MaxS, MaxA and MaxD) for each id.
Thank you!
Converting the data frame column to a numpy array can work:
import pandas as pd
import numpy as np
def _get_average(column):
""" perform sort, reverse and get 3rd to 7th values to average """
return np.mean(np.sort(column.to_numpy())[::-1][3:8])
def average_csv():
""" read data as csv and average the desired fields """
my_df = pd.read_csv("my_csv.csv", delimiter="\s+")
return _get_average(my_df["MaxS"]), _get_average(my_df["MaxA"]), _get_average(my_df["MaxD"])
if __name__ == "__main__":
print("MaxS: {}, MaxA: {}, MaxD: {}".format(*average_csv()))
I have a dataframe, df which looks like this
Open High Low Close Volume
Date
2007-03-22 2.65 2.95 2.64 2.86 176389
2007-03-23 2.87 2.87 2.78 2.78 63316
2007-03-26 2.83 2.83 2.51 2.52 54051
2007-03-27 2.61 3.29 2.60 3.28 589443
2007-03-28 3.65 4.10 3.60 3.80 1114659
2007-03-29 3.91 3.91 3.33 3.57 360501
2007-03-30 3.70 3.88 3.66 3.71 185787
I'm trying to create a new column, which will takes the df.Open value 5 days ahead from each df.Open value and subtract it.
So the loop I"m using is this:
for i in range(0, len(df.Open)): #goes through indexes values
df['5days'][i]=df.Open[i+5]-df.Open[i] #I use those index values to locate
However, this loop is yielding an error.
KeyError: '5days'
Not sure why. I got this to temporarily work by removing the df['5days'][i], but it seems awfully slow. Not sure if there is a more efficient way to do this.
Thank you.
Using diff
df['5Days'] = df.Open.diff(5)
print(df)
Open High Low Close Volume 5Days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 0.83
However, per your code, you may want to look ahead and align the results back. In that case
df['5Days'] = -df.Open.diff(-5)
print(df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 1.26
2007-03-23 2.87 2.87 2.78 2.78 63316 0.83
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 NaN
2007-03-30 3.70 3.88 3.66 3.71 185787 NaN
I think you need shift with sub:
df['5days'] = df.Open.shift(5).sub(df.Open)
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 -1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 -0.83
Or maybe need substract Open with shifted column:
df['5days'] = df.Open.sub(df.Open.shift(5))
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 NaN
2007-03-23 2.87 2.87 2.78 2.78 63316 NaN
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 1.26
2007-03-30 3.70 3.88 3.66 3.71 185787 0.83
df['5days'] = -df.Open.sub(df.Open.shift(-5))
print (df)
Open High Low Close Volume 5days
Date
2007-03-22 2.65 2.95 2.64 2.86 176389 1.26
2007-03-23 2.87 2.87 2.78 2.78 63316 0.83
2007-03-26 2.83 2.83 2.51 2.52 54051 NaN
2007-03-27 2.61 3.29 2.60 3.28 589443 NaN
2007-03-28 3.65 4.10 3.60 3.80 1114659 NaN
2007-03-29 3.91 3.91 3.33 3.57 360501 NaN
2007-03-30 3.70 3.88 3.66 3.71 185787 NaN