Union of several DataFrames stored in the same variable - python

I have imported information about some stocks through a loop through the MetaTrader 5 module.
import MetaTrader5 as mt5
tickers = ['Apple', 'Amazon', 'Facebook', 'Microsoft']
results = {}
for ticker in tickers:
results[ticker] = mt5.copy_rates_range(ticker, mt5.TIMEFRAME_M1, inicio, fin)
results[ticker] = pd.DataFrame(results[ticker]).set_index('time')
The data has been stored in results [ticker]. For example, when ticker = 'Apple'
results['Apple']
{'Apple': open high low close tick_volume spread real_volume
time
1606149300 117.33 117.55 117.31 117.47 126 12 0
1606149360 117.48 117.54 117.31 117.39 134 12 0
1606149420 117.38 117.54 117.36 117.41 95 12 0
1606149480 117.43 117.47 117.32 117.33 90 12 0
1606149540 117.32 117.33 117.24 117.26 123 12 0
... ... ... ... ... ... ... ...
when ticker = 'Amazon'
results['Amazon']
open high low close tick_volume spread real_volume
time
1606149300 3114.25 3132.43 3114.25 3131.28 44 429 0
1606149360 3131.28 3133.25 3122.69 3131.52 83 450 0
1606149420 3131.52 3132.12 3122.69 3130.11 61 449 0
1606149480 3127.53 3135.92 3122.69 3127.05 80 448 0
1606149540 3129.77 3135.54 3123.50 3131.98 49 441 0
... ... ... ... ... ... ... ...
My question is how can I join all these tables into a single DataFrame? For example, the 'close' column for each of the tickers in a single DataFrame as in the example below
CLOSE Apple Amazon Microsoft ETC...
time
1606149300 3114.25 3132.43 3114.25
1606149360 3131.28 3133.25 3122.69
1606149420 3131.52 3132.12 3122.69
1606149480 3127.53 3135.92 3122.69
1606149540 3129.77 3135.54 3123.50
... ... ... ... ... ... ... ...
Thanks in advance for the help

You could try with the join function in Pandas.
merged_df = results[tickers[0]]
for t in tickers[1:]:
merged_df = merged_df.merge(results[t][['close']], left_index=True, right_index=True)
I hope it solves your problem!

Related

How to web scrape Rotowire iframe table

I am try to scrape tables from Rotowire. pd.read is only returning the Headers.
import pandas as pd
url = pd.read_html("http://www.rotowire.com/daily/mlb/optimizer.htm?site=DraftKings&sport=MLB")
# for idx, table in enumerate(url):
# print("***************************")
# print(idx)
# print(table)
url[5]
Output:
Player Team Position Salary Fpts. Val Min. % Max. % Exposure
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
No idea what table you want, but you're not going to get anything from the static html response as the page is rendered through javascript. They do have some data you can access though. You'd have to work out the parameters:
import pandas as pd
import requests
url = 'https://www.rotowire.com/daily/tables/optimizer-mlb.php'
payload = {
'siteID': '1',
'slateID': '6441',
'projSource': 'RotoWire',
'rst': 'RotoWire'}
jsonData = requests.get(url, params=payload).json()
df = pd.DataFrame(jsonData)
Output:
print(df)
id playerID rotoPlayerID ... ie_green_lights ie_matchup_notes ie_volatility
0 12739 11095 12739 ... 0 0
1 10510 4081 10510 ... 0 0
2 16036 5163 16036 ... 0 0
3 14194 10827 14194 ... 0 0
4 14865 15463 14865 ... 0 0
.. ... ... ... ... ... ... ...
687 14444 11330 14444 ... 0 0
688 14440 18894 14440 ... 0 0
689 14439 18905 14439 ... 0 0
690 14435 5058 14435 ... 0 0
691 17921 18828 17921 ... 0 0
[692 rows x 99 columns]

Euclidean Distance over 2 dataframes

I have 2 Dataframes
DF1-
Name X Y
0 Astonished 0.430 0.890
1 Excited 0.700 0.720
2 Expectant 0.320 0.067
3 Passionate 0.333 0.127
[47 rows * 3 columns]
DF2-
Id X Y
0 1 -0.288453 0.076105
1 4 -0.563453 -0.498895
2 5 -0.788453 -0.673895
3 6 -0.063453 -0.373895
4 7 0.311547 0.376105
[767 rows * 3 columns]
Now what I want to achieve is -
Take the X,Y from first entry from DF2, iterate it over DF1, calculate Euclidean Distance between each value of X,Y in DF2.
Find the minimum of all the Euclidean Distance obtained between the two points, save the minimum result somewhere along with the corresponding entry under the name column.
Example-
Say for any tuple of X,Y in DF2, the minimum Euclidean distance is corresponding to the X,Y value in the row 0 of DF1, then the result should be, the distance and name Astonished.
My Attempt-
import pandas as pd
import numpy as np
import csv
mood = pd.read_csv("C:/Users/Desktop/DF1.csv")
song_value = pd.read_csv("C:/Users/Desktop/DF2.csv")
df_temp = mood.loc[:, ['Arousal','Valence']]
df_temp1 = song_value.loc[:, ['Arousal','Valence']]
import scipy
from scipy import spatial
ary = scipy.spatial.distance.cdist(mood.loc[:, ['Arousal','Valence']], song_value.loc[:, ['Arousal','Valence']], metric='euclidean')
print (ary)
Result Obtained -
[[1.08563344 1.70762362 1.98252253 ... 0.64569366 0.47426051 0.83656989]
[1.17967807 1.75556794 2.03922435 ... 0.59326275 0.2469077 0.79334076]
[0.60852124 1.04915517 1.33326431 ... 0.1848471 0.53293637 0.08394834]
...
[1.26151359 1.5500629 1.81168766 ... 0.74070027 0.70209658 0.75277205]
[0.69085994 1.03764923 1.31608627 ... 0.33265268 0.61928227 0.21397822]
[0.84484398 1.11428893 1.38222899 ... 0.48330291 0.69288125 0.3886008 ]]
I have no clue how I should proceed now.
Please suggest something.
EDIT - 1
I converted the array in another data frame using
new_series = pd.DataFrame(ary)
print (new_series)
Result -
0 1 2 ... 764 765 766
0 1.085633 1.707624 1.982523 ... 0.645694 0.474261 0.836570
1 1.179678 1.755568 2.039224 ... 0.593263 0.246908 0.793341
2 0.608521 1.049155 1.333264 ... 0.184847 0.532936 0.083948
3 0.623534 1.093331 1.378075 ... 0.124156 0.479393 0.109057
4 0.791926 1.352785 1.636748 ... 0.197403 0.245908 0.398619
5 0.740038 1.260768 1.545785 ... 0.092072 0.304926 0.281791
6 0.923284 1.523395 1.803676 ... 0.415540 0.293217 0.611312
7 1.202447 1.679660 1.962823 ... 0.554256 0.247391 0.703298
8 0.824898 1.343684 1.628727 ... 0.177560 0.222666 0.360980
9 1.191411 1.604942 1.883150 ... 0.570771 0.395957 0.668736
10 0.822236 1.456863 1.708469 ... 0.706252 0.787271 0.823542
11 0.741683 1.371996 1.618916 ... 0.704496 0.835235 0.798964
12 0.346244 0.967891 1.240839 ... 0.376504 0.715617 0.359700
13 0.526096 1.163209 1.421820 ... 0.520190 0.748265 0.579333
14 0.435992 0.890291 1.083229 ... 0.937048 1.254437 0.884499
15 0.600338 1.162469 1.375755 ... 0.876228 1.116301 0.891714
16 0.634254 1.059083 1.226407 ... 1.088393 1.373536 1.058550
17 0.712227 1.284502 1.498187 ... 0.917272 1.117806 0.956957
18 0.194387 0.799728 1.045745 ... 0.666713 1.013563 0.597524
19 0.456000 0.708741 0.865870 ... 1.068296 1.420654 0.973234
20 0.633776 0.632060 0.709202 ... 1.277083 1.645173 1.157765
21 0.192291 0.597749 0.826602 ... 0.831713 1.204117 0.716746
22 0.522033 0.526969 0.645998 ... 1.170316 1.546040 1.041762
23 0.668148 0.504480 0.547920 ... 1.316602 1.698041 1.176933
24 0.718440 0.285718 0.280984 ... 1.334008 1.727796 1.166364
25 0.759187 0.265412 0.217165 ... 1.362786 1.757580 1.190132
26 0.598326 0.113459 0.380513 ... 1.087573 1.479296 0.896239
27 0.676841 0.263613 0.474246 ... 1.074911 1.456515 0.875707
28 0.865641 0.365394 0.462742 ... 1.239941 1.612779 1.038790
29 0.463623 0.511737 0.786284 ... 0.719525 1.099122 0.519226
30 0.780386 0.550483 0.750532 ... 0.987863 1.336760 0.788449
31 1.077559 0.711697 0.814205 ... 1.274933 1.602953 1.079529
32 1.020408 0.497152 0.522999 ... 1.372444 1.736938 1.170889
33 0.963911 0.367018 0.336035 ... 1.398444 1.778496 1.198905
34 1.092763 0.759612 0.873457 ... 1.256086 1.574565 1.063570
35 0.903631 0.810449 1.018501 ... 0.921287 1.219046 0.740134
36 0.728728 0.795942 1.045868 ... 0.695317 1.009043 0.512147
37 0.738314 0.600405 0.822742 ... 0.895225 1.239125 0.697393
38 1.206901 1.151385 1.343654 ... 1.064721 1.273002 0.922962
39 1.248530 1.293525 1.508517 ... 0.988508 1.137608 0.880669
40 0.988777 1.205968 1.463036 ... 0.622495 0.776919 0.541414
41 0.941001 1.043940 1.285215 ... 0.732293 0.960420 0.595174
42 1.242508 1.321327 1.544222 ... 0.947970 1.080069 0.851396
43 1.262534 1.399453 1.633948 ... 0.900340 0.989603 0.830024
44 1.261514 1.550063 1.811688 ... 0.740700 0.702097 0.752772
45 0.690860 1.037649 1.316086 ... 0.332653 0.619282 0.213978
46 0.844844 1.114289 1.382229 ... 0.483303 0.692881 0.388601
[47 rows x 767 columns]
Moreover, is this the best approach? Sorry, but am not sure, that's why am putting this up.
Say df_1 and df_2 are your dataframes, first extract your pairs as shown below:
pairs_1 = list(tuple(zip(df_1.X, df_1.Y)))
pairs_2 = list(tuple(zip(df_2.X, df_2.Y)))
Then iterate over pairs as per your use case and get the index of minimum distance for the iterated points:
from scipy import spatial
min_distances = []
closest_pairs = []
names = []
for i in pairs_2:
min_dist = scipy.spatial.distance.cdist([i], pairs_1, metric='euclidean').min()
index_min = scipy.spatial.distance.cdist([i], pairs_1, metric='euclidean').argmin()
min_distances.append(min_dist)
closest_pairs.append(df_1.loc[index_min, ['X', 'Y']])
names.append(df_1.loc[index_min, 'Name'])
Insert results to df_2:
df_2['min_distance'] = min_distances
df_2['closest_pairs'] = [tuple(i.values) for i in closest_pairs]
df_2['name'] = names
df_2
Output:
Id X Y min_distance closest_pairs name
0 1 -0.288453 0.076105 0.608521 (0.32, 0.067) Expectant
1 4 -0.563453 -0.498895 1.049155 (0.32, 0.067) Expectant
2 5 -0.788453 -0.673895 1.333264 (0.32, 0.067) Expectant
3 6 -0.063453 -0.373895 0.584316 (0.32, 0.067) Expectant
4 7 0.311547 0.376105 0.250027 (0.33, 0.127) Passionate
I have added min_distance and closest_pairs as well, you can exclude these columns if you want to.

Getting date field from JSON url as pandas DataFrame

I am trying to bring this API URL into a pandas DataFrame and getting the values but still needing to add the date as a column like the other values:
import pandas as pd
from pandas.io.json import json_normalize
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
df = pd.read_json("https://covidapi.info/api/v1/country/DOM")
df = pd.DataFrame(df['result'].values.tolist())
print (df)
Getting this output:
confirmed deaths recovered
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
.. ... ... ...
72 1488 68 16
73 1488 68 16
74 1745 82 17
75 1828 86 33
76 1956 98 36
You need to pass the index from your dataframe as well as the data itself:
df = pd.DataFrame(index=df.index, data=df['result'].values.tolist())
The line above creates the same columns, but keeps the original date index from the API call.

How to create DataFrame from json data - dicts, lists and arrays within an array

I'm not able to get the data but only the headers from json data
Have tried to use json_normalize which creates a DataFrame from json data, but when I try to loop and append data the result is that I only get the headers.
import pandas as pd
import json
import requests
from pandas.io.json import json_normalize
import numpy as np
# importing json data
def get_json(file_path):
r = requests.get('https://www.atg.se/services/racinginfo/v1/api/games/V75_2019-09-29_5_6')
jsonResponse = r.json()
with open(file_path, 'w', encoding='utf-8') as outfile:
json.dump(jsonResponse, outfile, ensure_ascii=False, indent=None)
# Run the function and choose where to save the json file
get_json('../trav.json')
# Open the json file and print a list of the keys
with open('../trav.json', 'r') as json_data:
d = json.load(json_data)
print(list(d.keys()))
[Out]:
['#type', 'id', 'status', 'pools', 'races', 'currentVersion']
To get all data for the starts in one race I can use json_normalize function
race_1_starts = json_normalize(d['races'][0]['starts'])
race_1_starts_df = race_1_starts.drop('videos', axis=1)
print(race_1_starts_df)
[Out]:
distance driver.birth ... result.prizeMoney result.startNumber
0 1640 1984 ... 62500 1
1 1640 1976 ... 11000 2
2 1640 1968 ... 500 3
3 1640 1953 ... 250000 4
4 1640 1968 ... 500 5
5 1640 1962 ... 18500 6
6 1640 1961 ... 7000 7
7 1640 1989 ... 31500 8
8 1640 1960 ... 500 9
9 1640 1954 ... 500 10
10 1640 1977 ... 125000 11
11 1640 1977 ... 500 12
Above we get a DataFrame with data on all starts from one race. However, when I try to loop through all races in range in order to get data on all starts for all races, then I only get the headers from each race and not the data on starts for each race:
all_starts = []
for t in range(len(d['races'])):
all_starts.append([t+1, json_normalize(d['races'][t]['starts'])])
all_starts_df = pd.DataFrame(all_starts, columns = ['race', 'starts'])
print(all_starts_df)
[Out]:
race starts
0 1 distance ... ...
1 2 distance ... ...
2 3 distance ... ...
3 4 distance ... ...
4 5 distance ... ...
5 6 distance ... ...
6 7 distance ... ...
In output I want a DataFrame that is a merge of data on all starts from all races. Note that the number of columns can differ depending on which race, but that I expect in case one race has 21 columns and another has 20 columns - then the all_starts_df should contain all columns but in case a race do not have data for one column it should say 'NaN'.
Expected result:
[Out]:
race distance driver.birth ... result.column_20 result.column_22
1 1640 1984 ... 12500 1
1 1640 1976 ... 11000 2
2 2140 1968 ... NaN 1
2 2140 1953 ... NaN 2
3 3360 1968 ... 1500 NaN
3 3360 1953 ... 250000 NaN
If you want all columns you can try this.. (I find a lot more than 20 columns so I might have something wrong.)
all_starts = []
headers = []
for idx, race in enumerate(d['races']):
df = json_normalize(race['starts'])
df['race'] = idx
all_starts.append(df.drop('videos', axis=1))
headers.append(set(df.columns))
# Create set of all columns for all races
columns = set.union(*headers)
# If columns are missing from one dataframe add it (as np.nan)
for df in all_starts:
for c in columns - set(df.columns):
df[c] = np.nan
# Concatenate all dataframes for each race to make one dataframe
df_all_starts = pd.concat(all_starts, axis=0, sort=True)
Alternatively, if you know the names of the columns you want to keep, try this
columns = ['race', 'distance', 'driver.birth', 'result.prizeMoney']
all_starts = []
for idx, race in enumerate(d['races']):
df = json_normalize(race['starts'])
df['race'] = idx
all_starts.append(df[columns])
# Concatenate all dataframes for each race to make one dataframe
df_all_starts = pd.concat(all_starts, axis=0)

Using pd.to_datetime to convert "object" column into %HH:MM:SS

I am doing some exploratory data analysis using finish-time data scraped from the 2018 KONA IRONMAN. I used JSON to format the data and pandas to read into csv. The 'swim','bike','run' columns should be formatted as %HH:MM:SS to be operable, however, I am receiving a ValueError: ('Unknown string format:', '--:--:--').
print(data.head(2))
print(kona.info())
print(kona.describe())
Name div_rank ... bike run
0 Avila, Anthony 2470 138 ... 05:27:59 04:31:56
1 Lindgren, Mikael 1050 151 ... 05:17:51 03:49:20
swim 2472 non-null object
bike 2472 non-null object
run 2472 non-null object
Name div_rank ... bike run
count 2472 2472 ... 2472 2472
unique 2472 288 ... 2030 2051
top Jara, Vicente 986 -- ... --:--:-- --:--:--
freq 1 165 ... 122 165
How should I use pd.to_datetime to properly format the 'bike','swim','run' column and for future use sum these columns and append a 'Total Finish Time' column? Thanks!
The reason the error is because it can't pull the time from '--:--:--'. So you'd need to convert all those to '00:00:00', but then that implies they did the event in 0 time. The other option is to just convert the times that are present, leaving a null in the places that don't have a time. This will also include a date of 1900-01-01, when you convert to datetime. I put the .dt.time so only time will display.
timed_events = ['bike', 'swim', 'run']
for event in timed_events:
result[event] = pd.to_datetime(result[result[event] != '--:--:--'][event], format="%H:%M:%S").dt.time
The problem with this though is I remember seeing you wanted to sum those times, which would require you to do some extra conversions. So I'm suggesting to use .to_timedelta() instead. It'll work the same way, as you'd need to not include the --:--:--. But then you can sum those times. I also added a column of number of event completed, so that if you want to sort by best times, you can filter out anyone who hasn't competed in all three events, as obviously they'd have better times because they are missing entire events:
I'll also add, regarding the comment of:
"You think providing all the code will be helpful but it does not. You
will get a quicker and more useful response if you keep the code
minimum that can replicate your issue.stackoverflow.com/help/mcve –
mad_ "
I'll give him the benefit of the doubt as seeing the whole code and not realizing that the code you provided was the minimal code to replicate your issue, as no one wants to code a way to generate your data to work with. Sometimes you can explicitly state that in your question.
ie:
Here's the code to generate my data:
CODE PART 1
import bs4
import pandas as pd
code...
But now that I have the data, here's where I'm having trouble:
df = pd.to_timedelta()...
...
Luckily I remembered helping you earlier on this so knew I could go back and get that code. So the code you originally had was fine.
But here's the full code I used, which is a different way of storing the csv than you orginially had. So you can change that part, but the end part is what you'll need:
from bs4 import BeautifulSoup, Comment
from collections import defaultdict
import requests
import pandas as pd
sauce = 'http://m.ironman.com/triathlon/events/americas/ironman/world-championship/results.aspx'
r = requests.get(sauce)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
def parse_table(soup):
result = defaultdict(list)
my_table = soup.find('tbody')
for node in my_table.children:
if isinstance(node, Comment):
# Get content and strip comment "<!--" and "-->"
# Wrap the rows in "table" tags as well.
data = '<table>{}</table>'.format(node[4:-3])
break
table = BeautifulSoup(data, 'html.parser')
for row in table.find_all('tr'):
name, _, swim, bike, run, div_rank, gender_rank, overall_rank = [col.text.strip() for col in row.find_all('td')[1:]]
result[name].append({
'div_rank': div_rank,
'gender_rank': gender_rank,
'overall_rank': overall_rank,
'swim': swim,
'bike': bike,
'run': run,
})
return result
jsonObj = parse_table(soup)
result = pd.DataFrame()
for k, v in jsonObj.items():
temp_df = pd.DataFrame.from_dict(v)
temp_df['name'] = k
result = result.append(temp_df)
result = result.reset_index(drop=True)
result.to_csv('C:/data.csv', index=False)
# However you read in your csv/dataframe, use the code below on it to get those times
timed_events = ['bike', 'swim', 'run']
for event in timed_events:
result[event] = pd.to_timedelta(result[result[event] != '--:--:--'][event])
result['total_events_participated'] = 3 - result.isnull().sum(axis=1)
result['total_times'] = result[timed_events].sum(axis=1)
Output:
print (result)
bike div_rank ... total_events_participated total_times
0 05:27:59 138 ... 3 11:20:06
1 05:17:51 151 ... 3 10:16:17
2 06:14:45 229 ... 3 14:48:28
3 05:13:56 162 ... 3 10:19:03
4 05:19:10 6 ... 3 09:51:48
5 04:32:26 25 ... 3 08:23:26
6 04:49:08 155 ... 3 10:16:16
7 04:50:10 216 ... 3 10:55:47
8 06:45:57 71 ... 3 13:50:28
9 05:24:33 178 ... 3 10:21:35
10 06:36:36 17 ... 3 14:36:59
11 NaT -- ... 0 00:00:00
12 04:55:29 100 ... 3 09:28:53
13 05:39:18 72 ... 3 11:44:40
14 04:40:41 -- ... 2 05:35:18
15 05:23:18 45 ... 3 10:55:27
16 05:15:10 3 ... 3 10:28:37
17 06:15:59 78 ... 3 11:47:24
18 NaT -- ... 0 00:00:00
19 07:11:19 69 ... 3 15:39:51
20 05:49:02 29 ... 3 10:32:36
21 06:45:48 4 ... 3 13:39:17
22 04:39:46 -- ... 2 05:48:38
23 06:03:01 3 ... 3 11:57:42
24 06:24:58 193 ... 3 13:52:57
25 05:07:42 116 ... 3 10:01:24
26 04:44:46 112 ... 3 09:29:22
27 04:46:06 55 ... 3 09:32:43
28 04:41:05 69 ... 3 09:31:32
29 05:27:55 68 ... 3 11:09:37
... ... ... ... ...
2442 NaT -- ... 0 00:00:00
2443 05:26:40 3 ... 3 11:28:53
2444 05:04:37 19 ... 3 10:27:13
2445 04:50:45 74 ... 3 09:15:14
2446 07:17:40 120 ... 3 14:46:05
2447 05:26:32 45 ... 3 10:50:48
2448 05:11:26 186 ... 3 10:26:00
2449 06:54:15 185 ... 3 14:05:16
2450 05:12:10 22 ... 3 11:21:37
2451 04:59:44 45 ... 3 09:29:43
2452 06:03:59 96 ... 3 12:12:35
2453 06:07:27 16 ... 3 12:47:11
2454 04:38:06 91 ... 3 09:52:27
2455 04:41:56 14 ... 3 08:58:46
2456 04:38:48 85 ... 3 09:18:31
2457 04:42:30 42 ... 3 09:07:29
2458 04:40:54 110 ... 3 09:32:34
2459 06:08:59 37 ... 3 12:15:23
2460 04:32:20 -- ... 2 05:31:05
2461 04:45:03 96 ... 3 09:30:06
2462 06:14:29 95 ... 3 13:38:54
2463 06:00:20 164 ... 3 12:10:03
2464 05:11:07 22 ... 3 10:32:35
2465 05:56:06 188 ... 3 13:32:48
2466 05:09:26 2 ... 3 09:54:55
2467 05:22:15 7 ... 3 10:26:14
2468 05:53:14 254 ... 3 12:34:21
2469 05:00:29 156 ... 3 10:18:29
2470 04:30:46 7 ... 3 08:38:23
2471 04:34:59 39 ... 3 09:04:13
[2472 rows x 9 columns]

Categories

Resources