Related
I have the below script that returns data in a list format per quote of (i). I set up an empty list, and then query with the API function get_kline_data, and pass each output into my klines_list with the .extend function
klines_list = []
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_list.extend([i,klines])
klines_list
klines_list then returns data in this format;
['REQ-ETH',
[['1619317500',
'0.0000491',
'0.0000491',
'0.0000491',
'0.0000491',
'5.1147',
'0.00025113177']],
'REQ-BTC',
[['1619317500',
'0.00000219',
'0.00000219',
'0.00000219',
'0.00000219',
'19.8044',
'0.000043371636']],
'XLM-BTC',
[['1619317500',
'0.00000863',
'0.00000861',
'0.00000863',
'0.00000861',
'653.5693',
'0.005629652673']]]
I then try to convert it into a dataframe;
import pandas as py
df = py.DataFrame(klines_list)
And this is the result;
0
0 REQ-ETH
1 [[1619317500, 0.0000491, 0.0000491, 0.0000491,...
2 REQ-BTC
3 [[1619317500, 0.00000219, 0.00000219, 0.000002...
4 XLM-BTC
5 [[1619317500, 0.00000863, 0.00000861, 0.000008..
The structure of the DF is incorrect and it seems to be due to the way I have put my list together.
I would like the quantitative data in a column corresponding to the correct entry in list a, not in rows. Also, the ticker data, or list a, ("REQ-ETH/REQ-BTC") etc should be in a separate column. What would be a good way to go about restructuring this?
Edit: #Ynjxsjmh
This is the output when following the suggestion below for appending a dictionary within the for loop
REQ-ETH REQ-BTC XLM-BTC
0 [1619317500, 0.0000491, 0.0000491, 0.0000491, ... NaN NaN
1 NaN [1619317500, 0.00000219, 0.00000219, 0.0000021... NaN
2 NaN NaN [1619317500, 0.00000863, 0.00000861, 0.0000086...
pandas.DataFrame() can accept a dict. It will construct the dict key as column header, dict value as column values.
import pandas as pd
a = ["REQ-ETH","REQ-BTC","XLM-BTC"]
klines_data = {}
for i in a:
klines = client.get_kline_data(i, '5min', 1619317366, 1619317606)
klines_data[i] = klines[0]
# ^
# |
# Add a key to klines_data
df = pd.DataFrame(klines_data)
print(df)
REQ-ETH REQ-BTC XLM-BTC
0 1619317500 1619317500 1619317500
1 0.0000491 0.00000219 0.00000863
2 0.0000491 0.00000219 0.00000861
3 0.0000491 0.00000219 0.00000863
4 0.0000491 0.00000219 0.00000861
5 5.1147 19.8044 653.5693
6 0.00025113177 0.000043371636 0.005629652673
If the length of klines is not equal, you can use
df = pd.DataFrame.from_dict(klines_data, orient='index').T
I am extracting a data frame in pandas and want to only extract rows where the date is after a variable.
I can do this in multiple steps but would like to know if it is possible to apply all logic in one call for best practice.
Here is my code
import pandas as pd
self.min_date = "2020-05-01"
#Extract DF from URL
self.df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
self.df = self.df.loc[["Date of case" < self.min_date], ["Subject","Reference","Date of case"]]
return(self.df)
I keep getting the error: "IndexError: Boolean index has wrong length: 1 instead of 100"
I cannot find the solution online because every answer is too specific to the scenario of the person that asked the question.
e.g. this solution only works for if you are calling one column: How to select rows from a DataFrame based on column values?
I appreciate any help.
Replace this:
["Date of case" < self.min_date]
with this:
self.df["Date of case"] < self.min_date
That is:
self.df = self.df.loc[self.df["Date of case"] < self.min_date,
["Subject","Reference","Date of case"]]
You have a slight syntax issue.
Keep in mind that it's best practice to convert string dates into pandas datetime objects using pd.to_datetime.
min_date = pd.to_datetime("2020-05-01")
#Extract DF from URL
df = pd.read_html("https://webgate.ec.europa.eu/rasff-window/portal/index.cfm?event=notificationsList")[0]
#Here is where the error lies, I want to extract the columns ["Subject","Reference","Date of case"] but where the date is after min_date.
df['Date of case'] = pd.to_datetime(df['Date of case'])
df = df.loc[df["Date of case"] > min_date, ["Subject","Reference","Date of case"]]
Output:
Subject Reference Date of case
0 Salmonella enterica ser. Enteritidis (presence... 2020.2145 2020-05-22
1 migration of primary aromatic amines (0.4737 m... 2020.2131 2020-05-22
2 celery undeclared on green juice drink from Ge... 2020.2118 2020-05-22
3 aflatoxins (B1 = 29.4 µg/kg - ppb) in shelled ... 2020.2146 2020-05-22
4 too high content of E 200 - sorbic acid (1772 ... 2020.2125 2020-05-22
I am not entirely sure if this is possible but I thought I would go ahead and ask. I currently have a string that looks like the following:
myString =
"{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}"
The datasets can be varying lengths (this example has 2 datasets but it could have more), however the parameters will always be the same, (close, downticks, downvolume, etc).
Is there a way to create a dataframe from this string that takes the parameters as the index, and the numbers as the values in the column? So the dataframe would look something like this:
df =
0 1
index
Close 175.30 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.80
Low 173.66 174.94
Open 177.32 175.24
(etc)...
It looks like there are some issues with your input. As mentioned by #lmiguelvargasf, there's a missing comma at the end of the first dictionary. Additionally, there's a \n which you can simply use a str.replace to fix.
Once those issues have been solved, the process it pretty simple.
myString = '''{"Close":175.30,"DownTicks":122973,"DownVolume":18639140,"High":177.47,"Low":173.66,"Open":177.32,"Status":29,"TimeStamp":"\/Date(1521489600000)\/","TotalTicks":245246,"TotalVolume":33446771,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":122273,"UpVolume":14807630,"OpenInterest":0}
{"Close":175.24,"DownTicks":69071,"DownVolume":10806836,"High":176.80,"Low":174.94,"Open":175.24,"Status":536870941,"TimeStamp":"\/Date(1521576000000)\/","TotalTicks":135239,"TotalVolume":19649350,"UnchangedTicks":0,"UnchangedVolume":0,"UpTicks":66168,"UpVolume":8842514,"OpenInterest":0}'''
myString = myString.replace('\n', ',')
import ast
list_of_dicts = list(ast.literal_eval(myString))
df = pd.DataFrame.from_dict(list_of_dicts).T
df
0 1
Close 175.3 175.24
DownTicks 122973 69071
DownVolume 18639140 10806836
High 177.47 176.8
Low 173.66 174.94
Open 177.32 175.24
OpenInterest 0 0
Status 29 536870941
TimeStamp \/Date(1521489600000)\/ \/Date(1521576000000)\/
TotalTicks 245246 135239
TotalVolume 33446771 19649350
UnchangedTicks 0 0
UnchangedVolume 0 0
UpTicks 122273 66168
UpVolume 14807630 8842514
I would like to write a python script that will check if there is any missing day. If there is it should take the price from the latest day and create a new day in data. I mean something like shown below. My data is in CSV files. Any ideas how it can be done?
Before:
MSFT,5-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,29-May-07,243.31
After:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,253.28
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,249.95
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,243.31
MSFT,29-May-07,243.31
My solution:
import pandas as pd
df = pd.read_csv("path/to/file/file.csv",names=list("abc")) # read string as file
cols = df.columns # store column order
df.b = pd.to_datetime(df.b) # convert col Date to datetime
df.set_index("b",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df = df[cols] # revert order
df.sort_values(by="b",ascending=False,inplace=True) # sort by date
df["b"] = df["b"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv("data.csv",index=False,header=False) #specify outputfile if needed
print(df.to_string())
Using pandas library this operation can be made on a single line. But first we need to read in your data to the right formats:
import io
import pandas as pd
s = u"""name,Date,Close
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,23-Dec-16,789.91
MSFT,16-Dec-16,790.8
MSFT,15-Dec-16,797.85
MSFT,14-Dec-16,797.07"""
#df = pd.read_csv("path/to/file.csv") # read from file
df = pd.read_csv(io.StringIO(s)) # read string as file
cols = df.columns # store column order
df.Date = pd.to_datetime(df.Date) # convert col Date to datetime
df.set_index("Date",inplace=True) # set col Date as index
df = df.resample("D").ffill().reset_index() # resample Days and fill values
df
Returns:
Date name Close
0 2016-12-14 MSFT 797.07
1 2016-12-15 MSFT 797.85
2 2016-12-16 MSFT 790.80
3 2016-12-17 MSFT 790.80
4 2016-12-18 MSFT 790.80
5 2016-12-19 MSFT 790.80
6 2016-12-20 MSFT 790.80
7 2016-12-21 MSFT 790.80
8 2016-12-22 MSFT 790.80
9 2016-12-23 MSFT 789.91
10 2016-12-24 MSFT 789.91
11 2016-12-25 MSFT 789.91
12 2016-12-26 MSFT 789.91
13 2016-12-27 MSFT 791.55
14 2016-12-28 MSFT 785.05
15 2016-12-29 MSFT 782.79
16 2016-12-30 MSFT 771.82
Return back to csv with:
df = df[cols] # revert order
df.sort_values(by="Date",ascending=False,inplace=True) # sort by date
df["Date"] = df["Date"].dt.strftime("%-d-%b-%y") # revert date format
df.to_csv(index=False,header=False) #specify outputfile if needed
Output:
MSFT,30-Dec-16,771.82
MSFT,29-Dec-16,782.79
MSFT,28-Dec-16,785.05
MSFT,27-Dec-16,791.55
MSFT,26-Dec-16,789.91
MSFT,25-Dec-16,789.91
MSFT,24-Dec-16,789.91
MSFT,23-Dec-16,789.91
...
To do this, you would need to iterate through your dataframe using nested for loops. That would look something like:
for column in df:
for row in df:
do_something()
To give you an idea, the
do_something()
part of your code would probably be something like checking if there was a gap between the dates. Then you would copy the other columns from the row above and insert a new row using:
df.loc[row] = [2, 3, 4] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort() # sorting by index
Hope this helped give you an idea of how you would solve this. Let me know if you want some more code!
This code uses standard routines.
from datetime import datetime, timedelta
Input lines will have to be split on commas, and the dates parsed in two places in the main part of the code. I have therefore put this work in a single function.
def glean(s):
msft, date_part, amount = s.split(',')
if date_part.find('-')==1:
date_part = '0'+date_part
date = datetime.strptime(date_part, '%d-%b-%y')
return date, amount
Similarly, dates will have to be formatted for output with other pieces of data in a number of places in the main code.
def out(date,amount):
date_str = date.strftime('%d-%b-%y')
print(('%s,%s,%s' % ('MSFT', date_str, amount)).replace('MSFT,0', 'MSFT,'))
with open('before.txt') as before:
I read the initial line of data on its own to establish the first date for comparison with date in the next line.
previous_date, previous_amount = glean(before.readline().strip())
out(previous_date, previous_amount)
for line in before.readlines():
date, amount = glean(line.strip())
I calculate the elapsed time between the current line and the previous line, to know how many lines to output in place of missing lines.
elapsed = previous_date - date
setting_date is decremented from previous_date for the number of days that elapsed without data. One line is omitted for each day, if there were any.
setting_date = previous_date
for i in range(-1+elapsed.days):
setting_date -= timedelta(days=1)
out(setting_date, previous_amount)
Now the available line of data is output.
out(date, amount)
Now previous_date and previous_amount are reset to reflect the new values, for use against the next line of data, if any.
previous_date, previous_amount = date, amount
Output:
MSFT,5-Jun-07,259.16
MSFT,4-Jun-07,259.16
MSFT,3-Jun-07,253.28
MSFT,2-Jun-07,253.28
MSFT,1-Jun-07,249.95
MSFT,31-May-07,248.71
MSFT,30-May-07,248.71
MSFT,29-May-07,243.31
I am trying to calculate the diff_chg of S&P sectors for 4 different dates (given in start_return) :
start_return = [-30,-91,-182,-365]
for date in start_return:
diff_chg = closingprices[-1].divide(closingprices[date])
for i in sectors: #Sectors is XLK, XLY , etc
diff_chg[i] = diff_chg[sectordict[i]].mean() #finds the % chg of all sectors
diff_df = diff_chg.to_frame
My expected output is to have 4 columns in the df, each one with the returns of each sector for the given period (-30,-91, -182,-365.) .
As of now when I run this code, it returns the sum of the returns of all 4 periods in the diff_df. I would like it to create a new column in the df for each period.
my code returns:
XLK 1.859907
XLI 1.477272
XLF 1.603589
XLE 1.415377
XLB 1.526237
but I want it to return:
1mo (-30) 3mo (-61) 6mo (-182) 1yr (-365
XLK 1.086547 values here etc etc
XLI 1.0334
XLF 1.07342
XLE .97829
XLB 1.0281
Try something like this:
start_return = [-30,-91,-182,-365]
diff_chg = pd.DataFrame()
for date in start_return:
diff_chg[date] = closingprices[-1].divide(closingprices[date])
What this does is to add columns for each date in start_return to a single DataFrame created at the beginning.