I am iterating over the rows that are available, but it doesn't seem to be the most optimal way to do it -- it takes forever.
Is there a special way in Pandas to do it.
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
#NEED TO ADD DATA FROM THAT COLUMN
df = pd.read_csv(dataset_path, delimiter=',',skiprows=range(0,1),names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME'])
df = df.drop('MULT',1)
df = df.drop('EMPTY',1)
df = df.drop('TSTAMP', 1)
for index, row in df.iterrows():
TMP_TIME = INIT_TIME + datetime.timedelta(seconds=row['TCOUNT'])
df['STAMPME'] = TMP_TIME.strftime("%s")
In addition, the datetime I am adding is in the following format
2017-05-11 11:12:37.100192 1494493957
2017-05-11 11:12:37.200541 1494493957
and therefore the unix timestamp is same (and it is correct), but is there a better way to represent it?
Assuming the datetimes are correctly reflecting what you're trying to do, with respect to Pandas you should be able to do:
df['STAMPME'] = df['TCOUNT'].apply(lambda x: (datetime.timedelta(seconds=x) + INIT_TIME).strftime("%s"))
As noted here you should not use iterrows() to modify the DF you are iterating over. If you need to iterate row by row (as opposed to using the apply method) you can use another data object, e.g. a list, to retain the values you're calculating, and then create a new column from that.
Also, for future reference, the itertuples() method is faster than iterrows(), although it requires you to know the index of each column (i.e. row[x] as opposed to row['name']).
I'd rewrite your code like this
INIT_TIME = datetime.datetime.strptime(date + ' ' + time, "%Y-%B-%d %H:%M:%S")
INIT_TIME = pd.to_datetime(INIT_TIME)
df = pd.read_csv(
dataset_path, delimiter=',',skiprows=range(0,1),
names=['TCOUNT','CORE','COUNTER','EMPTY','NAME','TSTAMP','MULT','STAMPME']
)
df = df.drop(['MULT', 'EMPTY', 'TSTAMP'], 1)
df['STAMPME'] = pd.to_timedelta(df['TCOUNT'], 's') + INIT_TIME
Related
How can I build a function that creates these dataframes?:
buy_orders_1h = pd.DataFrame(
{'Date_buy': buy_orders_date_1h,
'Name_buy': buy_orders_name_1h
})
sell_orders_1h = pd.DataFrame(
{'Date_sell': sell_orders_date_1h,
'Name_sell': sell_orders_name_1h
})
I have 10 dataframes like this I create very manually and everytime I want to add a new column I would have to do it in all of them which is time consuming. If I can build a function I would only have to do it once.
The differences between the two above function are of course one is for buy signals the other is for sell signals.
I guess the inputs to the function should be:
_buy/_sell - for the Column name
buy_ / sell_ - for the Column input
I'm thinking input to the function could be something like:
def create_dfs(col, col_input,hour):
df = pd.DataFrame(
{'Date' + col : col_input + "_orders_date_" + hour,
'Name' + col : col_input + "_orders_name_" + hour
}
return df
buy_orders_1h = create_dfs("_buy", "buy_", "1h")
sell_orders_1h = create_dfs("_sell", "sell_", "1h")
A dataframe needs an index, so either you can manually pass an index, or enter your row values in list form:
def create_dfs(col, col_input, hour):
df = pd.DataFrame(
{'Date' + col: [col_input + "_orders_date_" + hour],
'Name' + col: [col_input + "_orders_name_" + hour]})
return df
buy_orders_1h = create_dfs("_buy", "buy_", "1h")
sell_orders_1h = create_dfs("_sell", "sell_", "1h")
Edit: Updated due to new information:
To call a global variable using a string, enter globals() before the string in the following manner:
'Date' + col: globals()[col_input + "_orders_date_" + hour]
Check the output please to see if this is what you want. You first create two dictionaries, then depending on the buy=True condition, it either appends to the buying_df or to the selling_df. I created two sample lists of dates and column names, and iteratively appended to the desired dataframes. After creating the dicts, then pandas.DataFrame is created. You do not need to create it iteratively, rather once in the end when your dates and names have been collected into a dict.
from collections import defaultdict
import pandas as pd
buying_df=defaultdict(list)
selling_df=defaultdict(list)
def add_column_to_df(date,name,buy=True):
if buy:
buying_df["Date_buy"].append(date)
buying_df["Name_buy"].append(name)
else:
selling_df["Date_sell"].append(date)
selling_df["Name_sell"].append(name)
dates=["1900","2000","2010"]
names=["Col_name1","Col_name2","Col_name3"]
for date, name in zip(dates,names):
add_column_to_df(date,name)
#print(buying_df)
df=pd.DataFrame(buying_df)
print(df)
I have a dataframe in which i need to calculate the means of ozone values for every 8 hours. The problem is that the column after which i am doing the resampling('readable time') disappears and cannot be referenced after the resampling.
import pandas as pd
data = pd.read_csv("o3_new.csv")
del data['latitude']
del data['longitude']
del data['altitude']
sensor_name = "o3"
data['readable time'] = pd.to_datetime(data['readable time'], dayfirst=True)
data = data.resample('480min', on='readable time').mean() # 8h mean
data[str(sensor_name) + "_aqi"] = ""
for i in range(len(data)):
data[str(sensor_name) + "_aqi"][i] = calculate_aqi(sensor_name, data[sensor_name][i])
print(data['readable time']) #throws KeyError
Where o3_new.csv is like this:
,time,latitude,longitude,altitude,o3,readable time,day
0,1591037392,45.645893,25.599471,576.38,39.4,1/6/2020 21:49,1/6/2020
1,1591037452,45.645893,25.599471,576.64,48.4,1/6/2020 21:50,1/6/2020
2,1591037512,45.645893,25.599471,576.56,53.4,1/6/2020 21:51,1/6/2020
3,1591037572,45.645893,25.599471,576.64,36.4,1/6/2020 21:52,1/6/2020
4,1591037632,45.645893,25.599471,576.73,50.4,1/6/2020 21:53,1/6/2020
5,1591037692,45.645893,25.599471,577.09,37.4,1/6/2020 21:54,1/6/2020
What to do to keep referencing the 'readable time' column after resampling?
What would you like the column to contain? mean makes no particularly good sense for time columns. Also, the resampler makes your on column the index, so just data.reset_index(inplace=True) may make you happy.
Or you can use data.index to access the values still, directly after the resample
I am trying to compare two different values in a dataframe. The questions/answers I've found I wasn't able to utilize.
import pandas as pd
# from datetime import timedelta
"""
read csv file
clean date column
convert date str to datetime
sort for equity options
replace date str column with datetime column
"""
trade_reader = pd.read_csv('TastyTrades.csv')
trade_reader['Date'] = trade_reader['Date'].replace({'T': ' ', '-0500': ''}, regex=True)
date_converter = pd.to_datetime(trade_reader['Date'], format="%Y-%m-%d %H:%M:%S")
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')]
clean_frame = options_frame.replace(to_replace=['Date'], value='date_converter')
# Separate opening transaction from closing transactions, combine frames
opens = clean_frame[clean_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_OPEN'])]
closes = clean_frame[clean_frame['Action'].isin(['BUY_TO_CLOSE', 'SELL_TO_CLOSE'])]
open_close_set = set(opens['Symbol']) & set(closes['Symbol'])
open_close_frame = clean_frame[clean_frame['Symbol'].isin(open_close_set)]
'''
convert Value to float
sort for trade readability
write
'''
ocf_float = open_close_frame['Value'].astype(float)
ocf_sorted = open_close_frame.sort_values(by=['Date', 'Call or Put'], ascending=True)
# for readability, revert back to ocf_sorted below
ocf_list = ocf_sorted.drop(
['Type', 'Instrument Type', 'Description', 'Quantity', 'Average Price', 'Commissions', 'Fees', 'Multiplier'], axis=1
)
ocf_list.reset_index(drop=True, inplace=True)
ocf_list['Strategy'] = ''
# ocf_list.to_csv('Sorted.csv')
# create strategy list
debit_single = []
debit_vertical = []
debit_calendar = []
credit_vertical = []
iron_condor = []
# shift columns
ocf_list['Symbol Shift'] = ocf_list['Underlying Symbol'].shift(1)
ocf_list['Symbol Check'] = ocf_list['Underlying Symbol'] == ocf_list['Symbol Shift']
# compare symbols, append depending on criteria met
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
print(type(ocf_list['Underlying Symbol']))
ocf_list.to_csv('Sorted.csv')
print(debit_vertical)
# delta = timedelta(seconds=10)
The error I get is:
line 51, in <module>
if row['Symbol Check'][-1] is row['Underlying Symbol'][-1]:
TypeError: string indices must be integers
I am trying to compare the newly created shifted column to the original, and if they are the same, append to a list. Is there a way to compare two string values at all in python? I've tried checking if Symbol Check is true and it still returns an error about str indices must be int. .iterrows() didn't work
Here, you will actually iterate through the columns of your DataFrame, not the rows:
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
You can use one of the methods iterrows or itertuples to iterate through the rows, but they return rows as lists and tuples respectively, which means you can't index them using the column names, as you did here.
Second, you should use == instead of is since you are probably comparing values, not identities.
Lastly, I would skip iterating over the rows entirely, as pandas is made for selecting rows based on a condition. You should be able to replace the aforementioned code with this:
debit_vertical = ocf_list[ocf_list['Symbol Shift'] == ocf_list['Underlying Symbol']].values.tolist()
The excel file creating from python is extremely slow to open even the size of file is about 50 mb.
I have tried on both pandas and openpyxl.
def to_file(list_report,list_sheet,strip_columns,Name):
i = 0
wb = ExcelWriter(path_output + '\\' + Name + dateformat + '.xlsx')
while i <= len(list_report)-1:
try:
df = pd.DataFrame(pd.read_csv(path_input + '\\' + list_report[i] + reportdate + '.csv'))
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
df = adjust_report(df,list_report[i])
df = df.apply(pd.to_numeric, errors ='ignore', downcast = 'integer')
df.to_excel(wb, sheet_name = list_sheet[i], index = False)
except:
print('Missing report: ' + list_report[i])
i += 1
wb.save()
Is there anyway to speed it up?
idiom
Let us rename list_report to reports.
Then your while loop is usually expressed as simply: for i in range(len(reports)):
You access the i-th element several times. The loop could bind that for you, with: for i, report in enumerate(reports):.
But it turns out you never even need i. So most folks would write this as: for report in reports:
code organization
This bit of code is very nice:
for column in strip_column:
try:
df[column] = df[column].str.strip('=("")')
except:
pass
I recommend you bury it in a helper function, using def strip_punctuation.
(The list should be plural, I think? strip_columns?)
Then you would have a simple sequence of df assignments.
timing
Profile elapsed time(). Surround each df assignment with code like this:
t0 = time()
df = ...
print(time() - t0)
That will show you which part of your processing pipeline takes the longest and therefore should receive the most effort for speeding it up.
I suspect adjust_report() uses the bulk of the time,
but without seeing it that's hard to say.
This may be a slightly insane question...
I've got a single Pandas DF of articles which I have then split into multiple DF's so each DF only contains the articles from a particular year. I have then put these variables into a list called box_of_years.
indexed_df = article_db.set_index('date')
indexed_df = indexed_df.sort_index()
year_2004 = indexed_df.truncate(before='2004-01-01', after='2004-12-31')
year_2005 = indexed_df.truncate(before='2005-01-01', after='2005-12-31')
year_2006 = indexed_df.truncate(before='2006-01-01', after='2006-12-31')
year_2007 = indexed_df.truncate(before='2007-01-01', after='2007-12-31')
year_2008 = indexed_df.truncate(before='2008-01-01', after='2008-12-31')
year_2009 = indexed_df.truncate(before='2009-01-01', after='2009-12-31')
year_2010 = indexed_df.truncate(before='2010-01-01', after='2010-12-31')
year_2011 = indexed_df.truncate(before='2011-01-01', after='2011-12-31')
year_2012 = indexed_df.truncate(before='2012-01-01', after='2012-12-31')
year_2013 = indexed_df.truncate(before='2013-01-01', after='2013-12-31')
year_2014 = indexed_df.truncate(before='2014-01-01', after='2014-12-31')
year_2015 = indexed_df.truncate(before='2015-01-01', after='2015-12-31')
year_2016 = indexed_df.truncate(before='2016-01-01', after='2016-12-31')
box_of_years = [year_2004, year_2005, year_2006, year_2007,
year_2008, year_2009, year_2010, year_2011,
year_2012, year_2013, year_2014, year_2015,
year_2016]
I've written various functions to tokenize, clean up and convert the tokens into a FreqDist object and wrapped those up into a single function called year_prep(). This works fine when I do
year_2006 = year_prep(year_2006)
...but is there a way I can iterate across every year variable, apply the function and have it transform the same variable, short of just repeating the above for every year?
I know repeating myself would be the simplest way, but not necessarily the cleanest. I may perhaps have this backwards and do the slicing later on but at that point I feel like the layers of lists will be out of hand as I'm going from a list of years to a list of years, containing a list of articles, containing a list of every word in the article.
I think you can use groupby by year with custom function:
import pandas as pd
start = pd.to_datetime('2004-02-24')
rng = pd.date_range(start, periods=30, freq='50D')
df = pd.DataFrame({'Date': rng, 'a':range(30)})
#print (df)
def f(x):
print (x)
#return year_prep(x)
#some custom output
return x.a + x.Date.dt.month
print (df.groupby(df['Date'].dt.year).apply(f))