I'm having an issue that I can't seem to understand. I've written a function that takes a dataframe as the input and then performs a number of cleaning steps on it. When I run the function I get the error message KeyError: ('amount', 'occurred at index date'). This doesn't make sense to me because amount is a column in my dataframe .
Here is some code with a subset of the data created:
data = pd.DataFrame.from_dict({"date": ["10/31/2019","10/27/2019"], "amount": [-13.3, -6421.25], "vendor": ["publix","verizon"]})
#create cleaning function for dataframe
def cleaning_func(x):
#convert the amounts to positive numbers
x['amount'] = x['amount'] * -1
#convert dates to datetime for subsetting purposes
x['date'] = pd.to_datetime(x['date'])
#begin removing certain strings
x['vendor'] = x['vendor'].str.replace("PURCHASE AUTHORIZED ON ","")
x['vendor'] = x['vendor'].str.replace("[0-9]","")
x['vendor'] = x['vendor'].str.replace("PURCHASE WITH CASH BACK $ . AUTHORIZED ON /","")
#build table of punctuation and remove from vendor strings
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
x['vendor'] = x['vendor'].str.translate(table)
return x
clean_data = data.apply(cleaning_func)
If someone could shed some light on why this error appears I would appreciate it.
Don't use apply here, it's slow and basically loops over your dataframe. Just pass the function your data and let it return a cleaned up dataframe, this way it will use the vectorized methods over the whole column.
def cleaning_func(df):
#convert the amounts to positive numbers
df['amount'] = df['amount'] * -1
#convert dates to datetime for subsetting purposes
df['date'] = pd.to_datetime(df['date'])
#begin removing certain strings
df['vendor'] = df['vendor'].str.replace("PURCHASE AUTHORIZED ON ","")
df['vendor'] = df['vendor'].str.replace("[0-9]","")
df['vendor'] = df['vendor'].str.replace("PURCHASE WITH CASH BACK $ . AUTHORIZED ON /","")
#build table of punctuation and remove from vendor strings
table = str.maketrans(dict.fromkeys(string.punctuation)) # OR {key: None for key in string.punctuation}
df['vendor'] = df['vendor'].str.translate(table)
return df
clean_df = cleaning_func(data)
Related
This script needs to query the DC server for events. Since this is done live, each time the server is queried, it returns query results of varying lengths. The log file is long and messy, as most logs are. I need to filter only the event names and their codes and then create a DataFrame. Additionally, I need to add a third column that counts the number of times each event took place. I've done most of it but can't figure out how to fix the error I'm getting.
After doing all the filtering from Elasticsearch, I get two lists - action and code - which I have emulated here.
action_list = ['logged-out', 'logged-out', 'logged-out', 'Directory Service Access', 'Directory Service Access', 'Directory Service Access', 'logged-out', 'logged-out', 'Directory Service Access', 'created-process', 'created-process']
code_list = ['4634', '4634', '4634', '4662', '4662', '4662', '4634', '4634', '4662','4688']
I then created a list that contains only the codes that need to be filtered out.
event_code_list = ['4662', '4688']
My script is as follows:
import pandas as pd
from collections import Counter
#Create a dict that combines action and code
lists2dict = {}
lists2dict = dict(zip(action_list,code_list))
# print(lists2dict)
#Filter only wanted eventss
filtered_events = {k: v for k, v in lists2dict.items() if v in event_code_list}
# print(filtered_events)
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index to DataFrame
df = pd.DataFrame(filtered_events,index=index)#Create DataFrame from filtered events
#Create Auto Index
count = Counter(df)
action_count = dict(Counter(count))
action_count_values = action_count.values()
# print(action_count_values)
#Convert Columns to Rows and Add Index
new_df = df.melt(var_name="Event",value_name="Code")
new_df['Count'] = action_count_values
print(new_df)
Up until this point, everything works as it should. The problem is what comes next. If there are no events, the script outputs an empty DataFrame. This works fine. However, if there are events, then we should see the events, the codes, and the number of times each event occurred. The problem is that it always outputs 1. How can I fix this? I'm sure it's something ridiculous that I'm missing.
#If no alerts, create empty DataFrame
if new_df.empty:
empty_df = pd.DataFrame(columns=['Event','Code','Count'])
empty_df['Event'] = ['-']
empty_df['Code'] = ['-']
empty_df['Count'] = ['-']
empty_df.to_html()
html = empty_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
else: #else, output alerts + codes + count
new_df.to_html()
html = new_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
Any help is appreciated.
It is because you are collecting the result as dictionary - the repeated records are ignored. You lost the record count here: lists2dict = dict(zip(action_list,code_list)).
You can do all these operations very easily on dataframe. Just construct a pandas dataframe from given lists, then filter by code, groupby, and aggregate as count:
df = pd.DataFrame({"Event": action_list, "Code": code_list})
df = df[df.Code.isin(event_code_list)] \
.groupby(["Event", "Code"]) \
.agg(Count = ("Code", len)) \
.reset_index()
print(df)
Output:
Event Code Count
0 Directory Service Access 4662 4
1 created-process 4688 2
I am having trouble in filtering databased on a multiple conditions.
[dataframe image][1]
[1]: https://i.stack.imgur.com/TN9Nd.png
When I filter it based on multiple condition, I am getting empty DataFrame.
user_ID_existing = input("Enter User ID:")
print("Available categories are:\n Vehicle\tGadgets")
user_Category_existing = str(input("Choose from the above category:"))
info = pd.read_excel("Test.xlsx")
data = pd.DataFrame(info)
df = data[((data.ID == user_ID_existing) & (data.Category == user_Category_existing))]
print(df)
if I replace the variables user_ID_existing and user_Category_existing with values, I am getting the rows. I even tried with numpy and only getting empty dataframe
filtered_values = np.where((data['ID'] == user_ID_existing) & (data['Category'].str.contains(user_Category_existing)))
print(filtered_values)
print(data.loc[filtered_values])
input always returs a string but since the column ID read by pandas has a number dtype, when you filter it by a string, you're then getting an empty dataframe.
You need to use int to convert the value/ID (entered by the user) to a number.
Try this :
user_ID_existing = int(input("Enter User ID:"))
print("Available categories are:\n Vehicle\tGadgets")
user_Category_existing = input("Choose from the above category:")
data = pd.read_excel("Test.xlsx")
df = data[(data["ID"].eq(user_ID_existing))
& (data["Category"].eq(user_Category_existing))].copy()
print(df)
Noob here. PLEASE FORGIVE ABYSMAL FORMATTING as I am still learning. I am trying to create a time series (a dataframe, I think?) that consists of three columns. One is a date column, the next is an inventory column, and the last is a price column.
I have pulled two separate series (date & inventory; date & price) and I want to meld the two series so that I can see three columns instead of two sets of two. This is my code.
import json
import numpy as np
import pandas as pd
from urllib.error import URLError, HTTPError
from urllib.request import urlopen
class EIAgov(object):
def __init__(self, token, series):
'''
Purpose:
Initialise the EIAgov class by requesting:
- EIA token
- id code(s) of the series to be downloaded
Parameters:
- token: string
- series: string or list of strings
'''
self.token = token
self.series = series
def __repr__(self):
return str(self.series)
def Raw(self, ser):
# Construct url
url = 'http://api.eia.gov/series/?api_key=' + self.token + '&series_id=' + ser.upper()
try:
# URL request, URL opener, read content
response = urlopen(url);
raw_byte = response.read()
raw_string = str(raw_byte, 'utf-8-sig')
jso = json.loads(raw_string)
return jso
except HTTPError as e:
print('HTTP error type.')
print('Error code: ', e.code)
except URLError as e:
print('URL type error.')
print('Reason: ', e.reason)
def GetData(self):
# Deal with the date series
date_ = self.Raw(self.series[0])
date_series = date_['series'][0]['data']
endi = len(date_series) # or len(date_['series'][0]['data'])
date = []
for i in range (endi):
date.append(date_series[i][0])
# Create dataframe
df = pd.DataFrame(data=date)
df.columns = ['Date']
# Deal with data
lenj = len(self.series)
for j in range (lenj):
data_ = self.Raw(self.series[j])
data_series = data_['series'][0]['data']
data = []
endk = len(date_series)
for k in range (endk):
data.append(data_series[k][1])
df[self.series[j]] = data
return df
if __name__ == '__main__':
tok = 'mytoken'
# Natural Gas - Weekly Storage
#
ngstor = ['NG.NW2_EPG0_SWO_R48_BCF.W'] # w/ several series at a time ['ELEC.REV.AL-ALL.M', 'ELEC.REV.AK-ALL.M', 'ELEC.REV.CA-ALL.M']
stordata = EIAgov(tok, ngstor)
print(stordata.GetData())
# Natural Gas - Weekly Prices
#
ngpx = ['NG.RNGC1.W'] # w/ several series at a time ['ELEC.REV.AL-ALL.M', 'ELEC.REV.AK-ALL.M', 'ELEC.REV.CA-ALL.M']
pxdata = EIAgov(tok, ngpx)
print(pxdata.GetData())
Note that 'mytoken' needs to be replaced by an eia.gov API key. I can get this to successfully create an output of two lists...but then to get the lists merged I tried to add this at the end:
joined_frame = pd.concat([ngstor, ngpx], axis = 1, sort=False)
print(joined_frame.GetData())
But I get an error
("TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid")
because apparently I don't know the difference between a list and a series.
How do I merge these lists by date column? Thanks very much for any help. (Also feel free to advise why I am terrible at formatting code correctly in this post.)
If you want to manipulate them as DataFrames in the rest of your code, you can transform ngstor and ngpx into DataFrames as follows:
import pandas as pd
# I create two lists that look like yours
ngstor = [[1,2], ["2020-04-03", "2020-05-07"]]
ngpx = [[3,4] , ["2020-04-03", "2020-05-07"]]
# I transform them to DataFrames
ngstor = pd.DataFrame({"value1": ngstor[0],
"date_col": ngstor[1]})
ngpx = pd.DataFrame({"value2": ngpx[0],
"date_col": ngpx[1]})
Then you can either use pandas.merge or pandas.concat :
# merge option
joined_framed = pd.merge(ngstor, ngpx, on="date_col",
how="outer")
# concat option
ngstor = ngstor.set_index("date_col")
ngpx = ngpx.set_index("date_col")
joined_framed = pd.concat([ngstor, ngpx], axis=1,
join="outer").reset_index()
The result will be:
date_col value1 value2
0 2020-04-03 1 3
1 2020-05-07 2 4
I am trying to compare two different values in a dataframe. The questions/answers I've found I wasn't able to utilize.
import pandas as pd
# from datetime import timedelta
"""
read csv file
clean date column
convert date str to datetime
sort for equity options
replace date str column with datetime column
"""
trade_reader = pd.read_csv('TastyTrades.csv')
trade_reader['Date'] = trade_reader['Date'].replace({'T': ' ', '-0500': ''}, regex=True)
date_converter = pd.to_datetime(trade_reader['Date'], format="%Y-%m-%d %H:%M:%S")
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')]
clean_frame = options_frame.replace(to_replace=['Date'], value='date_converter')
# Separate opening transaction from closing transactions, combine frames
opens = clean_frame[clean_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_OPEN'])]
closes = clean_frame[clean_frame['Action'].isin(['BUY_TO_CLOSE', 'SELL_TO_CLOSE'])]
open_close_set = set(opens['Symbol']) & set(closes['Symbol'])
open_close_frame = clean_frame[clean_frame['Symbol'].isin(open_close_set)]
'''
convert Value to float
sort for trade readability
write
'''
ocf_float = open_close_frame['Value'].astype(float)
ocf_sorted = open_close_frame.sort_values(by=['Date', 'Call or Put'], ascending=True)
# for readability, revert back to ocf_sorted below
ocf_list = ocf_sorted.drop(
['Type', 'Instrument Type', 'Description', 'Quantity', 'Average Price', 'Commissions', 'Fees', 'Multiplier'], axis=1
)
ocf_list.reset_index(drop=True, inplace=True)
ocf_list['Strategy'] = ''
# ocf_list.to_csv('Sorted.csv')
# create strategy list
debit_single = []
debit_vertical = []
debit_calendar = []
credit_vertical = []
iron_condor = []
# shift columns
ocf_list['Symbol Shift'] = ocf_list['Underlying Symbol'].shift(1)
ocf_list['Symbol Check'] = ocf_list['Underlying Symbol'] == ocf_list['Symbol Shift']
# compare symbols, append depending on criteria met
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
print(type(ocf_list['Underlying Symbol']))
ocf_list.to_csv('Sorted.csv')
print(debit_vertical)
# delta = timedelta(seconds=10)
The error I get is:
line 51, in <module>
if row['Symbol Check'][-1] is row['Underlying Symbol'][-1]:
TypeError: string indices must be integers
I am trying to compare the newly created shifted column to the original, and if they are the same, append to a list. Is there a way to compare two string values at all in python? I've tried checking if Symbol Check is true and it still returns an error about str indices must be int. .iterrows() didn't work
Here, you will actually iterate through the columns of your DataFrame, not the rows:
for row in ocf_list:
if row['Symbol Shift'] is row['Underlying Symbol']:
debit_vertical.append(row)
You can use one of the methods iterrows or itertuples to iterate through the rows, but they return rows as lists and tuples respectively, which means you can't index them using the column names, as you did here.
Second, you should use == instead of is since you are probably comparing values, not identities.
Lastly, I would skip iterating over the rows entirely, as pandas is made for selecting rows based on a condition. You should be able to replace the aforementioned code with this:
debit_vertical = ocf_list[ocf_list['Symbol Shift'] == ocf_list['Underlying Symbol']].values.tolist()
I am fairly new to python and coding in general.
I have a big data file that provides daily data for the period 2011-2018 for a number of stock tickers (300~).
The data is a .csv file with circa 150k rows and looks as follows (short example):
Date,Symbol,ShortExemptVolume,ShortVolume,TotalVolume
20110103,AAWW,0.0,28369,78113.0
20110103,AMD,0.0,3183556,8095093.0
20110103,AMRS,0.0,14196,18811.0
20110103,ARAY,0.0,31685,77976.0
20110103,ARCC,0.0,177208,423768.0
20110103,ASCMA,0.0,3930,26527.0
20110103,ATI,0.0,193772,301287.0
20110103,ATSG,0.0,23659,72965.0
20110103,AVID,0.0,7211,18896.0
20110103,BMRN,0.0,21740,213974.0
20110103,CAMP,0.0,2000,11401.0
20110103,CIEN,0.0,625165,1309490.0
20110103,COWN,0.0,3195,24293.0
20110103,CSV,0.0,6133,25394.0
I have a function that allows me to filter for a specific symbol and get 10 observations before and after a specified date (could be any date between 2011 and 2018).
import pandas as pd
from datetime import datetime
import urllib
import datetime
def get_data(issue_date, stock_ticker):
df = pd.read_csv (r'D:\Project\Data\Short_Interest\exampledata.csv')
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
d = df
df = pd.DataFrame(d)
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
return [short_data]
I want to create a script that iterates a list of 'issue_dates' and 'stock_tickers'. The list (a .csv) looks as following:
ARAY,07/08/2017
ARAY,24/04/2014
ACETQ,16/11/2015
ACETQ,16/11/2015
NVLNA,15/08/2014
ATSG,29/09/2017
ATI,24/05/2016
MDRX,18/06/2013
MDRX,18/06/2013
AMAGX,10/05/2017
AMAGX,14/02/2014
AMD,14/09/2016
To break down my problem and question I would like to know how to do the following:
First, how do I load the inputs?
Second, how do I call the function on each input?
And last, how do I accumulate all the function returns in one dataframe?
To load the inputs and call the function for each row; iterate over the csv file and pass each row's values to the function and accumulate the resulting Seriesin a list.
I modified your function a bit: removed the DataFrame creation so it is only done once and added a try/except block to account for missing dates or tickers (your example data didn't match up too well). The dates in the second csv look like they are day/month/year so I converted them for that format.
import pandas as pd
import datetime, csv
def get_data(df, issue_date, stock_ticker):
'''Return a Series for the ticker centered on the issue date.
'''
short = df.loc[df.Symbol.eq(stock_ticker)]
# get the index of the row of interest
try:
ix = short[short.Date.eq(issue_date)].index[0]
# get the item row for that row's index
iloc_ix = short.index.get_loc(ix)
# get the +/-1 iloc rows (+2 because that is how slices work), basically +1 and -1 trading days
short_data = short.iloc[iloc_ix-10: iloc_ix+11]
except IndexError:
msg = f'no data for {stock_ticker} on {issue_date}'
#log.info(msg)
print(msg)
short_data = None
return short_data
df = pd.read_csv (datafile)
df['Date'] = pd.to_datetime(df['Date'], format="%Y%m%d")
results = []
with open('issues.csv') as issues:
for ticker,date in csv.reader(issues):
day,month,year = map(int,date.split('/'))
# dt = datetime.datetime.strptime(date, r'%d/%m/%Y')
date = datetime.date(year,month,day)
s = get_data(df,date,ticker)
results.append(s)
# print(s)
Creating a single DataFrame or table for all that info may be problematic especially since the date ranges are all different. Probably should ask a separate question regarding that. Its mcve should probably just include a few minimal Pandas Series with a couple of different date ranges and tickers.