I am querying an API that lets you request n# of items in a single API call. So I am breaking up the list of items I am querying into n# of "sublists", passing them to a function which returns the API data, and then concatenating the data to a Dataframe.
But when I loop through the "sublists", the final Dataframe only contains the last "sublist", rather than every "sublist". So instead of:
netIncome sharesOutstanding
BRK.B 20 40
V 50 60
MSFT 30 10
ORCL 12 24
AMZN 33 55
GOOGL 66 88
I get:
netIncome sharesOutstanding
AMZN 33 55
GOOGL 66 88
Here is the full code, so can someone tell me what I'm doing wrong?
import os
from iexfinance.stocks import Stock
import pandas as pd
# Set IEX Finance API Token (Public Sandbox Version)
os.environ['IEX_API_VERSION'] = 'iexcloud-sandbox'
os.environ['IEX_TOKEN'] = 'XXXXXX'
def fetch_company_info(group):
"""Function to query API data"""
batch = Stock(group, output_format='pandas')
# Get income from last 4 quarters, sum it, and store to temp Dataframe
df_income = batch.get_income_statement(period="quarter", last='4')
df_income = df_income.T.sum(level=0)
income_ttm = df_income.loc[:, ['netIncome']]
# Get number of shares, and store to temp Dataframe
df_shares = batch.get_key_stats(period="quarter")
shares_outstanding = df_shares.loc['sharesOutstanding']
return income_ttm, shares_outstanding
# Full list to query via API
tickers = ['BRK.B', 'V', 'MSFT', 'ORCL', 'AMZN', 'GOOGL']
# Chunk ticker list into n# of lists
n = 2
batch_tickers = [tickers[i * n:(i + 1) * n] for i in range((len(tickers) + n - 1) // n)]
# Loop through each chunk of tickers
for group in batch_tickers:
company_info = fetch_company_info(group)
output_df = pd.concat(company_info, axis=1, sort='true')
print(output_df)
You need to do another pd.concat. The first one concats the the income_ttm and shares_outstanding column but you then need to use pd.concat in the row direction to add new rows to output_df.
First create output_df, where its first row is the first sublist. Then concat each new sublist to output_df. Also, it should be axis=0 instead of axis=1 because you want to concatenate in row direction, not column direction.
Try something like this at the end of your code:
# Loop through each chunk of tickers
for i in range(len(batch_tickers)):
group = batch_tickers[i]
company_info = fetch_company_info(group)
## concat income and shares outstanding
company_df = pd.concat(company_info, axis=1, sort='true')
# instantiate output_df to be company_info with first row
if(i==0):
output_df = company_df
# for other rows, concat company_df
else:
output_df = pd.concat([output_df, company_df], axis=0)
Try List Comprehension first and concatenate afterward
company_info = [fetch_company_info(group) for group in batch_tickers]
output_df = pd.concat(company_info, axis=1, sort='true')
def fetch_company_info(group):
"""Function to query API data"""
batch = Stock(group, output_format='pandas')
# Get income from last 4 quarters, sum it, and store to temp Dataframe
df_income = batch.get_income_statement(period="quarter", last='4')
df_income = df_income.T.sum(level=0)
income_ttm = df_income.loc[:, ['netIncome']]
# Get number of shares, and store to temp Dataframe
df_shares = batch.get_key_stats(period="quarter")
shares_outstanding = df_shares.loc['sharesOutstanding']
df = pd.concat([income_ttm, shares_outstanding], ignore_index=True, axis=1)
return df
.......
# Loop through each chunk of tickers
dataframes= []
for group in batch_tickers:
company_info = fetch_company_info(group)
dataframes.append(company_info )
df = reduce(lambda top, bottom: pd.concat([top, bottom], sort=False), dataframes)
Related
I have a ~8million-ish row data frame consisting of sales for 615 products across 16 stores each day for five years.
I need to make new column/s that consists of the sales shifted back from 1 to 7 days. I've decided to sort the data frame by date, product and location. The I concatenate item and location as its own column.
Using that column I loop through each unique item/location concatenation and make the shifted sales columns. This code is below:
import pandas as pd
#sort values by item, location, date
df = df.sort_values(['date', 'product', 'location'])
df['sort_values'] = df['product']+"_"+df['location']
df1 = pd.DataFrame()
z = 0
for i in list(df['sort_values'].unique()):
df_ = df[df['sort_values']==i]
df_ = df_.sort_values('ORD_DATE')
df_['eaches_1'] = df_['eaches'].shift(-1)
df_['eaches_2'] = df_['eaches'].shift(-2)
df_['eaches_3'] = df_['eaches'].shift(-3)
df_['eaches_4'] = df_['eaches'].shift(-4)
df_['eaches_5'] = df_['eaches'].shift(-5)
df_['eaches_6'] = df_['eaches'].shift(-6)
df_['eaches_7'] = df_['eaches'].shift(-7)
df1 = pd.concat((df1, df_))
z+=1
if z % 100 == 0:
print(z)
The above code gets me exactly what I want, but takes FOREVER to complete. Is there a faster way to accomplish what I want?
How can I create a single row and get the data type, maximum column length and count for each column of a data frame as shown in bottom desired output section.
import pandas as pd
table = 'sample_data'
idx=0
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,'NULL',40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]),
'new_column':pd.Series([])
}
#Create a DataFrame using above data
sdf = pd.DataFrame(d)
#Create a summary description
desired_data = sdf.describe(include='all').T
desired_data = desired_data.rename(columns={'index':'Variable'})
#print(summary)
#Get Data Type
dtype = sdf.dtypes
#print(data_type)
#Get total count of records (need to work on)
counts = sdf.shape[0] # gives number of row count
#Get maximum length of values
maxcollen = []
for col in range(len(sdf.columns)):
maxcollen.append(max(sdf.iloc[:,col].astype(str).apply(len)))
#print('Max Column Lengths ', maxColumnLenghts)
#Constructing final data frame
desired_data = desired_data.assign(data_type = dtype.values)
desired_data = desired_data.assign(total_count = counts)
desired_data = desired_data.assign(max_col_length = maxcollen)
final_df = desired_data
final_df = final_df.reindex(columns=['data_type','max_col_length','total_count'])
final_df.insert(loc=idx, column='table_name', value=table)
final_df.to_csv('desired_data.csv')
#print(final_df)
Output of above code:
The desired output I am looking for is :
In : sdf
Out:
table_name Name_data_type Name_total_count Name_max_col_length Age_data_type Age_total_count Age_max_col_length Rating_data_type Rating_total_count Rating_max_col_length
sample_data object 12 6 object 12 4 float64 12 4
If you have noticed, I want to print single row where I create column_name_data_type,column_name_total_count,column_name_max_col_length and get the respective values for the same.
Here's a solution:
df = final_df
df = df.drop("new_column").drop("table_name", axis=1)
df = df.reset_index()
df.melt(id_vars=["index"]).set_index(["index", "variable"]).sort_index().transpose()
The result is:
index Age Name \
variable data_type max_col_length total_count data_type max_col_length ...
value object 4 12 object 6 ...
Can you try this:
The below code tries to iterate entire dataframe, hence it may take some time complexity. This is not the optimal solution but working solution for the above problem.
from collections import OrderedDict
## storing key-value pair
result_dic = OrderedDict()
unique_table_name = final_df["table_name"].unique()
# remove unwanted rows
final_df.drop("new_column", inplace=True)
cols_name = final_df.columns
## for every unique table name, generating row
for unique_table_name in unique_table_name:
result_dic["table_name"] = unique_table_name
filtered_df = final_df[final_df["table_name"] == unique_table_name]
for row in filtered_df.iterrows():
for cols in cols_name:
if cols != "table_name":
result_dic[row[0]+"_"+cols] = row[1][cols]
Convert dict to dataframe
## convert dataframe from dict
result_df = pd.DataFrame([result_dic])
result_df
expected output is:
table_name Name_data_type Name_max_col_length Name_total_count Age_data_type Age_max_col_length Age_total_count Rating_data_type Rating_max_col_length Rating_total_count
0 sample_data object 6 12 object 4 12 float64 4 12
I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?
Try this :-
df.groupby('name')[df.columns.values].sum()
try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)
After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).
It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))
You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)
I working with a forex dataset, trying to fill in my dataframe with open, high, low, close updated every tick.
Here is my code:
import pandas as pd
# pandas settings
pd.set_option('display.max_columns', 320)
pd.set_option('display.max_rows', 320)
pd.set_option('display.width', 320)
# creating dataframe
df = pd.read_csv('https://www.dropbox.com/s/tcek3kmleklgxm5/eur_usd_lastweek.csv?dl=1', names=['timestamp', 'ask', 'bid', 'avol', 'bvol'], parse_dates=[0], header=0)
df['spread'] = df.ask - df.bid
df['symbol'] = 'EURUSD'
times = pd.DatetimeIndex(df.timestamp)
# parameters for df.groupby()
df['date'] = times.date
df['hour'] = times.hour
# 1h candles updated every tick
df['candle_number'] = '...'
df['1h_open'] = '...'
df['1h_high'] = '...'
df['1h_low'] = '...'
df['1h_close'] = '...'
# print(df)
grouped = df.groupby(['date', 'hour'])
for idx, x in enumerate(grouped):
print(idx)
print(x)
So as you can see, with for loop I'm getting groups.
Now I want to fill the following columns in my dataframe:
idx be my df['candle_number']
df['1h_open'] must be equal to the very first df.bid in the group
df['1h_high'] = the highest number in df.bid up until current row (so for instance if there are 350 rows in the group, for 20th value
we count the highest number from 0-20 span, on 215th value we the
highest value from 0-215 span which can be completely different.
df['1h_low'] = lowest value up until the current iteration (same approach as for the above)
I hope it's not too confusing =)
Cheers
It's convinient to reindex on date and hour:
df_new = df.set_index(['date', 'hour'])
Then apply groupby functions aggregating by index:
df_new['candle_number'] = df_new.groupby(level=[0,1]).ngroup()
df_new['1h_open'] = df_new.groupby(level=[0,1])['bid'].first()
df_new['1h_high'] = df_new.groupby(level=[0,1])['bid'].cummax()
df_new['1h_low'] = df_new.groupby(level=[0,1])['bid'].cummin()
you can reset_index() back to a flat dataframe.
I am having trouble reformatting a dataframe.
My input is a day value rows by symbols columns (each symbol has different dates with it's values):
Input
code to generate input
data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)
The needed output is (month row with new row for each month):
Needed output
code to generate output
data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)
Meaning (Pseudo code)
1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value.
2. take only rows that are business start of month or business end of month.
3. write those rows to a new datatframe.
I have read several posts like this and this and several more.
All treat with dataframe of the same "type" and just resampling while I need to change to structure...
My code so far
# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()
# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)
# selecting only the rows I want
df = df.loc[new_index]
# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
# print index
date = df.loc[index].name
# values are the not none values
values = df.loc[index][~df.loc[index].isnull().values]
new_dict[date]=values
# from dict to list
data=[]
for key, values in new_dict.iteritems():
for i in range(0, len(values)):
date = key
stock_name = str(values.index[i])
stock_value = values.iloc[i]
row = (key, stock_name, stock_value)
data.append(row)
# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")
Current output I get
One big problem:
I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..
One smaller problem:
I am sure this is not the cleanest and fastest code :)
Thanks a lot!
So I have found a way.
looping through each column
groupby month
taking the first and last value I have in that month
calculate return
df_migdal = pd.DataFrame()
for col in df_input.columns[0:]:
stock_position = df_input.loc[:,col]
name = stock_position.name
name = re.sub('[^a-zA-Z]+', '', name)
name = name[0:-4]
stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
stock_position["name"] = name
stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
stock_position.dropna(inplace=True)
df_migdal=df_migdal.append(stock_position)
df_migdal=df_migdal.round(decimals=2)
I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.
df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])