As you can tell I'm new to python, and struggling with the correct way/syntax to iterate over date ranges in my Google Analytics API call. The master function iterates over specific Website View IDs already (I have it loop through the 'client_id', but I'm wondering what the correct way to have it iterate over date ranges would be/where I need to put it in my code.
Right now I am running it using the GA Date Dimension to get daily results, but I also need to run it for MTD, 7 Day Rolling, and 30 Day rolling iterations. Here is a sample the code I am using.
def APmain(client_id):
base_df = pd.DataFrame()
columns = traffic_columns
start_date = datetime.strptime(master_startdate , '%Y-%m-%d').date()
end_date = datetime.strptime(master_enddate , '%Y-%m-%d').date()
filters = 'ga:screenResolution!=0x0'
print('Running Google Analytics Report.')
#change based the type and number of dimensions pulled. These are set
dimensions = 'ga:'+(',ga:'.join(columns[:3]))
metrics = 'ga:'+(',ga:'.join(columns[3:]))
results = analytics_get_report(client_id, str(start_date), str(end_date), dimensions, metrics, filters)
df = pd.DataFrame(results, columns=columns)
df['view_id'] = client_id
df['filter_type'] = 'New'
df['dimension'] = 'SourceMedium'
df['DateType'] = 'Daily'
base_df = base_df.append(df)
base_df = base_df.applymap(str)
base_df = base_df[base_df.date != '(other)'] #remove if date column has '(other)' as a value
base_df['date'] = pd.to_datetime(base_df['date'], format='%Y%m%d') #format date
# Powershell output
print(base_df.head())
#bigquery output
base_df.to_gbq(
'', #dataset name + table name
'', #project name
chunksize=10000,
reauth=False,
if_exists='append',
credentials=credentials
)
I have a variables names 'master_startdate' and 'master_enddate' where I define the date range I have it pull data for currently. I've messed around with something like this but can't get it to work correctly, but I feel like I'm on the right track and/or thinking about it correctly.
import calendar
from datetime import datetime, timedelta, date
import calendar
cal= calendar.Calendar()
wholeyear= range(0,365)
master_start_date= '2019-01-01'
start_date= datetime.strptime(master_start_date , '%Y-%m-%d').date()
end_date2= datetime.strptime(end_date , '%Y-%m-%d').date()
for x in wholeyear:
end_date = start_date + (timedelta(days=1)*x)
print(end_date)
pass
start_date2= datetime.date.today()
start_date2= datetime.date(end_date2.year(),end_date2.month(),1)
print(start_date2)
Any help for a beginner here would be greatly appreciated!
Related
I’m having some hard time trying to make this code shows more than 1 page of orders.
I already tried different methods, such as loops and also the one below (which is just a workaround) where I tried to get the page 2.
I just need it to brings me all the orders generated in a specific day - but I got completely stuck.
import requests
import pandas as pd
from datetime import datetime, timedelta
# Set the API token for the Shopify API
api_token = 'MYTOKEN'
# Get the current date and subtract one day
today = datetime.now()
yesterday = today - timedelta(days=1)
# Format the date strings for the API request
start_date = yesterday.strftime('%Y-%m-%dT00:00:00Z')
end_date = yesterday.strftime('%Y-%m-%dT23:59:59Z')
# Set the initial limit to 1
limit = 1
page_info = 2
# Set the initial URL for the API endpoint you want to access, including the limit and date range parameters
url = f'https://MYSTORE.myshopify.com/admin/api/2020-04/orders.json?page_info=%7Bpage_info%7D&limit=%7Blimit%7D&created_at_min=%7Bstart_date%7D&created_at_max=%7Bend_date%7D&'
# Set the API token as a header for the request
headers = {'X-Shopify-Access-Token': api_token}
# Make the GET request
response = requests.get(url, headers=headers)
# Check the status code of the response
if response.status_code == 200:
# Parse the JSON response directly
orders = response.json()['orders']
# Flatten the JSON response into a Pandas DataFrame, including the 'name' column (order number) and renaming the 'id' column to 'order_id'
df = pd.json_normalize(orders, sep='_', record_path='line_items', meta=['name', 'id'], meta_prefix='meta_')
# Flatten the line_items data into a separate DataFrame
line_items_df = pd.json_normalize(orders, 'line_items', ['id'], meta_prefix='line_item_')
# Flatten the 'orders' data into a separate | Added in Dec.26-2022
orders_df = pd.json_normalize(orders, sep='_', record_path='line_items', meta=['created_at', 'id'], meta_prefix='ordersDTbs_')
# Merge the 'df' and 'orders_df' DataFrames | Added in Dec.26-2022
df = pd.merge(df, orders_df[['id', 'ordersDTbs_created_at']], on='id')
# Converting create_at date to DATE only | Added in Dec.26-2022
df['ordersDTbs_created_at'] = df['ordersDTbs_created_at'].apply(lambda x: datetime.strptime(x, '%Y-%m-%dT%H:%M:%S%z').date())
# Concatenate the two dataframes
df = pd.merge(df, line_items_df[['id', 'sku', 'quantity']], on='id')
# Calculate the discount amount and add it as a new column in the dataframe
df['price_set_shop_money_amount'] = pd.to_numeric(df['price_set_shop_money_amount'])
df['total_discount_set_shop_money_amount'] = pd.to_numeric(df['total_discount_set_shop_money_amount'])
df = df.assign(paid_afterdiscount=df['price_set_shop_money_amount'] - df['total_discount_set_shop_money_amount'])
# Print the DataFrame
print(df[['meta_name','ordersDTbs_created_at','sku_y','title','fulfillable_quantity','quantity_x','quantity_y','paid_afterdiscount']])
#Checking if API ran smoothly
else:
print('Something went wrong.')
I already tried different methods, such as loops and also the one below (which is just a workaround) where I tried to get the page 2.
How do I specify a start date and end date? For example, I want to extract daily close price for AAPL from 01-01-2020 to 30-06-2021?
enter code here
from alpha.vantage.timeseries import TimeSeries
import pandas as pd
api_key = ''
ts = TimeSeries(key = api_key, output_format = 'pandas')
data - ts.get_daily('AAPL', outputsize = 'full')
print(data)
Based on their documentation, it doesn't seem to have an option to filter by date directly in the API call.
Once you retrieve the data you can filter it within the dataframe.
from alpha_vantage.timeseries import TimeSeries
import pandas as pd
api_key = ''
ts = TimeSeries(key = api_key, output_format = 'pandas')
data = ts.get_daily('AAPL', outputsize = 'full')
data[0][(data[0].index >= '2020-01-01') & (data[0].index <= '2021-06-30')]
output:
You can also use .loc when filtering the data since date is the index, but this is deprecated and it will throw an error in a future version of pandas.
data[0].loc['2020-01-01':'2021-06-30']
I am working on a personal project collecting the data on Covid-19 cases. The data set only shows the total number of Covid-19 cases per state cumulatively. I would like to add a column that contains the new cases added that day. This is what I have so far:
import pandas as pd
from datetime import date
from datetime import timedelta
import numpy as np
#read the CSV from github
hist_US_State = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
#some code to get yesterday's date and the day before which is needed later.
today = date.today()
yesterday = today - timedelta(days = 1)
yesterday = str(yesterday)
day_before_yesterday = today - timedelta(days = 2)
day_before_yesterday = str(day_before_yesterday)
#Extracting yesterday's and the day before cases and combine them in one dataframe
yesterday_cases = hist_US_State[hist_US_State["date"] == yesterday]
day_before_yesterday_cases = hist_US_State[hist_US_State["date"] == day_before_yesterday]
total_cases = pd.DataFrame()
total_cases = day_before_yesterday_cases.append(yesterday_cases)
#Adding a new column called "new_cases" and this is where I get into trouble.
total_cases["new_cases"] = yesterday_cases["cases"] - day_before_yesterday_cases["cases"]
Can you please point out what I am doing wrong?
Because you defined total_cases as a concatenation (via append) of yesterday_cases and day_before_yesterday_cases, its number of rows is equal to the sum of the other two dataframes. It looks like yesterday_cases and day_before_yesterday_cases both have 55 rows, and so total_cases has 110 rows. Thus your last line is trying to assign 55 values to a series of 110 values.
You may either want to reshape your data so that each date is its own column, or work in arrays of dataframes.
I have a csv file with data every ~minute over 2 years, and am wanting to run code to calculate 24-hour averages. Ideally I'd like the code to iterate over the data, calculate averages and standard deviations, and R^2 between dataA and dataB, for every 24hr period and then output this new data into a new csv file (with datestamp and calculated data for each 24hr period).
The data has an unusual timestamp which I think might be tripping me up slightly. I've been trying different For Loops to iterate over the data, but I'm not sure how to specify that I want the averages,etc for each 24hr period.
This is the code I have so far, but I'm not sure how to complete the For Loop to achieve what I'm wanting. If anyone can help that would be great!
import math
import pandas as pd
import os
import numpy as np
from datetime import timedelta, date
# read the file in csv
data = pd.read_csv("Jacaranda_data_HST.csv")
# Extract the data columns from the csv
data_date = data.iloc[:,1]
dataA = data.iloc[:,2]
dataB = data.iloc[:,3]
# set the start and end dates of the data
start_date = data_date.iloc[0]
end_date = data_date.iloc[-1:]
# for loop to run over every 24 hours of data
day_count = (end_date - start_date).days + 1
for single_date in [d for d in (start_date + timedelta(n) for n in
range(day_count)) if d <= end_date]:
print np.mean(dataA), np.mean(dataB), np.std(dataA), np.std(dataB)
# output new csv file - **unsure how to call the data**
csvfile = "Jacaranda_new.csv"
outdf = pd.DataFrame()
#outdf['dataA_mean'] = ??
#outdf['dataB_mean'] = ??
#outdf['dataA_stdev'] = ??
#outdf['dataB_stdev'] = ??
outdf.to_csv(csvfile, index=False)
A simplified aproach could be to group by calendar day in a dict. I don't have much experience with pandas time management in DataFrames, so this could be an alternative.
You could create a dict where the keys are the dates of the data (without the time part), so you can later calculate the mean of all the data points that are under each key.
data_date = data.iloc[:,1]
data_a = data.iloc[:,2]
data_b = data.iloc[:,3]
import collections
dd_a = collections.defaultdict(list)
dd_b = collections.defaultdict(list)
for date_str, data_point_a, data_point_b in zip(data_date, data_a, data_b):
# we split the string by the first space, so we get only the date part
date_part, _ = date_str.split(' ', maxsplit=1)
dd_a[date_part].append(data_point_a)
dd_b[date_part].append(data_point_b)
Now you can calculate the averages:
for date, v_list in dd_a.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))
for date, v_list in dd_b.items():
if len(v_list) > 0:
print(date, 'mean:', sum(v_list) / len(v_list))
all_data = {}
for ticker in ['TWTR', 'SNAP', 'FB']:
all_data[ticker] = np.array(pd.read_csv('https://www.google.com/finance/getprices?i=60&p=10d&f=d,o,h,l,c,v&df=cpct&q={}'.format(ticker, skiprows=7, header=None))
date = []
for i in np.arange(0, len(all_data['SNAP'])):
if all_data['SNAP'][i][0][0] == 'a':
t = datetime.datetime.fromtimestamp(int(all_data['SNAP'][i][0].replace('a','')))
date.append(t)
else:
date.append(t+ datetime.timedelta(minutes= int(all_data['SNAP'][i][0])))
Hi, what this code does is to create a dictionary(all_data) and then put intraday data for twitter, snapchat, facebook into the dictionary from the url. The dates are in epoch time format and so the second for did a second for loop.
I was only able to do so for one of the tickers (SNAP) and i was wondering if anyone knew how to create iterate all the data to do the same
With pandas, you normally convert a timestamp to datetime using:
df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit="s")
Note:
Your script seems to contain other errors, which are outside the scope of the question.