I'm trying to download weekly Sentinel 2 data for one year. So, one Sentinel dataset within each week of the year. I can create a list of datasets using the code:
from sentinelsat import SentinelAPI
api = SentinelAPI(user, password, 'https://scihub.copernicus.eu/dhus')
products = api.query(footprint,
date = ('20211001', '20221031'),
platformname = 'Sentinel-2',
processinglevel = 'Level-2A',
cloudcoverpercentage = (0,10)
)
products_gdf = api.to_geodataframe(products)
products_gdf_sorted = products_gdf.sort_values(['beginposition'], ascending=[False])
products_gdf_sorted
This creates a list of all datasets available within the year, and as the data capture is around one in every five days you could argue I can work off this list. But instead I would like to have just one option each week (Mon - Sun). I thought I could create a dataframe with a startdate and an enddate for each week and loop that through the api.query code. But not sure how I would do this.
I have created a dataframe using:
import pandas as pd
dates_df = pd.DataFrame({'StartDate':pd.date_range(start='20211001', end='20221030', freq = 'W-MON'),'EndDate':pd.date_range(start='20211004', end='20221031', freq = 'W-SUN')})
print (dates_df)
Any tips or advice is greatly appreciated. Thanks!
Related
so i'm currently trying to make a timeseries forecasting using LSTM and i'm still on the early stage where i wanted to make sure my data clean.
for the background:
i'm trying to make a model using LSTM for temperature, rain(?), and humidity (my english not good) for 3 Station, and so if i'm correct there will be 9 models, 3 models each for each station. as of now i'm doing an experiment using 1 year worth of data
the problem:
i named my file based on the index of the month, Jan as 1, Feb as 2, Mar as 3, and so on.
Using the os library i managed to loop through the folder for each file and clean the file, drop the column, filling the missing value, etc.
But when i'm trying to append the order of the month is not correct, it starts from month 11 then go to 8 etc. what am i doing wrong?
and how to print a full dataframe? currently i succed printing the full dataframe using this method
Here is the code:
Dir_data = '/content/DATA'
excel_clean = pd.DataFrame()
train_data=[]
for i in os.listdir(Dir_data):
excel_test = pd.read_excel(i)
#drop column
excel_test.drop(columns=['ff_avg', 'ddd_x', 'ddd_car', 'ff_avg', 'ff_x','ss','Tn','Tx'],inplace = True)
#Start Cleaning
excel_test = excel_test.replace(8888,'x').replace(9999,'x').replace('','x')
excel_test['RR'] = pd.to_numeric(excel_test['RR'], errors='coerce').astype('float64')
excel_test['RH_avg'] = pd.to_numeric(excel_test['RH_avg'], errors='coerce').astype('int64')
excel_test['Tavg'] = pd.to_numeric(excel_test['Tavg'], errors='coerce').astype('float64')
#excel_test.dtypes
#Filling Missing Values
excel_test['RR'] = excel_test['RR'].fillna(excel_test['RR'].mean())
excel_test['RH_avg'] = excel_test['RH_avg'].fillna(excel_test['RH_avg'].mean())
excel_test['Tavg'] = excel_test['Tavg'].fillna(excel_test['Tavg'].mean())
excel_test['RR'] = excel_test['RR'].round(decimals=1)
excel_test['Tavg'] = excel_test['Tavg'].round(decimals=1)
excel_clean = excel_clean.append(excel_test)
pd.set_option('max_rows', 99999)
pd.set_option('max_colwidth', 400)
pd.describe_option('max_colwidth')
excel_clean.reset_index(drop=True,inplace=True)
excel_clean
it's only for 1 station as this is an experiment
I am using Alpha Vantage extended intraday data for a project and it gives you 2 years worth of data. The way it works is that you can pull month by month through the slice method. This is done to make it faster and efficient. For the purpose of my study I need to pull the whole two years and save it in a dataframe. I was wondering if there is a way to do that with a for loop method. the issue with this slice as I am allowed 1 month per request. It goes like this the trailing 2 years of intraday data is evenly divided into 24 "slices" - year1month1, year1month2, year1month3, ..., year1month11, year1month12, year2month1, year2month2, year2month3, ..., year2month11, year2month12. Each slice is a 30-day window, with year1month1 being the most recent and year2month12 being the farthest from today. By default, slice=year1month1.
# replace the "demo" apikey below with your own key from https://www.alphavantage.co/support/#api-key
import csv
import requests
import pandas as pd
symbol='tsla'
CSV_URL = f'https://www.alphavantage.co/query?function=TIME_SERIES_INTRADAY_EXTENDED&symbol={symbol}&interval=1min&slice=year1month2&apikey={api_key}'
with requests.Session() as s:
download = s.get(CSV_URL)
decoded_content = download.content.decode('utf-8')
cr = csv.reader(decoded_content.splitlines(), delimiter=',')
my_list = list(cr)
df=pd.DataFrame(my_list)
print(df)
Only basic knowledge of Python, so I'm not even sure if this is possible?
I have a csv that looks like this:
[1]: https://i.stack.imgur.com/8clYM.png
(This is dummy data, the real one is about 30K rows.)
I need to find the most recent job title for each employee (unique id) and then calculate how long (= how many days) the employee has been on the same job title.
What I have done so far:
import csv
import datetime
from datetime import *
data = open("C:\\Users\\User\\PycharmProjects\\pythonProject\\jts.csv",encoding="utf-8")
csv_data = csv.reader(data)
data_lines = list(csv_data)
print(data_lines)
for i in data_lines:
for j in i[0]:
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
I also know that at one point I will need:
datetime.strptime(data_lines[1][2] , '%Y/%M/%d').date()
Could somebody help, please? I just need a new list saying something like:
id jt days
500 plumber 370
Edit to clarify: The dates are data points taken. I need to calculate back from the most recent of those back until the job title was something else. So in my example for employee 5000 from 04/07/2021 to 01/03/2020.
Let's consider sample data as follows:
id,jtitle,date
5000,plumber,01/01/2020
5000,senior plumber,02/03/2020
6000,software engineer,01/02/2020
6000,software architecture,06/02/2021
7000,software tester,06/02/2019
The following code works.
import pandas as pd
import datetime
# load data
data = pd.read_csv('data.csv')
# convert to datetime object
data.date = pd.to_datetime(data.date, dayfirst=True)
print(data)
# group employees by ID
latest = data.sort_values('date', ascending=False).groupby('id').nth(0)
print(latest)
# find the latest point in time where there is a change in job title
prev_date = data.sort_values('date', ascending=False).groupby('id').nth(1).date
print(prev_date)
# calculate the difference in days
latest['days'] = latest.date - prev_date
print(latest)
Output:
jtitle date days
id
5000 senior plumber 2020-03-02 61 days
6000 software architecture 2021-02-06 371 days
7000 software tester 2019-02-06 NaT
But then I haven't got anywhere because I can't even conceptualise how to structure this. :-(
Have a map (dict) of employee to (date, title).
For every row, check if you already have an entry for the employee. If you don't just put the information in the map, otherwise compare the date of the row and that of the entry. If the row has a more recent date, replace the entry.
Once you've gone through all the rows, you can just go through the map you've collected and compute the difference between the date you ended up with and "today".
Incidentally your pattern is not correct, the sample data uses a %d/%m/%Y (day/month/year) or %m/%d/%Y (month/day/year) format, the sample data is not sufficient to say which, but it certainly is not YMD.
Seems like I'm too late... Nevertheless, in case you're interested, here's a suggestion in pure Python (nothing wrong with Pandas, though!):
import csv
import datetime as dt
from operator import itemgetter
from itertools import groupby
reader = csv.reader('data.csv')
next(reader) # Discard header row
# Read, transform (date), and sort in reverse (id first, then date):
data = sorted(((i, jtitle, dt.datetime.strptime(date, '%d/%m/%Y'))
for i, jtitle, date in reader),
key=itemgetter(0, 2), reverse=True)
# Process data grouped by id
result = []
for i, group in groupby(data, key=itemgetter(0)):
_, jtitle, end = next(group) # Fetch last job title resp. date
# Search for first ocurrence of different job title:
start = end
for _, jt, start in group:
if jt != jtitle:
break
# Collect results in list with datetimes transformed back
result.append((i, jtitle, end.strftime('%d/%m/%Y'), (end - start).days))
result = sorted(result, key=itemgetter(0))
The result for the input data
id,jtitle,date
5000,plumber,01/01/2020
5000,plumber,01/02/2020
5000,senior plumber,01/03/2020
5000,head plumber,01/05/2020
5000,head plumber,02/09/2020
5000,head plumber,05/01/2021
5000,head plumber,04/07/2021
6000,electrician,01/02/2018
6000,qualified electrician,01/06/2020
7000,plumber,01/01/2004
7000,plumber,09/11/2020
7000,senior plumber,05/06/2021
is
[('5000', 'head plumber', '04/07/2021', 490),
('6000', 'qualified electrician', '01/06/2020', 851),
('7000', 'senior plumber', '05/06/2021', 208)]
I have two dataframes df_inv
df_inv
and df_sales.
df_sales
I need to add a column to df_inv with the sales person name based on the doctor he is tagged in df_sales. This would be a simple merge I guess if the sales person to doctor relationship in df_sales was unique. But There is change in ownership of doctors among sales person and a row is added with each transfer with an updated day.
So if the invoice date is less than updated date then previous tagging should be used, If there are no tagging previously then it should show nan. In other word for each invoice_date in df_inv the previous maximum updated_date in df_sales should be used for tagging.
The resulting table should be like this
Final Table
I am relatively new to programming but I can usually find my way through problems. But I can not figure this out. Any help is appreciated
import pandas as pd
import numpy as np
df_inv = pd.read_excel(r'C:\Users\joy\Desktop\sales indexing\consolidated report.xlsx')
df_sales1 = pd.read_excel(r'C:\Users\joy\Desktop\sales indexing\Sales Person
tagging.xlsx')
df_sales2 = df_sales1.sort_values('Updated Date',ascending=False)
df_sales = df_sales2.reset_index(drop=True)
sales_tag = []
sales_dup = []
counter = 0
for inv_dt, doc in zip(df_inv['Invoice_date'],df_inv['Doctor_Name']):
for sal, ref, update in zip(df_sales['Sales
Person'],df_sales['RefDoctor'],df_sales['Updated Date']):
if ref==doc:
if update<=inv_dt and sal not in sales_dup :
sales_tag.append(sal)
sales_dup.append(ref)
break
else:
pass
else:
pass
sales_dup = []
counter = counter+1
if len(sales_tag)<counter:
sales_tag.append('none')
else:
pass
df_inv['sales_person'] = sales_tag
This appears to work.
I am following this blog to identify seasonal customers in my time series data:
https://www.kristenkehrer.com/seasonality-code
My code is shamelessly nearly identical to the blog, with some small tweaks, code is below. I am able to run the code entirely, for 2000 customers. A few hours later, 0 customers were flagged as seasonal in my results.
Manually looking at customers data over time, I do believe I have many examples of seasonal customers that should have been picked up. Below is a sample of the data I am using.
Am I missing something stupid? am I in way over my head to even try this, being very new to python?
Note that I am adding the "0 months" in my data source, but I don't think it would hurt anything for that function to check again. I'm also not including the data source credentials step.
Thank you
import pandas as pa
import numpy as np
import pyodbc as py
cnxn = py.connect('DRIVER='+driver+';SERVER='+server+';PORT=1433;DATABASE='+database+';UID='+username+';PWD='+ password)
original = pa.read_sql_query('SELECT s.customer_id, s.yr, s.mnth, Case when s.usage<0 then 0 else s.usage end as usage FROM dbo.Seasonal s Join ( Select Top 2000 customer_id, SUM(usage) as usage From dbo.Seasonal where Yr!=2018 Group by customer_id ) t ON s.customer_id = t.customer_id Where yr!= 2018 Order by customer_id, yr, mnth', cnxn)
grouped = original.groupby(by='customer_id')
def yearmonth_to_justmonth(year, month):
return year * 12 + month - 1
def fillInForOwner(group):
min = group.head(1).iloc[0]
max = group.tail(1).iloc[0]
minMonths = yearmonth_to_justmonth(min.yr, min.mnth)
maxMonths = yearmonth_to_justmonth(max.yr, max.mnth)
filled_index = pa.Index(np.arange(minMonths, maxMonths, 1), name="filled_months")
group['months'] = group.yr * 12 + group.mnth - 1
group = group.set_index('months')
group = group.reindex(filled_index)
group.customer_id = min.customer_id
group.yr = group.index // 12
group.mnth = group.index % 12 + 1
group.usage = np.where(group.usage.isnull(), 0, group.usage).astype(int)
return group
filledIn = grouped.apply(fillInForOwner)
newIndex = pa.Index(np.arange(filledIn.customer_id.count()))
import rpy2 as r
from rpy2.robjects.packages import importr
from rpy2.robjects import r, pandas2ri, globalenv
pandas2ri.activate()
base = importr('base')
colorspace = importr('colorspace')
forecast = importr('forecast')
times = importr('timeSeries')
stats = importr('stats')
outfile = 'results.csv'
df_list = []
for customerid, dataForCustomer in filledIn.groupby(by=['customer_id']):
startYear = dataForCustomer.head(1).iloc[0].yr
startMonth = dataForCustomer.head(1).iloc[0].mnth
endYear = dataForCustomer.tail(1).iloc[0].yr
endMonth = dataForCustomer.tail(1).iloc[0].mnth
customerTS = stats.ts(dataForCustomer.usage.astype(int),
start=base.c(startYear,startMonth),
end=base.c(endYear, endMonth),
frequency=12)
r.assign('customerTS', customerTS)
try:
seasonal = r('''
fit<-tbats(customerTS, seasonal.periods = 12,
use.parallel = TRUE)
fit$seasonal
''')
except:
seasonal = 1
df_list.append({'customer_id': customerid, 'seasonal': seasonal})
print(f' {customerid} | {seasonal} ')
seasonal_output = pa.DataFrame(df_list)
print(seasonal_output)
seasonal_output.to_csv(outfile)
Kristen here (that's my code). 1 actually means that the customers are not seasonal (or it couldn't pick it up) and NULL also means not seasonal. If they have a seasonal usage pattern (period of 12 months, which is what the code is looking for) it'll output [12].
You can always confirm by inspecting a graph of a single customers behavior and then putting it through the algorithm. I also liked to cross check with a seasonal decomposition algo in either Python or R.
Here is some R code for looking at the decomposition of your time series. If there is no seasonal window in the plot your results are not seasonal:
library(forecast)
myts<-ts(mydata$SENDS, start=c(2013,1),end=c(2018,2),frequency = 12)
plot(decompose(myts))
Also, you mentioned having problems with some of the 0's not filling in (from your twitter conversation) I haven't had this problem but my customers have varying lengths of tenure from 2 years to 13 years. Not sure what the problem is here.
Let me know if I can help with anything else :)
Circling back to answer how I got this to work was by just passing the "original" dataframe into the for loop. My data already had the empty $0 months so I didn't need that part of the code to run. Thank you all for your help