I am working with a csv sheet which contains data from a brewery, for e.g Data required, Quantity order etc.
I want to write a module to read the csv file structure and load the data into a suitable data structure
in Python. I have to interpret the data by calculating the average growth rate, the ratio of sales for
different beers and use these values to predict sales for a given week or month in the future.
I have no idea where to start. The only line of code I have so far are :
df = pd.read_csv (r'file location')
print (df)
To illustrate, I have downloaded data on the US employment level (https://fred.stlouisfed.org/series/CE16OV) and population (https://fred.stlouisfed.org/series/POP).
import pandas as pd
employ = pd.read_csv('/home/brb/bugs/data/CE16OV.csv')
employ = employ.rename(columns={'DATE':'date'})
employ = employ.rename(columns={'CE16OV':'employ'})
employ = employ[employ['date']>='1952-01-01']
pop = pd.read_csv('/home/brb/bugs/data/POP.csv')
pop = pop.rename(columns={'DATE':'date'})
pop = pop.rename(columns={'POP':'pop'})
pop = pop[pop['date']<='2019-10-01']
df = pd.merge(employ,pop)
df['employ_monthly'] = df['employ'].pct_change()
df['employ_yoy'] = df['employ'].pct_change(periods=12)
df['employ_pop'] = df['employ']/df['pop']
df.head()
Related
I'm trying to download weekly Sentinel 2 data for one year. So, one Sentinel dataset within each week of the year. I can create a list of datasets using the code:
from sentinelsat import SentinelAPI
api = SentinelAPI(user, password, 'https://scihub.copernicus.eu/dhus')
products = api.query(footprint,
date = ('20211001', '20221031'),
platformname = 'Sentinel-2',
processinglevel = 'Level-2A',
cloudcoverpercentage = (0,10)
)
products_gdf = api.to_geodataframe(products)
products_gdf_sorted = products_gdf.sort_values(['beginposition'], ascending=[False])
products_gdf_sorted
This creates a list of all datasets available within the year, and as the data capture is around one in every five days you could argue I can work off this list. But instead I would like to have just one option each week (Mon - Sun). I thought I could create a dataframe with a startdate and an enddate for each week and loop that through the api.query code. But not sure how I would do this.
I have created a dataframe using:
import pandas as pd
dates_df = pd.DataFrame({'StartDate':pd.date_range(start='20211001', end='20221030', freq = 'W-MON'),'EndDate':pd.date_range(start='20211004', end='20221031', freq = 'W-SUN')})
print (dates_df)
Any tips or advice is greatly appreciated. Thanks!
so i'm currently trying to make a timeseries forecasting using LSTM and i'm still on the early stage where i wanted to make sure my data clean.
for the background:
i'm trying to make a model using LSTM for temperature, rain(?), and humidity (my english not good) for 3 Station, and so if i'm correct there will be 9 models, 3 models each for each station. as of now i'm doing an experiment using 1 year worth of data
the problem:
i named my file based on the index of the month, Jan as 1, Feb as 2, Mar as 3, and so on.
Using the os library i managed to loop through the folder for each file and clean the file, drop the column, filling the missing value, etc.
But when i'm trying to append the order of the month is not correct, it starts from month 11 then go to 8 etc. what am i doing wrong?
and how to print a full dataframe? currently i succed printing the full dataframe using this method
Here is the code:
Dir_data = '/content/DATA'
excel_clean = pd.DataFrame()
train_data=[]
for i in os.listdir(Dir_data):
excel_test = pd.read_excel(i)
#drop column
excel_test.drop(columns=['ff_avg', 'ddd_x', 'ddd_car', 'ff_avg', 'ff_x','ss','Tn','Tx'],inplace = True)
#Start Cleaning
excel_test = excel_test.replace(8888,'x').replace(9999,'x').replace('','x')
excel_test['RR'] = pd.to_numeric(excel_test['RR'], errors='coerce').astype('float64')
excel_test['RH_avg'] = pd.to_numeric(excel_test['RH_avg'], errors='coerce').astype('int64')
excel_test['Tavg'] = pd.to_numeric(excel_test['Tavg'], errors='coerce').astype('float64')
#excel_test.dtypes
#Filling Missing Values
excel_test['RR'] = excel_test['RR'].fillna(excel_test['RR'].mean())
excel_test['RH_avg'] = excel_test['RH_avg'].fillna(excel_test['RH_avg'].mean())
excel_test['Tavg'] = excel_test['Tavg'].fillna(excel_test['Tavg'].mean())
excel_test['RR'] = excel_test['RR'].round(decimals=1)
excel_test['Tavg'] = excel_test['Tavg'].round(decimals=1)
excel_clean = excel_clean.append(excel_test)
pd.set_option('max_rows', 99999)
pd.set_option('max_colwidth', 400)
pd.describe_option('max_colwidth')
excel_clean.reset_index(drop=True,inplace=True)
excel_clean
it's only for 1 station as this is an experiment
I am trying to complete a script to store all the trail reports my company gets from various clearing houses. As part of this script I rip the data from multiple excel sheets (over 20 a month) and an amalgamate it in a series of pandas dataframes(organized in a timeline). Unfortunately when I try to output a new spreadsheet with the amalgamated summaries, I get a 'number stored as text' error from excel.
FinalFile = Workbook()
FinalFile.create_sheet(title='Summary') ### This will hold a summary table eventually
for i in Timeline:
index = Timeline.index(i)
sheet = FinalFile.create_sheet(title=i)
sheet[i].number_format = 'Currency'
df = pd.DataFrame(Output[index])
df.columns = df.iloc[0]
df = df.iloc[1:].reset_index(drop=True)
df.head()
df = df.set_index('Payment Type')
for r in dataframe_to_rows(df, index=True,header=True):
sheet.append(r)
for cell in sheet['A'] + sheet[1]:
cell.style='Pandas'
SavePath = SaveFolder+'/'+CurrentDate+'.xlsx'
FinalFile.save(SavePath)
using number_format = 'Currency' to format as currency did not resolve this, nor did my attempt to use the write only methond on the openpyxl documentation page
https://openpyxl.readthedocs.io/en/stable/pandas.html
Fundamentally this code is outputting the right index, headers, sheetname and formatting the only issue issue is the numbers stored as text from B3:D7.
Attached is an example month Output
example dataframe of the same month
0 Total Paid Net GST
Payment Type
Adjustments -2800 -2546 -254
Agency Upfront 23500 21363 2135
Agency Trail 46980 42708 4270
Referring Office Trail 16003 14548 1454
NilTrailPayment 0 0 0
I have an excel file with stock symbols and many other columns. I have a simplified version of the excel file below:
Symbol
Industry
0
AAPL
Technology Manufacturing
1
MSFT
Technology Manufacturing
2
TSLA
Electric Car Manufacturing
Essentially, I am trying to get the Industry based on the Symbol.
For example, if I use 'AAPL' I want to get 'Technology Manufacturing'. Here is my code so far.
import pandas as pd
excel_file1 = 'file.xlsx'
df = pd.read_excel(excel_file1)
stock = 'AAPL'
row_index = df[df['Symbol'] == stock].index.item()
industry = df['Industry'][row_index]
print(industry)
after trying to get row_index, I get an error: "ValueError: can only convert an array of size 1 to a Python scalar"
can someone solve this? Also let's say row_index works: is this code (below) correct?
industry = df['Industry'][row_index]
Use:
stock = 'AAPL'
industry = df[df['Symbol'] == stock]['Industry'][0]
OR:, if you want to search using index, use df.loc:
stock = 'AAPL'
industry = df.loc[df[df['Symbol'] == stock].index, 'Industry'][0]
But the first one's much better.
Here is the question:
Write a program that computes the average learning coverage (the second column, labeled LC) and the highest Unique learner (the third column, labeled UL).
Both should be computed only for the period from June 2018 through May 2019.
Save the results in the variables mean_LC and max_UL.
The content of the .txt file is as below:
Date,LC,UL
1-01-2018,20045,687
1-02-2018,4536,67
1-03-2018,6783,209
1-04-2018,3465,2896
1-05-2018,456,27
1-06-2018,3458,986
1-07-2018,6895,678
1-08-2018,5678,345
1-09-2018,4576,654
1-10-2018,456,98
1-11-2018,456,8
1-12-2018,456,789
1-01-2019,876,98
1-02-2019,3468,924
1-03-2019,46758,973
1-04-2019,678,345
1-05-2019,345,90
1-06-2019,34,42
1-07-2019,35,929
1-08-2019,243,931
# Importing the pandas package.
import pandas as pd
# Reading the CSV formatted file using read_csv function.
df = pd.read_csv('content.txt')
# retraining only the data from 2018 June to 2019 May
#Filter your dataset here
df = df[ (df['Date'] >= '1-06-2018' ) & (df['Date'] <= '1-05-2019') ]
# Using the predefined pandas mean function to find the mean.
#To find average/ mean of column
mean_LC = df['LC'].mean()
# Using the predefined pandas max value function to find the Max value
#To find the Max UL
max_UL = df['UL'].max()
This link will give you an idea of how the code is actually working : https://www.learnpython.org/en/Pandas_Basics
Cracked it !!
with open("LearningData.txt","r") as fileref:
lines = fileref.read().split()
UL_list = []
sum = 0
for line in lines[6:18]:
sum += float(line.split(",")[1])
UL_list.append(line.split(",")[2])
max_UL = UL_list[0]
for i in UL_list:
if i> max_UL:
max_UL=int(i)
mean_LC = sum/12
print(mean_LC)
print(max_UL)