Creating new df columns for each iteration of "for" loop - python

I am trying to calculate the diff_chg of S&P sectors for 4 different dates (given in start_return) :
start_return = [-30,-91,-182,-365]
for date in start_return:
diff_chg = closingprices[-1].divide(closingprices[date])
for i in sectors: #Sectors is XLK, XLY , etc
diff_chg[i] = diff_chg[sectordict[i]].mean() #finds the % chg of all sectors
diff_df = diff_chg.to_frame
My expected output is to have 4 columns in the df, each one with the returns of each sector for the given period (-30,-91, -182,-365.) .
As of now when I run this code, it returns the sum of the returns of all 4 periods in the diff_df. I would like it to create a new column in the df for each period.
my code returns:
XLK 1.859907
XLI 1.477272
XLF 1.603589
XLE 1.415377
XLB 1.526237
but I want it to return:
1mo (-30) 3mo (-61) 6mo (-182) 1yr (-365
XLK 1.086547 values here etc etc
XLI 1.0334
XLF 1.07342
XLE .97829
XLB 1.0281

Try something like this:
start_return = [-30,-91,-182,-365]
diff_chg = pd.DataFrame()
for date in start_return:
diff_chg[date] = closingprices[-1].divide(closingprices[date])
What this does is to add columns for each date in start_return to a single DataFrame created at the beginning.

Related

How to filter dataframe only by month and year?

I want to select many cells which are filtered only by month and year. For example there are 01.01.2017, 15.01.2017, 03.02.2017 and 15.02.2017 cells. I want to group these cells just looking at the month and year information. If they are in january, They should be grouped together.
Output Expectation:
01.01.2017 ---- 1
15.01.2017 ---- 1
03.02.2017 ---- 2
15.02.2017 ---- 2
Edit: I have 2 datasets in different excels as you can see below.
first data
second data
What I m trying to do is that I want to get 'Su Seviye' data for every 'DH_ID' seperately from first data. And then I want to paste these data into 'Kuyu Yüksekliği' column in the second data. But the problems are that every 'DH_ID' is in different sheets and although there are only month and year data in first database, second database has day information additionally. How can I produce this kind of codes?
import pandas as pd
df = pd.read_excel('...Gözlem kuyu su seviyeleri- 2017.xlsx', sheet_name= 'GÖZLEM KUYULARI1', header=None)
df2 = pd.read_excel('...YERALTI SUYU GÖZLEM KUYULARI ANALİZ SONUÇLAR3.xlsx', sheet_name= 'HJ-8')
HJ8 = df.iloc[:, [0,5,7,9,11,13,15,17,19,21,23,25,27,29]]
##writer = pd.ExcelWriter('yıllarsuseviyeler.xlsx')
##HJ8.to_excel(writer)
##writer.save()
rb = pd.read_excel('...yıllarsuseviyeler.xlsx')
rb.loc[0,7]='01.2022'
rb.loc[0,9]='02.2022'
rb.loc[0,11]='03.2022'
rb.loc[0,13]='04.2022'
rb.loc[0,15]='05.2021'
rb.loc[0,17]='06.2022'
rb.loc[0,19]='07.2022'
rb.loc[0,21]='08.2022'
rb.loc[0,23]='09.2022'
rb.loc[0,25]='10.2022'
rb.loc[0,27]='11.2022'
rb.loc[0,29]='12.2022'
You can see what I have done above.
First, you can convert date column to Datetime object, then get the year and month part with to_period, at last get the group number with ngroup().
df['group'] = df.groupby(pd.to_datetime(df['date'], format='%d.%m.%Y').dt.to_period('M')).ngroup() + 1
date group
0 01.01.2017 1
1 15.01.2017 1
2 03.02.2017 2
3 15.02.2017 2

Column Lost when Concatenating Dataframes with Pandas?

I'm writing a program to scrape through a folder of PDFs, extracting a table from each that contains the same data fields. Sample screenshot of the table in one of the PDFs:
The goal of the program is to produce a spreadsheet with all of the data from each PDF in a single row with the date of the PDF, and the common fields as the column headers. The date in the first column should be the date in the PDF filename. It should look like this:
When I extract out the data into a dataframe and add column headers for "Field" and the date of the report, it looks like this:
Field 2021-12-04
0 Radiation: 5.22 kWh/m2
1 Energy: 116356 kWh
2 PR: 0.79
3 Month to Date NaN
4 Total radiation: 21.33 kWh/m2
5 Total energy: 464478 kWh
6 Max daily energy: 116478 kWh
7 Max daily occurred on: 2021-12-03
Then I set the index as the first column, since those are the common fields that I'll concat based on. When I do that, the date column header seems to be at a different level as the Field header? I'm not sure what happens here:
2021-12-20
Field
Radiation: 3.76 kWh/m2
Energy: 89175 kWh
PR: 0.84
Month to Date NaN
Total radiation: 84.66 kWh/m2
Total energy: 1960868 kWh
Max daily energy: 126309 kWh
Max daily occurred on: 2021-12-17
Then I transpose, and the result looks OK:
Field Radiation: Energy: ... Max daily energy: Max daily occurred on:
2021-12-13 0.79 kWh/m2 19193 kWh ... 124933 kWh 2021-12-12
Then I concatenate, and the result looks good except for some reason the first column with the dates is lost. Any suggestions?
import tabula as tb
import os
import glob
import pandas as pd
import datetime
import re
begin_time = datetime.datetime.now()
User_Profile = os.environ['USERPROFILE']
Canadian_Combined = User_Profile + '\Combined.csv'
CanadianReportsPDF = User_Profile + '\Canadian Reports (PDF)'
CanadianDailySummaryTable = (72,144,230,465)
CanadianDailyDF = pd.DataFrame()
def CanadianScrape():
global CanadianDailyDF
for pdf in glob.glob(CanadianReportsPDF + '/*Daily*'):
# try:
dfs = tb.read_pdf(os.path.abspath(pdf), area=CanadianDailySummaryTable, lattice=True, pages=1)
df = dfs[0]
date = re.search("([0-9]{4}\-[0-9]{2}\-[0-9]{2})", pdf)
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
# print(df.columns)
print(df)
df_t = df.transpose()
# print(df_t)
CanadianDailyDF = pd.concat([df_t,CanadianDailyDF], ignore_index=False)
# print(CanadianDailyDF)
# except:
# continue
# print(CanadianDailyDF)
CanadianDailyDF.to_csv(Canadian_Combined, index=False)
CanadianScrape()
print(datetime.datetime.now() - begin_time)
EDIT**
Added a insert() line after the transpose to add back in the date column per per Ezer K suggestion and that seems to have solved it.
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
df_t = df.transpose()
df_t.insert(0, "Date:", date.group(0))
It is hard to say exactly since your examples are hard to reproduce, but it seems that instead of adding a field you are changing the column names.
try switching these rows in your function:
df.columns = ["Field",date.group(0)]
df.set_index("Field",inplace=True)
df_t = df.transpose()
with these:
df_t = df.transpose()
df_t.insert(0, "Date:", date.group(0))

Python rows iteration over dates with a SQL server connection. Iterating over multiple columns in a python data frame

I have a data frame with a primary index and two date columns. I am retrieving data from a SQL server with the primary index, the retrieved data has a dollar value and a date.
Initial data frame looks
Primary_index Issue_date Experience_date
abc101 08/01/2018 08/01/2020
abc102 02/01/2018 02/01/2020
abc103 04/13/2017 04/13/2018
abc104 07/27/2019 07/27/2020
the SQL data is this
Primary_index Paid_date Amount
abc101 07/01/2017 $50
abc102 02/13/2018 $100
abc101 05/23/2019 $500
abc104 07/02/2020 $175
abc104 09/02/2017 $175
I need to iterate over the Primary_index along with Issue_date and Experience_date, to make sure the Paid_date is in between the Issue_date and Experience_date.
I am iterating over the primary index as shown below
df['Primary_Index'] = df['Primary_Index'].astype(str).str.strip()
Primary_list = df['Primary_Index'].apply(lambda x: "'{}'".format(x)).tolist()
list_split = [Primary_list [x:x+10000] for x in range(0, len(Primary_list), 10000)]
filter_list = []
for list in list_split:
filter_list.append(','.join(list))
df_final = pd.DataFrame()
for i in filter_list:
sql="""
SELECT
DB.tbl.Primary_Index,
DB.tbl.Paid_date
FROM
DB.tbl
WHERE
DB.tbl.Primary_Index IN ("""+i+""")
AND
DB.tbl.Paid_date BETWEEN '2017-01-02 00:00:00' AND '2020-06-30 00:00:00'
Group by
DB.tbl.Primary_Index,
DB.tbl.Paid_date
"""
df_final = df_final.append(pd.read_sql(sql,con))
The problem with this is I have hard coded the minimum and maximum Paid_date which returns millions of rows.
Is there a way to iterate over the dates in the initial data frame along with the primary_index?

Creating a conditional loop in Numpy/Pandas

Absolute newbie here....
I have a dataset with a list of expenses data 1
I would like to create a loop to identify the dates in which the person spends more than the previous day and also spends more than the next day. In doing so, I would like it to either print the date and amount(expenses) or create a new column reading true/false.
Should I use Numpy or Pandas?
I was thinking of something in the likes of: today = i yesterday = i-1 and tomorrow = i+1
...and then proceeding to create a loop
Are you looking for something like this:
# sample data
np.random.seed(4)
df = pd.DataFrame({'Date': pd.date_range('2020-01-01', '2020-01-10'),
'Name': ['Some Name', 'Another Name']*5,
'Price': np.random.randint(100,1000, 10)})
# groupby name
g = df.groupby('Name')['Price']
# create a mask to filter your dataframe where the current price is grater than the price above and below
mask = (g.shift(0) > g.shift(1)) & (g.shift(0) > g.shift(-1))
df[mask]
Date Name Price
3 2020-01-04 Another Name 809
4 2020-01-05 Some Name 997
7 2020-01-08 Another Name 556

Optimization of date subtraction on large dataframe - Pandas

I'm a beginner learning Python. I have a very large dataset - I'm having trouble optimizing my code to make this run faster.
My goal is to optimize all of this (my current code works but slow):
Subtract two date columns
Create new column with the result of that subtraction
Remove original two columns
Do all of this in a fast manner
Random finds:
Thinking about changing the initial file read method...
https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file
I have parse_dates=True when reading the CSV file - so could this be a slowdown? I have 50+ columns but only 1 timestamp column and 1 year column.
This column:
saledate
1 3/26/2004 0:00
2 2/26/2004 0:00
3 5/19/2011 0:00
4 7/23/2009 0:00
5 12/18/2008 0:00
Subtracted by (Should this be converted to a format like 1/1/1996?):
YearMade
1 1996
2 2001
3 2001
4 2007
5 2004
Current code:
mean_YearMade = dfx[dfx['YearMade'] > 1000]['YearMade'].mean()
def age_at_sale(df, mean_YearMade):
'''
INPUT: Dateframe
OUTPUT: Dataframe
Add a column called AgeSale
'''
df.loc[:, 'YearMade'][df['YearMade'] == 1000] = mean_YearMade
# Column has tons of erroneous years with 1000
df['saledate'] = pd.to_datetime(df['saledate'])
df['saleyear'] = df['saledate'].dt.year
df['Age_at_Sale'] = df['saleyear'] - df['YearMade']
df = df.drop('saledate', axis=1)
df = df.drop('YearMade', axis=1)
df = df.drop('saleyear', axis=1)
return df
Any optimization tricks would be much appreciated...
You can try use sub for substract and for select by condition use loc with mask like dfx['YearMade'] > 1000. Also creating column saleyear is not necessary.
dfx['saledate'] = pd.to_datetime(dfx['saledate'])
mean_YearMade = dfx.loc[dfx['YearMade'] > 1000, 'YearMade'].mean()
def age_at_sale(df, mean_YearMade):
'''
INPUT: Dateframe
OUTPUT: Dataframe
Add a column called AgeSale
'''
df.loc[df['YearMade'] == 1000, 'YearMade'] = mean_YearMade
df['Age_at_Sale'] = df['saledate'].dt.year.sub(df['YearMade'])
df = df.drop(['saledate', 'YearMade'], axis=1)
return df

Categories

Resources