Problem:
my dataframe shows forward contract settlements on a daily basis. I want to create a column that returns all the January values. So in November, M0 is November contract and I want to return M2, which is the January contract, then in December I would return M1. The normal code for for doing this would be:
df['Jan'] = df.loc[(df.Month==12), ['M12]]
This only works for one value of the month though and I want to loop through the values in the Month column of the dataframe to pull the right 'Mx' column for the given month and have a single column with values for January.
I have tried various loops but continually get errors. The latest I have below with the dataframe:
DataFrame
Code
Error Message
Any help appreciated, I have googled all day. Happy to hear any better ways of doing this
You can pass the whole dataframe and then access the columns within your function.
Try this.
def create_jan(x):
if x['Month']== 11:
return x['M1']
elif x['month']==12:
return x['M2']
else:
return 0
df['jan']= df.apply(create_jan, axis = 1)
Related
I'm trying to pull some data from yfinance in Python for different funds from different exchanges. In pulling my data I just set-up the start and end dates through:
start = '2002-01-01'
end = '2022-06-30'
and pulling it through:
assets = ['GOVT', 'IDNA.L', 'IMEU.L', 'EMMUSA.SW', 'EEM', 'IJPD.L', 'VCIT',
'LQD', 'JNK', 'JNKE.L', 'IEF', 'IEI', 'SHY', 'TLH', 'IGIB',
'IHYG.L', 'TIP', 'TLT']
assets.sort()
data = yf.download(assets, start = start, end = end)
I guess you've noticed that the "assets" or the ETFs come from different exchanges such as ".L" or ".SW".
Now the result this:
It seems to me that there is no overlap for a single instrument (i.e. two prices for the same day). So I don't think the data will be disturbed if any scrubbing or clean-up is done.
So my goal is to harmonize or consolidate the prices to its date index rather than date-and-time index so that each price for each instrument is firmly side-by-side each other for a particular date.
Thanks!
If you want the daily last closing price from the yahoo-finance api you could use the interval argument,
yf.download(assets, start=start, end=end, interval="1d")
Solution with Pandas:
Transforming the Index
You have an index where each row is a string representing the datetime. You firstly want to transform those strings to an actual DatetimeIndex where each row will be of type datetime64. This is done in order to easily work with dates in you dataset applying functions from the datetime library. Finally, you pick the date from each datetime64;
data.index = pd.to_datetime(data.index).date
Groupby
Now that you have an index of dates you can groupby on index. Firstly, you want to deal with NaN values. If you want that the closing price is only considered to fill the values within the date itself only you want to apply:
data= data.groupby(data.index).ffill()
Otherwise, if you think that the closing price of (e.g.) the 1st October can be used not only to filter values in the 1st October but also 2nd and 3rd of October which have NaN values, simply apply the ffill() without the groupby;
data= data.ffill()
Lastly, taking last observed record grouping for date (Index); Note that you can apply all the functions you want here, even a custom lambda;
data = data.groupby(data.index).last()
I'm making a pivot table from a CSV (cl_total_data.csv) file using pandas pd.pivot_table() and need find a fix to values in the wrong rows.
[Original CSV File]
The error occurs when the year has 53 weeks(i.e. 53 values) instead of 52, the first value in the year with 53 weeks is set as the last value in the pivot table
[Pivot Table with wrong values top]
[Pivot Table with wrong values bottom]
[Original CSV 2021 w/ 53 values]
The last value for the pivot table 2021 row 53 (1123544) is the first value of the year for 2021-01-01 (1123544) in the original CSV table for 2021.
I figured out how to fix this in the pivot table after making it. I use
Find columns with 53 values:
cl_total_p.columns[~cl_total_p.isnull().any()]
Then take the values from the original CSV files to its corresponding year and replace the values in the pivot table
cl_total_p[2021] = cl_total_data.loc['2021'].Quantity.values
My problem is:
I can't figure out what I'm coding wrong in the pivot table function that causes this misplacement of values. Is there a better way to code it?
Using my manual solution takes a lot of time especially when I'm using multiple CSV files 10+ and having to fix every single misplacement in columns with 53 weeks. Is there a for loop I can code to loop through all columns with 53 weeks and replace them with their corresponding year?
I tried
import numpy
import pandas
year_range = np.arange(1982,2023)
week_range = np.arange(54)
for i in year_range:
for y in week_range:
cl_total_p[i] = cl_total_data.loc['y'].Quantity.values
But I get an error :( How can I fix the pivot table value misplacement? and/or find a for loop to take the original values and replace them in the pivot table?
I can't figure out what I'm coding wrong in the pivot table function that causes this misplacement of values. Is there a better way to code it?
The problem here lies in the definition of the ISO week number. Let's look at this line of code:
cl_total_p = pd.pivot_table(cl_total_data, index = cl_total_data.index.isocalendar().week, columns = cl_total_data.index.year, values = 'Quantity')
This line uses the ISO week number to determine the row position, and the non-ISO year to determine the column position.
The ISO week number is defined as the number of weeks since the first week of the year with a majority of its days in that year. This means that it is possible for the first week of the year to not line up with the first day of the year. For that reason, the ISO week number is used alongside the ISO year number, which says that the part of the year before the first week belongs to the the previous year.
For that reason, January 1st, 2021 was not the first week of 2021 in the ISO system. It was the 53rd week of 2020. When you mix the ISO week with the non-ISO year, you get the result that it was the 53rd week of 2021, a date which is a year off.
Here's an example of how to show this with the linux program date:
$ date -d "Jan 1 2021" "+%G-%V"
2020-53
You have a few options:
Use both the ISO week and the ISO year for consistency. The isocalendar() function can provide both the ISO week and ISO year.
If you don't want the ISO system, you can come up with your own definition of "week" which avoids having the year's first day belong to the previous year. One approach you could take is to take the day of year, divide by seven, and round down. Unfortunately, this does mean that the week will start on a different day each year.
I am working on a COVID-19 dataset with total cases and total deaths at the last day of each month for each city since march. But I would like to create a column which tells me the number of new cases for every city in each of these months.
My logic is: if the value in the cell from the 'city_ibge_code' column in position p is the same as the value in position p-1, it should create a new column that is the difference between the number of cases in two months. And if the values are different (that shows that are different cities), just pass that value to the new column.
casos_full: is the dataframe with the cities and the number of cases and deaths in march, april, may, june, july, august and semptember.
city_ibge_code: is the code for each city in the dataframe - each city has a unique code.
And there also is a "date" column - which represents the last day of the month
for rows in casos_full:
if rows['city_ibge_code'] == rows['city_ibge_code'].shift(1):
rows['New Cases'] = rows['last_available_confirmed'] - rows['last_available_confirmed'].shift(1)
else:
rows['New Cases'] = rows['last_available_confirmed']
rows here is a view of the line. You need to update the actual dataframe. If I understood your problem correctly.
for i, rows in enumerate(casos_full):
if rows['city_ibge_code'] == rows['city_ibge_code'].shift(1):
casos_full[i]['New Cases'] = rows['last_available_confirmed'] - rows['last_available_confirmed'].shift(1)
else:
casos_full[i]['New Cases'] = rows['last_available_confirmed']
Please give more precision on your problem so we can help.
I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]
I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!
Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.