I have population data of a city divided into 10 zones. The rate of increase of population is given, I want to calculate the population of each zone for the next ten years and append population for every year in separate columns. I am able to append one column but after that, not able to append the next column using the latest appended column. I am able to append column one by one, which is not a good way to do this
data['zone_pop'] = data['zone_pop'].apply(lambda zone_pop: population(zone_pop))
Please help me with this.
Try to use concat function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html).
In the for loop, you should have something like this:
new_data = pandas.DataFrame(population(data[label]), columns=[label_incr])
data = pandas.concat([data, new_data], axis=1)
Where label and label_incr refer to str variables used to get current year data and new year calculation.
Edit (detail syntax)
I guess you have already a dataframe data containing a single column 'population_zone' with 10 indexes (for each zone). And the rate of change r.
The code above should work (at least, I tested on fake local data)
current_label = 'population_zone'
for i in range(1, 11):
new_label = 'population_zone_year' + str(i)
new_data = pd.DataFrame((data[current_label] * (1+r/100.)**i).values, columns=[new_label])
data = pd.concat([data, new_data], axis=1)
current_label = new_label
If it doesn't work, I probably misunderstood how your data is stored.
Related
Using the ff_monthly.csv data set https://github.com/alexpetralia/fama_french,
use the first column as an index
(this contains the year and month of the data as a string
Create a new column ‘Mkt’ as ‘Mkt-RF’ + ‘RF’
Create two new columns in the loaded DataFrame, ‘Month’ and ‘Year’ to
contain the year and month of the dataset extracted from the index column.
Create a new DataFrame with columns ‘Mean’ and ‘Standard
Deviation’ and the full set of years from (b) above.
Write a function which accepts (r_m,s_m) the monthy mean and standard
deviation of a return series and returns a tuple (r_a,s_a), the annualised
mean and standard deviation. Use the formulae: r_a = (1+r_m)^12 -1, and
s_a = s_m * 12^0.5.
Loop through each year in the data, and calculate the annualised mean and
standard deviation of the new ‘Mkt’ column, storing each in the newly
created DataFrame. Note that the values in the input file are % returns, and
need to be divided by 100 to return decimals (i.e the value for August 2022
represents a return of -3.78%).
. Print the DataFrame and output it to a csv file.
Workings so far:
import pandas as pd
ff_monthly=pd.read_csv(r"file path")
ff_monthly=pd.read_csv(r"file path",index_col=0)
Mkt=ff_monthly['Mkt-RF']+ff_monthly['RF']
ff_monthly= ff_monthly.assign(Mkt=Mkt)
df=pd.DataFrame(ff_monthly)
enter image description here
There are a few things to pay attention to.
The Date is the index of your DataFrame. This is treated in a special way compared to the normal columns. This is the reason df.Date gives an Attribute error. Date is not an Attribute, but the index. Instead try df.index
df.Date.str.split("_", expand=True) would work if your Date would look like 22_10. However according to your picture it doesn't contain an underscore and also contains the day, so this cannot work
In fact the format you have is not even following any standard. In order to properly deal with that the best way would be parsing this to a proper datetime64[ns] type that pandas will understand with df.index = pd.to_datetime(df.index, format='%y%m%d'). See the python docu for supported format strings.
If all this works, it should be rather straightforward to create the columns
df['year'] = df.index.dt.year
In fact, this part has been asked before
I have a database that includes monthly time series data on around 15 different indicators. The data is all in the same format, year-to-date values and year-to-date growth. January data is missing, with data for each indicator starting with the year-to-date total as of February.
For each indicator I want to turn the year-to-date data into monthly values. The code below does that.
But I want to be able to run this as a loop over all the 15 indictators, and then automatically rename each dataframe that results to include a reference to the category it belongs to. For example, one category of data is sales in value terms, so when I apply the code to that category, I want the output of df_m to be renamed as sales_m, and df_yoy as sales_yoy.
I thought I could so this by defining a list of the 15 indicators to start with, and then somehow assigning that list to the dataframes produced by the loop. But I can't make that work.
category = ['sales', 'construction']
df_m = df.loc[:, df.columns.str.contains('Monthly')]
df_ytd = df.drop(df.filter(regex='Monthly').columns, axis=1)
df_ytd = df_ytd.fillna(method='bfill', limit=1)
df_ytd.loc[df_ytd.index.month.isin([1,2]), :] = df_ytd / 2
df_ytd.columns = df_ytd.columns.str.replace(', YTD', '')
df_m.columns = df_m.columns.str.replace('YTD, ', '').str.replace(', Monthly', '')
df_m = df_m.fillna(df_ytd)
df_yoy = df_m.pct_change(periods=12) * 100
sales_m = df_m
I'm new to python and have researched to find an answer, I am most likely not asking the right question. I am streaming data from an exchange into a dataframe, will later stream the data into a databse, My problem is that when I do a calculation on a column to create a new column containing the result, all of the values of all rows in the new column change to the last result.
I am streaming in the open, high, low, close of a stock. In one column I am calculating the range for a candle during the timeframe, like on a one hour chart.
src = candles.close
ohlc = candles
ohlc = ohlc.rename(columns=str.lower)
candles['SMA_21'] = TA.SSMA(ohlc, period)
candles['EMA_21'] = TA.EMA(ohlc, period)
candles['WMA'] = TA.WMA(ohlc, 10)
candles['Range'] = src - candles['open']
candles['AvgRange'] = candles['Range'].tail(21).mean()
The range column works and has correct information which is not changed by each calculation. But the column for 'AvgRange' ends up with all values changed with each new mean value calculated.
The following also writes the last data entry to the whole column stream['EMA_Dir']
if stream['EMA'].iloc[-1] > stream['EMA'].iloc[-2]:
stream['EMA_Dir'] = "Ascending"
I only want the last entry in the last, most recent, row of the dataframe.
Tried several things, but the last calculation changes all values in 'AvgRange' column.
Thanks in advance. Sorry if I didn't ask the question correctly, but that is probably why I haven't found the answer.
candles['AvgRange'] = candles[’range’].rolling(
window=3,
center=False
).mean()
this will give you a 3 row rolling average
I have a pandas dataframe where observations are broken out per every two days. The values in the 'Date' column each describe a range of two days (eg 2020-02-22 to 2020-02-23).
I want to spit those Date values into individual days, with a row for each day. The closest I got was by doing newdf = df_day.set_index(df_day.columns.drop('Date',1).tolist()).Date.str.split(' to ', expand=True).stack().reset_index().loc[:, df_day.columns]
The problem here is that the new date values are returned as NaNs. Is there a way to achieve this data broken out by individual day?
I might not be understanding, but based on the image it's a single date per row as is, just poorly labeled -- I would manipulate the index strings, and if I can't do that I would create a new date column, or new df w/ clean date and merge it.
You should be able to chop off the first 14 characters with a lambda -- leaving you with second listed date in index.
I can't reproduce this, so bear with me.
df.rename(index=lambda s: s[14:])
#should remove first 14 characters from each row label.
#leaving just '2020-02-23' in row 2.
#If you must skip row 1, idx = df.index[1:]
#or df.iloc[1:].rename(index=lambda s: s[1:])
Otherwise, I would just replace it with a new datetime index.
didx = pd.DatetimeIndex(start ='2000-01-10', freq ='D',end='2020-02-26')
#Make sure same length as df
df.set_index(didx)
#Or
#df['new_date'] = didx.values
#df.set_index('new_date').drop(columns=['Date'])
#Or
#df.append(didx,axis=1) #might need ignore_index=True
I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.