I have a database that includes monthly time series data on around 15 different indicators. The data is all in the same format, year-to-date values and year-to-date growth. January data is missing, with data for each indicator starting with the year-to-date total as of February.
For each indicator I want to turn the year-to-date data into monthly values. The code below does that.
But I want to be able to run this as a loop over all the 15 indictators, and then automatically rename each dataframe that results to include a reference to the category it belongs to. For example, one category of data is sales in value terms, so when I apply the code to that category, I want the output of df_m to be renamed as sales_m, and df_yoy as sales_yoy.
I thought I could so this by defining a list of the 15 indicators to start with, and then somehow assigning that list to the dataframes produced by the loop. But I can't make that work.
category = ['sales', 'construction']
df_m = df.loc[:, df.columns.str.contains('Monthly')]
df_ytd = df.drop(df.filter(regex='Monthly').columns, axis=1)
df_ytd = df_ytd.fillna(method='bfill', limit=1)
df_ytd.loc[df_ytd.index.month.isin([1,2]), :] = df_ytd / 2
df_ytd.columns = df_ytd.columns.str.replace(', YTD', '')
df_m.columns = df_m.columns.str.replace('YTD, ', '').str.replace(', Monthly', '')
df_m = df_m.fillna(df_ytd)
df_yoy = df_m.pct_change(periods=12) * 100
sales_m = df_m
Related
I downloaded historical price data for ^GSPC Share Market Index (S&P500), and several other Global Indices. Date is set as index.
Selecting values in rows when date is set to index works as expected with .loc.
# S&P500 DataFrame = spx_df
spx_df.loc['2010-01-04']
Open 1.116560e+03
High 1.133870e+03
Low 1.116560e+03
Close 1.132990e+03
Volume 3.991400e+09
Dividends 0.000000e+00
Stock Splits 0.000000e+00
Name: 2010-01-04 00:00:00-05:00, dtype: float64
I then concatenated several Stock Market Global Indices into a single DataFrame for further use. In effect, any date in range will be included five times when historical data for five Stock Indices are linked in a Time Series.
markets = pd.concat(ticker_list, axis = 0)
I want to reference a single date in concatenated df and set it as a variable. I would prefer if the said variable didn't represent a datetime object, because I would like to access it with .loc as part of def function. How does concatenate effect accessing rows via date as index if the same date repeats several times in a linked TimeSeries?
This is what I attempted so far:
# markets = concatenated DataFrame
Reference_date = markets.loc['2010-01-04']
# KeyError: '2010-01-04'
Reference_date = markets.loc[markets.Date == '2010-01-04']
# This doesn't work because Date is not an attribute of the DataFrame
Since you have set date as index you should be able to do:
Reference_date = markets.loc[markets.index == '2010-01-04']
To access a specific date in the concatenated DataFrame, you can use boolean indexing instead of .loc. This will return a DataFrame that contains all rows where the date equals the reference date:
reference_date = markets[markets.index == '2010-01-04']
You may also want to use query() method for searching for specific data
reference_date = markets.query('index == "2010-01-04"')
Keep in mind that the resulting variable reference_date is still a DataFrame and contains all rows that match the reference date across all the concatenated DataFrames. If you want to extract only specific columns, you can use the column name like this:
reference_date_Open = markets.query('index == "2010-01-04"')["Open"]
I am trying to drop specific rows in a dataframe where the index is a date with 1hr intervals during specific times of the day. (It is hourly intervals of stock market data).
For instance, 2021-10-26 09:30:00-4:00,2021-10-26 10:30:00-4:00,2021-10-26 11:30:00-4:00, 2021-10-26 12:30:00-4:00 etc.
I want to be able to specify the row to keep by hh:mm (e.g. keep just the 6:30, 10:30 data each day), and drop all the rest.
I'm pretty new to programming so have absolutely no idea how to do this.
If your columns are datetime objects and not strings, you can do something like this
df = pd.Dataframe()
...input data, etc...
columns = df.columns
kept = []
for col in columns
if (col.dt.hour == 6 or col.dt.hour == 10) and col.dt.minute == 30
kept.append(col)
else:
continue
df = df[kept]
see about half way down about working with time in pandas on this source here
https://www.dataquest.io/blog/python-datetime-tutorial/
I have population data of a city divided into 10 zones. The rate of increase of population is given, I want to calculate the population of each zone for the next ten years and append population for every year in separate columns. I am able to append one column but after that, not able to append the next column using the latest appended column. I am able to append column one by one, which is not a good way to do this
data['zone_pop'] = data['zone_pop'].apply(lambda zone_pop: population(zone_pop))
Please help me with this.
Try to use concat function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html).
In the for loop, you should have something like this:
new_data = pandas.DataFrame(population(data[label]), columns=[label_incr])
data = pandas.concat([data, new_data], axis=1)
Where label and label_incr refer to str variables used to get current year data and new year calculation.
Edit (detail syntax)
I guess you have already a dataframe data containing a single column 'population_zone' with 10 indexes (for each zone). And the rate of change r.
The code above should work (at least, I tested on fake local data)
current_label = 'population_zone'
for i in range(1, 11):
new_label = 'population_zone_year' + str(i)
new_data = pd.DataFrame((data[current_label] * (1+r/100.)**i).values, columns=[new_label])
data = pd.concat([data, new_data], axis=1)
current_label = new_label
If it doesn't work, I probably misunderstood how your data is stored.
I have a transaction sales dataframe:
print(df)
dt_op quantity cod_id
20/01/18 1 100
20/01/18 8 102
21/01/18 1 100
21/01/18 10 102
...
And I would like to define a new variable "speed" as "cumulative_sales / days_elapsed_since_the_launch_of_that_product, for every different item in "cod_id".
I tried with:
start = min(df["dt_op"])
df["running_days"] = (df["dt_op"] - start).astype('timedelta64[D]')
df["csum"] = df.quantity.cumsum()
df["speed"] = df["csum"] / df["running_days"]
But it does not compute it for every single item; I would avoid for-loops for computational issues and slow running time.
Try to save the first launching date for every 'cod_id' in a new column with grouby:
df2 = df.groupby(['cod_id']).dt_op.min()
and merge it back to your dataframe
df = pd.merge(df, df2, on='cod_id', how='left')
then create a new column as the data difference between the minimum date and the first one. And you can calculate the csum always like above and diveded through the date difference.
I have census data that looks like this for a full month and I want to find out how many unique inmates there were for the month. The information is taken daily so there are multiples.
_id,Date,Gender,Race,Age at Booking,Current Age
1,2016-06-01,M,W,32,33
2,2016-06-01,M,B,25,27
3,2016-06-01,M,W,31,33
My method now is to group them by day and then add the ones that are not accounted for into the DataFrame. My question is how to account for two people with the same info. They would both get not added to the new DataFrame because one of them already exists? I'm trying to figure out how many people total were in the prison during this time.
_id is incremental, for example here is some data from the second day
2323,2016-06-02,M,B,20,21
2324,2016-06-02,M,B,44,45
2325,2016-06-02,M,B,22,22
2326,2016-06-02,M,B,38,39
link to the dataset here: https://data.wprdc.org/dataset/allegheny-county-jail-daily-census
You could use the df.drop_duplicates() which will return the DataFrame with only unique values, then count the entries.
Something like this should work:
import pandas as pd
df = pd.read_csv('inmates_062016.csv', index_col=0, parse_dates=True)
uniqueDF = df.drop_duplicates()
countUniques = len(uniqueDF.index)
print(countUniques)
Result:
>> 11845
Pandas drop_duplicates Documentation
Inmates June 2016 CSV
The problem with this approach / data is that there could be many individual inmates that are the same age / gender / race that would be filtered out.
I think the trick here is to groupby as much as possible and check the differences in those (small) groups through the month:
inmates = pd.read_csv('inmates.csv')
# group by everything except _id and count number of entries
grouped = inmates.groupby(
['Gender', 'Race', 'Age at Booking', 'Current Age', 'Date']).count()
# pivot the dates out and transpose - this give us the number of each
# combination for each day
grouped = grouped.unstack().T.fillna(0)
# get the difference between each day of the month - the assumption here
# being that a negative number means someone left, 0 means that nothing
# has changed and positive means that someone new has come in. As you
# mentioned yourself, that isn't necessarily true
diffed = grouped.diff()
# replace the first day of the month with the grouped numbers to give
# the number in each group at the start of the month
diffed.iloc[0, :] = grouped.iloc[0, :]
# sum only the positive numbers in each row to count those that have
# arrived but ignore those that have left
diffed['total'] = diffed.apply(lambda x: x[x > 0].sum(), axis=1)
# sum total column
diffed['total'].sum() # 3393