My dataset is in an odd format and I have no clue how to fix it (I have tried and read a lot of similar questions but to no avail). Each column is a firm name (e.g. AAPL, AMZN, FB) and the first row is a list of each category of data. Basically each column has a firm name, then the entry below is a code (e.g. trading volume, market value, price), and then the respective data with an index of dates (monthly). How can I appropriately manipulate this so I can filter the data for a panel regression? Example: using each column of trading volume regressed on each column of earnings per share?
It sounds like you may need to learn how to select columns from Pandas MultiIndex, and perhaps how to create a MultiIndex. You may also benefit from learning how to reshape your data in order to run your panel regression.
If you provide a small sample of your data with the correct format, it will be easier to provide more specifics.
Related
I have a pandas dataframe with an index and jut one column. Index has Dates. Column has Values. I would like to find the NewValue of a NewDate that is not in the index. To do that i suppose i may use interpolation function as: NewValue=InterpolationFunction(NewDate,Index,Column,method ext..). So what is the InterpolationFunction? It seems that most of interpolation functions are used for padding, finding the missing values ext. This is not what i want. I just want the NewValue. Not built a new Dataframe ext..
Thank you very very much for taking the time to read this.
I am afraid that you cannot find the missing values without constructing a base for your data. here is the answer to your question if you make a base dataframe:
You need to construct a panel in order to set up your data for proper interpolation. For instance, in the case of the date, let's say that you have yearly data and you want to add information for a missing year in between or generate data for quarterly intervals.
You need to construct a base for the time series i.e.:
dates = pd.date_range(start="1987-01-01",end="2021-01-01", freq="Q").values
panel = pd.DataFrame({'date_Q' :dates})
Now you can join your data to this structure and it will have dates without information as missing. Now you need to use a proper interpolation algorithm to fill the missing values. Pandas .interpolate() method has some basic interpolation methods such as polynomial and linear which you can find here.
However, much more ways of interpolation are offered by Scipy which you can find in the tutorials here.
I am trying to build a binary classifier to predict the propensity of customers transitioning from one account to another.
I have age, gender, cust-segment data but also a time-series of their bank balances for the last 18mths on a monthly basis and also have a lot of high cardinality categorical variables.
So, what I want to know is how do I transform the time series data so its in a more compact static form to the rest of the data points like age, gender etc. Or can I just throw this into the algorithm too?
sample data may look like the below:
customer number, age, gender, marital status code, 18mth-bal, 17mth-bal,...,3mth-bal, postcode-segment ..
Any help would be Fantastic! Thank you.
I would generate descriptive statistics for each time serie. Standard deviation seems interesting, but you coud also use percentiles, mean, min and max... or all of them.
# add a column for the standard deviation (and/or percentiles etc.)
df['standard_deviation']= np.zeros(len(df)).astype(int)
# calculate the standard deviation with np.std() and add it to the new column
for i in df.index :
df['standard_deviation'][i] = np.std(df.loc[[i]]['balance'][i])
This last line works if the content of a 'balance' cell is a list or an array (of the 18 last amounts known for this customer for example).
Sorry I can't be more specific as I can't see your dataframe, but hope it helps !
I'm working on a big data project for my school project. My dataset looks like this: https://github.com/gindeleo/climate/blob/master/GlobalTemperatures.csv
I'm trying to predict the next values of "LandAverageTemperature".
I've asked another question about this topic earlier.(its here:How to predict correctly in sklearn RandomForestRegressor?) I couldn't get any answer for that question.After not getting anything in my first question and then failing for another day, I've decided to start from scratch.
Right now, I want to know which value is in my dataset is "x" to make the prediction correctly. I read that y is a dependent variable which that I want to predict and x is the independent variable that I should use as "predictor" to help the prediction proccess. In that case my y variable is "LandAverageTemperature". I don't know what the x value is. I was using date values for x at first but I'm not sure that is true at the moment.
And if I shouldn't use RandomForestRegressor or sklearn (I've started with spark to this project) for this dataset please let me know. Thanks in advance.
You only have one variable (LandAverageTemperature), so obviously that's what you're going to use. What you're looking for is the df.shift() function, which shifts your values. With this function, you'll be able to add columns of past values to your dataframe. You will then be able to use t 1 month/day ago, t 2 months/days ago, etc, as predictors of another day/month's temperature.
You can use it like this:
for i in range(1, 15):
df.loc[:, 'T–%s'%i] = df.loc[:, 'LandAverageTemperature'].shift(i)
Your columns will then be temperature, and temperature at T-1, T-2, for up to 14 time periods.
For your question about what is a proper model for time series forecasting, it would be off-topic for this site, but many resources exist on https://stats.stackexchange.com.
In general case you can use for X feature matrix all data columns excluding your target column. But in your case there is several complications:
You have missed (empty) data in most of the columns for many years. You can exclude such rows/years from train data. Or exclude columns with missed data (which will be almost all of your columns and it's not good).
Regression model can't use date fields directly, you should traislate date field to some numerical field(s), "months past first observation", for example. Something like (year-1750)*12 + month. Or/and you can have year and month in separate columns (it's better if you have some "seasonality" in your data).
You have sequental time data here, so may be you should not use simple regression. Use some ARIMA/SARIMA/SARIMAX and so on so-called Time-Series models which predicts target data sequentially one value by another, month after month in your case. It's a hard topic for learning, but you should definitely take a look at TS because you will need it some time in the future if not today.
I am working on a transactions data frame using python (anaconda) and I was told to Aggregate the data to a weekly level so that there is one row per product-week combination
I want to make sure if the following code is correct because I don't think I fully understood what I need to do
dataset.groupby(['id', dataset['history_date'].dt.strftime('%W')])['sales'].sum()
Note my dataset contains the following:
id history_date item_id price inventory sales category_id
Aggregating data means combining datasets based on a certain criteria, to narrow it down.
For example, it sounds like your dataset may be broken down by daily dates, where each row corresponds to a specific date.
What you need to do is aggregate the data into weekly segments, instead of having it broken down on a daily basis.
This is achieved by grouping your datasets based on the date & the most granular / detailed /specific pairing of your dataset.
Excel newbie and aspiring Data analyst, I have this data and I want to find the distribution of City wise Shopping Experience. The column M has the shopping experience rated from 1 to 5.
What I tried
I am not able to google how to do this at all. I tried running correlation, but the in-built excel data analysis tool does not let me run it on non-numeric data, and I am not able to group the City cells either. I thought of replacing every city with numeric alias but I don't know how to do that either. How to search, or go ahead with this problem?
Update: I was thinking of some way to get this out of the cities column.
I am thinking this is better done in python.
How about something like this, have just taken the cities and data to show averageif, sumif and countif:
I used Data validation to provide the list to select from.