I have a dataframe X, where each row is a data point in time and each column is a feature. The label/target variable Y is univariate. One of the columns of X is the lagged values of Y.
The RNN input is of the shape (batch_size, n_timesteps, n_feature).
From what I've been reading on this site, batch_size should be as big as possible without running out of memory. My main doubt is about n_timesteps. and n_features.
I think n_feature is the number of columns in the X dataframe.
What about the n_timesteps?
Consider the following dataframe with the features temperature, pressure, and humidity:
import pandas as pd
import numpy as np
X = pd.DataFrame(data={
'temperature': np.random.random((1, 20)).ravel(),
'pressure': np.random.random((1, 20)).ravel(),
'humidity': np.random.random((1, 20)).ravel(),
})
print(X.to_markdown())
| | temperature | pressure | humidity |
|---:|--------------:|-----------:|-----------:|
| 0 | 0.205905 | 0.0824903 | 0.629692 |
| 1 | 0.280732 | 0.107473 | 0.588672 |
| 2 | 0.0113955 | 0.746447 | 0.156373 |
| 3 | 0.205553 | 0.957509 | 0.184099 |
| 4 | 0.741808 | 0.689842 | 0.0891679 |
| 5 | 0.408923 | 0.0685223 | 0.317061 |
| 6 | 0.678908 | 0.064342 | 0.219736 |
| 7 | 0.600087 | 0.369806 | 0.632653 |
| 8 | 0.944992 | 0.552085 | 0.31689 |
| 9 | 0.183584 | 0.102664 | 0.545828 |
| 10 | 0.391229 | 0.839631 | 0.00644447 |
| 11 | 0.317618 | 0.288042 | 0.796232 |
| 12 | 0.789993 | 0.938448 | 0.568106 |
| 13 | 0.0615843 | 0.704498 | 0.0554465 |
| 14 | 0.172264 | 0.615129 | 0.633329 |
| 15 | 0.162544 | 0.439882 | 0.0185174 |
| 16 | 0.48592 | 0.280436 | 0.550733 |
| 17 | 0.0370098 | 0.790943 | 0.592646 |
| 18 | 0.371475 | 0.976977 | 0.460522 |
| 19 | 0.493215 | 0.381539 | 0.995716 |
Now, if you want to use this kind of data for time series prediction with a RNN model, you usually consider one row in the data frame as one timestep. Converting the dataframe into an array might also help you understand what the timesteps are:
print(np.expand_dims(X.to_numpy(), axis=1).shape)
# (20, 1, 3)
First, I obtain an array of the shape(20, 3) or in other words, 20 samples and each sample has three features. I then explicitly add a time dimension to the array, resulting in the shape(20, 1, 3), meaning that the data set consists of 20 samples and each sample has one time step and for each time step you have 3 features. Now, you can use this data directly as input for a RNN.
Related
Here is sample 1 :
| district_id | date |
| -------- | ----------- |
| 18 | 1995-03-24 |
| 1 | 1993-02-26 |
Sample 2:
| link_id | type |
| -------- | ----------- |
| 9 | gold |
| 19 | classic |
I want to gather sample 1's date column and sample 2's type column and output them as data.csv
You can use vertical concatenation of dataframes and then render it:
df3 = pandas.concat([df1['date'], df2['type']], axis = 1)
df3.to_csv('data.csv')
Can anyone help me sort the order of last page viewed?
I have a dataframe where I am attempting to sort it by the previous page viewed and I am having a really hard time coming up with an efficient method using Pandas.
For example from this:
+------------+------------------+----------+
| Customer | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | A | D |
| 1051471580 | C | B |
| 1051471580 | A | exit |
| 1051471580 | B | A |
| 1051471580 | D | A |
| 1051471580 | entrance | C |
+------------+------------------+----------+
To this:
+------------+------------------+----------+
| Customer | previousPagePath | pagePath |
+------------+------------------+----------+
| 1051471580 | entrance | C |
| 1051471580 | C | B |
| 1051471580 | B | A |
| 1051471580 | A | D |
| 1051471580 | D | A |
| 1051471580 | A | exit |
+------------+------------------+----------+
However it could be millions of rows long for thousands of different customers so I really need to think how to make this efficient.
pd.DataFrame({
'Customer':'1051471580',
'previousPagePath': ['E','C','B','A','D','A'],
'pagePath': ['C','B','A','D','A','F']
})
Thanks!
What you're trying to do is topological sorting, which can be achieved with networkx. Note that I had to change some values in your dataframe in order to prevent it throwing a cycle error, so I hope that the data you work on contains unique values:
import networkx as nx
import pandas as pd
data = [ [1051471580, "Z", "D"], [1051471580,"C","B" ], [1051471580,"A","exit" ], [1051471580,"B","Z" ], [1051471580,"D","A" ], [1051471580,"entrance","C" ] ]
df = pd.DataFrame(data, columns=['Customer', 'previousPagePath', 'pagePath'])
edges = df[df.pagePath != df.previousPagePath].reset_index()
dg = nx.from_pandas_edgelist(edges, source='previousPagePath', target='pagePath', create_using=nx.DiGraph())
order = list(nx.lexicographical_topological_sort(dg))
result = df.set_index('previousPagePath').loc[order[:-1], :].dropna().reset_index()
result = result[['Customer', 'previousPagePath', 'pagePath']]
Output:
| | Customer | previousPagePath | pagePath |
|---:|-----------:|:-------------------|:-----------|
| 0 | 1051471580 | entrance | C |
| 1 | 1051471580 | C | B |
| 2 | 1051471580 | B | Z |
| 3 | 1051471580 | Z | D |
| 4 | 1051471580 | D | A |
| 5 | 1051471580 | A | exit |
you can sort your DataFrame by column like that.
df = pd.DataFrame({'Customer':'1051471580','previousPagePath':['E','C','B','A','D','A'], 'pagePath':['C','B','A','D','A','F']})
df.sort_values(by='previousPagePath')
and you can find the document here pandas.DataFrame.sort_values
I have a MultiIndex Pandas DataFrame like so:
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
| | VECTOR | SEGMENTS | OVERALL | INDIVIDUAL |
| | | | TIP X | TIP Y | CURVATURE | TIP X | TIP Y | CURVATURE |
| 0 | (TOP, TOP) | 2 | 3.24 | 1.309 | 44 | 1.62 | 0.6545 | 22 |
| 1 | (TOP, BOTTOM) | 2 | 3.495 | 0.679 | 22 | 1.7475 | 0.3395 | 11 |
| 2 | (BOTTOM, TOP) | 2 | 3.495 | -0.679 | -22 | 1.7475 | -0.3395 | -11 |
| 3 | (BOTTOM, BOTTOM) | 2 | 3.24 | -1.309 | -44 | 1.62 | -0.6545 | -22 |
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
How can I drop duplicates based on all columns contained under 'OVERALL' or 'INDIVIDUAL'? So if I choose 'INDIVIDUAL' to drop duplicates from the values of TIP X, TIP Y, and CURVATURE under INDIVIDUAL must all match for it to be a duplicate?
And further, as you can see from the table 1 and 2 are duplicates that are simply mirrored about the x-axis. These must also be dropped.
Also, can I center the OVERALL and INDIVIDUAL headings?
EDIT: frame.drop_duplicates(subset=['INDIVIDUAL'], inplace=True) produces KeyError: Index(['INDIVIDUAL'], dtype='object')
You can pass pandas .drop_duplicates a subset of tuples for multi-indexed columns:
df.drop_duplicates(subset=[
('INDIVIDUAL', 'TIP X'),
('INDIVIDUAL', 'TIP Y'),
('INDIVIDUAL', 'CURVATURE')
])
Or, if your row indices are unique, you could use the following approach that saves some typing:
df.loc[df['INDIVIDUAL'].drop_duplicates().index]
Update:
As you suggested in the comments, if you want to do operations on the dataframe you can do that in-line:
df.loc[df['INDIVIDUAL'].abs().drop_duplicates().index]
Or for non-pandas functions, you can use .transform:
df.loc[df['INDIVIDUAL'].transform(np.abs).drop_duplicates().index]
I am using the Agate library to create a table.
Using the command as :
table = agate.Table(cpi_rows, cpi_types, cpi_titles)
Sample values are as below :
cpi_rows[0]
[1.0,'Denmark','DNK',128.0,'EU',1.0,91.0,7.0,2.2,87.0,95.0,83.0,98.0,0.0,97.0,0.0,96.0,98.0,0.0,87.0,89.0,88.0,83.0,0.0,0.0,0.0]
cpi_tiles
['Country Rank','Country / Territory','WB Code','IFS Code','Region','Country Rank','CPI 2013 Score', 'Surveys Used','Standard Error', '90% Confidence interval Lower', 'Upper','Scores range MIN','MAX','Data sources AFDB','BF (SGI)','BF (BTI)','IMD','ICRG','WB','WEF','WJP','EIU','GI','PERC','TI','FH']
When I run the command, I am getting the error as :
ValueError: Column names must be strings or None.
Though all the names in cpi_titles are type strings only, I am unable to get the cause for error.
Just tried your code, and apart from a few corrections to names and stuff this worked without a problem
cpi_rows = [[]]
cpi_rows[0] =[1.0,'Denmark','DNK',128.0,'EU',1.0,91.0,7.0,2.2,87.0,95.0,83.0,98.0,0.0,97.0,0.0,96.0,98.0,0.0,87.0,89.0,88.0,83.0,0.0,0.0,0.0]
cpi_titles = ['Country Rank','Country / Territory','WB Code','IFS Code','Region','Country Rank','CPI 2013 Score', 'Surveys Used','Standard Error', '90% Confidence interval Lower', 'Upper','Scores range MIN','MAX','Data sources AFDB','BF (SGI)','BF (BTI)','IMD','ICRG','WB','WEF','WJP','EIU','GI','PERC','TI','FH']
table = agate.Table(cpi_rows, cpi_titles)
print table.print_structure()
The output is
| column | data_type |
| ----------------------------- | --------- |
| Country Rank | Number |
| Country / Territory | Text |
| WB Code | Text |
| IFS Code | Number |
| Region | Text |
| Country Rank_2 | Number |
| CPI 2013 Score | Number |
| Surveys Used | Number |
| Standard Error | Number |
| 90% Confidence interval Lower | Number |
| Upper | Number |
| Scores range MIN | Number |
| MAX | Number |
| Data sources AFDB | Number |
| BF (SGI) | Number |
| BF (BTI) | Number |
| IMD | Number |
| ICRG | Number |
| WB | Number |
| WEF | Number |
| WJP | Number |
| EIU | Number |
| GI | Number |
| PERC | Number |
| TI | Number |
| FH | Number |
Obviously, I don't have your definition of types which you want to apply to this data. The only other thing to note is that you have defined Country Rank twice in your column titles so Agate does warn you about this and relabel it.
This is a complicated one, but I suspect there's some principle I can apply to make it simple - I just don't know what it is.
I need to parcel out presentation slots to a class full of students for the semester. There are multiple possible dates, and multiple presentation types. I conducted a survey where students could rank their interest in the different topics. What I'd like to do is get the best (or at least a good) distribution of presentation slots to students.
So, what I have:
List of 12 dates
List of 18 students
CSV file where each student (row) has a rating 1-5 for each date
What I'd like to get:
Each student should have one of presentation type A (intro), one of presentation type B (figures) and 3 of presentation type C (aims)
Each date should have at least 1 of each type of presentation
Each date should have no more than 2 of type A or type B
Try to give students presentations that they rated highly (4 or 5)
I should note that I realize this looks like a homework problem, but it's real life :-). I was thinking that I might make a Student class for each student that contains the dates for each presentation type, but I wasn't sure what the best way to populate it would be. Actually, I'm not even sure where to start.
TL;DR: I think you're giving your students too much choice :D
But I had a shot at this problem anyway. Pretty fun exercise actually, although some of the constraints were a little vague. Most of all, I had to guess what the actual students' preference distribution would look like. I went with uniformly distributed, independent variables, although that's probably not very realistic. Still I think it should work just as well on real data as it does on my randomly generated data.
I considered brute forcing it, but a rough analysis gave me an estimate of over 10^65 possible configurations. That's kind of a lot. And since we don't have a trillion trillion years to consider all of them, we'll need a heuristic approach.
Because of the size of the problem, I tried to avoid doing any backtracking. But this meant that you could get stuck; there might not be a solution where everyone only gets dates they gave 4's and 5's.
I ended up implementing a double-edged Iterative Deepening-like search, where both the best case we're still holding out hope for (i.e., assign students to a date they gave a 5) and the worst case scenario we're willing to accept (some student might have to live with a 3) are gradually lowered until a solution is found. If we get stuck, reset, lower expectations, and try again. Tasks A and B are assigned first, and C is done only after A and B are complete, because the constraints on C are far less stringent.
I also used a weighting factor to model the trade off between maximizing students happiness with satisfying the types-of-presentations-per-day limits.
Currently it seems to find a solution for pretty much every random generated set of preferences. I included an evaluation metric; the ratio between the sum of the preference values of all assigned student/date combos, and the sum of all student ideal/top 3 preference values. For example, if student X had two fives, one four and the rest threes on his list, and is assigned to one of his fives and two threes, he gets 5+3+3=11 but could ideally have gotten 5+5+4=14; he is 11/14 = 78.6% satisfied.
After some testing, it seems that my implementation tends to produce an average student satisfaction of around 95%, at lot better than I expected :) But again, that is with fake data. Real preferences are probably more clumped, and harder to satisfy.
Below is the core of the algorihtm. The full script is ~250 lines and a bit too long for here I think. Check it out at Github.
...
# Assign a date for a given task to each student,
# preferring a date that they like and is still free.
def fill(task, lowest_acceptable, spread_weight=0.1, tasks_to_spread="ABC"):
random_order = range(nStudents) # randomize student order, so everyone
random.shuffle(random_order) # has an equal chance to get their first pick
for i in random_order:
student = students[i]
if student.dates[task]: # student is already assigned for this task?
continue
# get available dates ordered by preference and how fully booked they are
preferred = get_favorite_day(student, lowest_acceptable,
spread_weight, tasks_to_spread)
for date_nr in preferred:
date = dates[date_nr]
if date.is_available(task, student.count, lowest_acceptable == 1):
date.set_student(task, student.count)
student.dates[task] = date
break
# attempt to "fill()" the schedule while gradually lowering expectations
start_at = 5
while start_at > 1:
lowest_acceptable = start_at
while lowest_acceptable > 0:
fill("A", lowest_acceptable, spread_weight, "AAB")
fill("B", lowest_acceptable, spread_weight, "ABB")
if lowest_acceptable == 1:
fill("C", lowest_acceptable, spread_weight_C, "C")
lowest_acceptable -= 1
And here is an example result as printed by the script:
Date
================================================================================
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
================================================================================
1 | | A | B | | C | | | | | | | |
2 | | | | | A | | | | | B | C | |
3 | | | | | B | | | C | | A | | |
4 | | | | A | | C | | | | | | B |
5 | | | C | | | | A | B | | | | |
6 | | C | | | | | | | A | B | | |
7 | | | C | | | | | B | | | | A |
8 | | | A | | C | | B | | | | | |
9 | C | | | | | | | | A | | | B |
10 | A | B | | | | | | | C | | | |
11 | B | | | A | | C | | | | | | |
12 | | | | | | A | C | | | | B | |
13 | A | | | B | | | | | | | | C |
14 | | | | | B | | | | C | | A | |
15 | | | A | C | | B | | | | | | |
16 | | | | | | A | | | | C | B | |
17 | | A | | C | | | B | | | | | |
18 | | | | | | | C | A | B | | | |
================================================================================
Total student satisfaction: 250/261 = 95.00%