Pandas creating data frame from matrix in dictionary - python

I have a dictionary Dict1 with keys as Dates and Sims.
Dates is an array with shape 100x1 and Sims has shape 100x5
I am trying:
import pandas as pd
df = pd.Dataframe.from_dict(Dict1)
But errors out due to size of Sims. Is there a way I can create the DataFrame with each row of column Sims has size 5? i.e each row can be stored as list or array of size 5.
Edit:
Dict1['Dates']
array([datetime.datetime(2016, 11, 1, 0, 0),
datetime.datetime(2016, 11, 1, 1, 0),
datetime.datetime(2016, 11, 1, 2, 0), ...,
datetime.datetime(2025, 12, 31, 21, 0),
datetime.datetime(2025, 12, 31, 22, 0),
datetime.datetime(2025, 12, 31, 23, 0)], dtype=object)
Dict1['Sims']
array([[ 63.89694316, 35.8551162 , 40.36134283, 57.23648392,
35.96607425, 61.166471 ],
[ 47.94894386, 53.95396849, 48.94336457, 51.04541849,
28.69973176, 49.78683505],
[ 63.90314179, 43.29467789, 36.97811714, 52.33639618,
45.24190878, 69.9059308 ]...]])
Edit2:
I am looking to create the dataframe such that I can perform the following operation:
print(df[datetime.datetime(2016, 11, 1, 0, 0)])
[ 63.89694316, 35.8551162 , 40.36134283, 57.23648392,
35.96607425, 61.166471 ]

You can use your Dict1['Dates'] as the index.
df = pd.DataFrame(Dict1['Sims'], index=Dict1['Dates'])
df.ix[datetime.datetime(2016, 11, 1, 0, 0)]
Note that you should use the df.ix[key] indexer, since df[key] defaults to looking up a column, not a row.
Alternatively, if you really want a single column containing lists, make sure that Dict1['Sims'] is a Python list, not a Numpy array before creating your data frame.
df = pd.DataFrame({'Sims': Dict1['Sims'].tolist()}, index=Dict1['Dates'])
The {'Sims': ...} construct tricks Pandas into interpreting the data as a single series of lists, rather than a multi-dimensional array.

Related

Values in numpy matrix based on an array using index and value of each element

I would like to use a numpy array to index a numpy matrix using the values of the array indicating the columns, and indices indicating the corresponding row numbers.As an example, I have a numpy matrix,
a = np.tile(np.arange(1920), (41, 1))
>>> [[0, 1, 2, ..., 1919]
[0, 1, 2, ..., 1919]
...
[0, 1, 2, ..., 1920]]
b = np.arange(40, -1, -1) # We want to do a[b] in the most efficient way.
What I would like to get is an array c which is,
c = [40, 39, 38, ..., 0]
That is, I want to use b to get the following indices from a,
[(0, b[0]), (1, b[1]), ... (40, b[40])] # 0th row b[0]th column, 1st row b[1]th column...
How do I do this, and what is the most efficient way to do this?
You can use advanced indexing with 2 integer arrays. For the 0th axis ("the rows"), you simply use 0..n (can be generated using np.arange) with n the length of b. For the 1st axis ("the columns"), you use b:
import numpy as np
# Setup:
a = np.tile(np.arange(1920), (41, 1))
b = np.arange(40, -1, -1)
# Solution:
c = a[np.arange(len(b)), b]
c:
array([40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24,
23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7,
6, 5, 4, 3, 2, 1, 0])

How to Reccurently Transpose A Series/List/Array

I have a array/list/pandas series :
np.arange(15)
Out[11]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
What I want is:
[[0,1,2,3,4,5],
[1,2,3,4,5,6],
[2,3,4,5,6,7],
...
[10,11,12,13,14]]
That is, recurently transpose this columns into a 5-column matrix.
The reason is that I am doing feature engineering for a column of temperature data. I want to use last 5 data as features and the next as target.
What's the most efficient way to do that? my data is large.
If the array is formatted like this :
arr = np.array([1,2,3,4,5,6,7,8,....])
You could try it like this :
recurr_transpose = np.matrix([[arr[i:i+5] for i in range(len(arr)-4)]])

How to change the date format of the whole column?

I am analyzing the .csv file and in this my first column is of the datetime in the format "2016-09-15T00:00:13" and I want to change this format to standard python datetime object.I can change the format for one but date but for whole column I can not do that.
My code that I am using:
import numpy
import dateutil.parser
mydate = dateutil.parser.parse(numpy.mydata[1:,0])
print(mydate)
I am getting the error:
'module' object has no attribute 'mydata'
Here is the column for which I want the format to be changed.
print(mydata[1:,0])
['2016-09-15T00:00:13'
'2016-09-15T00:00:38'
'2016-09-15T00:00:53'
...,
'2016-09-15T23:59:28'
'2016-09-15T23:59:37'
'2016-09-15T23:59:52']
from datetime import datetime
for date in mydata:
date_object = datetime.strptime(date, '%Y-%m-%dT%H:%M:%S')
Here's a link to the method I'm using. That same link also lists the format arguments.
Oh and about the
'module' object has no attribute 'mydata'
You call numpy.mydata which is a reference to the "mydata" attribute of the numpy module you imported. The problem is, is that "mydata" is just one of your variables, not something included with numpy.
Unless you have a compelling reason to avoid it, pandas is the way to go with this kind of analysis. You can simply do
import pandas
df = pandas.read_csv('myfile.csv', parse_dates=True)
This will assume the first column is the index column and parse dates in it. This is probably what you want.
Assuming you've dealt with that numpy.mydata[1:,0] attribute error
Your data looks like:
In [268]: mydata=['2016-09-15T00:00:13' ,
...: '2016-09-15T00:00:38' ,
...: '2016-09-15T00:00:53' ,
...: '2016-09-15T23:59:28' ,
...: '2016-09-15T23:59:37' ,
...: '2016-09-15T23:59:52']
or in array form it is a ld array of strings
In [269]: mydata=np.array(mydata)
In [270]: mydata
Out[270]:
array(['2016-09-15T00:00:13', '2016-09-15T00:00:38', '2016-09-15T00:00:53',
'2016-09-15T23:59:28', '2016-09-15T23:59:37', '2016-09-15T23:59:52'],
dtype='<U19')
numpy has a version of datetime that stores as a 64 bit float, and can be used numerically. Your dates readily convert to that with astype (your format is standard):
In [271]: mydata.astype(np.datetime64)
Out[271]:
array(['2016-09-15T00:00:13', '2016-09-15T00:00:38', '2016-09-15T00:00:53',
'2016-09-15T23:59:28', '2016-09-15T23:59:37', '2016-09-15T23:59:52'],
dtype='datetime64[s]')
tolist converts this array to a list - and the dates to datetime objects:
In [274]: D.tolist()
Out[274]:
[datetime.datetime(2016, 9, 15, 0, 0, 13),
datetime.datetime(2016, 9, 15, 0, 0, 38),
datetime.datetime(2016, 9, 15, 0, 0, 53),
datetime.datetime(2016, 9, 15, 23, 59, 28),
datetime.datetime(2016, 9, 15, 23, 59, 37),
datetime.datetime(2016, 9, 15, 23, 59, 52)]
which could be turned back into an array of dtype object:
In [275]: np.array(D.tolist())
Out[275]:
array([datetime.datetime(2016, 9, 15, 0, 0, 13),
datetime.datetime(2016, 9, 15, 0, 0, 38),
datetime.datetime(2016, 9, 15, 0, 0, 53),
datetime.datetime(2016, 9, 15, 23, 59, 28),
datetime.datetime(2016, 9, 15, 23, 59, 37),
datetime.datetime(2016, 9, 15, 23, 59, 52)], dtype=object)
These objects couldn't be used in array calculations. The list would be just as useful.
If your string format wasn't standard you'd have to use the datetime parser in a list comprehension as #staples shows.

How do I pad values with dates as the index using python pandas?

I have the following list with a group of dicts:
results = [{'timestamp': datetime.datetime(2016, 1, 15, 0, 0, tzinfo=<UTC>),
'value1':100,
'value2':200}, ... ]
I'm using pandas to pad these results between two utc dates, from_date and to_date with a week of the year frequency. The rest of the values should be 0:
generated_dates = pd.date_range(start=from_date, end=to_date,
freq=freq['W'], tz='UTC', normalize=True)
I'm trying to combine the two lists now, in order. So I create two DataFrames so I can do that:
results_df = pd.DataFrame(results)
generated_dates_df = pd.DataFrame(generated_dates, columns=results_df, index=generated_dates)
generated_dates_df.fillna(0, inplace=True)
I then concatenate the data:
pd.concat([results_df, generated_dates_df])
I'm expecting this:
[{'timestamp': datetime.datetime(2015, 1, 1, 0, 0, tzinfo=<UTC>),
'value1':0,
'value2':0},
...
{'timestamp': datetime.datetime(2016, 1, 15, 0, 0, tzinfo=<UTC>),
'value1':100,
'value2':200},
{'timestamp': datetime.datetime(2016, 22, , 0, 0, tzinfo=<UTC>),
'value1':0,
'value2':0},
...,
]
But I keep getting TypeError: data type not understood
Is there something I'm missing?
If by "order" you mean sorted by a date index, I think you want join...
ndf = results.join(generated_dates, how="outer")
You might have to use "lsuffix" or "rsuffix".

Pandas : first datetime field gets automatically converted to timestamp type

When creating a pandas dataframe object (python 2.7.9, pandas 0.16.2), the first datetime field gets automatically converted into a pandas timestamp. Why? Is it possible to prevent this so as to keep the field in the original type?
Please see code below:
import numpy as np
import datetime
import pandas
create a dict:
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstart':np.array([datetime.datetime(2001, 11, 16, 0, 0),
datetime.datetime(2012, 2, 28, 0, 0), datetime.datetime(2014, 12, 22, 0, 0)],
dtype=object),
'vstop': np.array([datetime.datetime(2012, 2, 28, 0, 0),
datetime.datetime(2014, 12, 22, 0, 0), datetime.datetime(9999, 12, 31, 0, 0)],
dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'],
dtype='|S18')}
So, the vstart and vstop keys are datetime so far. However, after:
df = pandas.DataFrame(data = x)
the vstart becomes a pandas Timestamp automatically while vstop remains a datetime
type(df.vstart[0])
#class 'pandas.tslib.Timestamp'
type(df.vstop[0])
#type 'datetime.datetime'
I don't understand why the first datetime column that the constructor comes across gets converted to Timestamp by pandas. And how to tell pandas to keep the data types as they are. Can you help? Thank you.
actually I've noticed something in your data , it has nothing to do with your first or second date column in your column vstop there is a datetime with value dt.datetime(9999, 12, 31, 0, 0) , if you changed the year on this date to a normal year like 2020 for example both columns will be treated the same .
just note that I'm importing datetime module as dt
x = {'cusip': np.array(['10553M10', '67085120', '67085140'], dtype='|S8'),
'vstop': np.array([dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0), dt.datetime(2020, 12, 31, 0, 0)], dtype=object),
'vstart': np.array([dt.datetime(2001, 11, 16, 0, 0),dt.datetime(2012, 2, 28, 0, 0), dt.datetime(2014, 12, 22, 0, 0)], dtype=object),
'id': np.array(['EQ0000000000041095', 'EQ0000000000041095', 'EQ0000000000041095'], dtype='|S18')}
In [27]:
df = pd.DataFrame(x)
df
Out[27]:
cusip id vstart vstop
10553M10 EQ0000000000041095 2001-11-16 2012-02-28
67085120 EQ0000000000041095 2012-02-28 2014-12-22
67085140 EQ0000000000041095 2014-12-22 2020-12-31
In [25]:
type(df.vstart[0])
Out[25]:
pandas.tslib.Timestamp
In [26]:
type(df.vstop[0])
Out[26]:
pandas.tslib.Timestamp

Categories

Resources