This question already has an answer here:
Pandas: Cast column to string does not work
(1 answer)
Closed 5 months ago.
I have a column that should have been "born" as type str. But given it was not -but rather as object, how in the world can I get it to be of type str ? i've tried a few approaches, which had been documented as supposed to work especially here How to convert column with dtype as object to string in Pandas Dataframe
Given a dataframe dfm and a list of feature values feats
# this should work as str's but does not
dfm.insert(colx,'top20pct_features', [str(x) for x in feats])
# let's try another way.. but also does not work
dfm.top20pct_features = dfm.top20pct_features.astype(str)
# another way.. same story..
dfm.top20pct_features = dfm.top20pct_features.str
print(dfm.info()) # reports the column as `object`
You can use convert_dtypes to benefit from the relatively recent string dtype:
df['top20pct_features'] = df['top20pct_features'].convert_dtypes()
Example:
df = pd.DataFrame({'top20pct_features': ['A', 'B', 'C']})
df.dtypes
top20pct_features object
dtype: object
df['top20pct_features'] = df['top20pct_features'].convert_dtypes()
df.dtypes
top20pct_features string
dtype: object
I am aware that I can find the length of a pd.Series by using pd.Series.str.len() but is there a method to strip the last two characters? I know we can use Python to accomplish this but I was curious to see if it could be done in Pandas.
For example:
$1000.0000
1..0009
456.2233
Would end in :
$1000.00
1..00
456.22
Any insight would be greatly appreciated.
Just do:
import pandas as pd
s = pd.Series(['$1000.0000', '1..0009', '456.2233'])
res = s.str[:-2]
print(res)
Output
0 $1000.00
1 1..00
2 456.22
dtype: object
Pandas supports the built-in string methods through the accessor str, from the documentation:
These are accessed via the str attribute and generally have names
matching the equivalent (scalar) built-in string methods
Try with
df_new = df.astype(str).applymap(lambda x : x[:-2])
Or only one column
df_new = df.astype(str).str[:-2]
In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.
Why is that? Is there a way to ensure I always get back a data frame?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
In [3]: type(df.loc[3])
Out[3]: pandas.core.frame.DataFrame
In [4]: type(df.loc[1])
Out[4]: pandas.core.series.Series
Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.
In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame
In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame
The TLDR
When using loc
df.loc[:] = Dataframe
df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe
df.loc[:, ["col_name"]] = Dataframe if you have more than one row and Series if you have only 1 row in the selection
df.loc[:, "col_name"] = Series
Not using loc
df["col_name"] = Series
df[["col_name"]] = Dataframe
You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.
The reason is that you don't specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.
Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.
If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).
Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.
You wrote in a comment to joris' answer:
"I don't understand the design
decision for single rows to get converted into a series - why not a
data frame with one row?"
A single row isn't converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data. For example, DataFrame is a
container for Series, and Panel is a container for DataFrame objects.
We would like to be able to insert and remove objects from these
containers in a dictionary-like fashion.
http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure
The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)
.
Edit : I don't agree with me
A DataFrame can't be composed of elements that would be Series, because the following code gives the same type "Series" as well for a row as for a column:
import pandas as pd
df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])
print '-------- df -------------'
print df
print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])
print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])
result
-------- df -------------
0
2 11
3 12
3 13
------- df.loc[2] --------
0 11
Name: 2, dtype: int64
type(df.loc[1]) : <class 'pandas.core.series.Series'>
--------- df[0] ----------
2 11
3 12
3 13
Name: 0, dtype: int64
type(df[0]) : <class 'pandas.core.series.Series'>
So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.
.
Then what is a DataFrame ?
In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.
Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.
However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.
.
Then what is a DataFrame object ?
The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.
# with pandas 0.12
from pandas import Series
print 'Series :\n',Series
print 'Series.__bases__ :\n',Series.__bases__
from pandas import DataFrame
print '\nDataFrame :\n',DataFrame
print 'DataFrame.__bases__ :\n',DataFrame.__bases__
print '\n-------------------'
from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__ :\n',NDFrame.__bases__
from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__ :\n',PandasContainer.__bases__
from pandas.core.base import PandasObject
print '\nPandasObject.__bases__ :\n',PandasObject.__bases__
from pandas.core.base import StringMixin
print '\nStringMixin.__bases__ :\n',StringMixin.__bases__
result
Series :
<class 'pandas.core.series.Series'>
Series.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)
DataFrame :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__ :
(<class 'pandas.core.generic.NDFrame'>,)
-------------------
NDFrame.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>,)
PandasContainer.__bases__ :
(<class 'pandas.core.base.PandasObject'>,)
PandasObject.__bases__ :
(<class 'pandas.core.base.StringMixin'>,)
StringMixin.__bases__ :
(<type 'object'>,)
So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.
The ways these extracting methods work are described in this page:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.
Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:
The best way to think about the pandas data structures is as flexible
containers for lower dimensional data.
The why of the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.
If the objective is to get a subset of the data set using the index, it is best to avoid using loc or iloc. Instead you should use syntax similar to this :
df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
result = df[df.index == 3]
isinstance(result, pd.DataFrame) # True
result = df[df.index == 1]
isinstance(result, pd.DataFrame) # True
every time we put [['column name']] it returns Pandas DataFrame object,
if we put ['column name'] we got Pandas Series object
If you also select on the index of the dataframe then the result can be either a DataFrame or a Series or it can be a Series or a scalar (single value).
This function ensures that you always get a list from your selection (if the df, index and column are valid):
def get_list_from_df_column(df, index, column):
df_or_series = df.loc[index,[column]]
# df.loc[index,column] is also possible and returns a series or a scalar
if isinstance(df_or_series, pd.Series):
resulting_list = df_or_series.tolist() #get list from series
else:
resulting_list = df_or_series[column].tolist()
# use the column key to get a series from the dataframe
return(resulting_list)
So if I have a timestamp in pandas as such:
Timestamp('2014-11-07 00:05:00')
How can I create a new column that just has the 'time' component?
So I want
00:05:00
Currently, I'm using .apply as shown below, but this is slow (my dataframe is a couple million rows), and i'm looking for a faster way.
df['time'] = df['date_time'].apply(lambda x: x.time())
Instead of .apply, I tried using .astype(time), as I noticed .astype operations can be faster than .apply, but that apparently doesn't work on timestamps (AttributeError: 'Timestamp' object has no attribute 'astype')... any ideas?
You want .dt.time see the docs for some more examples of things under the .dt accessor.
df['date_time'].dt.time
There are two df1 and df2, each having date and time columns respectively.
Following code snippets useful to convert data type and comparing.
type(df1['date'].iloc[0]), type(df2['time'].iloc[0])
>>> (datetime.date, pandas._libs.tslibs.timestamps.Timestamp)
type(df1['date'].iloc[0]), type(df2['time'].iloc[0].date())
>>> (datetime.date, datetime.date)
df1['date'].iloc[0] == df2['time'].iloc[0].date()
>>> False
I face some confusion with the way pandas is handling time-related objects.
If I do
x = pd.datetime.fromtimestamp(1440502703064/1000.) # or
x = pd.datetime(1234,5,6)
then type(x) returns datetime.datetime in either of the cases. However if I have:
z = pd.DataFrame([
{'a': 'foo', 'ts': pd.datetime.fromtimestamp(1440502703064/1000.)}
])
then type(z['ts'][0]) returns pandas.tslib.Timestamp. When is this casting happening? Is its trigger is pandas or maybe numpy? What is this type that I obtain in the latter case and where is it documented?
I'm not 100% sure, since I haven't studied the underlying code, but the conversion from datetime.datetime happens the moment the value is "incorporated" into a DataFrame.
Outside a DataFrame, pandas will try to do the smart thing and return something sensible when using pd.dattime(.fromtimestamp): it returns a Python datetime.datetime object.
Inside, it uses something it can probably work better with internally. You can see the conversion occurring when creating a DataFrame by using a datetime.datetime object instead:
>>> from datetime import datetime
>>> z = pd.DataFrame([
{'a': 'foo', 'ts': datetime(2015,8,27)} ])
>>> type(z['ts'][0])
pandas.tslib.Timestamp
Perhaps even clearer:
>>> pd.datetime == datetime
True
So the conversion happens during the DataFrame initialisation.
As for documentation, I searched around and found the source (note: probably not a very time-safe link), which says (doc-string):
TimeStamp is the pandas equivalent of python's Datetime and is
interchangable with it in most cases. It's the type used for the
entries that make up a DatetimeIndex, and other timeseries oriented
data structures in pandas.