I have a series object (1 column of a DataFrame) and would like to extract the value of the first element. Is there a way to do this simply without converting to a list and without knowing the key? Or is the only way to access it by converting it to a list first using tolist()[n]?
I think you can use iloc:
print df
col
0 a
1 b
2 c
3 d
4 e
print df.iloc[0]
col a
Name: 0, dtype: object
Related
This question already has an answer here:
Proper way to access a column of a pandas dataframe
(1 answer)
Closed last month.
import pandas as pd
df1 = pd.DataFrame({
"value": [1, 1, 1, 2, 2, 2]})
print(df1)
print("-------------------------")
print(df1.reset_index())
print("-------------------------")
print(df1.reset_index().index)
print("-------------------------")
print(df1.reset_index()["index"])
produces the output
value
0 1
1 1
2 1
3 2
4 2
5 2
-------------------------
index value
0 0 1
1 1 1
2 2 1
3 3 2
4 4 2
5 5 2
-------------------------
RangeIndex(start=0, stop=6, step=1)
-------------------------
0 0
1 1
2 2
3 3
4 4
5 5
Name: index, dtype: int64
I am wondering why print(df1.reset_index().index) and
print(df1.reset_index()["index"]) prints different things in this case? The latter prints the "index" column, while the former prints the indices.
If we want to access the reset indices (the column), then it seems we have to use brackets?
The .index attribute in a pandas DataFrame will always point to the Index (row label) of the DataFrame not a column named "index".
If we want to access the reset indices (the column), then it seems we
have to use brackets?
Yes, or you can assign a name when reseting the index for example:
df1.reset_index(names='the_index').the_index
# 0 0
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# Name: the_index, dtype: int64
Several things happened. First, when you don't specify and index, pandas uses a RangeIndex object as a virtual index of the dataframe. The dataframe is a collection of numpy arrays which are naturally indexed from 0, 1, 2, and etc. Since RangeIndex is just 0, 1, etc... it doesn't actually create its values in memory. Had you printed the index of the original df1, it would be a RangeIndex, just like df1.reset_index().index.
reset_index has an optional drop parameter. By default, pandas will take the existing index and turn it into a column of the dataframe. This was a RangeIndex object but it had to be expanded into a realized column to fit with the other columns in the df. Had you included drop=True, there would be no "index" column.
When you reset the index, dataframes always have to have some index and the default is that virtual RangeIndex you see.
DataFrames have a shortcut where some columns can be addressed by attribute name rather than item (the square brackets). But, if the column name doesn't meet python's attribute naming rules or if it clashes with an existing attribute, you can't reference it that way. .index is the dataframe index so if you happen to also have a column "index", you need to access it via the square bracket item protocol.
One could argue that pandas should never have allowed the attribute access path because it can't be used consistently. I wouldn't argue that (except I totally would).
It does this because you are printing different things:
print(df1.reset_index().index)
is the same as:
df = df1.reset_index()
print(df.index)
This firstly adds an Id index to the dataframe then prints the actual index of the df.
print(df1.reset_index()["index"])
is the equivalent of
df = df1.reset_index()
print(df["index"])
It firstly adds an Id index to the dataframe but keeps both "index" and "values" columns. It then prints the Column named "Index" (which is NOT the index of the df)
If you want to make the "index" column the index, you must use:
df = df1.set_index("index")
In a column my Pandas DataFrame I have strings that needs to to limited in length to a value that exist in another column in the same dataframe.
I have tried creating a new column and using normal python string indexing with the other column as the value.
Here is a MWE of the code I'm trying to run:
import pandas as pd
data = [[5, 'LONSTRING'], [3, 'LONGERSTRING'], [7, 'LONGESTSTRINGEVER']]
df = pd.DataFrame(data, columns=['String Limit', 'String'])
df['Short String'] = df['String'][:df['String Limit']]
print(df)
I expected a new column with shorter strings:
String Limit String Short String
0 5 LONSTRING LONST
1 3 LONGERSTRING LON
2 7 LONGESTSTRINGEVER LONGEST
Instead I get a TypeError:
TypeError: cannot do slice indexing on <class 'pandas.core.indexes.range.RangeIndex'> with these indexers [0 5
1 3
2 7
Name: String Limit, dtype: int64] of <class 'pandas.core.series.Series'>
It seems that string indexing can't be done this way because df['String Limit'] is the whole Series and not just the one row value - but are there any alternative ways of doing this?
Problem is you need filter all values separately, so use DataFrame.apply with axis=1 for loop by rows:
df['Short String'] = df.apply(lambda x: x['String'][:x['String Limit']], axis=1)
Or use zip with list comprehension:
df['Short String'] = [x[:y] for x, y in zip(df['String'], df['String Limit'])]
print(df)
String Limit String Short String
0 5 LONSTRING LONST
1 3 LONGERSTRING LON
2 7 LONGESTSTRINGEVER LONGEST
I have a Series, like this:
series = pd.Series({'a': 1, 'b': 2, 'c': 3})
I want to convert it to a dataframe like this:
a b c
0 1 2 3
pd.Series.to_frame() doesn't work, it got result like,
0
a 1
b 2
c 3
How can I construct a DataFrame from Series, with index of Series as columns?
You can also try this :
df = DataFrame(series).transpose()
Using the transpose() function you can interchange the indices and the columns.
The output looks like this :
a b c
0 1 2 3
You don't need the transposition step, just wrap your Series inside a list and pass it to the DataFrame constructor:
pd.DataFrame([series])
a b c
0 1 2 3
Alternatively, call Series.to_frame, then transpose using the shortcut .T:
series.to_frame().T
a b c
0 1 2 3
you can also try this:
a = pd.Series.to_frame(series)
a['id'] = list(a.index)
Explanation:
The 1st line convert the series into a single-column DataFrame.
The 2nd line add an column to this DataFrame with the value same as the index.
Try reset_index. It will convert your index into a column in your dataframe.
df = series.to_frame().reset_index()
This
pd.DataFrame([series]) #method 1
produces a slightly different result than
series.to_frame().T #method 2
With method 1, the elements in the resulted dataframe retain the same type. e.g. an int64 in series will be kept as an int64.
With method 2, the elements in the resulted dataframe become objects IF there is an object type element anywhere in the series. e.g. an int64 in series will be become an object type.
This difference may cause different behaviors in your subsequent operations depending on the version of pandas.
I'm trying to use assign to create a new column in a pandas DataFrame. I need to use something like str.format to have the new column be pieces of existing columns. For instance...
import pandas as pd
df = pd.DataFrame(np.random.randn(3, 3))
gives me...
0 1 2
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
an assign for a totally new column works
# works
print(df.assign(c = "a"))
0 1 2 c
0 -0.738703 -1.027115 1.129253 a
1 0.674314 0.525223 -0.371896 a
2 1.021304 0.169181 -0.884293 a
But, if I want to use an existing column into a new column it seems like pandas is adding the whole existing frame into the new column.
# doesn't work
print(df.assign(c = "a{}b".format(df[0])))
0 1 2 \
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
c
0 a0 -0.738703\n1 0.674314\n2 1.021304\n...
1 a0 -0.738703\n1 0.674314\n2 1.021304\n...
2 a0 -0.738703\n1 0.674314\n2 1.021304\n...
Thanks for the help.
In [131]: df.assign(c="a"+df[0].astype(str)+"b")
Out[131]:
0 1 2 c
0 0.833556 -0.106183 -0.910005 a0.833556419295b
1 -1.487825 1.173338 1.650466 a-1.48782514804b
2 -0.836795 -1.192674 -0.212900 a-0.836795026809b
'a{}b'.format(df[0]) is a str. "a"+df[0].astype(str)+"b" is a Series.
In [142]: type(df[0].astype(str))
Out[142]: pandas.core.series.Series
In [143]: type('{}'.format(df[0]))
Out[143]: str
When you assign a single string to the column c, that string is repeated for every row in df.
Thus, df.assign(c = "a{}b".format(df[0])) assigns the string 'a{}b'.format(df[0])
to each row of df:
In [138]: 'a{}b'.format(df[0])
Out[138]: 'a0 0.833556\n1 -1.487825\n2 -0.836795\nName: 0, dtype: float64b'
It is really no different than what happened with df.assign(c = "a").
In contrast, when you assign a Series to the column c, then the index of the Series is aligned with the index of df and the corresponding values are assigned to df['c'].
Under the hood, the Series.__add__ method is defined in such a way so that addition of the Series containing strings with a string results in a new Series with the string concatenated with the values in the Series:
In [149]: "a"+df[0].astype(str)
Out[149]:
0 a0.833556419295
1 a-1.48782514804
2 a-0.836795026809
Name: 0, dtype: object
(The astype method was called to convert the floats in df[0] into strings.)
df['c'] = "a" + df[0].astype(str) + 'b'
df
0 1 2 c
0 -1.134154 -0.367397 0.906239 a-1.13415403091b
1 0.551997 -0.160217 -0.869291 a0.551996920472b
2 0.490102 -1.151301 0.541888 a0.490101854737b
I have a pandas DataFrame with a multi-level index ("instance" and "index"). I want to find all the first-level ("instance") index values which are non-unique and to print out those values.
My frame looks like this:
A
instance index
a 1 10
2 12
3 4
b 1 12
2 5
3 2
b 1 12
2 5
3 2
I want to find "b" as the duplicate 0-level index and print its value ("b") out.
You can use the get_duplicates() method:
>>> df.index.get_level_values('instance').get_duplicates()
[0, 1]
(In my example data 0 and 1 both appear multiple times.)
The get_level_values() method can accept a label (such as 'instance') or an integer and retrieves the relevant part of the MultiIndex.
Assuming that your df has an index made of 'instance' and 'index' you could do this:
df1 = df.reset_index().pivot_table(index=['instance','index'], values='A', aggfunc='count')
df1[df1 > 1].index.get_level_values(0).drop_duplicates()
Which yields:
Index([u'b'], dtype='object')
Adding .values at the end (.drop_duplicates().values) will make an array:
array(['b'], dtype=object)
Or the same with one line using .groupby:
df[df.groupby(level=['instance','index']).count() > 1].dropna().index.get_level_values(0).drop_duplicates()
This should give you the whole row which isn't quite what you asked for but might be close enough:
df[df.index.get_level_values('instance').duplicated()]
You want the duplicated method:
df['Instance'].duplicated()