Here is the code that I am working with:
import pandas as pd
test3 = pd.Series([1,2,3], index = ['a','b','c'])
test3 = test3.reindex(index = ['f','g','z'])
So originally every thing is fine and test3 has an index of 'a' 'b' 'c' and values 1,2,3. But then when I got to reindex test3 I get that my values 1 2 3 are lost. Why is that? The desired output would be:
f 1
g 2
z 3
The docs are clear on this behaviour :
Conform Series to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index
if you just want to overwrite the index values then do:
In [32]:
test3.index = ['f','g','z']
test3
Out[32]:
f 1
g 2
z 3
dtype: int64
Related
I want to set the following multiindexes to a pandas.Series . Currently, my indexes are [none, none], how can I change them?
Letters bins
A (2.212, 15.753] 4
(15.753, 29.227] 1
B (2.212, 15.753] 1
C (2.212, 15.753] 1
(15.753, 29.227] 3
(29.227, 42.701] 2
D (2.212, 15.753] 1
(56.174, 69.648] 1
E (56.174, 69.648] 1
I think you mean to change the column names of a pandas DataFrame. You can use this, see if it helps:
df.set_index(['Letters', 'bins'], inplace=True)
I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)
I have a pandas DataFrame with a multi-index like this:
import pandas as pd
import numpy as np
arr = [1]*3 + [2]*3
arr2 = list(range(3)) + list(range(3))
mux = pd.MultiIndex.from_arrays([
arr,
arr2
], names=['one', 'two'])
df = pd.DataFrame({'a': np.arange(len(mux))}, mux)
df
a
one two
1 0 0
1 1 1
1 2 2
2 0 3
2 1 4
2 2 5
I have a function that takes a slice of a DataFrame and needs to assign a new column to the rows that have been sliced:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
#pass in a slice of the df with only records that have the last value for 'two'
work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
However calling the function results in the error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# This is added back by InteractiveShellApp.init_path()
How can I create a new column 'b' in the original DataFrame and assign its values for only the rows that were passed to the function, leaving the rest of the rows nan?
The desired output is:
a b
one two
1 0 0 nan
1 1 1 nan
1 2 2 4
2 0 3 nan
2 1 4 nan
2 2 5 10
NOTE: In the work function I'm actually doing a bunch of complex operations involving calling other functions to generate the values for the new column so I don't think this will work. Multiplying by 2 in my example is just for illustrative purposes.
You actually don't have an error, but just a warning. Try this:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
return df
#pass in a slice of the df with only records that have the last value for 'two'
new_df = work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
Then:
df.reset_index().merge(new_df, how="left").set_index(["one","two"])
Output:
a b
one two
1 0 0 NaN
1 1 NaN
2 2 4.0
2 0 3 NaN
1 4 NaN
2 5 10.0
I don't think you need a separate function at all. Try this...
df['b'] = df['a'].where(df.index.isin(df.index.get_level_values('two')[-1:], level=1))*2
The Series.where() function being called on df['a'] here should return a series where values are NaN for rows that do not result from your query.
I'm trying to use assign to create a new column in a pandas DataFrame. I need to use something like str.format to have the new column be pieces of existing columns. For instance...
import pandas as pd
df = pd.DataFrame(np.random.randn(3, 3))
gives me...
0 1 2
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
an assign for a totally new column works
# works
print(df.assign(c = "a"))
0 1 2 c
0 -0.738703 -1.027115 1.129253 a
1 0.674314 0.525223 -0.371896 a
2 1.021304 0.169181 -0.884293 a
But, if I want to use an existing column into a new column it seems like pandas is adding the whole existing frame into the new column.
# doesn't work
print(df.assign(c = "a{}b".format(df[0])))
0 1 2 \
0 -0.738703 -1.027115 1.129253
1 0.674314 0.525223 -0.371896
2 1.021304 0.169181 -0.884293
c
0 a0 -0.738703\n1 0.674314\n2 1.021304\n...
1 a0 -0.738703\n1 0.674314\n2 1.021304\n...
2 a0 -0.738703\n1 0.674314\n2 1.021304\n...
Thanks for the help.
In [131]: df.assign(c="a"+df[0].astype(str)+"b")
Out[131]:
0 1 2 c
0 0.833556 -0.106183 -0.910005 a0.833556419295b
1 -1.487825 1.173338 1.650466 a-1.48782514804b
2 -0.836795 -1.192674 -0.212900 a-0.836795026809b
'a{}b'.format(df[0]) is a str. "a"+df[0].astype(str)+"b" is a Series.
In [142]: type(df[0].astype(str))
Out[142]: pandas.core.series.Series
In [143]: type('{}'.format(df[0]))
Out[143]: str
When you assign a single string to the column c, that string is repeated for every row in df.
Thus, df.assign(c = "a{}b".format(df[0])) assigns the string 'a{}b'.format(df[0])
to each row of df:
In [138]: 'a{}b'.format(df[0])
Out[138]: 'a0 0.833556\n1 -1.487825\n2 -0.836795\nName: 0, dtype: float64b'
It is really no different than what happened with df.assign(c = "a").
In contrast, when you assign a Series to the column c, then the index of the Series is aligned with the index of df and the corresponding values are assigned to df['c'].
Under the hood, the Series.__add__ method is defined in such a way so that addition of the Series containing strings with a string results in a new Series with the string concatenated with the values in the Series:
In [149]: "a"+df[0].astype(str)
Out[149]:
0 a0.833556419295
1 a-1.48782514804
2 a-0.836795026809
Name: 0, dtype: object
(The astype method was called to convert the floats in df[0] into strings.)
df['c'] = "a" + df[0].astype(str) + 'b'
df
0 1 2 c
0 -1.134154 -0.367397 0.906239 a-1.13415403091b
1 0.551997 -0.160217 -0.869291 a0.551996920472b
2 0.490102 -1.151301 0.541888 a0.490101854737b
I have this nice pandas dataframe:
And I want to group it by the column "0" (which represents the year) and calculate the mean of the other columns for each year. I do such thing with this code:
df.groupby(0)[2,3,4].mean()
And that successfully calculates the mean of every column. The problem here being the empty row that appears on top:
That's just a display thing, the grouped column now becomes the index and this is just the way that it is displayed, you will notice here that even when you set pd.set_option('display.notebook_repr_html', False) you still get this line, it has no effect on operations on the goruped df:
In [30]:
df = pd.DataFrame({'a':np.random.randn(5), 'b':np.random.randn(5), 'c':np.arange(5)})
df
Out[30]:
a b c
0 0.766706 -0.575700 0
1 0.594797 -0.966856 1
2 1.852405 1.003855 2
3 -0.919870 -1.089215 3
4 -0.647769 -0.541440 4
In [31]:
df.groupby('c')['a','b'].mean()
Out[31]:
a b
c
0 0.766706 -0.575700
1 0.594797 -0.966856
2 1.852405 1.003855
3 -0.919870 -1.089215
4 -0.647769 -0.541440
Technically speaking it has assigneed the name attribute:
In [32]:
df.groupby('c')['a','b'].mean().index.name
Out[32]:
'c'
by default there will be no name if it has not been assigned:
In [34]:
print(df.index.name)
None