I have a multi-index dataframe and want to set a slice of one of its columns equal to a series, ordered (sorted) according to the column slice' and series' index-match. The column's innermost index and series' index are identical, except their ordering (sorting). (see example below)
I can do this by first sorting the series' index according to the column's index and then using series.values (see below), but this feels like a workaround and I was wondering if it's possible to directly assign the series to the column slice.
example:
import pandas as pd
multi_index=pd.MultiIndex.from_product([['a','b'],['x','y']])
df=pd.DataFrame(0,multi_index,['p','q'])
s1=pd.Series([1,2],['y','x'])
df.loc['a','p']=s1[df.loc['a','p'].index].values
The code above gives the desired output, but I was wondering if the last line could be done simpler, e.g.:
df.loc['a','p']=s1
but this sets the column slice to NaNs.
Desired output:
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
obtained output from df.loc['a','p']=s1:
p q
a x NaN 0
y NaN 0
b x 0.0 0
y 0.0 0
It seems like a simple issue to me but I haven't been able to find the answer anywhere.
Have you tried something like that?
df.loc['a']['p'] = s1
Resulting df is here
p q
a x 2 0
y 1 0
b x 0 0
y 0 0
Related
Lets say we have a data frame that looks like this
Index 1
Index 2
Value
a
z
1
a
z
2
b
y
1
c
y
1
And I have a list list = [c,a,d]
Note that the list might have index values which are not in the data frame.
Is there a way how I can access all the rows from the data frame for the indexes where is matches with the list?
So in this example the output would look like this:
Index 1
Index 2
Value
a
z
1
a
z
2
c
y
1
pd.Series.isin() tests if elements are part of a list (or set), which returns a boolean Series. You can combine that with pd.DataFrame.loc[] which accepts such a boolean series.
This combination will not throw an error when elements of the list are not part of the index, as opposed to using .loc[list] directly.
>>> df.loc[df['Index 1'].isin(['c', 'a', 'd'])]
Index 1 Index 2 Value
0 a z 1
1 a z 2
3 c y 1
Use MultiIndex.get_level_values with Index.isin, loc is not necessary here:
df[df.index.get_level_values('Index 1').isin(['c', 'a', 'd'])]
I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)
I have an array of numbers (I think the format makes it a pivot table) that I want to turn into a "tidy" data frame. For example, I start with variable 1 down the left, variable 2 across the top, and the value of interest in the middle, something like this:
X Y
A 1 2
B 3 4
I want to turn that into a tidy data frame like this:
V1 V2 value
A X 1
A Y 2
B X 3
B Y 4
The row and column order don't matter to me, so the following is totally acceptable:
value V1 V2
2 A Y
4 B Y
3 B X
1 A X
For my first go at this, which was able to get me the correct final answer, I looped over the rows and columns. This was terribly slow, and I suspected that some machinery in Pandas would make it go faster.
It seems that melt is close to the magic I seek, but it doesn't get me all the way there. That first array turns into this:
V2 value
0 X 1
1 X 2
2 Y 3
3 Y 4
It gets rid of my V1 variable!
Nothing is special about melt, so I will be happy to read answers that use other approaches, particularly if melt is not much faster than my nested loops and another solution is. Nonetheless, how can I go from that array to the kind of tidy data frame I want as the output?
Example dataframe:
df = pd.DataFrame({"X":[1,3], "Y":[2,4]},index=["A","B"])
Use DataFrame.reset_index with DataFrame.rename_axis and then DataFrame.melt. If you want order columns we could use DataFrame.reindex.
new_df = (df.rename_axis(index = 'V1')
.reset_index()
.melt('V1',var_name='V2')
.reindex(columns = ['value','V1','V2']))
print(new_df)
Another approach DataFrame.stack:
new_df = (df.stack()
.rename_axis(index = ['V1','V2'])
.rename('value')
.reset_index()
.reindex(columns = ['value','V1','V2']))
print(new_df)
value V1 V2
0 1 A X
1 3 B X
2 2 A Y
3 4 B Y
to names names there is another alternative like commenting #Scott Boston in the comments
Melt is a good approach, but it doesn't seem to play nicely with identifying the results by index. You can reset the index first to move it to its own column, then use that column as the id col.
test = pd.DataFrame([[1,2],[3,4]], columns=['X', 'Y'], index=['A', 'B'])
X Y
A 1 2
B 3 4
test = test.reset_index()
index X Y
0 A 1 2
1 B 3 4
test.melt('index',['X', 'Y'], 'prev cols')
index prev cols value
0 A X 1
1 B X 3
2 A Y 2
3 B Y 4
I have a series:
s
A 1
B 0
C 1
D -1
E -1
F 0
...
and a dataframe with a subset of the series index values:
df
one two three ....
A
C
D
F
...
the contents of the df are not relevant in my question.
I am looking for the most Pythonic way to check the series index against the dataframe index and if the index element is in both series and dataframe indices, change the series value to zero.
The result I'm looking for is for the series to look like this based on sample s and df provided above:
s
A 0
B 0
C 0
D 0
E -1
F 0
Note that some series values were 0 to begin with and they stay 0. The ones where the index elements are in both are changed to 0 in the series.
I can iterate through the index but looking for more pythonic way to do this.
Thanks in advance.
Just do:
s[s.index.isin(df.index)] = 0
Yields:
A 0
B 0
C 0
D 0
E -1
F 0
dtype: int64
Could use update, with a dummy column of all 0. Should be fast.
import pandas as pd
s.update(pd.Series(0, index=df.index))
I have data from an Excel file in the format
0,1,0
1,0,0
0,0,1
I want to convert those data into a list where the ith element indicates the position of the nonzero element for the ith row. For example, the above would be:
[1,0,2]
I tried two ways to no avail:
Way one (NumPy)
df = pd.read_excel(file,convert_float=False)
idx = np.where(df==1)[1]
This gives me an odd error- idx is never the same length as the number of row in df. For this data set the two numbers are always equal. (I double checked, and there are no empty rows.)
Way two (Pandas)
idx = df.where(df==1)
This gives me output like:
52 NaN NaN NaN
53 1 NaN NaN
54 1 NaN NaN
This is the appropriate shape, but I don't know how to just get the column index.
Set up the dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[0,1,0],[1,0,0],[0,0,1]]))
Use np.argwhere to find the element indices:
np.argwhere(df.values ==1)
returns:
array([[0, 1],
[1, 0],
[2, 2]], dtype=int64)
so for row 0 the column 1 contains 1 for the df:
0 1 2
0 0 1 0
1 1 0 0
2 0 0 1
Note:
(you can get just the column index by using: np.array_split(indices, 2,1)[1] for example)
Here is a solution that works for limited use cases including this one. If you know that you will only have a single 1 in your row, then you can transpose the original data frame so the indices of your columns from the original data frame become the row indices of the transposed data frame. With that you can find the max value in each row and return an array of those values.
Your original data frame is not the best example for this solution because it is symmetrical and its transpose is the same as the original data frame. So for the sake of this solution we'll use a starting data frame that looks like:
df = pd.DataFrame({0:[0,0,1], 1:[1,0,0], 2:[0,1,0]})
# original data frame --> df
0 1 2
0 0 1 0
1 0 0 1
2 1 0 0
# transposed data frame --> df.T
0 1 2
0 0 0 1
1 1 0 0
2 0 1 0
Now to find the max of each row:
np.array(df.T.idxmax())
Which returns an array of values that represent the column indices of the original data frame that contain a 1:
[1 2 0]