I don't understand this code:
d = {'col1': [5, 6,4, 1, 2, 9, 15, 11]}
df = pd.DataFrame(data=d)
df.head(10)
df['col1'] = df.sort_values('col1')['col1']
print(df.sort_values('col1')['col1'])
This is what is printed:
3 1
4 2
2 4
0 5
1 6
5 9
7 11
6 15
My df doesn't change at all.
Why does this code: df.sort_values('col1')['col1'] do not arrange my dataframe?
Thanks
If want assign back sorted column is necessary convert output to numpy array for prevent index alignment - it means if use only df.sort_values('col1')['col1'] it sorting correctly, index order is changed, but in assign step is change order like original, so no change in order of values.
df['col1'] = df.sort_values('col1')['col1'].to_numpy()
If default index another idea is create default index (same like original), so alignment asign by new index values:
df['col1'] = df.sort_values('col1')['col1'].reset_index(drop=True)
If want sort by col1 column:
df = df.sort_values('col1')
Related
I have this dataset
In [4]: df = pd.DataFrame({'A':[1, 2, 3, 4, 5]})
In [5]: df
Out[5]:
A
0 1
1 2
2 3
3 4
4 5
I want to add a new column in dataset based em last value of item, like this
A
New Column
1
2
1
3
2
4
3
5
4
I tryed to use apply with iloc, but it doesn't worked
Can you help
Thank you
With your shown samples, could you please try following. You could use shift function to get the new column which will move all elements of given column into new column with a NaN in first element.
import pandas as pd
df['New_Col'] = df['A'].shift()
OR
In case you would like to fill NaNs with zeros then try following, approach is same as above for this one too.
import pandas as pd
df['New_Col'] = df['A'].shift().fillna(0)
Suppose I have the following Pandas DataFrame:
df = pd.DataFrame({
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
})
a b c
0 1 4 7
1 2 5 8
2 3 6 9
I want to generate a new pandas.Series so that the values of this series are selected, row by row, from a random column in the DataFrame. So, a possible output for that would be the series:
0 7
1 2
2 9
dtype: int64
(where in row 0 it randomly chose 'c', in row 1 it randomly chose 'a' and in row 2 it randomly chose 'c' again).
I know this can be done by iterating over the rows and using random.choice to choose each row, but iterating over the rows not only has bad performance but also is "unpandonic", so to speak. Also, df.sample(axis=1) would choose whole columns, so all of them would be chosen from the same column, which is not what I want. Is there a better way to do this with vectorized pandas methods?
Here is a fully vectorized solution. Note however that it does not use Pandas methods, but rather involves operations on the underlying numpy array.
import numpy as np
indices = np.random.choice(np.arange(len(df.columns)), len(df), replace=True)
Example output is [1, 2, 1] which corresponds to ['b', 'c', 'b'].
Then use this to slice the numpy array:
df['random'] = df.to_numpy()[np.arange(len(df)), indices]
Results:
a b c random
0 1 4 7 7
1 2 5 8 5
2 3 6 9 9
May be something like:
pd.Series([np.random.choice(i,1)[0] for i in df.values])
This does the job (using the built-in module random):
ddf = df.apply(lambda row : random.choice(row.tolist()), axis=1)
or using pandas sample:
ddf = df.apply(lambda row : row.sample(), axis=1)
Both have the same behaviour. ddf is your Series.
pd.DataFrame(
df.values[range(df.shape[0]),
np.random.randint(
0, df.shape[1], size=df.shape[0])])
output
0
0 4
1 5
2 9
You're probably still going to need to iterate through each row while selecting a random value in each row - whether you do it explicitly with a for loop or implicitly with whatever function you decide to call.
You can, however, simplify the to a single line using a list comprehension, if it suits your style:
result = pd.Series([random.choice(pd.iloc[i]) for i in range(len(df))])
When I add a column using apply on other columns, does panda store the result of this new column in the same row as the one used for the computation. If not how can I make it do it.
The reason why I'am not completely confident is following example
df = pd.DataFrame({'index':[0,1,2,3,4], 'value':[1,2,3,4,5]})
df2 = pd.DataFrame({'index':[0,2,1,3,5], 'value':[1,2,3,4,5]})
df['second_value'] = df['value'].apply(lambda x: x**2)
df['third_value'] = df2['value'].apply(lambda x: x**2)
df
The results this yields is
index value second_value third_value
0 1 1 1
1 2 4 4
2 3 9 9
3 4 16 16
4 5 25 25
So what I see here is that pandas only checks for the order. So can it happen that a DataFrame is sorted at a random moment which could mess up or can I assume that the order is always preserved when I perform
df['new_value'] = df['old_value'].apply(...)
?
EDIT: In my original code snippet I forgot to set the index and that is actually where I was doing wrong. So I had df.set_index('index') and df2.set_index('index') before using apply. the problem is that this method creates a copy with the said index. So either you asign these to the original dataframes df and df2 or even better you add inline=True in the method call in order to not create a copy and set the index in the given dataframe.
That's not how you define an index. You need to pass a list/iterable to the index keyword argument when calling the pd.DataFrame constructor.
df = pd.DataFrame({'value' : [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'value' : [1, 2, 3, 4, 5]}, index=[0, 2, 1, 3, 4])
df['second'] = df['value'] ** 2
df['third'] = df2['value'] ** 2
df
value second third
0 1 1 1
1 2 4 9 # note these
2 3 9 4 # two rows
3 4 16 16
4 5 25 25
The assignment operations are always index aligned.
I'm trying to re-insert back into a pandas dataframe a column that I extracted and of which I changed the order by sorting it.
Very simply, I have extracted a column from a pandas df:
col1 = df.col1
This column contains integers and I used the .sort() method to order it from smallest to largest. And did some operation on the data.
col1.sort()
#do stuff that changes the values of col1.
Now the indexes of col1 are the same as the indexes of the overall df, but in a different order.
I was wondering how I can insert the column back into the original dataframe (replacing the col1 that is there at the moment)
I have tried both of the following methods:
1)
df.col1 = col1
2)
df.insert(column_index_of_col1, "col1", col1)
but both methods give me the following error:
ValueError: cannot reindex from a duplicate axis
Any help will be greatly appreciated.
Thank you.
Consider this DataFrame:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [6, 5, 4]}, index=[0, 0, 1])
df
Out:
A B
0 1 6
0 2 5
1 3 4
Assign the second column to b and sort it and take the square, for example:
b = df['B']
b = b.sort_values()
b = b**2
Now b is:
b
Out:
1 16
0 25
0 36
Name: B, dtype: int64
Without knowing the exact operation you've done on the column, there is no way to know whether 25 corresponds to the first row in the original DataFrame or the second one. You can take the inverse of the operation (take the square root and match, for example) but that would be unnecessary I think. If you start with an index that has unique elements (df = df.reset_index()) it would be much easier. In that case,
df['B'] = b
should work just fine.
Say I want a function that changes the value of a named column in a given row number of a DataFrame.
One option is to find the column's location and use iloc, like that:
def ChangeValue(df, rowNumber, fieldName, newValue):
columnNumber = df.columns.get_loc(fieldName)
df.iloc[rowNumber, columnNumber] = newValue
But I wonder if there is a way to use the magic of iloc and loc in one go, and skip the manual conversion.
Any ideas?
I suggest just using iloc combined with the Index.get_loc method. eg:
df.iloc[0:10, df.columns.get_loc('column_name')]
A bit clumsy, but simple enough.
MultiIndex has both get_loc and get_locs which takes a sequence; unfortunately Index just seems to have the former.
Using loc
One has to resort to either employing integer location iloc all the way —as suggested in this answer,— or using plain location loc all the way, as shown here:
df.loc[df.index[[0, 7, 13]], 'column_name']
According to this answer,
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
So you should especially be able to use df.ix[rowNumber, fieldname] in case type(df.index) != type(rowNumber).
Even it does not hold for each case, I'd like to add an easy one, if you are looking for top or bottom entries:
df.head(1)['column_name'] # first entry in 'column_name'
df.tail(5)['column_name'] # last 5 entries in 'column_name'
Edit: doing the following is not a good idea. I leave the answer as a counter example.
You can do this:
df.iloc[rowNumber].loc[fieldName] = newValue
Example
import pandas as pd
def ChangeValue(df, rowNumber, fieldName, newValue):
df.iloc[rowNumber].loc[fieldName] = newValue
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
print(df)
A B C
4 0 2 3
5 0 4 1
6 10 20 30
ChangeValue(df, 1, "B", 999)
print(df)
A B C
4 0 2 3
5 0 999 1
6 10 20 30
But be careful, if newValue is not the same type it does not work and will fail silently
ChangeValue(df, 1, "B", "Oops")
print(df)
A B C
4 0 2 3
5 0 999 1
6 10 20 30
There is some good info about working with columns data types here: Change column type in pandas