I'm trying to replace the last row(s) of a Pandas dataframe using iloc, however I cannot get it to work. There are lots of solutions out there, but the simplest (slowest) is here:
How to do a FIFO push-operation for rows on Pandas dataframe in Python?
Why doesn't this method work in the code below ?
def append_from_dataframe(self,timeframe,new_dataframe):
new_dataframe.reset_index(inplace=True)
temp_dataframe = self.timeframedict.get(timeframe)
num_rows_existing = temp_dataframe.shape[0]
num_rows_new = new_dataframe.shape[0]
overlap = (num_rows_existing + num_rows_new) - 500
# slow, replace with numpy array eventually
if overlap >= 1:
# number of rows to shift
i = overlap * -1
#shift the dataframe back in time
temp_dataframe = temp_dataframe.shift(i)
#self.timeframedict.get(timeframe) = self.timeframedict.get(timeframe).shift(overlap)
#replace the last i rows with the new values
temp_dataframe.iloc[i:] = new_dataframe
self.timeframedict.update({timeframe:temp_dataframe})
else:
#TODO - see this https://stackoverflow.com/questions/10715965/add-one-row-in-a-pandas-dataframe
self.timeframedict.update({timeframe:self.timeframedict.get(timeframe).append(new_dataframe)})
Contents of the dataframe to replace one row in the other:
ipdb> new_dataframe
Timestamp Open High Low Close Volume localtime
0 1533174420000 423.43 423.44 423.43 423.44 0.73765 1533174423776
temp_dataframe.shift(i) shifts value back one, replaces the values with NaN -
ipdb> temp_dataframe.iloc[i:]
Timestamp Open High Low Close Volume localtime
499 NaN NaN NaN NaN NaN NaN NaN
However temp_dataframe.iloc[i:] = new_dataframe does not replace anything.
edit: I should add that after some more playing aroundnow I can replace 1 row with:
temp_dataframe.iloc[-1] = new_dataframe.iloc[0]
however, I cannot get the multiple row version to work
df = pd.DataFrame({'a':[1,2,3,4,5],'b':['foo','bar','foobar','foobaz','food']})
Output:
df
Out[117]:
a b
0 1 foo
1 2 bar
2 3 foobar
3 4 foobaz
4 5 food
Replace last two rows(foobaz and food) with second and first rows respectively:
df.iloc[-2:]=[df.iloc[1],df.iloc[0]]
df
Out[119]:
a b
0 1 foo
1 2 bar
2 3 foobar
3 2 bar
4 1 foo
You can also do this to achieve the same result:
df.iloc[-2:]=df.iloc[1::-1].values
Related
My questions regarding this topic are twofold (in bold):
I am using a Dataframe to manage a large amount of data in the following format:
Time
Data
1
Start
2
1
3
2
4
3
5
R
6
A
7
Start
8
3
9
R
The goal is to have a new column storing the time difference from the first message after a 'Start' message until the first 'R' message. This 'Start' to 'Start' pattern cycles repeats thousands of times in a dataframe. I am marking rows that contain the first after a 'Start' message using the following code:
df['Shift'] = df['Data'].shift()
df['first_in_cycle'] = df['Shift'].str.contains('Start')
df.drop(columns='Shift', inplace=True)
Then I attempt to mark the first occurrence of an 'R using the following code:
counter = (df['Data'].str.contains('Start')).cumsum()
first_r = df[df['Data'].str.contains('R')].groupby(counter).transform('idxmin')
df.loc[first_r.index, 'first_R'] = True
However, the result of this is a Dataframe with all of the 'R' messages marked as true, which is incorrect. I am unsure of how to fix this issue.
The plan was to get both of the flag columns correct then merge them using .any(), removing all rows that are marked as false, then using .diff() to calculate the difference between the 'R' and 'Start'. Is this the best way to accomplish this?
The following is the desired output given the initial example:
Time
Data
Time Difference
1
Start
2
1
5
R
3
7
Start
8
3
9
R
1
Use:
#create default index if necessary
df = df.reset_index(drop=True)
#check Start and R
m1 = df['Data'].str.contains('Start')
m2 = df['Data'].str.contains('R')
#groups by Start
g = m1.cumsum()
#get first R per groups to mask
mask1 = m2.groupby(g).transform('idxmax').eq(df.index)
#get Start + one row after Start
mask2 = m1.groupby(g).shift(fill_value=True)
#boolean indexing
df = df[mask1 | mask2]
#get difference per groups
df['Time Difference'] = df['Time'].groupby(g).diff().mask(mask2)
print (df)
Time Data Time Difference
0 1 Start NaN
1 2 1 NaN
4 5 R 3.0
6 7 Start NaN
7 8 3 NaN
8 9 R 1.0
I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)
I have a dataframe as below:
I want to get the name of the column if column of a particular row if it contains 1 in the that column.
Use DataFrame.dot:
df1 = df.dot(df.columns)
If there is multiple 1 per row:
df2 = df.dot(df.columns + ';').str.rstrip(';')
Firstly
Your question is very ambiguous and I recommend reading this link in #sammywemmy's comment. If I understand your problem correctly... we'll talk about this mask first:
df.columns[
(df == 1) # mask
.any(axis=0) # mask
]
What's happening? Lets work our way outward starting from within df.columns[**HERE**] :
(df == 1) makes a boolean mask of the df with True/False(1/0)
.any() as per the docs:
"Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent".
This gives us a handy Series to mask the column names with.
We will use this example to automate for your solution below
Next:
Automate to get an output of (<row index> ,[<col name>, <col name>,..]) where there is 1 in the row values. Although this will be slower on large datasets, it should do the trick:
import pandas as pd
data = {'foo':[0,0,0,0], 'bar':[0, 1, 0, 0], 'baz':[0,0,0,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data, index=['a','b','c','d'])
print(df)
foo bar baz spam
a 0 0 0 0
b 0 1 0 1
c 0 0 0 0
d 0 0 0 1
# group our df by index and creates a dict with lists of df's as values
df_dict = dict(
list(
df.groupby(df.index)
)
)
Next step is a for loop that iterates the contents of each df in df_dict, checks them with the mask we created earlier, and prints the intended results:
for k, v in df_dict.items(): # k: name of index, v: is a df
check = v.columns[(v == 1).any()]
if len(check) > 0:
print((k, check.to_list()))
('b', ['bar', 'spam'])
('d', ['spam'])
Side note:
You see how I generated sample data that can be easily reproduced? In the future, please try to ask questions with posted sample data that can be reproduced. This way it helps you understand your problem better and it is easier for us to answer it for you.
Getting column name are dividing in 2 sections.
If you want in a new column name then condition should be unique because it will only give 1 col name for each row.
data = {'foo':[0,0,3,0], 'bar':[0, 5, 0, 0], 'baz':[0,0,2,0], 'spam':[0,1,0,1]}
df = pd.DataFrame(data)
df=df.replace(0,np.nan)
df
foo bar baz spam
0 NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0
2 3.0 NaN 2.0 NaN
3 NaN NaN NaN 1.0
If you were looking for min or maximum
max= df.idxmax(1)
min = df.idxmin(1)
out= df.assign(max=max , min=min)
out
foo bar baz spam max min
0 NaN NaN NaN NaN NaN NaN
1 NaN 5.0 NaN 1.0 bar spam
2 3.0 NaN 2.0 NaN foo baz
3 NaN NaN NaN 1.0 spam spam
2nd case, If your condition is satisfied in multiple columns for example you are looking for columns that contain 1 and you are looking for list because its not possible to adjust in same dataframe.
str_con= df.astype(str).apply(lambda x:x.str.contains('1.0',case=False, na=False)).any()
df.column[str_con]
#output
Index(['spam'], dtype='object') #only spam contains 1
Or you are looking for numerical condition columns contains value more than 1
num_con = df.apply(lambda x:x>1.0).any()
df.columns[num_con]
#output
Index(['foo', 'bar', 'baz'], dtype='object') #these col has higher value than 1
Happy learning
I have a pandas DataFrame with a multi-index like this:
import pandas as pd
import numpy as np
arr = [1]*3 + [2]*3
arr2 = list(range(3)) + list(range(3))
mux = pd.MultiIndex.from_arrays([
arr,
arr2
], names=['one', 'two'])
df = pd.DataFrame({'a': np.arange(len(mux))}, mux)
df
a
one two
1 0 0
1 1 1
1 2 2
2 0 3
2 1 4
2 2 5
I have a function that takes a slice of a DataFrame and needs to assign a new column to the rows that have been sliced:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
#pass in a slice of the df with only records that have the last value for 'two'
work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
However calling the function results in the error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# This is added back by InteractiveShellApp.init_path()
How can I create a new column 'b' in the original DataFrame and assign its values for only the rows that were passed to the function, leaving the rest of the rows nan?
The desired output is:
a b
one two
1 0 0 nan
1 1 1 nan
1 2 2 4
2 0 3 nan
2 1 4 nan
2 2 5 10
NOTE: In the work function I'm actually doing a bunch of complex operations involving calling other functions to generate the values for the new column so I don't think this will work. Multiplying by 2 in my example is just for illustrative purposes.
You actually don't have an error, but just a warning. Try this:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
return df
#pass in a slice of the df with only records that have the last value for 'two'
new_df = work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
Then:
df.reset_index().merge(new_df, how="left").set_index(["one","two"])
Output:
a b
one two
1 0 0 NaN
1 1 NaN
2 2 4.0
2 0 3 NaN
1 4 NaN
2 5 10.0
I don't think you need a separate function at all. Try this...
df['b'] = df['a'].where(df.index.isin(df.index.get_level_values('two')[-1:], level=1))*2
The Series.where() function being called on df['a'] here should return a series where values are NaN for rows that do not result from your query.
I have seen a variant of this question asked that keeps the top n rows of each group in a pandas dataframe and the solutions use n as an absolute number rather than a percentage here Pandas get topmost n records within each group. However, in my dataframe, each group has different numbers of rows in it and I want to keep the top n% rows of each group. How would I approach this problem?
You can construct a Boolean series of flags and filter before you groupby. First let's create an example dataframe and look at the number of row for each unique value in the first series:
np.random.seed(0)
df = pd.DataFrame(np.random.randint(0, 2, (10, 3)))
print(df[0].value_counts())
0 6
1 4
Name: 0, dtype: int64
Then define a fraction, e.g. 50% below, and construct a Boolean series for filtering:
n = 0.5
g = df.groupby(0)
flags = (g.cumcount() + 1) <= g[1].transform('size') * n
Then apply the condition, set the index as the first series and (if required) sort the index:
df = df.loc[flags].set_index(0).sort_index()
print(df)
1 2
0
0 1 1
0 1 1
0 1 0
1 1 1
1 1 0
As you can see, the resultant dataframe only has 3 0 indices and 2 1 indices, in each case half the number in the original dataframe.
Here is another option which builds on some of the answers in the post you mentioned
First of all here is a quick function to either round up or round down. If we want the top 30% of rows of a dataframe 8 rows long then we would try to take 2.4 rows. So we will need to either round up or down.
My preferred option is to round up. This is because, for eaxample, if we were to take 50% of the rows, but had one group which only had one row, we would still keep that one row. I kept this separate so that you can change the rounding as you wish
def round_func(x, up=True):
'''Function to round up or round down a float'''
if up:
return int(x+1)
else:
return int(x)
Next I make a dataframe to work with and set a parameter p to be the fraction of the rows from each group that we should keep. Everything follows and I have commented it so that hopefully you can follow.
import pandas as pd
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
p = 0.30 # top fraction to keep. Currently set to 80%
df_top = df.groupby('id').apply( # group by the ids
lambda x: x.reset_index()['value'].nlargest( # in each group take the top rows by column 'value'
round_func(x.count().max()*p))) # calculate how many to keep from each group
df_top = df_top.reset_index().drop('level_1', axis=1) # make the dataframe nice again
df looked like this
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
df_top looks like this
id value
0 1 3
1 2 4
2 2 3
3 3 1
4 4 1