I have a data frame df1:
df1 =
index col1 col2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
What I would like to do is for example to replace the last two rows in col2 with NaN, so the resulting data frame would be:
index col1 col2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 NaN
6 6 NaN
Use indexing by positions with DataFrame.iloc, so need position by Index.get_loc for column:
df.iloc[-2:, df.columns.get_loc('col2')] = np.nan
Or use DataFrame.loc with indexing df.index:
df.loc[df.index[-2:], 'col2'] = np.nan
print (df)
col1 col2
1 1 2.0
2 2 3.0
3 3 4.0
4 4 5.0
5 5 NaN
6 6 NaN
Last if need integer column:
df['col2'] = df['col2'].astype('Int64')
print (df)
col1 col2
1 1 2
2 2 3
3 3 4
4 4 5
5 5 <NA>
6 6 <NA>
Just try:
df.col2[-2:] = np.NaN
It seems like the post is going to gather all the possible ways
df["col2"].iloc[-2:,] = np.nan
4 ways to do this. Ways 3 and 4 seem the best to me:
1)
df.at[5,'col2']=math.nan
df.at[6,'col2']=math.nan
df.loc[5,'col2']=np.nan
df.loc[6,'col2']=np.nan
from the answer above
df.col2[-2:]=np.nan
df['col2'][-2:]=np.nan
Related
I have DataFrame object df with column like that:
[In]: df
[Out]:
id sum
0 1 NaN
1 1 NaN
2 1 2
3 1 NaN
4 1 4
5 1 NaN
6 2 NaN
7 2 NaN
8 2 3
9 2 NaN
10 2 8
10 2 NaN
... ... ...
[1810601 rows x 2 columns]
I have a lot a NaN values in my column and I want to fill these in the following way:
if NaN is on the beginning (for first index per id equals 0), then it should be 0
else if NaN I want take value from previous index for the same id
Output should be like that:
[In]: df
[Out]:
id sum
0 1 0
1 1 0
2 1 2
3 1 2
4 1 4
5 1 4
6 2 0
7 2 0
8 2 3
9 2 3
10 2 8
10 2 8
... ... ...
[1810601 rows x 2 columns]
I tried to do it "step by step" using loop with iterrows(), but it is very ineffective method. I believe it can be done faster with pandas methods
Try ffill as suggested with groupby
df['sum'] = df.groupby('id')['sum'].ffill().fillna(0)
How would I select rows 2 through 4 of the following df to get the desired output shown below.
I tried to do df = df.index.between(2,4) but I got the following error: AttributeError: 'Int64Index' object has no attribute 'between'
col 1 col 2 col 3
0 1 1 2
1 5 4 2
2 2 1 5
3 1 2 2
4 3 2 4
5 4 3 2
Desired output
col 1 col 2 col 3
2 2 1 5
3 1 2 2
4 3 2 4
Try using loc for index selection using label slicing:
df.loc[2:4]
Output:
col 1 col 2 col 3
2 2 1 5
3 1 2 2
4 3 2 4
the easiest way to select rows from a dataframe is to use the .iloc[rows, columns] function pandas for example here i select lines 2 to 4
df1=pd.DataFrame({"a":[1,2,3,4,5,6,7],"b":[4,5,6,7,8,9,10]})
df1.iloc[1:3] #
With loc
min=2
max=4
between_range= range(min, max+1,1)
df.loc[between_range]
use the following
df.iloc[2:4]
You want to use .iloc[rows,columns]
df.iloc[2:4, :]
between cannot act on an index datatypes but just on a Series. So, if you want to use a boolean mask you first need to convert the index to a series using to_series like this:
df
# col1 col2 col3
# 0 1 1 2
# 1 5 4 2
# 2 2 1 5
# 3 1 2 2
# 4 3 2 4
# 5 4 3 2
df[df.index.to_series().between(2,4)]
# col1 col2 col3
# 2 2 1 5
# 3 1 2 2
# 4 3 2 4
Let's say I have a dataframe df:
df = pd.DataFrame({'col1': [1,1,2,2,2], 'col2': ['A','B','A','B','C'], 'value': [2,4,6,8,10]})
col1 col2 value
0 1 A 2
1 1 B 4
2 2 A 6
3 2 B 8
4 2 C 10
I'm looking for a way to create any missing rows among the possible combination of col1 and col2 with exiting values, and fill in the missing rows with zeros
The desired result would be:
col1 col2 value
0 1 A 2
1 1 B 4
2 2 A 6
3 2 B 8
4 2 C 10
5 1 C 0 <- Missing the "1-C" combination, so create it w/ value = 0
I've looked into using stack and unstack to make this work, but I'm not sure that's exactly what I need.
Thanks in advance
Use pivot , then stack
df.pivot(*df.columns).fillna(0).stack().to_frame('values').reset_index()
Out[564]:
col1 col2 values
0 1 A 2.0
1 1 B 4.0
2 1 C 0.0
3 2 A 6.0
4 2 B 8.0
5 2 C 10.0
Another way using unstack with fill_value=0 and stack, reset_index
df.set_index(['col1','col2']).unstack(fill_value=0).stack().reset_index()
Out[311]:
col1 col2 value
0 1 A 2
1 1 B 4
2 1 C 0
3 2 A 6
4 2 B 8
5 2 C 10
You could use reindex + MultiIndex.from_product:
index = pd.MultiIndex.from_product([df.col1.unique(), df.col2.unique()])
result = df.set_index(['col1', 'col2']).reindex(index, fill_value=0).reset_index()
print(result)
Output
col1 col2 value
0 1 A 2
1 1 B 4
2 1 C 0
3 2 A 6
4 2 B 8
5 2 C 10
Let's suppose that I am having the following dataset:
Stock_id Week Stock_value
1 1 2
1 2 4
1 4 7
1 5 1
2 3 8
2 4 6
2 5 5
2 6 3
I want to shift the values of the Stock_value column by one position so that I get the following:
Stock_id Week Stock_value
1 1 NA
1 2 2
1 4 4
1 5 7
2 3 NA
2 4 8
2 5 6
2 6 5
What I am doing is the following:
df = pd.read_csv('C:/Users/user/Desktop/test.txt', keep_default_na=True, sep='\t')
df = df.groupby('Store_id', as_index=False)['Waiting_time'].transform(lambda x:x.shift(periods=1))
But then this gives me:
Waiting_time
0 NaN
1 2.0
2 4.0
3 7.0
4 NaN
5 8.0
6 6.0
7 5.0
So it gives me the values shifted but it does not retain all the columns of the dataframe.
How do I also retain all the columns of the dataframe along with shifting the values of one column?
You can simplify solution by DataFrameGroupBy.shift and assign back to new column:
df['Waiting_time'] = df.groupby('Stock_id')['Stock_value'].shift()
Working same like:
df['Waiting_time']=df.groupby('Stock_id')['Stock_value'].transform(lambda x:x.shift(periods=1))
print (df)
Stock_id Week Stock_value Waiting_time
0 1 1 2 NaN
1 1 2 4 2.0
2 1 4 7 4.0
3 1 5 1 7.0
4 2 3 8 NaN
5 2 4 6 8.0
6 2 5 5 6.0
7 2 6 3 5.0
When you do df.groupby('Store_id', as_index=False)['Waiting_time'], you obtain a DataFrame with a single column 'Waiting_time', you can't generate the other columns from that.
As suggested by jezrael in the comment, you should do
df['new col'] = df.groupby('Store_id...
to add this new column to your previously existing DataFrame.
Let's say i have to data-frames, as shown below:
df=pd.DataFrame({'a':[1,4,3,2],'b':[1,2,3,4]})
df2=pd.DataFrame({'a':[1,2,3,4],'b':[1,2,3,4],'c':[34,56,7,55]})
I would like to sort df data by the order df2 data on 'a' column, so the df.a column would be the order of df2.a and that which makes the whole data-frame that order.
Desired output:
a b
0 1 1
1 2 4
2 3 3
3 4 2
(made it manually, and if there's any mistake with it, please tell me :D)
My own attempt:
df = df.set_index('a')
df = df.reindex(index=df2['a'])
df = df.reset_index()
print(df)
Works as expected!!!,
But when i have longer data-frames, like:
df=pd.DataFrame({'a':[1,4,3,2,3,4,5,3,5,6],'b':[1,2,3,4,5,5,5,6,6,7]})
df2=pd.DataFrame({'a':[1,2,3,4,3,4,5,6,4,5],'b':[1,2,4,3,4,5,6,7,4,3]})
It doesn't work ass expected.
Note: i don't only want a explanation of why but i also need a solution to do it for big data-frames
One possible solution is create helper columns in both DataFrames, because duplicated values:
df['g'] = df.groupby('a').cumcount()
df2['g'] = df2.groupby('a').cumcount()
df = df.set_index(['a','g']).reindex(index=df2.set_index(['a','g']).index)
print(df)
b
a g
1 0 1.0
2 0 4.0
3 0 3.0
4 0 2.0
3 1 5.0
4 1 5.0
5 0 5.0
6 0 7.0
4 2 NaN
5 1 6.0
Or maybe need merge:
df3 = df.merge(df2[['a','g']], on=['a','g'])
print(df3)
a b g
0 1 1 0
1 4 2 0
2 3 3 0
3 2 4 0
4 3 5 1
5 4 5 1
6 5 5 0
7 5 6 1
8 6 7 0