replace rows in a pandas data frame

replace rows in a pandas data frame - python

I want to start with an empty data frame and then add to it one row each time.
I can even start with a 0 data frame data=pd.DataFrame(np.zeros(shape=(10,2)),column=["a","b"]) and then replace one line each time.
How can I do that?

Use .loc for label based selection, it is important you understand how to slice properly: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label and understand why you should avoid chained assignment: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [14]:
data=pd.DataFrame(np.zeros(shape=(10,2)),columns=["a","b"])
data
Out[14]:
a b
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
In [15]:
data.loc[2:2,'a':'b']=5,6
data
Out[15]:
a b
0 0 0
1 0 0
2 5 6
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]

If you are replacing the entire row then you can just use an index and not need row,column slices.
...
data.loc[2]=5,6

Related

Convert the last non-zero value to 0 for each row in a pandas DataFrame

I'm trying to modify my data frame in a way that the last variable of a label encoded feature is converted to 0. For example, I have this data frame, top row being the labels and the first column as the index:
df
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 1
1 0 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1 0
Columns 1-10 are the ones that have been encoded. What I want to convert this data frame to, without changing anything else is:
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
So the last values occurring in each row should be converted to 0. I was thinking of using the last_valid_index method, but that would take in the other remaining columns and change that as well, which I don't want. Any help is appreciated

You can use cumsum to build a boolean mask, and set to zero.
v = df.cumsum(axis=1)
df[v.lt(v.max(axis=1), axis=0)].fillna(0, downcast='infer')
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
Another similar option is reversing before calling cumsum, you can now do this in a single line.
df[~df.iloc[:, ::-1].cumsum(1).le(1)].fillna(0, downcast='infer')
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
If you have more columns, just apply these operations on the slice. Later, assign back.
u = df.iloc[:, :10]
df[u.columns] = u[~u.iloc[:, ::-1].cumsum(1).le(1)].fillna(0, downcast='infer')

Pandas get_dummies generates multiple columns for the same feature

I'm using a pandas series and trying to convert it to one hot encoding. I'm using the describe method in order to check how many unique categories the series has. The output is:
input['pattern'].describe(include='all')
count 9725
unique 7
top 1
freq 4580
Name: pattern, dtype: object
When I'm trying:
x = pd.get_dummies(input['pattern'])
x.describe(include= 'all')
I get 18 classes with 12 classes which are completely zeros. How come did get_dummies produced classes which did not occur even once in the input?

From a discussion in the comments, it was deduced that your column contained a mixture of strings and integers.
For example,
s = pd.Series(['0', 0, '0', '6', 6, '6', '3', '3'])
s
0 0
1 0
2 0
3 6
4 6
5 6
6 3
7 3
dtype: object
Now, calling pd.get_dummies would result in multiple such columns of the same feature.
pd.get_dummies(s)
0 6 0 3 6
0 0 0 1 0 0
1 1 0 0 0 0
2 0 0 1 0 0
3 0 0 0 0 1
4 0 1 0 0 0
5 0 0 0 0 1
6 0 0 0 1 0
7 0 0 0 1 0
The fix is to ensure that all elements are of the same type. I'd recommend, for this case, converting to str.
s.astype(str).str.get_dummies()
0 3 6
0 1 0 0
1 1 0 0
2 1 0 0
3 0 0 1
4 0 0 1
5 0 0 1
6 0 1 0
7 0 1 0

Create Pandas DataFrame from (row, column, value) data

I have a Pandas Dataframe with three columns: row, column, value. The row values are all integers below some N, and the column values are all integers below some M. The values are all positive integers.
How do I efficiently create a Dataframe with N rows and M columns, with at index i, j the value val if (i, j , val) is a row in my original Dataframe, and some default value (0) otherwise? Furthermore, is it possible to create a sparse Dataframe immediately, since the data is already quite large, but N*M is still about 10 times the size of my data?

A NumPy solution would suit here for performance -
a = df.values
m,n = a[:,:2].max(0)+1
out = np.zeros((m,n),dtype=a.dtype)
out[a[:,0], a[:,1]] = a[:,2]
df_out = pd.DataFrame(out)
Sample run -
In [58]: df
Out[58]:
row col val
0 7 1 30
1 3 3 0
2 4 8 30
3 5 8 18
4 1 3 6
5 1 6 48
6 0 2 6
7 4 7 6
8 5 0 48
9 8 1 48
10 3 2 12
11 6 8 18
In [59]: df_out
Out[59]:
0 1 2 3 4 5 6 7 8
0 0 0 6 0 0 0 0 0 0
1 0 0 0 6 0 0 48 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 12 0 0 0 0 0 0
4 0 0 0 0 0 0 0 6 30
5 48 0 0 0 0 0 0 0 18
6 0 0 0 0 0 0 0 0 18
7 0 30 0 0 0 0 0 0 0
8 0 48 0 0 0 0 0 0 0

Conditional raplacement with index shift in Pandas

having following column in dataframe:
0
0
0
0
0
5
I would like to check for values greater than a threshold. If found, set to zero and move up by the difference value-threshold, setting threshold on the new position. Let's say threshold=3, then the resulting column has to be:
0
0
0
3
0
0
Any idea for fast transformation?

For this DataFrame:
df
Out:
A
0 0
1 0
2 0
3 0
4 0
5 5
6 0
7 0
8 0
9 0
10 6
11 0
12 0
threshold = 3
above_threshold = df['A'] > threshold
df.loc[df[above_threshold].index - (df.loc[above_threshold, 'A'] - 3).values, 'A'] = 3
df.loc[above_threshold, 'A'] = 0
df
Out:
A
0 0
1 0
2 0
3 3
4 0
5 0
6 0
7 3
8 0
9 0
10 0
11 0
12 0

how to create all zero dataframe in Python

I want to create a dataframe in Python with 24 columns (indicating 24 hours), which looks like this:
column name 0 1 2 3 ... 24
row 1 0 0 0 0 0
row 2 0 0 0 0 0
row 3 0 0 0 0 0
I would like to know how to initialize it? and in the future, I may add row 4, with all "0", how to do that? Thanks,

There's a trick here: when DataFrame (or Series) constructor is passed a scalar as the first argument this value is propogated:
In [11]: pd.DataFrame(0, index=np.arange(1, 4), columns=np.arange(24))
Out[11]:
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
[3 rows x 24 columns]
Note: np.arange is numpy's answer to python's range.

You can create an empty numpy array, convert it to a dataframe, and then add the header names.
import numpy
a = numpy.zeros(shape=(3,24))
df = pd.DataFrame(a,columns=['col1','col2', etc..])
to set row names use
df.set_index(['row1', 'row2', etc..])
if you must.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

replace rows in a pandas data frame - python

I want to start with an empty data frame and then add to it one row each time. I can even start with a 0 data frame data=pd.DataFrame(np.zeros(shape=(10,2)),column=["a","b"]) and then replace one line each time. How can I do that?

If you are replacing the entire row then you can just use an index and not need row,column slices. ... data.loc[2]=5,6

Related

Convert the last non-zero value to 0 for each row in a pandas DataFrame

Pandas get_dummies generates multiple columns for the same feature

Create Pandas DataFrame from (row, column, value) data

Conditional raplacement with index shift in Pandas

how to create all zero dataframe in Python

Categories

Resources