replace rows in a pandas data frame - python

I want to start with an empty data frame and then add to it one row each time.
I can even start with a 0 data frame data=pd.DataFrame(np.zeros(shape=(10,2)),column=["a","b"]) and then replace one line each time.
How can I do that?

Use .loc for label based selection, it is important you understand how to slice properly: http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label and understand why you should avoid chained assignment: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [14]:
data=pd.DataFrame(np.zeros(shape=(10,2)),columns=["a","b"])
data
Out[14]:
a b
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]
In [15]:
data.loc[2:2,'a':'b']=5,6
data
Out[15]:
a b
0 0 0
1 0 0
2 5 6
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
[10 rows x 2 columns]

If you are replacing the entire row then you can just use an index and not need row,column slices.
...
data.loc[2]=5,6

Related

Convert the last non-zero value to 0 for each row in a pandas DataFrame

I'm trying to modify my data frame in a way that the last variable of a label encoded feature is converted to 0. For example, I have this data frame, top row being the labels and the first column as the index:
df
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 1
1 0 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 1 0
Columns 1-10 are the ones that have been encoded. What I want to convert this data frame to, without changing anything else is:
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
So the last values occurring in each row should be converted to 0. I was thinking of using the last_valid_index method, but that would take in the other remaining columns and change that as well, which I don't want. Any help is appreciated
You can use cumsum to build a boolean mask, and set to zero.
v = df.cumsum(axis=1)
df[v.lt(v.max(axis=1), axis=0)].fillna(0, downcast='infer')
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
Another similar option is reversing before calling cumsum, you can now do this in a single line.
df[~df.iloc[:, ::-1].cumsum(1).le(1)].fillna(0, downcast='infer')
1 2 3 4 5 6 7 8 9 10
0 0 1 0 0 0 0 0 0 1 0
1 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0
If you have more columns, just apply these operations on the slice. Later, assign back.
u = df.iloc[:, :10]
df[u.columns] = u[~u.iloc[:, ::-1].cumsum(1).le(1)].fillna(0, downcast='infer')

Pandas get_dummies generates multiple columns for the same feature

I'm using a pandas series and trying to convert it to one hot encoding. I'm using the describe method in order to check how many unique categories the series has. The output is:
input['pattern'].describe(include='all')
count 9725
unique 7
top 1
freq 4580
Name: pattern, dtype: object
When I'm trying:
x = pd.get_dummies(input['pattern'])
x.describe(include= 'all')
I get 18 classes with 12 classes which are completely zeros. How come did get_dummies produced classes which did not occur even once in the input?
From a discussion in the comments, it was deduced that your column contained a mixture of strings and integers.
For example,
s = pd.Series(['0', 0, '0', '6', 6, '6', '3', '3'])
s
0 0
1 0
2 0
3 6
4 6
5 6
6 3
7 3
dtype: object
Now, calling pd.get_dummies would result in multiple such columns of the same feature.
pd.get_dummies(s)
0 6 0 3 6
0 0 0 1 0 0
1 1 0 0 0 0
2 0 0 1 0 0
3 0 0 0 0 1
4 0 1 0 0 0
5 0 0 0 0 1
6 0 0 0 1 0
7 0 0 0 1 0
The fix is to ensure that all elements are of the same type. I'd recommend, for this case, converting to str.
s.astype(str).str.get_dummies()
0 3 6
0 1 0 0
1 1 0 0
2 1 0 0
3 0 0 1
4 0 0 1
5 0 0 1
6 0 1 0
7 0 1 0

Create Pandas DataFrame from (row, column, value) data

I have a Pandas Dataframe with three columns: row, column, value. The row values are all integers below some N, and the column values are all integers below some M. The values are all positive integers.
How do I efficiently create a Dataframe with N rows and M columns, with at index i, j the value val if (i, j , val) is a row in my original Dataframe, and some default value (0) otherwise? Furthermore, is it possible to create a sparse Dataframe immediately, since the data is already quite large, but N*M is still about 10 times the size of my data?
A NumPy solution would suit here for performance -
a = df.values
m,n = a[:,:2].max(0)+1
out = np.zeros((m,n),dtype=a.dtype)
out[a[:,0], a[:,1]] = a[:,2]
df_out = pd.DataFrame(out)
Sample run -
In [58]: df
Out[58]:
row col val
0 7 1 30
1 3 3 0
2 4 8 30
3 5 8 18
4 1 3 6
5 1 6 48
6 0 2 6
7 4 7 6
8 5 0 48
9 8 1 48
10 3 2 12
11 6 8 18
In [59]: df_out
Out[59]:
0 1 2 3 4 5 6 7 8
0 0 0 6 0 0 0 0 0 0
1 0 0 0 6 0 0 48 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 12 0 0 0 0 0 0
4 0 0 0 0 0 0 0 6 30
5 48 0 0 0 0 0 0 0 18
6 0 0 0 0 0 0 0 0 18
7 0 30 0 0 0 0 0 0 0
8 0 48 0 0 0 0 0 0 0

Conditional raplacement with index shift in Pandas

having following column in dataframe:
0
0
0
0
0
5
I would like to check for values greater than a threshold. If found, set to zero and move up by the difference value-threshold, setting threshold on the new position. Let's say threshold=3, then the resulting column has to be:
0
0
0
3
0
0
Any idea for fast transformation?
For this DataFrame:
df
Out:
A
0 0
1 0
2 0
3 0
4 0
5 5
6 0
7 0
8 0
9 0
10 6
11 0
12 0
threshold = 3
above_threshold = df['A'] > threshold
df.loc[df[above_threshold].index - (df.loc[above_threshold, 'A'] - 3).values, 'A'] = 3
df.loc[above_threshold, 'A'] = 0
df
Out:
A
0 0
1 0
2 0
3 3
4 0
5 0
6 0
7 3
8 0
9 0
10 0
11 0
12 0

how to create all zero dataframe in Python

I want to create a dataframe in Python with 24 columns (indicating 24 hours), which looks like this:
column name 0 1 2 3 ... 24
row 1 0 0 0 0 0
row 2 0 0 0 0 0
row 3 0 0 0 0 0
I would like to know how to initialize it? and in the future, I may add row 4, with all "0", how to do that? Thanks,
There's a trick here: when DataFrame (or Series) constructor is passed a scalar as the first argument this value is propogated:
In [11]: pd.DataFrame(0, index=np.arange(1, 4), columns=np.arange(24))
Out[11]:
0 1 2 3 4 5 6 7 8 9 ... 14 15 16 17 18 19 20 21 22 23
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
[3 rows x 24 columns]
Note: np.arange is numpy's answer to python's range.
You can create an empty numpy array, convert it to a dataframe, and then add the header names.
import numpy
a = numpy.zeros(shape=(3,24))
df = pd.DataFrame(a,columns=['col1','col2', etc..])
to set row names use
df.set_index(['row1', 'row2', etc..])
if you must.

Categories

Resources