I'd like to change a value in a dataframe by addressing the rows by using integer indexing (using iloc) , and addressing the columns using location indexing (using loc).
Is there anyway to combine these two methods? I believe it would be the same as saying, I want the 320th row of this dataframe and the column that has the title "columnTitle". Is this possible?
IIUC you can call iloc directly on the column:
In [193]:
df = pd.DataFrame(columns=list('abc'), data = np.random.randn(5,3))
df
Out[193]:
a b c
0 -0.810747 0.898848 -0.374113
1 0.550121 0.934072 -1.117936
2 -2.113217 0.131204 -0.048545
3 1.674282 -0.611887 0.696550
4 -0.076561 0.331289 -0.238261
In [194]:
df['b'].iloc[3] = 0
df
Out[194]:
a b c
0 -0.810747 0.898848 -0.374113
1 0.550121 0.934072 -1.117936
2 -2.113217 0.131204 -0.048545
3 1.674282 0.000000 0.696550
4 -0.076561 0.331289 -0.238261
Mixed integer and label based access is supported by ix.
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
>>> df
A B C
0 -0.473002 0.400249 0.332440
1 -1.291438 0.042443 0.001893
2 0.294902 0.927790 0.999090
3 1.415020 0.428405 -0.291283
4 -0.195136 -0.400629 0.079696
>>> df.ix[[0, 3, 4], ['B', 'C']]
B C
0 0.400249 0.332440
3 0.428405 -0.291283
4 -0.400629 0.079696
df.ix[[0, 3, 4], ['B', 'C']] = 0
>>> df
A B C
0 -0.473002 0.000000 0.000000
1 -1.291438 0.042443 0.001893
2 0.294902 0.927790 0.999090
3 1.415020 0.000000 0.000000
4 -0.195136 0.000000 0.000000
Related
two sample dataframes with different index values, but identical column names and order:
df1 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[2,4])
df2 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[7,9])
df1
A B C
2 1 3
4 2
df2
A B C
7 4
9 5 6
I know how to concat the two dataframes, but that gives this:
A B C
2 1 3
4 2
Omitting the non=matching indexes from the other df
result I am trying to achieve is:
A B C
0 1 4 3
1 5 2 6
I want to combine the rows with the same index values from each df so that missing values in one df are replaced by the corresponding value in the other.
Concat and Merge are not up to the job I have found.
I assume I have to have identical indexes in each df which correspond to the values I want to merge into one row. But, so far, no luck getting it to come out correctly. Any pandas transformational wisdom is appreciated.
This merge attempt did not do the trick:
df1.merge(df2, on='A', how='outer')
The solutions below were all offered before I edited the question. My fault there, I neglected to point out that my actual data has different indexes in the two dataframes.
Let us try mask
out = df1.mask(df1=='',df2)
Out[428]:
A B C
0 1 4 3
1 5 2 6
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
if df1.iloc[i,j]=="":
df1.iloc[i,j] = df2.iloc[i,j]
print(df1)
A B C
0 1 4 3
1 5 2 6
Since the index of your two dataframes are different, it's easier to make it into the same index.
index = [i for i in range(len(df1))]
df1.index = index
df2.index = index
ddf = df1.replace('',np.nan)).fillna(df2)
If both df1 and df2 have different size of datas, it's still workable.
df1 = pd.DataFrame([[1, '', 3], ['', 2, ''],[7,8,9],[10,11,12]], columns=['A', 'B', 'C'],index=[7,8,9,10])
index1 = [i for i in range(len(df1))]
index2 = [i for i in range(len(df2))]
df1.index = index1
df2.index = index2
df1.replace('',np.nan).fillna(df2)
You can get
Out[17]:
A B C
0 1.0 5 3.0
1 4 2.0 6
2 7.0 8.0 9.0
3 10.0 11.0 12.0
How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2
my first post!
I'm running python 3.8.5 & pandas 1.1.0 on jupyter notebooks.
I want to divide several columns by the corresponding elements in another column of the same dataframe.
For example:
import pandas as pd
df = pd.DataFrame({'a': [2, 3, 4], 'b': [4, 6, 8], 'c':[6, 9, 12]})
df
a b c
0 2 4 6
1 3 6 9
2 4 8 12
I'd like to divide columns 'b' & 'c' by the corresponding values in 'a' and substitute the values in 'b' and 'c' with the result of this division. So the above dataframe becomes:
a b c
0 2 2 3
1 3 2 3
2 4 2 3
I tried
df.iloc[: , 1:] = df.iloc[: , 1:] / df['a']
but this gives:
a b c
0 2 NaN NaN
1 3 NaN NaN
2 4 NaN NaN
I got it working by doing:
for colname in df.columns[1:]:
df[colname] = (df[colname] / df['a'])
Is there a faster way of doing the above by avoiding the for loop?
thanks,
mk
Almost there, use div with axis=0:
df.iloc[:,1:] = df.iloc[:,1:].div(df.a, axis=0)
df.b= df.b/df.a
df.c=df.c/df.a
or
df[['b','c']]=df.apply(lambda x: x[['b','c']]/x.a ,axis=1)
consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.
I would like to subtract all rows in a dataframe with one row from another dataframe.
(Difference from one row)
Is there an easy way to do this? Like df-df2)?
df = pd.DataFrame(abs(np.floor(np.random.rand(3, 5)*10)),
... columns=['a', 'b', 'c', 'd', 'e'])
df
Out[18]:
a b c d e
0 8 9 8 6 4
1 3 0 6 4 8
2 2 5 7 5 6
df2 = pd.DataFrame(abs(np.floor(np.random.rand(1, 5)*10)),
... columns=['a', 'b', 'c', 'd', 'e'])
df2
a b c d e
0 8 1 3 7 5
Here is an output that works for the first row, however I want the remaining rows to be detracted as well...
df-df2
a b c d e
0 0 8 5 -1 -1
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
Pandas NDFrames generally try to perform operations on items with matching indices. df - df2 only performs subtraction on the first row, because the 0 indexed row is the only row with an index shared in common.
The operation you are looking for looks more like a NumPy array operation performed with "broadcasting":
In [21]: df.values-df2.values
Out[21]:
array([[ 0, 8, 5, -1, -1],
[-5, -1, 3, -3, 3],
[-6, 4, 4, -2, 1]], dtype=int64)
To package the result in a DataFrame:
In [22]: pd.DataFrame(df.values-df2.values, columns=df.columns)
Out[22]:
a b c d e
0 0 8 5 -1 -1
1 -5 -1 3 -3 3
2 -6 4 4 -2 1
You can do this directly in pandas as well. (I used df2 = df.loc[[0]])
In [80]: df.sub(df2,fill_value=0)
Out[80]:
a b c d e
0 0 0 0 0 0
1 7 6 0 7 8
2 4 4 3 6 2
[3 rows x 5 columns]
Alternatively you could simply use the apply function on all rows of df.
df3 = df.apply(lambda x: x-df2.squeeze(), axis=1)
# axis=1 because it should apply to rows instead of columns
# squeeze because we would like to substract Series