I have a DataFrame and I would like to add some inexisting rows to it. I have found the .loc method, but this adds the values at the end, and not in a sorted way. For example
import numpy as np
import pandas as pd
dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
>>> dfi
A B
0 0 1
1 2 3
2 4 5
[3 rows x 2 columns]
Adding a inexisting row through .loc:
dfi.loc[5,:] = 0
>>> dfi
A B
0 0 1
1 2 3
2 4 5
5 0 0
[3 rows x 2 columns]
So far everything ok. But this is what happens when trying to add another row, with index smaller than the last one:
dfi.loc[3,:] = 0
>>> dfi
A B
0 0 1
1 2 3
2 4 5
5 0 0
3 0 0
[3 rows x 2 columns]
I would like it to put the row with index 3 between the row 2 and the 5. I could sort the DataFrame by index everytime, but that would take too long. Is there another way?
My actual problem is considering a DataFrame where the indexes are datetime objects. I didn't put the whole detail of that implementation here because that would confuse what my real problem is: adding rows in DataFrame such that the result has an ordered index.
If your index is almost continuous, only missing a few values here and there. I think you may try the following,
In [15]:
df=pd.DataFrame(np.zeros((100,2)), columns=['A', 'B'])
df['A']=np.nan
df['B']=np.nan
In [16]:
df.iloc[[0,1,2]]=pd.DataFrame({'A': [0,2,4,], 'B': [1,3,5]})
df.iloc[5]=[0,0]
df.iloc[3]=0
print df.dropna()
A B
0 0 1
1 2 3
2 4 5
3 0 0
5 0 0
[5 rows x 2 columns]
Related
Is there a way to multiply each element of a row of a dataframe by an element of the same row from a particular column of another dataframe.
For example, such that:
df1:
1 2 3
2 2 2
3 2 1
and df2:
x 1 b
z 2 c
x 4 a
results in
1 2 3
4 4 4
12 8 4
So basically such that df1[i,:] * df2[i,j] = df3[i,:].
Multiply the first df by the column of the second df
Assuming your column names are 0,1,2
df1.mul(df2[1],0)
Output
0 1 2
0 1 2 3
1 4 4 4
2 12 8 4
Here you go.
I have created a variable that allows you to select that which column of the second dataframe you want to multiply with the numbers in the first dataframe.
arr1 = np.array(df1) # df1 to array
which_df2col_to_multiply = 0 # select col from df2
arr2 = np.array(df2)[:, which_df2col_to_multiply ] # selected col to array
print(np.transpose(arr2*arr1)) # result
This is the output:
[[1 2 3]
[4 4 4]
[12 8 4]]
I am trying to select only one row from a dask.dataframe by using command x.loc[0].compute(). It returns 4 rows with all having index=0. I tried reset_index, but there will still be 4 rows having index=0 after resetting. (I think I did reset correctly because I did reset_index(drop=False) and I could see the original index in the new column).
I read dask.dataframe document and it says something along the line that there might be more than one rows with index=0 due to how dask structuring the chunk data.
So, if I really want only one row by using index=0 for subsetting, how can I do this?
Edit
Probably, your problem comes from reset_index. This issue is explained at the end of the answer. Earlier part of the text is just how to solve it.
For example, there is the following dask DataFrame:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
df.compute()
Out[1]:
col_1 col_2
0 1 a
0 2 b
1 3 c
2 4 d
3 5 e
4 6 f
5 7 g
it has a numerical index with repeated 0 values. As loc is a
Purely label-location based indexer for selection by label
- it selects both 0-labeled values, if you'll do a
df.loc[0].compute()
Out[]:
col_1 col_2
0 1 a
0 2 b
- you'll get all the rows with 0-s (or another specified label).
In pandas there is a pd.DataFrame.iloc which helps us to select a row by it's numerical index. Unfortunately, in dask you can't do so, because the iloc is
Purely integer-location based indexing for selection by position.
Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.
To beat this problem, you can do some indexing tricks:
df.compute()
Out[2]:
index col_1 col_2
x
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
4 3 5 e
5 4 6 f
6 5 7 g
- now, there's new index ranged from 0 to the length of the data frame - 1.
It's possible to slice it with the loc and do the following (I suppose that select 0 label via loc means "select first row"):
df.loc[0].compute()
Out[3]:
index col_1 col_2
x
0 0 1 a
About multiplicated 0 index label
If you need original index, it's still here an it could be accessed through the
df.loc[:, 'index'].compute()
Out[4]:
x
0 0
1 0
2 1
3 2
4 3
5 4
6 5
I guess, you get such a duplication from reset_index() or so, because it genretates new 0-started index for each partition, for example, for this table of 2 partitions:
df.reset_index().compute()
Out[5]:
index col_1 col_2
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
0 3 5 e
1 4 6 f
2 5 7 g
I have a NxM dataframe and a NxL numpy matrix. I'd like to add the matrix to the dataframe to create L new columns by simply appending the columns and rows the same order they appear. I tried merge() and join(), but I end up with errors:
assign() keywords must be strings
and
columns overlap but no suffix specified
respectively.
Is there a way I can add a numpy matrix as dataframe columns?
You can turn the matrix into a datframe and use concat with axis=1:
For example, given a dataframe df and a numpy array mat:
>>> df
a b
0 5 5
1 0 7
2 1 0
3 0 4
4 6 4
>>> mat
array([[0.44926098, 0.29567859, 0.60728561],
[0.32180566, 0.32499134, 0.94950085],
[0.64958125, 0.00566706, 0.56473627],
[0.17357589, 0.71053224, 0.17854188],
[0.38348102, 0.12440952, 0.90359566]])
You can do:
>>> pd.concat([df, pd.DataFrame(mat)], axis=1)
a b 0 1 2
0 5 5 0.449261 0.295679 0.607286
1 0 7 0.321806 0.324991 0.949501
2 1 0 0.649581 0.005667 0.564736
3 0 4 0.173576 0.710532 0.178542
4 6 4 0.383481 0.124410 0.903596
Setup
df = pd.DataFrame({'a': [5,0,1,0,6], 'b': [5,7,0,4,4]})
mat = np.random.rand(5,3)
Using join:
df.join(pd.DataFrame(mat))
a b 0 1 2
0 5 5 0.884061 0.803747 0.727161
1 0 7 0.464009 0.447346 0.171881
2 1 0 0.353604 0.912781 0.199477
3 0 4 0.466095 0.136218 0.405766
4 6 4 0.764678 0.874614 0.310778
If there is the chance of overlapping column names, simply supply a suffix:
df = pd.DataFrame({0: [5,0,1,0,6], 1: [5,7,0,4,4]})
mat = np.random.rand(5,3)
df.join(pd.DataFrame(mat), rsuffix='_')
0 1 0_ 1_ 2
0 5 5 0.783722 0.976951 0.563798
1 0 7 0.946070 0.391593 0.273339
2 1 0 0.710195 0.827352 0.839212
3 0 4 0.528824 0.625430 0.465386
4 6 4 0.848423 0.467256 0.962953
In python when using pandas, the similar rows can be dropped using drop_duplicates. Is there any way to separate the dataframe into two dataframes and not actually "drop" rows?
If you want to split the dataframe by duplicates, maybe you could use the boolean array returned by.duplicated():
>>> df = pd.DataFrame({"A": [1,1,2,3,2,4]})
>>> df
A
0 1
1 1
2 2
3 3
4 2
5 4
[6 rows x 1 columns]
>>> df_a, df_b= df[~df.duplicated()], df[df.duplicated()]
>>> df_a
A
0 1
2 2
3 3
5 4
[4 rows x 1 columns]
>>> df_b
A
1 1
4 2
[2 rows x 1 columns]
I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4