separating data frames instead of dropping in pandas - python

In python when using pandas, the similar rows can be dropped using drop_duplicates. Is there any way to separate the dataframe into two dataframes and not actually "drop" rows?

If you want to split the dataframe by duplicates, maybe you could use the boolean array returned by.duplicated():
>>> df = pd.DataFrame({"A": [1,1,2,3,2,4]})
>>> df
A
0 1
1 1
2 2
3 3
4 2
5 4
[6 rows x 1 columns]
>>> df_a, df_b= df[~df.duplicated()], df[df.duplicated()]
>>> df_a
A
0 1
2 2
3 3
5 4
[4 rows x 1 columns]
>>> df_b
A
1 1
4 2
[2 rows x 1 columns]

Related

delete the first n rows of each ids in dataframe

I have a DataFrame, with two columns. I want to delete the first 3 rows values of each ids. If the id has less or equal to three rows, delete those rows also. Like in the following, the ids 3 and 1 have 3 and 2 rows, sod they should be deleted. for ids 4 and 2, only the rows 4, 5 are preserved.
import pandas as pd
df = pd.DataFrame()
df ['id'] = [4,4,4,4, 4,2, 2,2,2,2,3,3,3, 1, 1]
df ['value'] = [2,1,1,2, 3, 4, 6,-1,-2,2,-3,5,7, -2, 5]
Here is the DataFrame which I want.
Number each "id" using groupby + cumcount and filter the rows where the the number is more than 2:
out = df[df.groupby('id').cumcount() > 2]
Output:
id value
3 4 2
4 4 3
8 2 -2
9 2 2
Use Series.value_counts and Series.map in order to performance a boolean indexing
new_df = df[df['id'].map(df['id'].value_counts().gt(2))]
id value
3 4 2
4 4 3
8 2 -2
9 2 2
Using cumcount is the way but with drop work as well
out = df.groupby('id',sort=False).apply(lambda x : x.drop(x.index[:3])).reset_index(drop=True)
Out[12]:
id value
0 4 2
1 4 3
2 2 -2
3 2 2

Pandas Dataframe multiply with a column of another dataframe

Is there a way to multiply each element of a row of a dataframe by an element of the same row from a particular column of another dataframe.
For example, such that:
df1:
1 2 3
2 2 2
3 2 1
and df2:
x 1 b
z 2 c
x 4 a
results in
1 2 3
4 4 4
12 8 4
So basically such that df1[i,:] * df2[i,j] = df3[i,:].
Multiply the first df by the column of the second df
Assuming your column names are 0,1,2
df1.mul(df2[1],0)
Output
0 1 2
0 1 2 3
1 4 4 4
2 12 8 4
Here you go.
I have created a variable that allows you to select that which column of the second dataframe you want to multiply with the numbers in the first dataframe.
arr1 = np.array(df1) # df1 to array
which_df2col_to_multiply = 0 # select col from df2
arr2 = np.array(df2)[:, which_df2col_to_multiply ] # selected col to array
print(np.transpose(arr2*arr1)) # result
This is the output:
[[1 2 3]
[4 4 4]
[12 8 4]]

How to manually arrange rows in pandas dataframe

I have a small dataframe produced from value_counts() that I want to plot with a categorical x axis. It s a bit bigger than this but:
Age Income
25-30 10
65-70 5
35-40 2
I want to be able to manually reorder the rows. How do I do this?
You can reorder rows with .reindex:
>>> df
a b
0 1 4
1 2 5
2 3 6
>>> df.reindex([1, 2, 0])
a b
1 2 5
2 3 6
0 1 4
From here Link, you can create a sorting criteria and use that:
df = pd.DataFrame({'Age':['25-30','65-70','35-40'],'Income':[10,5,2]})
sort_criteria = {'25-30': 0, '35-40': 1, '65-70': 2}
df = df.loc[df['Age'].map(sort_criteria).sort_values(ascending = True).index]

How to add numpy matrix as new columns for pandas dataframe?

I have a NxM dataframe and a NxL numpy matrix. I'd like to add the matrix to the dataframe to create L new columns by simply appending the columns and rows the same order they appear. I tried merge() and join(), but I end up with errors:
assign() keywords must be strings
and
columns overlap but no suffix specified
respectively.
Is there a way I can add a numpy matrix as dataframe columns?
You can turn the matrix into a datframe and use concat with axis=1:
For example, given a dataframe df and a numpy array mat:
>>> df
a b
0 5 5
1 0 7
2 1 0
3 0 4
4 6 4
>>> mat
array([[0.44926098, 0.29567859, 0.60728561],
[0.32180566, 0.32499134, 0.94950085],
[0.64958125, 0.00566706, 0.56473627],
[0.17357589, 0.71053224, 0.17854188],
[0.38348102, 0.12440952, 0.90359566]])
You can do:
>>> pd.concat([df, pd.DataFrame(mat)], axis=1)
a b 0 1 2
0 5 5 0.449261 0.295679 0.607286
1 0 7 0.321806 0.324991 0.949501
2 1 0 0.649581 0.005667 0.564736
3 0 4 0.173576 0.710532 0.178542
4 6 4 0.383481 0.124410 0.903596
Setup
df = pd.DataFrame({'a': [5,0,1,0,6], 'b': [5,7,0,4,4]})
mat = np.random.rand(5,3)
Using join:
df.join(pd.DataFrame(mat))
a b 0 1 2
0 5 5 0.884061 0.803747 0.727161
1 0 7 0.464009 0.447346 0.171881
2 1 0 0.353604 0.912781 0.199477
3 0 4 0.466095 0.136218 0.405766
4 6 4 0.764678 0.874614 0.310778
If there is the chance of overlapping column names, simply supply a suffix:
df = pd.DataFrame({0: [5,0,1,0,6], 1: [5,7,0,4,4]})
mat = np.random.rand(5,3)
df.join(pd.DataFrame(mat), rsuffix='_')
0 1 0_ 1_ 2
0 5 5 0.783722 0.976951 0.563798
1 0 7 0.946070 0.391593 0.273339
2 1 0 0.710195 0.827352 0.839212
3 0 4 0.528824 0.625430 0.465386
4 6 4 0.848423 0.467256 0.962953

Adding row in Pandas DataFrame keeping index order

I have a DataFrame and I would like to add some inexisting rows to it. I have found the .loc method, but this adds the values at the end, and not in a sorted way. For example
import numpy as np
import pandas as pd
dfi = pd.DataFrame(np.arange(6).reshape(3,2),columns=['A','B'])
>>> dfi
A B
0 0 1
1 2 3
2 4 5
[3 rows x 2 columns]
Adding a inexisting row through .loc:
dfi.loc[5,:] = 0
>>> dfi
A B
0 0 1
1 2 3
2 4 5
5 0 0
[3 rows x 2 columns]
So far everything ok. But this is what happens when trying to add another row, with index smaller than the last one:
dfi.loc[3,:] = 0
>>> dfi
A B
0 0 1
1 2 3
2 4 5
5 0 0
3 0 0
[3 rows x 2 columns]
I would like it to put the row with index 3 between the row 2 and the 5. I could sort the DataFrame by index everytime, but that would take too long. Is there another way?
My actual problem is considering a DataFrame where the indexes are datetime objects. I didn't put the whole detail of that implementation here because that would confuse what my real problem is: adding rows in DataFrame such that the result has an ordered index.
If your index is almost continuous, only missing a few values here and there. I think you may try the following,
In [15]:
df=pd.DataFrame(np.zeros((100,2)), columns=['A', 'B'])
df['A']=np.nan
df['B']=np.nan
In [16]:
df.iloc[[0,1,2]]=pd.DataFrame({'A': [0,2,4,], 'B': [1,3,5]})
df.iloc[5]=[0,0]
df.iloc[3]=0
print df.dropna()
A B
0 0 1
1 2 3
2 4 5
3 0 0
5 0 0
[5 rows x 2 columns]

Categories

Resources