I have this dataframe and I need to drop all duplicates but I need to keep first AND last values
For example:
1 0
2 0
3 0
4 0
output:
1 0
4 0
I tried df.column.drop_duplicates(keep=("first","last")) but it doesn't word, it returns
ValueError: keep must be either "first", "last" or False
Does anyone know any turn around for this?
Thanks
You could use the panda's concat function to create a dataframe with both the first and last values.
pd.concat([
df['X'].drop_duplicates(keep='first'),
df['X'].drop_duplicates(keep='last'),
])
you can't drop both first and last... so trick is too concat data frames of first and last.
When you concat one has to handle creating duplicate of non-duplicates. So only concat unique indexes in 2nd Dataframe. (not sure if Merge/Join would work better?)
import pandas as pd
d = {1:0,2:0,10:1, 3:0,4:0}
df = pd.DataFrame.from_dict(d, orient='index', columns=['cnt'])
print(df)
cnt
1 0
2 0
10 1
3 0
4 0
Then do this:
d1 = df.drop_duplicates(keep=("first"))
d2 = df.drop_duplicates(keep=("last"))
d3 = pd.concat([d1,d2.loc[set(d2.index) - set(d1.index)]])
d3
Out[60]:
cnt
1 0
10 1
4 0
Use a groupby on your column named column, then reindex. If you ever want to check for duplicate values in more than one column, you can extend the columns you include in your groupby.
df = pd.DataFrame({'column':[0,0,0,0]})
Input:
column
0 0
1 0
2 0
3 0
df.groupby('column', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[0, -1]]).reset_index(level=0, drop=True)
Output:
column
0 0
3 0
Related
I have a list and I would like to convert it to a pandas dataframe. In the second column, I want to give all zeros but I got "object of type 'int' has no len()" error. The thing I did is this:
df = pd.DataFrame([all_equal_timestamps['A'], 0], columns=['data','label'])
How can i add second column with all zeros to this dataframe in the easiest manner and why did the code above give me this error?
Not sure what is in all_equal_timestamps, so I presume it's a list of elements. Do you mean to get this result?
import pandas as pd
all_equal_timestamps = {'A': ['1234', 'aaa', 'asdf']}
df = pd.DataFrame(all_equal_timestamps['A'], columns=['data']).assign(label=0)
# df['label'] = 0
print(df)
Output:
data label
0 1234 0
1 aaa 0
2 asdf 0
If you're creating a DataFrame with a list of lists, you'd expect something like this
df = pd.DataFrame([ all_equal_timestamps['A'], '0'*len(all_equal_timestamps['A']) ], columns=['data', 'label', 'anothercol'])
print(df)
Output:
data label anothercol
0 1234 aaa asdf
1 0 0 0
you can add a column named as "new" with all zero by using
df['new'] = 0
You can do it all in one line with assign:
timestamps = [1,0,3,5]
pd.DataFrame({"Data":timestamps}).assign(new=0)
Output:
Data new
0 1 0
1 0 0
2 3 0
3 5 0
I have some troubles with my Python work,
my steps are:
1)add the list to ordinary Dataframe
2)delete the columns which is min in the list
my list is called 'each_c' and my ordinary Dataframe is called 'df_col'
I want it to become like this:
hope someone can help me, thanks!
This is clearly described in the documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
df_col.drop(columns=[3])
Convert each_c to Series, append by DataFrame.append and then get indices by minimal value by Series.idxmin and pass to drop - it remove only first minimal column:
s = pd.Series(each_c)
df = df_col.append(s, ignore_index=True).drop(s.idxmin(), axis=1)
If need remove all columns if multiple minimals:
each_c = [-0.025,0.008,-0.308,-0.308]
s = pd.Series(each_c)
df_col = pd.DataFrame(np.random.random((10,4)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
0 1
0 0.602312 0.641220
1 0.586233 0.634599
2 0.294047 0.339367
3 0.246470 0.546825
4 0.093003 0.375238
5 0.765421 0.605539
6 0.962440 0.990816
7 0.810420 0.943681
8 0.307483 0.170656
9 0.851870 0.460508
10 -0.025000 0.008000
EDIT: If solution raise error:
IndexError: Boolean index has wrong length:
it means there is no default columns name by range - 0,1,2,3. Possible solution is set index values in Series by rename:
each_c = [-0.025,0.008,-0.308,-0.308]
df_col = pd.DataFrame(np.random.random((10,4)), columns=list('abcd'))
s = pd.Series(each_c).rename(dict(enumerate(df.columns)))
df = df_col.append(s, ignore_index=True)
df = df.loc[:, s.ne(s.min())]
print (df)
a b
0 0.321498 0.327755
1 0.514713 0.575802
2 0.866681 0.301447
3 0.068989 0.140084
4 0.069780 0.979451
5 0.629282 0.606209
6 0.032888 0.204491
7 0.248555 0.338516
8 0.270608 0.731319
9 0.732802 0.911920
10 -0.025000 0.008000
I have pandas df which looks like the pic:
enter image description here
I want to delete any column if more than half of the values are the same in the column, and I dont know how to do this
I trid using :pandas.Series.value_counts
but with no luck
You can iterate over the columns, count the occurences of values as you tried with value counts and check if it is more than 50% of your column's data.
n=len(df)
cols_to_drop=[]
for e in list(df.columns):
max_occ=df['id'].value_counts().iloc[0] #Get occurences of most common value
if 2*max_occ>n: # Check if it is more than half the len of the dataset
cols_to_drop.append(e)
df=df.drop(cols_to_drop,axis=1)
You can use apply + value_counts and getting the first value to get the max count:
count = df.apply(lambda s: s.value_counts().iat[0])
col1 4
col2 2
col3 6
dtype: int64
Thus, simply turn it into a mask depending on whether the greatest count is more than half len(df), and slice:
count = df.apply(lambda s: s.value_counts().iat[0])
df.loc[:, count.le(len(df)/2)] # use 'lt' if needed to drop if exactly half
output:
col2
0 0
1 1
2 0
3 1
4 2
5 3
Use input:
df = pd.DataFrame({'col1': [0,1,0,0,0,1],
'col2': [0,1,0,1,2,3],
'col3': [0,0,0,0,0,0],
})
Boolean slicing with a comprension
df.loc[:, [
df.shape[0] // s.value_counts().max() >= 2
for _, s in df.iteritems()
]]
col2
0 0
1 1
2 0
3 1
4 2
5 3
Credit to #mozway for input data.
I have two dataframes:
The first one looks like this:
variable
entry
subentry
0
1
X
2
Y
3
Z
and the second one looks like:
variable
entry
subentry
0
1
A
2
B
I would like to merge the two dataframe such that I get:
variable
entry
subentry
0
1
X
2
Y
3
Z
1
1
A
2
B
Simply using df1.append(df2, ignore_index=True) gives
variable
0
X
1
Y
2
Z
3
A
4
B
In other words, it collapses the multindex into a single index. Is there a way around this?
Edit: Here is a code sinppet that will reproduce the problem:
arrays = [
np.array([0,0,0]),
np.array([0,1,2]),]
arrays_2 = [
np.array([0,0]),
np.array([0,1]),]
df1 = pd.DataFrame(np.random.randn(3, 1), index=arrays)
df2 = pd.DataFrame(np.random.randn(2, 1), index=arrays_2)
df = df1.append(df2, ignore_index=True)
print(df)
Edit: In practice, I am looking ao combine N dataframes, each with a different number of "entry" rows. So I am looking for an approach that will not rely on me knowing the exact of the dataframes I am combining.
One way try:
pd.concat([df1, df2], keys=[0,1]).droplevel(1)
Output:
0
0 0 -0.439749
1 -0.478744
2 0.719870
1 0 -1.055648
1 -2.007242
Use pd.concat to concat the dataframes together and since entry is the same of both, use keys parameter to create a new level with the naming you want your level to be. Finally, go back and drop the old index level (where the value was the same).
I have a very large DataFrame that looks like this:
A B
SPH2008 3/21/2008 1 2
3/21/2008 1 2
3/21/2008 1 2
SPM2008 6/21/2008 1 2
6/21/2008 1 2
6/21/2008 1 2
And I have the following code which is intended to flatten and acquire the unique pairs of the two indeces into a new DF:
indeces = [df.index.get_level_values(0), df.index.get_level_values(1)]
tmp = pd.DataFrame(data=indeces).T.drop_duplicates()
tmp.columns = ['ID', 'ExpirationDate']
tmp.sort_values('ExpirationDate', inplace=True)
However, this operation takes a remarkably long amount of time. Is there a more efficient way to do this?
pandas.DataFrame.index.drop_duplicates
pd.DataFrame([*df.index.drop_duplicates()], columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008
With older versions of Python that can't unpack in that way
pd.DataFrame(df.index.drop_duplicates().tolist(), columns=['ID', 'ExpirationDate'])
IIUC, You can also groupby the levels of your multiindex, then create a dataframe from that with your desired columns:
>>> pd.DataFrame(df.groupby(level=[0,1]).groups.keys(), columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008