I have a dataset of events with a date column which I need to display in a weekly plot and do some more data processing afterwards. After some googling I found pd.Grouper(freq="W") so I am using that to group the events by week and display them. My problem is that after doing the groupby and ungroup I end up with a data frame where there is an unnamed column that I am unable to refer to except using iloc. This is an issue because in later plots I am grouping by other columns so I need a way to refer to this column by name, not iloc.
Here's a reproducible example of my dataset:
from datetime import datetime
from faker import Faker
fake = Faker()
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)
# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
{"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
"dummy": [1 for i in range(30)]}) # There's probably a better way of counting than this
grouper = df.set_index("date").groupby([pd.Grouper(freq="W"), 'dummy'])
result = grouper['dummy'].count().unstack('dummy').fillna(0)
The result data frame that I get has weird indexes/columns that I am unable to navigate:
>>> print(result)
dummy 1
date
2023-01-01 1
2023-01-08 3
2023-01-15 4
2023-01-22 9
2023-01-29 8
2023-02-05 5
>>> print(result.columns)
Int64Index([1], dtype='int64', name='dummy')
Then only column here is dummy, but even after result.dummy I get an AttributeError
I've also tried result.reset_index():
dummy date 1
0 2023-01-01 1
1 2023-01-08 3
2 2023-01-15 4
3 2023-01-22 9
4 2023-01-29 8
5 2023-02-05 5
But for this data frame I can only get the date column - the counts column named "1" cannot be accessed using result.reset_index()["1"] as I get an AttributeError
I am completely perplexed by what is going on here, pandas is really powerful but sometimes I find it incredibly unintuitive. I've checked several pages of the docs and checked if there's another index level (there isn't). Can someone who's better at pandas help me out here?
I just want a way to convert the grouped data frame into something like this:
date counts
0 2023-01-01 1
1 2023-01-08 3
2 2023-01-15 4
3 2023-01-22 9
4 2023-01-29 8
5 2023-02-05 5
Where date and counts are columns and there is an unnamed index
You can solve this by simply doing:
from datetime import datetime
from faker import Faker
fake = Faker()
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 2, 1)
# Generate data frame of 30 random dates in January 2023
df = pd.DataFrame(
{"date": [fake.date_time_between(start_date=start_date, end_date=end_date) for i in range(30)],
"dummy": [1 for i in range(30)]}) # There's probably a better way of counting than this
result = df.groupby([pd.Grouper(freq="W", key='date'), 'dummy'], squeeze=True)['dummy'].count()
result = result.reset_index(name='counts')
result = result.drop(['dummy'], axis = 1)
which gives
date counts
0 2023-01-01 3
1 2023-01-08 7
2 2023-01-15 5
3 2023-01-22 5
4 2023-01-29 8
5 2023-02-05 2
I am retrieving Yahoo stock ticker data and want to convert the given currency to euros. For this purpose I am using the Python Library Currency Converter and the pandas method multiply.
One of the columns, trading volume, shouldn't be "converted" - whats the best way to skip it?
This is what I currently have:
import pandas as pd
import datetime
import pandas_datareader.data as web
from pandas import Series, DataFrame
from currency_converter import CurrencyConverter
start = datetime.datetime(2017, 1, 1)
end = datetime.datetime(2020, 12, 31)
c = CurrencyConverter()
df = web.DataReader("EXK", 'yahoo', start, end)
df.tail()
conversion = c.convert(1, 'USD', 'EUR')
eurodf = df.multiply(conversion,axis='rows')
eurodf.tail()
One approach I thought of taking, was to maybe join the "volume" column after multiplication.
Alternatively I could just target that one column and convert it back?
You can loc all columns except one. For example:
A B C
0 0 1 2
1 3 4 5
2 6 7 8
df.loc[:, df.columns.drop('B')] *= 10
Result:
A B C
0 0 1 20
1 30 4 50
2 60 7 80
What is the easiest way to remove duplicate columns from a dataframe?
I am reading a text file that has duplicate columns via:
import pandas as pd
df=pd.read_table(fname)
The column names are:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
All the Time and Time Relative columns contain the same data. I want:
Time, Time Relative, N2, H2
All my attempts at dropping, deleting, etc such as:
df=df.T.drop_duplicates().T
Result in uniquely valued index errors:
Reindexing only valid with uniquely valued index objects
Sorry for being a Pandas noob. Any Suggestions would be appreciated.
Additional Details
Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)
data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
Here's a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].
Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])
Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
It sounds like you already know the unique column names. If that's the case, then df = df['Time', 'Time Relative', 'N2'] would work.
If not, your solution should work:
In [101]: vals = np.random.randint(0,20, (4,3))
vals
Out[101]:
array([[ 3, 13, 0],
[ 1, 15, 14],
[14, 19, 14],
[19, 5, 1]])
In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
df
Out[106]:
Time H1 N2 Time Relative N2 Time
0 3 13 0 3 13 0
1 1 15 14 1 15 14
2 14 19 14 14 19 14
3 19 5 1 19 5 1
In [107]: df.T.drop_duplicates().T
Out[107]:
Time H1 N2
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
You probably have something specific to your data that's messing it up. We could give more help if there's more details you could give us about the data.
Edit:
Like Andy said, the problem is probably with the duplicate column titles.
For a sample table file 'dummy.csv' I made up:
Time H1 N2 Time N2 Time Relative
3 13 13 3 13 0
1 15 15 1 15 14
14 19 19 14 19 14
19 5 5 19 5 1
using read_table gives unique columns and works properly:
In [151]: df2 = pd.read_table('dummy.csv')
df2
Out[151]:
Time H1 N2 Time.1 N2.1 Time Relative
0 3 13 13 3 13 0
1 1 15 15 1 15 14
2 14 19 19 14 19 14
3 19 5 5 19 5 1
In [152]: df2.T.drop_duplicates().T
Out[152]:
Time H1 Time Relative
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
If your version doesn't let your, you can hack together a solution to make them unique:
In [169]: df2 = pd.read_table('dummy.csv', header=None)
df2
Out[169]:
0 1 2 3 4 5
0 Time H1 N2 Time N2 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [171]: from collections import defaultdict
col_counts = defaultdict(int)
col_ix = df2.first_valid_index()
In [172]: cols = []
for col in df2.ix[col_ix]:
cnt = col_counts[col]
col_counts[col] += 1
suf = '_' + str(cnt) if cnt else ''
cols.append(col + suf)
cols
Out[172]:
['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
Time H1 N2 Time_1 N2_1 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [178]: df2.T.drop_duplicates().T
Out[178]:
Time H1 Time Relative
1 3 13 0
2 1 15 14
3 14 19 14
4 19 5 1
Transposing is inefficient for large DataFrames. Here is an alternative:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
Use it like this:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
A memory efficient version that treats nans like any other value:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups
If I'm not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than #kalu 's function, keeping the first of any similarly named columns.
Cols = list(df.columns)
for i,item in enumerate(df.columns):
if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)
It looks like you were on the right path. Here is the one-liner you were looking for:
df.reset_index().T.drop_duplicates().T
But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:
original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T
Note that Gene Burinsky's answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:
df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]]
An update on #kalu's answer, which uses the latest pandas:
def find_duplicated_columns(df):
dupes = []
columns = df.columns
for i in range(len(columns)):
col1 = df.iloc[:, i]
for j in range(i + 1, len(columns)):
col2 = df.iloc[:, j]
# break early if dtypes aren't the same (helps deal with
# categorical dtypes)
if col1.dtype is not col2.dtype:
break
# otherwise compare values
if col1.equals(col2):
dupes.append(columns[i])
break
return dupes
Although #Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df.
This means that subsequent assignments like df['newcol'] = 1 generate a SettingWithCopy warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.
The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.
- E
March 2021 update
The subsequent post by #CircArgs may have provided a succinct one-liner to accomplish what I described here.
First step:- Read first row i.e all columns the remove all duplicate columns.
Second step:- Finally read only that columns.
cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)
The way below will identify dupe columns to review what is going wrong building the dataframe originally.
dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]
Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:
def get_dup_columns(m):
'''
This will check every column in data frame
and verify if you have duplicated columns.
can help whenever you are cleaning big data sets of 50+ columns
and clean up a little bit for you
The result will be a list of tuples showing what columns are duplicates
for example
(column A, Column C)
That means that column A is duplicated with column C
more info go to https://wanatux.com
'''
headers_list = [x for x in m.columns]
duplicate_col2 = []
y = 0
while y <= len(headers_list)-1:
for x in range(1,len(headers_list)-1):
if m[headers_list[y]].equals(m[headers_list[x]]) == False:
continue
else:
duplicate_col2.append((headers_list[y],headers_list[x]))
headers_list.pop(0)
return duplicate_col2
And you can cast the definition like this:
duplicate_col = get_dup_columns(pd_excel)
It will show a result like the following:
[('column a', 'column k'),
('column a', 'column r'),
('column h', 'column m'),
('column k', 'column r')]
I am not sure why Gene Burinsky's answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.
df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()])
A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:
import numpy as np
import pandas as pd
from itertools import combinations as combi
df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy() # column 'a' is equal to column 'd'
# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
df = df.drop(columns=dupli_cols)
In case you want to check for duplicate columns, this code can be useful
columns_to_drop= []
for cname in sorted(list(df)):
for cname2 in sorted(list(df))[::-1]:
if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop:
columns_to_drop.append(cname2)
print(cname,cname2,'Are equal')
df = df.drop(columns_to_drop, axis=1)
Fast and easy way to drop the duplicated columns by their values:
df = df.T.drop_duplicates().T
More info: Pandas DataFrame drop_duplicates manual .