I have a large dataframe that has temperature measurements where some of the values are missing. The values are in two separate columns, where one has the actual measurements (TEMP), while the other column has only estimated temperatures (TEMP_ESTIMATED).
I'm trying to create a new column where these 2 values are combined in a way that the new column would have the actual measurement values if the value exists (is not NaN), and otherwise the new column would have the estimated values. Example of dataframe and how I would want it to look after the for-loop.
I have tried many different ways to do this but none of them have worked so far. I'm still new to programming so I apologize if there are some obvious mistakes, just trying to learn more!
What I tried the last time but the values were not added to the new column (I have imported pandas already and all the temperature data is saved to the data.DataFrame):
for i in range(len(data)):
if data.at[i, 'TEMP'] == 'NaN':
data.at[i, 'TEMP_ALL'] = data.at[i, 'TEMP_ESTIMATED']
else:
data.at[i, 'TEMP_ALL'] = data.at[i, 'TEMP']
I would greatly appreciate any feedback on this or any alternate ways how to achieve the desired result, thank you!
You can try using np.where:
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'DATE': ['20100101', '20100102', '20100103', '20100104', '20100105'],
'TEMP': [np.nan, np.nan, np.nan, 15, 20],
'TEMP_ESTIMATED': [10, 15, 16, 17, 22]})
df = df.rename_axis('index')
df['TEMP_ALL'] = np.where(np.isnan(df.TEMP), df.TEMP_ESTIMATED, df.TEMP)
index
DATE
TEMP
TEMP_ESTIMATED
TEMP_ALL
0
20100101
nan
10
10
1
20100102
nan
15
15
2
20100103
nan
16
16
3
20100104
15
17
15
4
20100105
20
22
20
If your NaN values are strings, try:
df['TEMP_ALL'] = np.where(df.TEMP == 'NaN', df.TEMP_ESTIMATED, df.TEMP)
I have a dataframe with 3 columns Replaced_ID, New_ID and Installation Date of New_ID.
Each New_ID replaces the Replaced_ID.
Replaced_ID New_ID Installation Date (of New_ID)
3 5 16/02/2018
5 7 17/05/2019
7 9 21/06/2019
9 11 23/08/2020
25 39 16/02/2017
39 41 16/08/2018
My goal is to get a dataframe which includes the first and last record of the sequence. I care only for the first Replaced_ID value and the last New_ID value.
i.e from above dataframe I want this
Replaced_ID New_ID Installation Date (of New_ID)
3 11 23/08/2020
25 41 16/08/2018
Sorting by date and perform shift is not the solution here as far as I can imagine.
Also, I tried to join the columns New_ID with Replaced_ID but this is not the case because it returns only the previous sequence.
I need to find a way to get the sequence [3,5,7,9,11] & [25,41] combining the Replaced_ID & New_ID columns for all rows.
I care mostly about getting the first Replaced_ID value and the last New_ID value and not the Installation Date because I can perform join in the end.
Any ideas here? Thanks.
First, let's create the DataFrame:
import pandas as pd
import numpy as np
from io import StringIO
data = """Replaced_ID,New_ID,Installation Date (of New_ID)
3,5,16/02/2018
5,7,17/05/2019
7,9,21/06/2019
9,11,23/08/2020
25,39,16/02/2017
39,41,16/08/2018
11,14,23/09/2020
41,42,23/10/2020
"""
### note that I've added two rows to check whether it works with non-consecutive rows
### defining some short hands
r = "Replaced_ID"
n = "New_ID"
i = "Installation Date (of New_ID)"
df = pd.read_csv(StringIO(data),header=0,parse_dates=True,sep=",")
df[i] = pd.to_datetime(df[i], )
And now for my actual solution:
a = df[[r,n]].values.flatten()
### returns a flat list of r and n values which clearly show duplicate entries, i.e.:
# [ 3 5 5 7 7 9 9 11 25 39 39 41 11 14 41 42]
### now only get values that occur once,
# and reshape them nicely, such that the first column gives the lowest (replaced) id,
# and the second column gives the highest (new) id, i.e.:
# [[ 3 14]
# [25 42]]
u, c = np.unique( a, return_counts=True)
res = u[c == 1].reshape(2,-1)
### now filter the dataframe where "New_ID" is equal to the second column of res, i.e. [14,42]:
# and replace the entries in "r" with the "lowest possible values" of r
dfn = df[ df[n].isin(res[:,1].tolist()) ]
# print(dfn)
dfn.loc[:][r] = res[:,0]
print(dfn)
Which yields:
Replaced_ID New_ID Installation Date (of New_ID)
6 3 14 2020-09-23
7 25 42 2020-10-23
Assuming dates are sorted , You can create a helper series and then groupby and aggregate:
df['Installation Date (of New_ID)']=pd.to_datetime(df['Installation Date (of New_ID)'])
s = df['Replaced_ID'].ne(df['New_ID'].shift()).cumsum()
out = df.groupby(s).agg(
{"Replaced_ID":"first","New_ID":"last","Installation Date (of New_ID)":"last"}
)
print(out)
Replaced_ID New_ID Installation Date (of New_ID)
1 3 11 2020-08-23
2 25 41 2018-08-16
The helper series s helps in differentiating the groups by comparing the Replaced_ID with the next value of New_ID and when they do not match , it returns True. Then with the help of series.cumsum we return a sum across the series to create seperate groups:
print(s)
0 1
1 1
2 1
3 1
4 2
5 2
For over two hours, I have not been able to solve this problem. I've found every single variation of a solution it seems, but none of them seem to work. It may be because I'm running on four hours of sleep per day, though. Anyway, what I'm trying to do is conditionally delete rows from a pandas dataframe. The dataframe is from a trending youtube videos CSV. One of the columns is "category_id."
I'm trying to remove all categories that do not have the number 25 or 43. Everytime I do this, the entire dataset is reduced down to 0 rows. I know what you're thinking, do rows exist that even have category 25 or 43? YES! They do!
A solution I really thought would work is as follows:
df.drop(df[df.category_id != 25].index, inplace=True)
df.drop(df[df.category_id != 43].index, inplace=True)
But then I inspect that dataframe and it is empty. How to fix this?
df = pd.DataFrame( {'category_id': [12, 14, 25, 7, 29, 43, 22, 95]} )
df
category_id
0 12
1 14
2 25
3 7
4 29
5 43
6 22
7 95
df.drop( list(df[ ~ df['category_id'].isin([25, 43])].index), \
inplace = True, axis = 0)
df
category_id
2 25
4 43
Currently, you're removing all the category_ids that aren't equal to 43, and then removing all the category_ids that aren't equal to 25. Leaving you with an empty dataframe.
I think it would be easier for you to find the indexes you would like to drop and then remove those from your dataframe.
indexes = df[df['category_id'] != 25 & df['category_id'] != 40].index
df = df.drop(indexes)
I'm trying to get a handle on slicing. I've got the following dataframe, df:
Feeder # 1 Feeder # 2
TimeStamp MW Month Day Hour TimeStamp MW Month Day Hour
0 2/3 1.2 1 30 22 2/3 2.4 1 30 22
1 2/4 2.3 1 31 23 2/3 4.1 1 31 23
2 2/5 3.4 2 1 0 2/3 3.7 2 1 0
There are 8 feeders in total.
If I want to select all the MW columns in all the Feeders, I can do:
df.xs('MW', level=1, axis=1,drop_level=False)
If I want Feeders 2 through 4, I can do:
df.loc[:,'Feeder #2':'Feeder #4']
BUT if I want columns MW through Day in just Feeders 2 through 4 via:
df.loc[:,pd.IndexSlice['Feeder #2':'Feeder #4','MW':'Day']]
I get the following error.
MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)
So if I sort the dataframe, then I'm able to do:
df.sortlevel(level=0,axis=1).loc[:,pd.IndexSlice['Feeder #2':'Feeder #4','Day':'MW']]
But sorting the dataframe destroys the original order of level 1 in the header-- everything gets alphabetized (lexsorted in Python-speak?). And my desired contents get jumbled: 'Day':'MW' yields the Day, Hour and MW columns. But what I want is 'MW':'Day' which would yield the MW, Month, and Day columns.
So my question is: is it possible to slice through my dataframe and preserve the order of the columns? Alternatively, can I lexsort the dataframe, perform the slices I need and then put the dataframe back in its original order?
Thanks in advance.
I think you can use CategoricalIndex to keep the order:
import pandas as pd
import numpy as np
level0 = "Feeder#1 Feeder#2 Feeder#3 Feeder#4".split()
level1 = "TimeStamp MW Month Day Hour".split()
idx0 = pd.CategoricalIndex(level0, level0, ordered=True)
idx1 = pd.CategoricalIndex(level1, level1, ordered=True)
columns = pd.MultiIndex.from_product([idx0, idx1])
df = pd.DataFrame(np.random.randint(0, 10, (10, 20)), columns=columns)
Then you can do this:
df.loc[:, pd.IndexSlice["Feeder#2":"Feeder#3", "MW":"Day"]]
edit
to convert the levels to CategoricalIndex:
columns = df.columns
for i in range(columns.nlevels):
level = pd.unique(columns.get_level_values(i))
cidx = pd.CategoricalIndex(level, level, sorted=True)
print(cidx)
What is the easiest way to remove duplicate columns from a dataframe?
I am reading a text file that has duplicate columns via:
import pandas as pd
df=pd.read_table(fname)
The column names are:
Time, Time Relative, N2, Time, Time Relative, H2, etc...
All the Time and Time Relative columns contain the same data. I want:
Time, Time Relative, N2, H2
All my attempts at dropping, deleting, etc such as:
df=df.T.drop_duplicates().T
Result in uniquely valued index errors:
Reindexing only valid with uniquely valued index objects
Sorry for being a Pandas noob. Any Suggestions would be appreciated.
Additional Details
Pandas version: 0.9.0
Python Version: 2.7.3
Windows 7
(installed via Pythonxy 2.7.3.0)
data file (note: in the real file, columns are separated by tabs, here they are separated by 4 spaces):
Time Time Relative [s] N2[%] Time Time Relative [s] H2[ppm]
2/12/2013 9:20:55 AM 6.177 9.99268e+001 2/12/2013 9:20:55 AM 6.177 3.216293e-005
2/12/2013 9:21:06 AM 17.689 9.99296e+001 2/12/2013 9:21:06 AM 17.689 3.841667e-005
2/12/2013 9:21:18 AM 29.186 9.992954e+001 2/12/2013 9:21:18 AM 29.186 3.880365e-005
... etc ...
2/12/2013 2:12:44 PM 17515.269 9.991756+001 2/12/2013 2:12:44 PM 17515.269 2.800279e-005
2/12/2013 2:12:55 PM 17526.769 9.991754e+001 2/12/2013 2:12:55 PM 17526.769 2.880386e-005
2/12/2013 2:13:07 PM 17538.273 9.991797e+001 2/12/2013 2:13:07 PM 17538.273 3.131447e-005
Here's a one line solution to remove columns based on duplicate column names:
df = df.loc[:,~df.columns.duplicated()].copy()
How it works:
Suppose the columns of the data frame are ['alpha','beta','alpha']
df.columns.duplicated() returns a boolean array: a True or False for each column. If it is False then the column name is unique up to that point, if it is True then the column name is duplicated earlier. For example, using the given example, the returned value would be [False,False,True].
Pandas allows one to index using boolean values whereby it selects only the True values. Since we want to keep the unduplicated columns, we need the above boolean array to be flipped (ie [True, True, False] = ~[False,False,True])
Finally, df.loc[:,[True,True,False]] selects only the non-duplicated columns using the aforementioned indexing capability.
The final .copy() is there to copy the dataframe to (mostly) avoid getting errors about trying to modify an existing dataframe later down the line.
Note: the above only checks columns names, not column values.
To remove duplicated indexes
Since it is similar enough, do the same thing on the index:
df = df.loc[~df.index.duplicated(),:].copy()
To remove duplicates by checking values without transposing
Update and caveat: please be careful in applying this. Per the counter-example provided by DrWhat in the comments, this solution may not have the desired outcome in all cases.
df = df.loc[:,~df.apply(lambda x: x.duplicated(),axis=1).all()].copy()
This avoids the issue of transposing. Is it fast? No. Does it work? In some cases. Here, try it on this:
# create a large(ish) dataframe
ldf = pd.DataFrame(np.random.randint(0,100,size= (736334,1312)))
#to see size in gigs
#ldf.memory_usage().sum()/1e9 #it's about 3 gigs
# duplicate a column
ldf.loc[:,'dup'] = ldf.loc[:,101]
# take out duplicated columns by values
ldf = ldf.loc[:,~ldf.apply(lambda x: x.duplicated(),axis=1).all()].copy()
It sounds like you already know the unique column names. If that's the case, then df = df['Time', 'Time Relative', 'N2'] would work.
If not, your solution should work:
In [101]: vals = np.random.randint(0,20, (4,3))
vals
Out[101]:
array([[ 3, 13, 0],
[ 1, 15, 14],
[14, 19, 14],
[19, 5, 1]])
In [106]: df = pd.DataFrame(np.hstack([vals, vals]), columns=['Time', 'H1', 'N2', 'Time Relative', 'N2', 'Time'] )
df
Out[106]:
Time H1 N2 Time Relative N2 Time
0 3 13 0 3 13 0
1 1 15 14 1 15 14
2 14 19 14 14 19 14
3 19 5 1 19 5 1
In [107]: df.T.drop_duplicates().T
Out[107]:
Time H1 N2
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
You probably have something specific to your data that's messing it up. We could give more help if there's more details you could give us about the data.
Edit:
Like Andy said, the problem is probably with the duplicate column titles.
For a sample table file 'dummy.csv' I made up:
Time H1 N2 Time N2 Time Relative
3 13 13 3 13 0
1 15 15 1 15 14
14 19 19 14 19 14
19 5 5 19 5 1
using read_table gives unique columns and works properly:
In [151]: df2 = pd.read_table('dummy.csv')
df2
Out[151]:
Time H1 N2 Time.1 N2.1 Time Relative
0 3 13 13 3 13 0
1 1 15 15 1 15 14
2 14 19 19 14 19 14
3 19 5 5 19 5 1
In [152]: df2.T.drop_duplicates().T
Out[152]:
Time H1 Time Relative
0 3 13 0
1 1 15 14
2 14 19 14
3 19 5 1
If your version doesn't let your, you can hack together a solution to make them unique:
In [169]: df2 = pd.read_table('dummy.csv', header=None)
df2
Out[169]:
0 1 2 3 4 5
0 Time H1 N2 Time N2 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [171]: from collections import defaultdict
col_counts = defaultdict(int)
col_ix = df2.first_valid_index()
In [172]: cols = []
for col in df2.ix[col_ix]:
cnt = col_counts[col]
col_counts[col] += 1
suf = '_' + str(cnt) if cnt else ''
cols.append(col + suf)
cols
Out[172]:
['Time', 'H1', 'N2', 'Time_1', 'N2_1', 'Time Relative']
In [174]: df2.columns = cols
df2 = df2.drop([col_ix])
In [177]: df2
Out[177]:
Time H1 N2 Time_1 N2_1 Time Relative
1 3 13 13 3 13 0
2 1 15 15 1 15 14
3 14 19 19 14 19 14
4 19 5 5 19 5 1
In [178]: df2.T.drop_duplicates().T
Out[178]:
Time H1 Time Relative
1 3 13 0
2 1 15 14
3 14 19 14
4 19 5 1
Transposing is inefficient for large DataFrames. Here is an alternative:
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
dcols = frame[v].to_dict(orient="list")
vs = dcols.values()
ks = dcols.keys()
lvs = len(vs)
for i in range(lvs):
for j in range(i+1,lvs):
if vs[i] == vs[j]:
dups.append(ks[i])
break
return dups
Use it like this:
dups = duplicate_columns(frame)
frame = frame.drop(dups, axis=1)
Edit
A memory efficient version that treats nans like any other value:
from pandas.core.common import array_equivalent
def duplicate_columns(frame):
groups = frame.columns.to_series().groupby(frame.dtypes).groups
dups = []
for t, v in groups.items():
cs = frame[v].columns
vs = frame[v]
lcs = len(cs)
for i in range(lcs):
ia = vs.iloc[:,i].values
for j in range(i+1, lcs):
ja = vs.iloc[:,j].values
if array_equivalent(ia, ja):
dups.append(cs[i])
break
return dups
If I'm not mistaken, the following does what was asked without the memory problems of the transpose solution and with fewer lines than #kalu 's function, keeping the first of any similarly named columns.
Cols = list(df.columns)
for i,item in enumerate(df.columns):
if item in df.columns[:i]: Cols[i] = "toDROP"
df.columns = Cols
df = df.drop("toDROP",1)
It looks like you were on the right path. Here is the one-liner you were looking for:
df.reset_index().T.drop_duplicates().T
But since there is no example data frame that produces the referenced error message Reindexing only valid with uniquely valued index objects, it is tough to say exactly what would solve the problem. if restoring the original index is important to you do this:
original_index = df.index.names
df.reset_index().T.drop_duplicates().reset_index(original_index).T
Note that Gene Burinsky's answer (at the time of writing the selected answer) keeps the first of each duplicated column. To keep the last:
df=df.loc[:, ~df.columns[::-1].duplicated()[::-1]]
An update on #kalu's answer, which uses the latest pandas:
def find_duplicated_columns(df):
dupes = []
columns = df.columns
for i in range(len(columns)):
col1 = df.iloc[:, i]
for j in range(i + 1, len(columns)):
col2 = df.iloc[:, j]
# break early if dtypes aren't the same (helps deal with
# categorical dtypes)
if col1.dtype is not col2.dtype:
break
# otherwise compare values
if col1.equals(col2):
dupes.append(columns[i])
break
return dupes
Although #Gene Burinsky answer is great, it has a potential problem in that the reassigned df may be either a copy or a view of the original df.
This means that subsequent assignments like df['newcol'] = 1 generate a SettingWithCopy warning and may fail (https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing). The following solution prevents that issue:
duplicate_cols = df.columns[df.columns.duplicated()]
df.drop(columns=duplicate_cols, inplace=True)
I ran into this problem where the one liner provided by the first answer worked well. However, I had the extra complication where the second copy of the column had all of the data. The first copy did not.
The solution was to create two data frames by splitting the one data frame by toggling the negation operator. Once I had the two data frames, I ran a join statement using the lsuffix. This way, I could then reference and delete the column without the data.
- E
March 2021 update
The subsequent post by #CircArgs may have provided a succinct one-liner to accomplish what I described here.
First step:- Read first row i.e all columns the remove all duplicate columns.
Second step:- Finally read only that columns.
cols = pd.read_csv("file.csv", header=None, nrows=1).iloc[0].drop_duplicates()
df = pd.read_csv("file.csv", usecols=cols)
The way below will identify dupe columns to review what is going wrong building the dataframe originally.
dupes = pd.DataFrame(df.columns)
dupes[dupes.duplicated()]
Just in case somebody still looking for an answer in how to look for duplicated values in columns for a Pandas Data Frame in Python, I came up with this solution:
def get_dup_columns(m):
'''
This will check every column in data frame
and verify if you have duplicated columns.
can help whenever you are cleaning big data sets of 50+ columns
and clean up a little bit for you
The result will be a list of tuples showing what columns are duplicates
for example
(column A, Column C)
That means that column A is duplicated with column C
more info go to https://wanatux.com
'''
headers_list = [x for x in m.columns]
duplicate_col2 = []
y = 0
while y <= len(headers_list)-1:
for x in range(1,len(headers_list)-1):
if m[headers_list[y]].equals(m[headers_list[x]]) == False:
continue
else:
duplicate_col2.append((headers_list[y],headers_list[x]))
headers_list.pop(0)
return duplicate_col2
And you can cast the definition like this:
duplicate_col = get_dup_columns(pd_excel)
It will show a result like the following:
[('column a', 'column k'),
('column a', 'column r'),
('column h', 'column m'),
('column k', 'column r')]
I am not sure why Gene Burinsky's answer did not work for me. I was getting the same original dataframes with duplicated columns. My workaround was force the selection over the ndarray and get back the dataframe.
df = pd.DataFrame(df.values[:,~df.columns.duplicated()], columns=df.columns[~df.columns.duplicated()])
A simple column-wise comparison is the most efficient way (in terms of memory and time) to check duplicated columns by values. Here an example:
import numpy as np
import pandas as pd
from itertools import combinations as combi
df = pd.DataFrame(np.random.uniform(0,1, (100,4)), columns=['a','b','c','d'])
df['a'] = df['d'].copy() # column 'a' is equal to column 'd'
# to keep the first
dupli_cols = [cc[1] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
# to keep the last
dupli_cols = [cc[0] for cc in combi(df.columns, r=2) if (df[cc[0]] == df[cc[1]]).all()]
df = df.drop(columns=dupli_cols)
In case you want to check for duplicate columns, this code can be useful
columns_to_drop= []
for cname in sorted(list(df)):
for cname2 in sorted(list(df))[::-1]:
if df[cname].equals(df[cname2]) and cname!=cname2 and cname not in columns_to_drop:
columns_to_drop.append(cname2)
print(cname,cname2,'Are equal')
df = df.drop(columns_to_drop, axis=1)
Fast and easy way to drop the duplicated columns by their values:
df = df.T.drop_duplicates().T
More info: Pandas DataFrame drop_duplicates manual .