I have no idea what's happening, the title is just a first-order approximation. I'm trying to join two data frames:
>>> df_sum.head()
TUCASEID t070101 t070102 t070103 t070104 t070105 t070199 \
0 20030100013280 0 0 0 0 0 0
1 20030100013344 0 0 0 0 0 0
2 20030100013352 60 0 0 0 0 0
3 20030100013848 0 0 0 0 0 0
4 20030100014165 0 0 0 0 0 0
t070201 t070299 shopping year
0 0 0 0 2003
1 0 0 0 2003
2 0 0 60 2003
3 0 0 0 2003
4 0 0 0 2003
>>> emp.head()
TUCASEID status
0 20030100013280 emp
1 20030100013344 emp
2 20030100013352 emp
4 20030100014165 emp
5 20030100014169 emp
That's the data frames, I want to join them over the common column TUCASEID, of which there are intersections:
>>> np.intersect1d(emp.TUCASEID, df_sum.TUCASEID)
array([20030100013280, 20030100013344, 20030100013352, ..., 20131212132462,
20131212132469, 20131212132475])
Now...
>>> df_sum.join(emp, on='TUCASEID', how='inner')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3829, in join
rsuffix=rsuffix, sort=sort)
File "/usr/local/lib/python2.7/site-packages/pandas/core/frame.py", line 3843, in _join_compat
suffixes=(lsuffix, rsuffix), sort=sort)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 39, in merge
return op.get_result()
File "/usr/local/lib/python2.7/site-packages/pandas/tools/merge.py", line 193, in get_result
rdata.items, rsuf)
File "/usr/local/lib/python2.7/site-packages/pandas/core/internals.py", line 3873, in items_overlap_with_suffix
to_rename)
ValueError: columns overlap but no suffix specified: Index([u'TUCASEID'], dtype='object')
Well, that's weird, the only column that appears in both data frames is the one to join over, but well, let's concur[1]:
>>> df_sum.join(emp, on='TUCASEID', how='inner', rsuffix='r')
Empty DataFrame
Columns: [TUCASEID, t070101, t070102, t070103, t070104, t070105, t070199, t070201, t070299, shopping, year, TUCASEIDr, status]
Index: []
Despite there being a huge intersection. What's going on here?
>>> pd.__version__
'0.15.0'
[1]: I actually enforced integer for dtype of the joining column because it said "object" there, made no difference:
>>> emp.dtypes
TUCASEID int64
status object
dtype: object
>>> df_sum.dtypes
TUCASEID int64
(...)
shopping int64
year int64
dtype: object
df.join generally calls pd.merge (except in a special case when it calls concat). Therefore, anything join can do, merge can do
also. Although perhaps not strictly correct, I tend to use df.join only when
joining on the index and use pd.merge for joining on columns.
Thus, I can reproduce the problem you describe:
import numpy as np
import pandas as pd
df_sum = pd.DataFrame(np.arange(6*2).reshape((6,2)),
index=list('ABCDEF'), columns=list('XY'))
emp = pd.DataFrame(np.arange(6*2).reshape((6,2)),
index=list('ABCDEF'), columns=list('XZ'))
print(df_sum.join(emp, on='X', rsuffix='_r', how='inner'))
# Empty DataFrame
# Columns: [X, Y, X_r, Z]
# Index: []
but pd.merge works as expected -- and without having to supply rsuffix:
print(pd.merge(df_sum, emp, on='X')
yields
X Y Z
0 0 1 1
1 2 3 3
2 4 5 5
3 6 7 7
4 8 9 9
5 10 11 11
Under the hood, df_sum.join calls merge this way:
if isinstance(other, DataFrame):
return merge(self, other, left_on=on, how=how,
left_index=on is None, right_index=True,
suffixes=(lsuffix, rsuffix), sort=sort)
So, even though you use df_sum.join(emp, on='...'), under the hood, Pandas converts this to pd.merge(df_sum, emp, left_on='...').
Furthermore, the merge is empty when called this way:
In [228]: pd.merge(df_sum, emp, left_on='X', left_index=False, right_index=True)
Out[228]:
Empty DataFrame
Columns: [X, X_x, Y, X_y, Z]
Index: []
because the left_on='X' needs to be on='X' for the merge to succeed as desired:
In [233]: pd.merge(df_sum, emp, on='X', left_index=False, right_index=True)
Out[233]:
X Y Z
A 0 1 1
B 2 3 3
C 4 5 5
D 6 7 7
E 8 9 9
F 10 11 11
Related
I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6
I've been successful creating functions in python and reading/writing files. However, I really need to apply certain functions to whole rows of data (not columns) and can't find out anything about how to do this. The goals are:
Read a csv or txt file into python (can-do)
Find a row of data and apply certain conditions and operations
Do the same with a second row of data
Then compare results from the rows to each other (done with a similarity function)
Print the resulting data into a separate file (easy peasy)
Function parameters include "if/then" conditions for ratios, sums, and square roots -- will not include whole function. For example, just use sum
Here's what I have so far (not much...):
import numpy as np
data = np.genfromtxt ('file_to_read.csv',
dtype=float,
delimiter=",",
names=True)
np.sum()
print(data)
np.savetxt('test.csv', data, delimiter=',')
file_to_read.csv is this:
0,2,1
0,2,2
0,2,3
0,1,0
0,2,0
0,3,0
1,0,0
2,0,0
3,0,0
you can transpose your matrix or data frame (if using pandas) and work with columns.
Example (pandas):
Original DF
In [162]: df
Out[162]:
a b c
0 0 2 1
1 0 2 2
2 0 2 3
3 0 1 0
4 0 2 0
5 0 3 0
6 1 0 0
7 2 0 0
8 3 0 0
Transposed DF
In [163]: df.T
Out[163]:
0 1 2 3 4 5 6 7 8
a 0 0 0 0 0 0 1 2 3
b 2 2 2 1 2 3 0 0 0
c 1 2 3 0 0 0 0 0 0
Select rows where b>0 and c>1:
In [166]: df[(df.b>0) & (df.c>1)]
Out[166]:
a b c
1 0 2 2
2 0 2 3
now calculate sum of cells for each found row:
In [167]: df[(df.b>0) & (df.c>1)].sum(axis=1)
Out[167]:
1 4
2 5
dtype: int64
or product:
In [169]: df[(df.b>0) & (df.c>1)].product(axis=1)
Out[169]:
1 0
2 0
dtype: int64
PS using axis=1 instructs Pandas/Numpy to use rows instead of columns
I have a dataframe with about 100 columns that looks like this:
Id Economics-1 English-107 English-2 History-3 Economics-zz Economics-2 \
0 56 1 1 0 1 0 0
1 11 0 0 0 0 1 0
2 6 0 0 1 0 0 1
3 43 0 0 0 1 0 1
4 14 0 1 0 0 1 0
Histo Economics-51 Literature-re Literatureu4
0 1 0 1 0
1 0 0 0 1
2 0 0 0 0
3 0 1 1 0
4 1 0 0 0
My goal is to leave only global categories -- English, History, Literature -- and write the sum of the value of their components, respectively, in this dataframe. For instance, "English" would be the sum of "English-107" and "English-2":
Id Economics English History Literature
0 56 1 1 2 1
1 11 1 0 0 1
2 6 0 1 1 0
3 43 2 0 1 1
4 14 0 1 1 0
For this purpose, I have tried two methods. First method:
df = pd.read_csv(file_path, sep='\t')
df['History'] = df.loc[df[df.columns[pd.Series(df.columns).str.startswith('History')]].sum(axes=1)]
Second method:
df = pd.read_csv(file_path, sep='\t')
filter_col = [col for col in list(df) if col.startswith('History')]
df['History'] = 0 # initialize value, otherwise throws KeyError
for c in df[filter_col]:
df['History'] = df[filter_col].sum(axes=1)
print df['History', df[filter_col]]
However, both gives the error:
TypeError: 'DataFrame' objects are mutable, thus they cannot be
hashed
My question is either: how can I debug this error or is there another solution for my problem. Notice that I have a rather large dataframe with about 100 columns and 400000 rows, so I'm looking for an optimized solution, like using loc in pandas.
I'd suggest that you do something different, which is to perform a transpose, groupby the prefix of the rows (your original columns), sum, and transpose again.
Consider the following:
df = pd.DataFrame({
'a_a': [1, 2, 3, 4],
'a_b': [2, 3, 4, 5],
'b_a': [1, 2, 3, 4],
'b_b': [2, 3, 4, 5],
})
Now
[s.split('_')[0] for s in df.T.index.values]
is the prefix of the columns. So
>>> df.T.groupby([s.split('_')[0] for s in df.T.index.values]).sum().T
a b
0 3 3
1 5 5
2 7 7
3 9 9
does what you want.
In your case, make sure to split using the '-' character.
You can use these to create sum of columns starting with specific name,
df['Economics']= df[list(df.filter(regex='Economics'))].sum(axis=1)
Using brilliant DSM's idea:
from __future__ import print_function
import pandas as pd
categories = set(['Economics', 'English', 'Histo', 'Literature'])
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df)
print(df.groupby(correct_categories(df.columns),axis=1).sum())
Output:
Economics English Histo Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
Here is another version, which takes care of "Histo/History" problematic..
from __future__ import print_function
import pandas as pd
#categories = set(['Economics', 'English', 'Histo', 'Literature'])
#
# mapping: common starting pattern: desired name
#
categories = {
'Histo': 'History',
'Economics': 'Economics',
'English': 'English',
'Literature': 'Literature'
}
def correct_categories(cols):
return [categories[cat] for col in cols for cat in categories.keys() if col.startswith(cat)]
df = pd.read_csv('data.csv', sep=r'\s+', index_col='Id')
#print(df.columns, len(df.columns))
#print(correct_categories(df.columns), len(correct_categories(df.columns)))
#print(df.groupby(pd.Index(correct_categories(df.columns)),axis=1).sum())
rslt = df.groupby(correct_categories(df.columns),axis=1).sum()
print(rslt)
print('History\n', rslt['History'])
Output:
Economics English History Literature
Id
56 1 1 2 1
11 1 0 0 1
6 1 1 0 0
43 2 0 1 1
14 1 1 1 0
History
Id
56 2
11 0
6 0
43 1
14 1
Name: History, dtype: int64
PS You may want to add missing categories to categories map/dictionary
I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
I II
0 1 2
1 3 1
Dataframe B:
I II
0 5 6
1 3 1
New Dataframe:
I II
0 1 2
1 3 1
2 5 6
How can I do this?
The simplest way is to just do the concatenation, and then drop duplicates.
>>> df1
A B
0 1 2
1 3 1
>>> df2
A B
0 5 6
1 3 1
>>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
A B
0 1 2
1 3 1
2 5 6
The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.
In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.
In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data
Here is an example:
df_1 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':34},])
df_2 = pd.DataFrame([
{'date':'11/20/2015', 'id':4, 'value':24},
{'date':'11/20/2015', 'id':6, 'value':14},
])
df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
df_tot = pd.concat([df_1,df_2], ignore_index=False)
df_tot = df_tot.drop_duplicates()
df_tot = df_tot.drop(['count'], axis=1)
>>> df_tot
date id value
0 11/20/2015 4 24
1 11/20/2015 4 24
2 11/20/2015 6 34
1 11/20/2015 6 14
I'm surprised that pandas doesn't offer a native solution for this task.
I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).
It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.
import pandas as pd
def append_non_duplicates(a, b, col=None):
if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
if (a is None):
return(b)
if (b is None):
return(a)
if(col is not None):
aind = a.iloc[:,col].values
bind = b.iloc[:,col].values
else:
aind = a.index.values
bind = b.index.values
take_rows = list(set(bind)-set(aind))
take_rows = [i in take_rows for i in bind]
return(a.append( b.iloc[take_rows,:] ))
# Usage
a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
append_non_duplicates(a,b)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 3000 7 8 9 <- from b
append_non_duplicates(a,b,0)
# 0 1 2
# 1000 1 2 3 <- from a
# 2000 1 5 6 <- from a
# 5000 1 12 13 <- from a
# 2000 4 5 6 <- from b
# 3000 7 8 9 <- from b
Another option:
concatenation = pd.concat([
dfA,
dfB[dfB['I'].isin(dfA['I']) == False], # <-- get all the data in dfB that doesn't show up in dfB (based on values in column 'I')
])
The object concatenation will be:
I II
0 1 2
1 3 1
2 5 6
In short ... I have a Python Pandas data frame that is read in from an Excel file using 'read_table'. I would like to keep a handful of the series from the data, and purge the rest. I know that I can just delete what I don't want one-by-one using 'del data['SeriesName']', but what I'd rather do is specify what to keep instead of specifying what to delete.
If the simplest answer is to copy the existing data frame into a new data frame that only contains the series I want, and then delete the existing frame in its entirety, I would satisfied with that solution ... but if that is indeed the best way, can someone walk me through it?
TIA ... I'm a newb to Pandas. :)
You can use the DataFrame drop function to remove columns. You have to pass the axis=1 option for it to work on columns and not rows. Note that it returns a copy so you have to assign the result to a new DataFrame:
In [1]: from pandas import *
In [2]: df = DataFrame(dict(x=[0,0,1,0,1], y=[1,0,1,1,0], z=[0,0,1,0,1]))
In [3]: df
Out[3]:
x y z
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0
4 1 0 1
In [4]: df = df.drop(['x','y'], axis=1)
In [5]: df
Out[5]:
z
0 0
1 0
2 1
3 0
4 1
Basically the same as Zelazny7's answer -- just specifying what to keep:
In [68]: df
Out[68]:
x y z
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0
4 1 0 1
In [70]: df = df[['x','z']]
In [71]: df
Out[71]:
x z
0 0 0
1 0 0
2 1 1
3 0 0
4 1 1
*Edit*
You can specify a large number of columns through indexing/slicing into the Dataframe.columns object.
This object of type(pandas.Index) can be viewed as a dict of column labels (with some extended functionality).
See this extension of above examples:
In [4]: df.columns
Out[4]: Index([x, y, z], dtype=object)
In [5]: df[df.columns[1:]]
Out[5]:
y z
0 1 0
1 0 0
2 1 1
3 1 0
4 0 1
In [7]: df.drop(df.columns[1:], axis=1)
Out[7]:
x
0 0
1 0
2 1
3 0
4 1
You can also specify a list of columns to keep with the usecols option in pandas.read_table. This speeds up the loading process as well.