Consider the following hdfstore and dataframes df and df2
import pandas as pd
store = pd.HDFStore('test.h5')
midx = pd.MultiIndex.from_product([range(2), list('XYZ')], names=list('AB'))
df = pd.DataFrame(dict(C=range(6)), midx)
df
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
midx2 = pd.MultiIndex.from_product([range(2), list('VWX')], names=list('AB'))
df2 = pd.DataFrame(dict(C=range(6)), midx2)
df2
C
A B
0 V 0
W 1
X 2
1 V 3
W 4
X 5
I want to first write df to the store.
store.append('df', df)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
At a later point in time I will have another dataframe that I want to update the store with. I want to overwrite the rows with the same index values as are in my new dataframe while keeping the old ones.
When I do
store.append('df', df2)
store.get('df')
C
A B
0 X 0
Y 1
Z 2
1 X 3
Y 4
Z 5
0 V 0
W 1
X 2
1 V 3
W 4
X 5
This isn't at all what I want. Notice that (0, 'X') and (1, 'X') are repeated. I can manipulate the combined dataframe and overwrite, but I expect to be working with a lot data where this wouldn't be feasible.
How do I update the store to get?
C
A B
0 V 0
W 1
X 2
Y 1
Z 2
1 V 3
W 4
X 5
Y 4
Z 5
You'll see that For each level of 'A', 'Y' and 'Z' are the same, 'V' and 'W' are new, and 'X' is updated.
What is the correct way to do this?
Idea: remove matching rows (with matching index values) from the HDF first and then append df2 to HDFStore.
Problem: I couldn't find a way to use where="index in df2.index" for multi-index indexes.
Solution: first convert multiindexes to normal ones:
df.index = df.index.get_level_values(0).astype(str) + '_' + df.index.get_level_values(1).astype(str)
df2.index = df2.index.get_level_values(0).astype(str) + '_' + df2.index.get_level_values(1).astype(str)
this yields:
In [348]: df
Out[348]:
C
0_X 0
0_Y 1
0_Z 2
1_X 3
1_Y 4
1_Z 5
In [349]: df2
Out[349]:
C
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
make sure that you use format='t' and data_columns=True (this will index save index and index all columns in the HDF5 file, allowing us to use them in the where clause) when you create/append HDF5 files:
store = pd.HDFStore('d:/temp/test1.h5')
store.append('df', df, format='t', data_columns=True)
store.close()
now we can first remove those rows from the HDFStore with matching indexes:
store = pd.HDFStore('d:/temp/test1.h5')
In [345]: store.remove('df', where="index in df2.index")
Out[345]: 2
and append df2:
In [346]: store.append('df', df2, format='t', data_columns=True, append=True)
Result:
In [347]: store.get('df')
Out[347]:
C
0_Y 1
0_Z 2
1_Y 4
1_Z 5
0_V 0
0_W 1
0_X 2
1_V 3
1_W 4
1_X 5
Related
I have two dataframes, example:
Df1 -
A B C D
x j 5 2
y k 7 3
z l 9 4
Df2 -
A B C D
z o 1 1
x p 2 1
y q 3 1
I want to deduct columns C and D in Df2 from columns C and D in Df1 based on the key contained in column A.
I also want to ensure that column B remains untouched, example:
Df3 -
A B C D
x j 3 1
y k 4 2
z l 8 3
I found an almost perfect answer in the following thread:
Subtracting columns based on key column in pandas dataframe
However what the answer does not explain is if there are other columns in the primary df (such as column B) that should not be involved as an index or with the operation.
Is somebody please able to advise?
I was originally performing a loop which find the value in the other df and deducts it however this takes too long for my code to run with the size of data I am working with.
Idea is specify column(s) for maching and column(s) for subtract, convert all not cols columnsnames to MultiIndex, subtract:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match + Df1.columns.difference(match + cols).tolist())
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index()
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
Or replace not matched values to original Df1:
match = ['A']
cols = ['C','D']
df1 = Df1.set_index(match)
df = df1.sub(Df2.set_index(match)[cols], level=0).reset_index().fillna(Df1)
print (df)
A B C D
0 x j 3 1
1 y k 4 2
2 z l 8 3
I have a distance matrix with IDs as column and row names:
A B C D
A 0 1 2 3
B 1 0 4 5
C 2 4 0 6
D 3 5 6 0
How to efficiently extract values from a large matrix, e.g. for IDs A and C to get this matrix:
A C
A 0 2
C 2 0
Edit, missing IDs in the matrix should be ignored.
Use DataFrame.loc for get values by labels:
vals = ['A','C']
df = df.loc[vals, vals]
print (df)
A C
A 0 2
C 2 0
EDIT: If some values not match and need omit them add Index.intersection:
vals = ['J','A','C']
new = df.columns.intersection(vals, sort=False)
df = df.loc[new, new]
print (df)
A C
A 0 2
C 2 0
I am trying to copy Name from df2 into df1 where ID is common between both dataframes.
df1:
ID Name
1 A
2 B
4 C
16 D
7 E
df2:
ID Name
1 X
2 Y
7 Z
Expected Output:
ID Name
1 X
2 Y
4 C
16 D
7 Z
I have tried like this, but it didn't worked. I am not able to understand how to assign value here. I am assigning =df2['Name'] which is wrong.
for i in df2["ID"].tolist():
df1['Name'].loc[(df1['ID'] == i)] = df2['Name']
Try with update
df1 = df1.set_index('ID')
df1.update(df2.set_index('ID'))
df1 = df1.reset_index()
df1
Out[476]:
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
If the order of rows does not matter, then concatenate two dfs and drop_duplicates will achieve the result,
df2.append(df1).drop_duplicates(subset='ID')
another solution would be,
s = df1["Name"]
df1.loc[:,"Name"]=df1["ID"].map(df2.set_index("ID")["Name"].to_dict()).fillna(s)
o/P:
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
One more for consideration
df,dg = df1,df2
df = df.set_index('ID')
dg = dg.set_index('ID')
df.loc[dg.index,:] = dg # All columns
#df.loc[dg.index,'Name'] = dg.Name # Single column
df = df.reset_index()
>>> df
ID Name
0 1 X
1 2 Y
2 4 C
3 16 D
4 7 Z
Or for a single column (index for both is 'ID'
I have a dataframe in pandas like this:
id some_type some_date some_data
0 1 A 19/12/1995 X
1 2 A 10/04/1997 Y
2 2 B 05/03/2013 Z
3 2 B 09/05/2017 W
4 2 B 09/05/2017 R
5 3 A 01/07/1998 M
6 3 B 09/08/2009 N
I need for each value of id, the rows that have the max value of some_type and some_date without deleting any value of some_data.
In other words, what I need is the following:
id some_type some_date some_data
0 1 A 19/12/1995 X
3 2 B 09/05/2017 W
4 2 B 09/05/2017 R
6 3 B 09/08/2009 N
you can do it with sort_values, groupby and apply by keeping the rows with the last value some_type and some_date:
df_output = (df.sort_values(by=['some_type','some_date']).groupby('id')
.apply(lambda df_g: df_g[(df_g['some_type'] == df_g['some_type'].iloc[-1]) &
(df_g['some_date'] == df_g['some_date'].iloc[-1])])
.reset_index(0,drop=True))
and the output is:
id some_type some_date some_data
0 1 A 1995-12-19 X
3 2 B 2017-09-05 W
4 2 B 2017-09-05 R
6 3 B 2009-09-08 N
EDIT: if you don't care about the indexes, you can also use merge:
#first get the last one once sorting
df_last = df.sort_values(['some_type','some_date']).groupby('id')['some_type','some_date'].last()
# now merge with inner to keep the one you want
df_output = df.merge(df_last ,how='inner')
you will get the same result besides indexes
Create a mask with groupby and max() and apply. But first convert to datetime:
df['some_date'] = pd.to_datetime(df['some_date'])
m = df.groupby('id')['some_type','some_date'].transform(lambda x: x == x.max()).all(1)
df = df[m]
Full example:
import pandas as pd
text = '''\
id some_type some_date some_data
1 A 19/12/1995 X
2 A 10/04/1997 Y
2 B 05/03/2013 Z
2 B 09/05/2017 W
2 B 09/05/2017 R
3 A 01/07/1998 M
3 B 09/08/2009 N'''
fileobj = pd.compat.StringIO(text)
df = pd.read_csv(fileobj, sep='\s+')
df['some_date'] = pd.to_datetime(df['some_date'])
m = df.groupby('id')['some_type','some_date'].transform(lambda x: x == x.max()).all(1)
df = df[m]
print(df)
Returns:
id some_type some_date some_data
0 1 A 1995-12-19 X
3 2 B 2017-09-05 W
4 2 B 2017-09-05 R
6 3 B 2009-09-08 N
I have 2 csv files. Each contains a data set with multiple columns and an ASSET_ID column. I used pandas to read each csv file in as a df1 and df2. My problem has been trying to define a function to iterate over the ASSET_ID value in df1 and compare each value against all the ASSET_ID values in df2. From there I want to return all the corresponding rows from df1's ASSET_ID's that matched df2. Any help would be appreciated I've been working on this for hours with little to show for it. dtypes are float or int.
My configuration = windows xp, python 2.7 anaconda distribution
Create a boolean mask of the values will index the rows where the 2 df's match, no need to iterate and much faster.
Example:
# define a list of values
a = list('abcdef')
b = range(6)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
# c has x values for 'a' and 'd' so these should not match
c = list('xbcxef')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(b)})
print(df)
print(df1)
X Y
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
[6 rows x 2 columns]
X Y
0 x 0
1 b 1
2 c 2
3 x 3
4 e 4
5 f 5
[6 rows x 2 columns]
In [4]:
# now index your df using boolean condition on the values
df[df.X == df1.X]
Out[4]:
X Y
1 b 1
2 c 2
4 e 4
5 f 5
[4 rows x 2 columns]
EDIT:
So if you have different length series then that won't work, in which case you can use isin:
So create 2 dataframes of different lengths:
a = list('abcdef')
b = range(6)
d = range(10)
df = pd.DataFrame({'X':pd.Series(a),'Y': pd.Series(b)})
c = list('xbcxefxghi')
df1 = pd.DataFrame({'X':pd.Series(c),'Y': pd.Series(d)})
print(df)
print(df1)
X Y
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
[6 rows x 2 columns]
X Y
0 x 0
1 b 1
2 c 2
3 x 3
4 e 4
5 f 5
6 x 6
7 g 7
8 h 8
9 i 9
[10 rows x 2 columns]
Now use isin to select rows from df1 where the id's exist in df:
In [7]:
df1[df1.X.isin(df.X)]
Out[7]:
X Y
1 b 1
2 c 2
4 e 4
5 f 5
[4 rows x 2 columns]