I work with Python and I try to implement the function merge with two tables df_agg and df_total. With this function, I used the argument left with the expectation that from the first table with the title all rows will be covered. For the first table, it is important to consider that the first table contains duplicates in the join column id while the second table does not have duplicates in id.
df_new = pd.merge(df_agg,df_total, on='id', how='left')
The merge command executes successfully.But the results are extraordinary, instead to have the same sum of df_agg['total'] with df_new['total'], results in the df_new['total'] being greater than df_agg.
So can anybody help me with what causes this problem and suggest to me some arguments in the function in order to have the same sum before and after merging?
It means id has duplicates in both DataFrames, so new DataFrame has more rows like df_agg (is created 'product' of duplicated rows by all combinations).
df_agg = pd.DataFrame( {"id": [1,1,2,3,3], 'a':range(5) })
df_total = pd.DataFrame( {"id": [1,1,1,3,4], 'b':range(10,15) })
df_new = pd.merge(df_agg,df_total, on='id', how='left')
print (df_new)
id a b
0 1 0 10.0
1 1 0 11.0
2 1 0 12.0
3 1 1 10.0
4 1 1 11.0
5 1 1 12.0
6 2 2 NaN
7 3 3 13.0
8 3 4 13.0
print (len(df_new), len(df_agg))
9 5
Possible solution is remove duplicates:
df_new = pd.merge(df_agg,df_total.drop_duplicates('id'), on='id', how='left')
print (df_new)
id a b
0 1 0 10.0
1 1 1 10.0
2 2 2 NaN
3 3 3 13.0
4 3 4 13.0
print (len(df_new), len(df_agg))
5 5
Related
Given a DataFrame
df1 :
value mesh
0 10 2
1 12 3
2 5 2
obtain a new DataFrame df2 in which for each value of df1 there are mesh values, each one obtained by dividing the corresponding value of df1 by its mesh:
df2 :
value/mesh
0 5
1 5
2 4
3 4
4 4
5 2.5
6 2.5
More general:
df1 :
value mesh_value other_value
0 10 2 0
1 12 3 1
2 5 2 2
obtain:
df2 :
value/mesh_value other_value
0 5 0
1 5 0
2 4 1
3 4 1
4 4 1
5 2.5 2
6 2.5 2
You can do map
df2['new'] = df2['value/mesh'].map(dict(zip(df1.eval('value/mesh'),df1.index)))
Out[243]:
0 0
1 0
2 1
3 1
4 1
5 2
6 2
Name: value/mesh, dtype: int64
Try as follows:
Use Series.div for value / mesh_value, and apply Series.reindex using np.repeat with df.mesh_value as the input array for the repeats parameter.
Next, use pd.concat to combine the result with df.other_value along axis=1.
Finally, rename the column with result of value / mesh_value (its default name will be 0) using df.rename, and chain df.reset_index to reset to a standard index.
df2 = pd.concat([df.value.div(df.mesh_value).reindex(
np.repeat(df.index,df.mesh_value)),df.other_value], axis=1)\
.rename(columns={0:'value_mesh_value'}).reset_index(drop=True)
print(df2)
value_mesh_value other_value
0 5.0 0
1 5.0 0
2 4.0 1
3 4.0 1
4 4.0 1
5 2.5 2
6 2.5 2
Or slightly different:
Use df.assign to add a column with the result of df.value.div(df.mesh_value), and reindex / rename in same way as above.
Use df.drop to get rid of columns that you don't want (value, mesh_value) and use df.iloc to change the column order (e.g. we want ['value_mesh_value','other_value'] instead of other way around (hence: [1,0]). And again, reset index.
We put all of this between brackets and assign it to df2.
df2 = (df.assign(tmp=df.value.div(df.mesh_value)).reindex(
np.repeat(df.index,df.mesh_value))\
.rename(columns={'tmp':'value_mesh_value'})\
.drop(columns=['value','mesh_value']).iloc[:,[1,0]]\
.reset_index(drop=True))
# same result
I have multiple dataframes with data for each quarter of the year. My goal is to concatenate all of them so I can sum values and have a vision for my entire year.
I managed to concatenate the four dataframes (that have the same column names and same rows names) into one. But I keep getting NaN at two columns, even though I have the data. It goes like this
df1:
my_data 1st_quarter
0 occurrence_1 2
1 occurrence_3 3
2 occurrence_2 0
df2:
my_data 2nd_quarter
0 occurrence_1 5
1 occurrence_3 10
2 occurrence_2 3
df3:
my_data 3th_quarter
0 occurrence_1 10
1 occurrence_3 2
2 occurrence_2 1
So I run this:
df_results = pd.concat(
(df_results.set_index('my_data') for df_results in [df1, df2, df3]),
axis=1, join='outer'
).reset_index()
What Is happening is this output:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 NaN 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
If I use join='inner', the first row disappear. Note that the rows have the exact same name in all dataframes.
How can I solve the NaN problem? Or after doing pd.concat reorganize my DF to "fill" the NaN with the correct numbers?
Update: My original dataset (which I unfortunately can post publicly) has a inconsistency in the first row name. Any suggestions about how I can get around it? Can I rename a row? Or combine two rows after a concatenate the dataframes?
I managed to get around this problem using combine_first with loc:
df_results.loc[0] = df_results.loc[0].combine_first(df_results.loc[3])
So I got this:
type 1st_quarter 2nd_quarter 3th_quarter
0 occurrence_1 2 5 10
1 occurrence_3 3 10 2
2 occurrence_2 0 3 1
3 occurrence_1 NaN 5 NaN
Then I dropped the last line:
df_results = df_results.drop([3])
I have two files:
File 1:
key.1 10 6
key.2 5 6
key.3. 5 8
key.4. 5 10
key.5 4 12
File 2:
key.1 10 6
key.2 6 6
key.4 5 10
key.5 2 8
I have a rather complicated issue. I want to average between the two files for each loc. ID. But if an ID is unique to either of the files, I simply want to keep that value in the output file. So the output file would look like this:
key.1 10 6
key.2 5.5 6
key.3. 5 8
key.4. 5 10
key.5 3 10
This is an example. In reality I have 100s of columns that I would like to average.
The following solution uses Pandas, and assumes that your data is stored in plain text files 'file1.txt' and 'file2.txt'. Let me know if this assumption is incorrect - it is likely a minimal edit to alter for different file types. If I have misunderstood your meaning of the word 'file' and your data is already in DataFrames, you can ignore the first step.
First read in the data to DataFrames:
import pandas as pd
df1 = pd.read_table('file1.txt', sep=r'\s+', header=None)
df2 = pd.read_table('file2.txt', sep=r'\s+', header=None)
Giving us:
In [9]: df1
Out[9]:
0 1 2
0 key.1 10 6
1 key.2 5 6
2 key.3 5 8
3 key.4 5 10
4 key.5 4 12
In [10]: df2
Out[10]:
0 1 2
0 key.1 10 6
1 key.2 6 6
2 key.4 5 10
3 key.5 2 8
Then join these datasets on column 0:
combined = pd.merge(df1, df2, 'outer', on=0)
Giving:
0 1_x 2_x 1_y 2_y
0 key.1 10 6 10.0 6.0
1 key.2 5 6 6.0 6.0
2 key.3 5 8 NaN NaN
3 key.4 5 10 5.0 10.0
4 key.5 4 12 2.0 8.0
Which is a bit of a mess, but we can select only the columns we want after doing the calculations:
combined[1] = combined[['1_x', '1_y']].mean(axis=1)
combined[2] = combined[['2_x', '2_y']].mean(axis=1)
Selecting only useful columns:
results = combined[[0, 1, 2]]
Which gives us:
0 1 2
0 key.1 10.0 6.0
1 key.2 5.5 6.0
2 key.3 5.0 8.0
3 key.4 5.0 10.0
4 key.5 3.0 10.0
Which is what you were looking for I believe.
You didn't state which file format you wanted the output to be, but the following will give you a tab-separated text file. Let me know if something different is preferred and I can edit.
results.to_csv('output.txt', sep='\t', header=None, index=False)
I should add that it would be better to give your columns relevant labels rather than using numbers as I have in this example - I just used the default integer values here since I don't know anything about your dataset.
This is one solution via pandas. The idea is to define indices for each dataframe and use ^ [equivalent to symmetric_difference in set terminology] to find your unique indices.
Treat each case separately via 2 pd.concat calls, perform a groupby.mean, and append your isolated indices at the end.
# read files into dataframes
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
# set first column as index
df1 = df1.set_index(0)
df2 = df2.set_index(0)
# calculate symmetric difference of indices
x = df1.index ^ df2.index
# Index(['key.3'], dtype='object', name=0)
# aggregate common and unique indices
df_common = pd.concat((df1[~df1.index.isin(x)], df2[~df2.index.isin(x)]))
df_unique = pd.concat((df1[df1.index.isin(x)], df2[df2.index.isin(x)]))
# calculate mean on common indices; append unique indices
mean = df_common.groupby(df_common.index)\
.mean()\
.append(df_unique)\
.sort_index()\
.reset_index()
# output to csv
mean.to_csv('out.csv', index=False)
Result
0 1 2
0 key.1 10.0 6.0
1 key.2 5.5 6.0
2 key.3 5.0 8.0
3 key.4 5.0 10.0
4 key.5 3.0 10.0
You can use itertools.groupby:
import itertools
import re
file_1 = [[re.sub('\.$', '', a), *list(map(int, filter(None, b)))] for a, *b in [re.split('\s+', i.strip('\n')) for i in open('filename.txt')]]
file_2 = [[re.sub('\.$', '', a), *list(map(int, filter(None, b)))] for a, *b in [re.split('\s+', i.strip('\n')) for i in open('filename1.txt')]]
special_keys ={a for a, *_ in [re.split('\s+', i.strip('\n')) for i in open('filename.txt')]+[re.split('\s+', i.strip('\n')) for i in open('filename2.txt')] if a.endswith('.')}
new_results = [[a, [c for _, *c in b]] for a, b in itertools.groupby(sorted(file_1+file_2, key=lambda x:x[0])[1:], key=lambda x:x[0])]
last_results = [(" "*4).join(["{}"]*3).format(a+'.' if a+'.' in special_keys else a, *[sum(i)/float(len(i)) for i in zip(*b)]) for a, b in new_results]
Output:
['key.1 10.0 6.0', 'key.2 5.5 6.0', 'key.3. 5.0 8.0', 'key.4. 5.0 10.0', 'key.5 3.0 10.0']
One possible solution is to read the two files into dictionaries (key being the key variable, and the value being a list with the two elements after). You can then get the keys of each dictionary, see which keys are duplicated (and if so, average the results), and which keys are unique (and if so, just output the key). This might not be the most efficient, but if you only have hundreds of columns that should be the simplest way to do it.
Look up set intersection and set difference, as they will help you find the common items and unique items.
I have two dataframes
df1
Name class value
Sri 1 5
Ram 2 8
viv 3 4
df2
Name class value
Sri 1 5
viv 4 4
My desired output is,
df,
Name class value
Sri 2 10
Ram 2 8
viv 7 8
Please help, thanks in advance!
I think need set_index for both DataFrames, add and last reset_index:
df = df1.set_index('Name').add(df2.set_index('Name'), fill_value=0).reset_index()
print (df)
Name class value
0 Ram 2.0 8.0
1 Sri 2.0 10.0
2 viv 7.0 8.0
If values in Name are not unique use groupby and aggregate sum:
df = df1.groupby('Name').sum().add(df2.groupby('Name').sum(), fill_value=0).reset_index()
pd.concat + groupby + sum
You can concatenate your individual dataframes and then group by your key column:
df = pd.concat([df1, df2])\
.groupby('Name')['class', 'value']\
.sum().reset_index()
print(df)
Name class value
0 Ram 2 8
1 Sri 2 10
2 viv 7 8
I have following two Data Frames:
df1 = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[0,0,1,1,0]})
df2 = pd.DataFrame({'ids':[1,5],'cost':[1,4]})
And I want to update the values of df1 with the ones on df2 whenever there is a match in the ids. The desired dataframe is this one:
df_result = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[1,0,1,1,4]})
How can I get that from the above two dataframes?
I have tried using merge, but fewer records and it keeps both columns:
results = pd.merge(df1,df2,on='ids')
results.to_dict()
{'cost_x': {0: 0, 1: 0}, 'cost_y': {0: 1, 1: 4}, 'ids': {0: 1, 1: 5}}
You could do this with a left merge:
merged = pd.merge(df1, df2, on='ids', how='left')
merged['cost'] = merged.cost_x.where(merged.cost_y.isnull(), merged['cost_y'])
result = merged[['ids','cost']]
However you can avoid the need for the merge (and get better performance) if you set the ids as an index column; then pandas can use this to align the results for you:
df1 = df1.set_index('ids')
df2 = df2.set_index('ids')
df1.cost.where(~df1.index.isin(df2.index), df2.cost)
ids
1 1.0
2 0.0
3 1.0
4 1.0
5 4.0
Name: cost, dtype: float64
You can use set_index and combine first to give precedence to values in df2
df_result = df2.set_index('ids').combine_first(df1.set_index('ids'))
df_result.reset_index()
You get
ids cost
0 1 1
1 2 0
2 3 1
3 4 1
4 5 4
Another way to do it, using a temporary merged dataframe which you can discard after use.
import pandas as pd
df1 = pd.DataFrame({'ids':[1,2,3,4,5],'cost':[0,0,1,1,0]})
df2 = pd.DataFrame({'ids':[1,5],'cost':[1,4]})
dftemp = df1.merge(df2,on='ids',how='left', suffixes=('','_r'))
print(dftemp)
df1.loc[~pd.isnull(dftemp.cost_r), 'cost'] = dftemp.loc[~pd.isnull(dftemp.cost_r), 'cost_r']
del dftemp
df1 = df1[['ids','cost']]
print(df1)
OUTPUT-----:
dftemp:
cost ids cost_r
0 0 1 1.0
1 0 2 NaN
2 1 3 NaN
3 1 4 NaN
4 0 5 4.0
df1:
ids cost
0 1 1.0
1 2 0.0
2 3 1.0
3 4 1.0
4 5 4.0
A little late, but this did it for me and was faster that the accepted answer in tests:
df1.update(df2.set_index('ids').reindex(df1.set_index('ids').index).reset_index())