I'm new to pandas and have been having trouble using the merge, join and concatenate functions on a single row of data.
I'm iterating over a handful of rows in a table and in each iteration add some data I've found to the row I'm handling. I know, blasphemy! Thou shall not iterate. Each iteration results in a call to a server, so I need to control flow. There aren't that many rows. It's just for my own use. I promise I'll not iterate when I shouldn't.
That aside, my basic question is this: How do I add data to a given row where the new data has priority over existing data and has new columns?
Let's suppose I have a DataFrame df that I'm iterating over by row:
> df
c1 c2 c3
0 a b c
1 d e f
and when iterating on row 0, I get some new data that I want to add to row 0. That new data is in df_a:
> df_a
c4 c5 c6
0 g h i
I want to add data from df_a to row 0 of df so df is now:
> df
c1 c2 c3 c4 c5 c6
0 a b c g h i
1 d e f NaN NaN NaN
Next I iterate on row 1 and I get some columns which overlap and some which don't in df_b:
> df_b
c5 c7 c8
0 j k l
And again I want to add this data to row 1 so df now has
> df
c1 c2 c3 c4 c5 c6 c7 c8
0 a b c g h i NaN NaN
1 d e f NaN j NaN k l
I can't list columns names because I don't know what they'll be and new ones can appear beyond my control. Rows don't have a key because the whole thing gets thrown away after I disconnect. Data I find during each iteration always overwrites what's currently in df.
Thanks in advance!
Related
I have a dataframe:
id value
a1 1,2
b2 4
c1 NaN
c5 9,10,11
I want to create a new column mean_value which is equal to mean values in column value:
id value mean_value
a1 1,2 1.5
b2 4 4
c5 9,10,11 10
and I also want to remove those values in NaN in it. How to do that?
Here's one way using str.split and mean:
df = df.assign(mean_value=df['value'].str.split(',', expand=True).astype(float)
.mean(axis=1)).dropna()
Output:
id value mean_value
0 a1 1,2 1.5
1 b2 4 4.0
3 c5 9,10,11 10.0
I'm new to pandas have tried going through the docs and experiment with various examples, but this problem I'm tacking has really stumped me.
I have the following two dataframes (DataA/DataB) which I would like to merge on a per global_index/item/values basis.
DataA DataB
row item_id valueA row item_id valueB
0 x A1 0 x B1
1 y A2 1 y B2
2 z A3 2 x B3
3 x A4 3 y B4
4 z A5 4 z B5
5 x A6 5 x B6
6 y A7 6 y B7
7 z A8 7 z B8
The list of items(item_ids) is finite and each of the two dataframes represent a the value of a trait (trait A, trait B) for an item at a given global_index value.
The global_index could roughly be thought of as a unit of "time"
The mapping between each data frame (DataA/DataB) and the global_index is done via the following two mapper DFs:
DataA_mapper
global_index start_row num_rows
0 0 3
1 3 2
3 5 3
DataB_mapper
global_index start_row num_rows
0 0 2
2 2 3
4 5 3
Simply put for a given global_index (eg: 1) the mapper will define a list of rows into the respective DFs (DataA or DataB) that are associated with that global_index.
For example, for a global_index value of 0:
In DF DataA rows 0..2 are associated with global_index 0
In DF DataB rows 0..1 are associated with global_index 0
Another example, for a global_index value of 2:
In DF DataB rows 2..4 are associated with global_index 2
In DF DataA there are no rows associated with global_index 2
The ranges [start_row,start_row + num_rows) presented do not overlap each other and represent a unique sequence/range of rows in their respective dataframes (DataA, DataB)
In short no row in either DataA or DataB will be found in more than one range.
I would like to merge the DFs so that I get the following dataframe:
row global_index item_id valueA valueB
0 0 x A1 B1
1 0 y A2 B2
2 0 z A3 NaN
3 1 x A4 B1
4 1 z A5 NaN
5 2 x A4 B3
6 2 y A2 B4
7 2 z A5 NaN
8 3 x A6 B3
9 3 y A7 B4
10 3 z A8 B5
11 4 x A6 B6
12 4 y A7 B7
13 4 z A8 B8
In the final datafram any pair of global_index/item_id there will ever be either:
a value for both valueA and valueB
a value only for valueA
a value only for valueB
With the requirement being if there is only one value for a given global_index/item (eg: valueA but no valueB) for the last value of the missing one to be used.
First, you can create the 'global_index' column using the function pd.cut:
for df, m in [(df_A, map_A), (df_B, map_B)]:
bins = np.insert(m['num_rows'].cumsum().values, 0, 0) # create bins and add zero at the beginning
df['global_index'] = pd.cut(df['row'], bins=bins, labels=m['global_index'], right=False)
Next, you can use outer join to merge both data frames:
df = df_A.merge(df_B, on=['global_index', 'item_id'], how='outer')
And finally you can use functions groupby and ffill to fill missing values:
for val in ['valueA', 'valueB']:
df[val] = df.groupby('item_id')[val].ffill()
Output:
item_id global_index valueA valueB
0 x 0 A1 B1
1 y 0 A2 B2
2 z 0 A3 NaN
3 x 1 A4 B1
4 z 1 A5 NaN
5 x 3 A6 B1
6 y 3 A7 B2
7 z 3 A8 NaN
8 x 2 A6 B3
9 y 2 A7 B4
10 z 2 A8 B5
11 x 4 A6 B6
12 y 4 A7 B7
13 z 4 A8 B8
I haven't tested this out, since I don't have any good test data, but I think something like this should work. Basically what this is doing is, rather than trying to pull off some sort of complicated join, it is building a series of lists to hold your data, which you can then put back together into a final dataframe at the end.
DataA.set_index('row')
DataB.set_index('row')
#we're going to create the new dataframe from scratch, creating a list for each column we want
global_index = []
AValues = []
AIndex = []
BValues = []
BIndex = []
for indexNum in totalIndexes:
#for each global index, we get the total number of rows to extract from DataA and DataB
AStart = DataA_mapper.loc[DataA_mapper['global_index']==indexNum, 'start_row'].values[0]
ARows = DataA_mapper.loc[DataA_mapper['global_index']==indexNum, 'num_rows'].values[0]
AStop = AStart + Arows
BStart = DataB_mapper.loc[DataB_mapper['global_index']==indexNum, 'start_row'].values[0]
BRows = DataB_mapper.loc[DataB_mapper['global_index']==indexNum, 'num_rows'].values[0]
BStop = BStart + Brows
#Next we extract values from DataA and DataB, turn them into lists, and add them to our data
AValues = AValues + list(DataA.iloc[AStart:AStop, 1].values)
AIndex = AIndex + list(DataA.iloc[AStart:AStop, 0].values)
BValues = BValues + list(DataB.iloc[BStart:BStop, 1].values)
BIndex = BIndex + list(DataA.iloc[AStart:AStop, 0].values)
#Create a temporary list of the current global_index, and add it to our data
global_index_temp = []
for row in range(max(ARows,Brows)):
global_index_temp.append(indexNum)
global_index = global_index + global_index_temp
#combine all these individual lists into a dataframe
finalData = list(zip(global_index, AIndex, BIndex, AValues, BValues))
df = pd.DataFrame(data = finalData, columns = ['global_index', 'item1', 'item2', 'valueA', 'valueB'])
#lastly you just need to merge item1 and item2 to get your item_id column
I've tried to comment it nicely so that hopefully the general plan makes sense and you can follow along and correct my mistakes or rewrite it your own way.
df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1
This question already has answers here:
Check if pandas dataframe is subset of other dataframe
(3 answers)
Closed 3 years ago.
I have 2 csv files (csv1, csv2). In csv2 there might be new column or row added in csv2.
I need to verify if csv1 is subset of csv2. For being a subset whole row should be present in both the files and elements from new coulmn or row should be ignored.
csv1:
c1,c2,c3
A,A,6
D,A,A
A,1,A
csv2:
c1,c2,c3,c4
A,A,6,L
A,changed,A,L
D,A,A,L
Z,1,A,L
Added,Anew,line,L
I am trying is :
df1 = pd.read_csv(csv1_file)
df2 = pd.read_csv(csv2_file)
matching_cols=df1.columns.intersection(df2.columns).tolist()
sorted_df1 = df1.sort_values(by=list(matching_cols)).reset_index(drop=True)
sorted_df2 = df2.sort_values(by=list(matching_cols)).reset_index(drop=True)
print("truth data>>>\n",sorted_df1)
print("Test data>>>\n",sorted_df2)
df1_mask = sorted_df1[matching_cols].eq(sorted_df2[matching_cols])
# print(df1_mask)
print("compared data>>>\n",sorted_df1[df1_mask])
It gives the out put as :
truth data>>>
c1 c2 c3
0 A 1 A
1 A A 6
2 D A A
Test data>>>
c1 c2 c3 c4
0 A A 6 L
1 A changed A L
2 Added Anew line L
3 D A A L
4 Z 1 A L
compared data>>>
c1 c2 c3
0 A NaN NaN
1 A NaN NaN
2 NaN NaN NaN
What i want is :
compared data>>>
c1 c2 c3
0 Nan NaN NaN
1 A A 6
2 D A A
Please help.
Thanks
If need missing values in last row, because no match, use DataFrame.merge with left join and indicator parameter, then set mising values by mask and rmove helper column _merge:
matching_cols=df1.columns.intersection(df2.columns)
df2 = df1[matching_cols].merge(df2[matching_cols], how='left', indicator=True)
df2.loc[df2['_merge'].ne('both')] = np.nan
df2 = df2.drop('_merge', axis=1)
print (df2)
c1 c2 c3
0 A A 6
1 D A A
2 NaN NaN NaN
I have two dataframes df1 and df2 with key as index.
dict_1={'key':[1,1,1,2,2,3], 'col1':['a1','b1','c1','d1','e1','f1']}
df1 = pd.DataFrame(dict_1).set_index('key')
dict_2={'key':[1,1,2], 'col2':['a2','b2','c2']}
df2 = pd.DataFrame(dict_2).set_index('key')
df1:
col1
key
1 a1
1 b1
1 c1
2 d1
2 e1
3 f1
df2
col2
key
1 a2
1 b2
2 c2
Note that there are unequal rows for each index. I want to concatenate these two dataframes such that, I have the following dataframe (say df3).
df3
col1 col2
key
1 a1 a2
1 b1 b2
2 d1 c2
i.e. concatenate the two columns so that the new dataframe as the least (of df1 and df2) rows for each index.
I tried
pd.concat([df1,df2],axis=1)
but I get the following error:
Value Error: Shape of passed values is (2,17), indices imply (2,7)
My question: How can I concatentate df1 and df2 to get df3? Should I use DataFrame.merge instead? If so, how?
Merge/join alone will get you a lot of (hard to get rid of) duplicates. But a little trick will help:
df1['count1'] = 1
df1['count1'] = df1['count1'].groupby(df1.index).cumsum()
df1
Out[198]:
col1 count1
key
1 a1 1
1 b1 2
1 c1 3
2 d1 1
2 e1 2
3 f1 1
The same thing for df2:
df2['count2'] = 1
df2['count2'] = df2['count2'].groupby(df2.index).cumsum()
And finally:
df_aligned = df1.reset_index().merge(df2.reset_index(), left_on = ['key','count1'], right_on = ['key', 'count2'])
df_aligned
Out[199]:
key col1 count1 col2 count2
0 1 a1 1 a2 1
1 1 b1 2 b2 2
2 2 d1 1 c2 1
Now, you can reset index with set_index('key') and drop no longer needed columns countn.
The biggest problem for why you are not going to be able to line up the two in the way that you want is that your keys are duplicative. How are you going to be line up the A1 value in df1 with the A2 value in df2 When A1, A2, B1, B2, and C1 all have the same key?
Using merge is what you'll want if you can resolve the key issues:
df3 = df1.merge(df2, left_index=True, right_index=True, how='inner')
You can use inner, outer, left or right for how.