Joining 2 data frames with overlapping data

Joining 2 data frames with overlapping data - python

I have 2 data frames created by pivot tables
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df=pd.DataFrame({'axis1': ['Unix','Window','Apple','Linux'],
'A': [1,np.nan,1,1],
'B': [1,np.nan,np.nan,1],
'C': [np.nan,1,np.nan,1],
'D': [1,np.nan,1,np.nan],
}).set_index(['axis1'])
print (df)
df2=pd.DataFrame({'axis1': ['Unix','Window','Apple','Linux','A'],
'A': [1,1,np.nan,np.nan,np.nan],
'E': [1,np.nan,1,1,1],
}).set_index(['axis1'])
print (df2)
Output looks like this
A B C D
axis1
Unix 1 1 NaN 1
Window NaN NaN 1 NaN
Apple 1 NaN NaN 1
Linux 1 1 1 NaN
[4 rows x 4 columns]
A E
axis1
Unix 1 1
Window 1 NaN
Apple NaN 1
Linux NaN 1
A NaN 1
Lets say I want to combine them but I want only want values of 1
So far I got it but it does not have column E or row A:
>>> df.update(df2)
>>> df
A B C D
axis1
Unix 1 1 NaN 1
Window 1 NaN 1 NaN
Apple 1 NaN NaN 1
Linux 1 1 1 NaN
[4 rows x 4 columns]
How would I update it to get the additional axis values? (include row A and Column E)

you want to reindex your first Dataframe before you call update
one robust way would be to calculate the union of both columns and rows of both df, maybe there is a smarter way, but I can't think of any at the moment
df = df.reindex(columns=df2.columns.union(df.columns),
index=df2.index.union(df.index))
then you call update on that, and it should work.

Related

what is the best way to create running total columns in pandas

What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X'
df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3'
df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'
output:
test desired_output_level_1 desired_output_level_2
0 NaN NaN NaN
1 NaN NaN NaN
2 X 1 1
3 X 1 2
4 X 1 3
5 X 1 4
6 NaN NaN NaN
7 X 2 1
8 X 2 2
9 X 2 3
10 X 2 4
11 X 2 5
12 X 2 6
13 NaN NaN NaN
14 NaN NaN NaN
15 X 3 1
16 X 3 2
The test column can only contain X's or NaNs.
The number of consecutive X's is random.
In the 'desired_output_level_1' column, trying to count up the number of series of X's.
In the 'desired_output_level_2' column, trying to find the duration of each series.
Can anyone help? Thanks in advance.

Perhaps not the most pandastic way, but seems to yield what you are after.
Three key points:
we are operating on only rows that are not NaN, so let's create a mask:
mask = df['test'].notna()
For level 1 computation, it's easy to compare when there is a change from NaN to not NaN by shifting rows by one:
df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()
For level 2 computation, it's a bit trickier. One way to do it is to run the computation for each level_1 group and do .transform to preserve the indexing:
df.loc[mask, "level_2"] = (
df.loc[mask, ["level_1"]]
.assign(level_2=1)
.groupby("level_1")["level_2"]
.transform("cumsum")
)
Last step (if needed) is to transform columns to strings:
df['level_1'] = df['level_1'].astype('Int64').astype('str')
df['level_2'] = df['level_2'].astype('Int64').astype('str')

Pandas: How to use (df.groupby) in a lambda formula

The example below:
import pandas as pd
list1 = ['a','a','a','b','b','b','b','c','c','c']
list2 = range(len(list1))
df = pd.DataFrame(zip(list1, list2), columns= ['Item','Value'])
df
gives:
required: GroupFirstValue column as shown below.
The idea is to use a lambda formula to get the 'first' value for each group..for example "a"'s first value is 0, "b"'s first value is 3, "c"'s first value is 7. That's why those numbers appear in the GroupFirstValue column.
Note: I know that I can do this on 2 steps...one is the original df and the second is a grouped by df and then merge them together. The idea is to see if this can be done more efficiently in a single step. Many thanks in advance!

groupby and use first
df.groupby('Item')['Value'].first()
or you can use transform and assign to a new column in your frame
df['new_col'] = df.groupby('Item')['Value'].transform('first')

Use mask and duplicated
df['GroupFirstValue'] = df.Value.mask(df.Item.duplicated())
Out[109]:
Item Value GroupFirstValue
0 a 0 0.0
1 a 1 NaN
2 a 2 NaN
3 b 3 3.0
4 b 4 NaN
5 b 5 NaN
6 b 6 NaN
7 c 7 7.0
8 c 8 NaN
9 c 9 NaN

aggregating across a dictionary containing dataframes

i have the following dictionary which contains dataframes as values, each always having the same number of columns (1) with the same title
test = {'A': pd.DataFrame(np.random.randn(10), index=range(10),columns=['values']),
'B': pd.DataFrame(np.random.randn(6), index=range(6),columns=['values']),
'C': pd.DataFrame(np.random.randn(11), index=range(11),columns=['values'])}
from this, i would like to create a single dataframe where the index values are the key values of the dictionary (so A,B,C) and the columns are the union of the current index values across all dictionaries (so in this case 0,1,2,3...10). the values of this dataframe would be the corresponding 'values' from the dataframe corresponding to each row, and where blank, NaN
is there a handy way to do this?

IIUC, use pd.concat, keys, and unstack:
pd.concat([test[i] for i in test], keys=test.keys()).unstack(1)['values']
Better yet,
pd.concat(test).unstack(1)['values']
Output:
0 1 2 3 4 5 6 \
A -0.029027 -0.530398 -0.866021 1.331116 0.090178 1.044801 -1.586620
C 1.320105 1.244250 -0.162734 0.942929 -0.309025 -0.853728 1.606805
B -1.683822 1.015894 -0.178339 -0.958557 -0.910549 -1.612449 NaN
7 8 9 10
A -1.072210 1.654565 -1.188060 NaN
C 1.642461 -0.137037 -1.416697 -0.349107
B NaN NaN NaN NaN

dont over complicate things:
just use concat and transpose
pd.concat(test, axis=1).T
0 1 2 3 4 5 \
A values -0.592711 0.266518 -0.774702 0.826701 -2.642054 -0.366401
B values -0.709410 -0.463603 0.058129 -0.054475 -1.060643 0.081655
C values 1.384366 0.662186 -1.467564 0.449142 -1.368751 1.629717
6 7 8 9 10
A values 0.431069 0.761245 -1.125767 0.614622 NaN
B values NaN NaN NaN NaN NaN
C values 0.988287 -1.508384 0.214971 -0.062339 -0.011547
if you were dealing with series instead of 1 column DataFrame it would make more sense to begin with...
test = {'A': pd.Series(np.random.randn(10), index=range(10)),
'B': pd.Series(np.random.randn(6), index=range(6)),
'C': pd.Series(np.random.randn(11), index=range(11))}
pd.concat(test,axis=1).T
0 1 2 3 4 5 6 \
A -0.174565 -2.015950 0.051496 -0.433199 0.073010 -0.287708 -1.236115
B 0.935434 0.228623 0.205645 -0.602561 1.860035 -0.921963 NaN
C 0.944508 -1.296606 -0.079339 0.629038 0.314611 -0.429055 -0.911775
7 8 9 10
A -0.704886 -0.369263 -0.390684 NaN
B NaN NaN NaN NaN
C 0.815078 0.061458 1.726053 -0.503471

How do I combine two columns within a dataframe in Pandas?

Say I have two columns, A and B, in my dataframe:
A B
1 NaN
2 5
3 NaN
4 6
I want to get a new column, C, which fills in NaN cells in column B using values from column A:
A B C
1 NaN 1
2 5 5
3 NaN 3
4 6 6
How do I do this?
I'm sure this is a very basic question, but as I am new to Pandas, any help will be appreciated!

You can use combine_first:
df['c'] = df['b'].combine_first(df['a'])
Docs: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.Series.combine_first.html

You can use where which is a vectorized if/else:
df['C'] = df['A'].where(df['B'].isnull(), df['B'])
A B C
0 1 NaN 1
1 2 5 5
2 3 NaN 3
3 4 6 6

df['c'] = df['b'].fillna(df['a'])
So what .fillna will do is it will fill all the Nan values in the data frame
We can pass any value to it
Here we pass the value df['a']
So this method will put the corresponding values of 'a' into the Nan values of 'b'
And the final answer will be in 'c'

Merge a lot of DataFrames together, without loop and not using concat

I have >1000 DataFrames, each have >20K rows and several columns, need to be merge by a certain common column, the idea can be illustrated by this:
data1=pd.DataFrame({'name':['a','c','e'], 'value':[1,3,4]})
data2=pd.DataFrame({'name':['a','d','e'], 'value':[3,3,4]})
data3=pd.DataFrame({'name':['d','e','f'], 'value':[1,3,5]})
data4=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4]})
#some or them may have more or less columns that the others:
#data5=pd.DataFrame({'name':['d','f','g'], 'value':[0,3,4], 'score':[1,3,4]})
final_data=data1
for i, v in enumerate([data2, data3, data4]):
if i==0:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('_0', '_%s'%(i+1)))
#in real case right_on may be = columns other than 'name'
#dependents on the dataframe, but this requirement can be
#ignored in this minimal example.
else:
final_data=pd.merge(final_data, v, how='outer', left_on='name',
right_on='name', suffixes=('', '_%s'%(i+1)))
Result:
name value_0 value_1 value value_3
0 a 1 3 NaN NaN
1 c 3 NaN NaN NaN
2 e 4 4 3 NaN
3 d NaN 3 1 0
4 f NaN NaN 5 3
5 g NaN NaN NaN 4
[6 rows x 5 columns]
It works, but anyway this can be done without a loop?
Also, why the column name of the second to last column is not value_2?
P.S.
I know that in this minimal example, the result can also be achieved by:
pd.concat([item.set_index('name') for item in [data1, data2, data3, data4]], axis=1)
But In the real case due to the way how the dataframes were constructed and the information stored in the index columns, this is not an ideal solution without additional tricks. So, let's not consider this route.

Does it even make sense to merge it, then? What's wrong with a panel?
> data = [data1, data2, data3, data4]
> p = pd.Panel(dict(zip(map(str, range(len(data))), data)))
> p.to_frame().T
major 0 1 2
minor name value name value name value
0 a 1 c 3 e 4
1 a 3 d 3 e 4
2 d 1 e 3 f 5
3 d 0 f 3 g 4
# and just for kicks
> p.transpose(2, 0, 1).to_frame().reset_index().pivot_table(values='value', rows='name', cols='major')
major 0 1 2 3
name
a 1 3 NaN NaN
c 3 NaN NaN NaN
d NaN 3 1 0
e 4 4 3 NaN
f NaN NaN 5 3
g NaN NaN NaN 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Joining 2 data frames with overlapping data - python

Related

what is the best way to create running total columns in pandas

Pandas: How to use (df.groupby) in a lambda formula

aggregating across a dictionary containing dataframes

How do I combine two columns within a dataframe in Pandas?

Merge a lot of DataFrames together, without loop and not using concat

Categories

Resources