Concat two Pandas DataFrame column with different length of index - python

How do I add a merge columns of Pandas dataframe to another dataframe while the new columns of data has less rows? Specifically I need to new column of data to be filled with NaN at the first few rows in the merged DataFrame instead of the last few rows. Please refer to the picture. Thanks.

Use:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
})
df2 = pd.DataFrame({
'SMA':list('rty')
})
df3 = df1.join(df2.set_index(df1.index[-len(df2):]))
Or:
df3 = pd.concat([df1, df2.set_index(df1.index[-len(df2):])], axis=1)
print (df3)
A B SMA
0 a 4 NaN
1 b 5 NaN
2 c 4 NaN
3 d 5 r
4 e 5 t
5 f 4 y
How it working:
First is selected index in df1 by length of df2 from back:
print (df1.index[-len(df2):])
RangeIndex(start=3, stop=6, step=1)
And then is overwrite existing values by DataFrame.set_index:
print (df2.set_index(df1.index[-len(df2):]))
SMA
3 r
4 t
5 y

Related

Merging df in python

Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0

Pandas Dataframe convert column of lists to multiple columns

I am trying to convert a dataframe that has list of various size for example something like this:
d={'A':[1,2,3],'B':[[1,2,3],[3,5],[4]]}
df = pd.DataFrame(data=d)
df
to something like this:
d1={'A':[1,2,3],'B-1':[1,0,0],'B-2':[1,0,0],'B-3':[1,1,0],'B-4':[0,0,1],'B-5':[0,1,0]}
df1 = pd.DataFrame(data=d1)
df1
Thank you for the help
explode the lists then get_dummies and sum over the original index. (max [credit to #JonClements] if you want true dummies and not counts in case there can be multiples). Then join the result back
dfB = pd.get_dummies(df['B'].explode()).sum(level=0).add_prefix('B-')
#dfB = pd.get_dummies(df['B'].explode()).max(level=0).add_prefix('B-')
df = pd.concat([df['A'], dfB], axis=1)
# A B-1 B-2 B-3 B-4 B-5
#0 1 1 1 1 0 0
#1 2 0 0 1 0 1
#2 3 0 0 0 1 0
You can use pop to remove the column you explode so you don't need to specify df[list_of_all_columns_except_B] in the concat:
df = pd.concat([df, pd.get_dummies(df.pop('B').explode()).sum(level=0).add_prefix('B-')],
axis=1)

Python: sample from dataframe, storing the non-sampled

I have a pandas DataFrame.
Say I want to sample two persons of each group, I use the following code to get a new dataframe:
sample_df = df.groupby("category").apply(lambda group_df: group_df.sample(2, random_state=1234)
I would like to create a dataframe where the non-sampled persons are stored.
The sample_df stil has the indices of the original df so I probably have to do something with that, but I'm not sure what...
Thanks in advance!
First add group_keys=False to groupby for avoid category to MultiIndex:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'category':list('aaabbb')
})
sample_df = (df.groupby("category", group_keys=False)
.apply(lambda group_df: group_df.sample(2, random_state=1234)))
print(sample_df)
A B category
0 a 4 a
1 b 5 a
3 d 5 b
4 e 5 b
So possible filter original index values with boolean indexing by Index.isin and inverted mask by ~:
non_sample_df = df[~df.index.isin(sample_df.index)]
print(non_sample_df)
A B category
2 c 4 a
5 f 4 b

Pandas - fill new column with values from following day

In the following dataframe
#Create data
data = {'Day': [1,1,2,2,3,3],
'Where': ['A','B','A','B','B','B'],
'What': ['x','y','x','x','x','y'],
'Dollars': [100,200,100,100,100,200]}
index = range(len(data['Day']))
columns = ['Day','Where','What','Dollars']
df = pd.DataFrame(data, index=index, columns=columns)
df
I would like to add a column with the future values. In this case, the first value should be 100 as on day 2 at A x was sold for 100 dollars. The complete column should contain the values 100, None, None, 100, None, None.
I thought that I could index the cells in the following way
df2 = df
df2['Tomorrow_Dollars'] = df[df.Day == df2.Day+1,'Dollars']
but this throws the following error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Is there a solution to this or a smarter approach?
Idea is add create missing combinations by reindex with MultiIndex.from_product, reshape by unstack for unique Days, so possible shift. Last reshape back and join for new column:
df1 = df.set_index(['Day','Where','What'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
s = df1.reindex(mux)['Dollars'].unstack([1,2]).shift(-1).unstack().rename('Tomorrow_Dollars')
df = df.join(s, on=['Where','What','Day'])
print (df)
Day Where What Dollars Tomorrow_Dollars
0 1 A x 100 100.0
1 1 B y 200 NaN
2 2 A x 100 NaN
3 2 B x 100 100.0
4 3 B x 100 NaN
5 3 B y 200 NaN

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

Categories

Resources