Combine two pandas DataFrame into one new - python

I have two Pandas DataFrames whose data from different sources, but both DataFrames have the same column names. When combined only one column will keep the name.
Like this:
speed_df = pd.DataFrame.from_dict({
'ts': [0,1,3,4],
'val': [5,4,2,1]
})
temp_df = pd.DataFrame.from_dict({
'ts': [0,1,2],
'val': [9,8,7]
})
And I need to have a result like this:
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4],
'speed': [5,4,NaN,1],
'temp': [9,8,7,NaN,NaN]
})
Later I will deal with empty cells (here filled with NaN) by copying the values of the previous valid value. And get something like this:
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4],
'speed': [5,4,4,1],
'temp': [9,8,7,7,7]
})

Use pd.merge
In [406]: (pd.merge(speed_df, temp_df, how='outer', on='ts')
.rename(columns={'val_x': 'speed','val_y': 'temp'})
.sort_values(by='ts'))
Out[406]:
ts speed temp
0 0 5.0 9.0
1 1 4.0 8.0
4 2 NaN 7.0
2 3 2.0 NaN
3 4 1.0 NaN
In [407]: (pd.merge(speed_df, temp_df, how='outer', on='ts')
.rename(columns={'val_x': 'speed', 'val_y': 'temp'})
.sort_values(by='ts').ffill())
Out[407]:
ts speed temp
0 0 5.0 9.0
1 1 4.0 8.0
4 2 4.0 7.0
2 3 2.0 7.0
3 4 1.0 7.0

Two main DataFrame options, one is pd.merge and the other is pd.fillna. Here is the code:
df = speed_df.merge(temp_df, how='outer', on='ts')
df = df.rename(columns=dict(val_x='speed', val_y='temp'))
df = df.sort_values('ts')
df.fillna(method='ffill')
Hope this would be helpful.
Thanks

You need to do a left outer join using pandas.merge function
d = pd.merge(speed_df,temp_df,on='ts',how='outer').rename(columns=\
{'val_x':'speed','val_y':'temp'})
d = d.sort_values('ts')
d['speed']=d['speed'].fillna(4)
d['temp']=d['temp'].fillna(7)
That should return you this:

Related

How to fill up each element in a list to given columns of a dataframe?

Let's say I have a dataframe as shown.
I have a list now like [6,7,6]. How do I fill these to the my 3 desired columns i.e,[one,Two,Four] of dataframe? Notice that I have not given column 'three'. Final Dataframe should look like:
You can append a Series:
df = pd.DataFrame([[2, 4, 4, 8]],
columns=['One', 'Two', 'Three', 'Four'])
values = [6, 3, 6]
lst = ['One', 'Two', 'Four']
df = df.append(pd.Series(values, index=lst), ignore_index=True)
or a dict:
df = df.append(dict(zip(lst, values)), ignore_index=True)
output:
One Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 3.0 NaN 6.0
you could do:
columnstobefilled = ["One","Two","Four"]
elementsfill = [6,3,6]
for column,element in zip(columnstobefilled,elementsfill):
df[column] = element
Since you want the list values to be in specific places you have to specify where each value should go. One way to include this is to use a key value pair object, a dictionary. Once you create that you can use append to include it as a row in your dataframe:
d = {'one':6,'Two':7,'Four':6}
df.append(d,ignore_index=True)
one Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 7.0 NaN 6.0
Dataset:
df = pd.DataFrame({'one':2,'Two':4,'Three':4,'Four':8},
index=[0])
import pandas as pd
df = pd.DataFrame({'One':2, 'Two':4, 'Three':4, 'Four':8}, index=[0])
new_row = {'One':6, 'Two':7, 'Three':None, 'Four':6}
df.append(new_row, ignore_index=True)
print(df)
output:
One Two Three Four
0 2.0 4.0 4.0 8.0
1 6.0 7.0 NaN 6.0

insert missing rows in df with dictionary values

Hello I have the following dataframe
df = pd.DataFrame(data={'grade_1':['A','B','C'],
'grade_1_count': [19,28,32],
'grade_2': ['pass','fail',np.nan],
'grade_2_count': [39,18, np.nan]})
whereby some grades as missing, and need to be inserted in to the grade_n column according to the values in this dictionary
grade_dict = {'grade_1':['A','B','C','D','E','F'],
'grade_2' : ['pass','fail','not present', 'borderline']}
and the corresponding row value in the _count column should be filled with np.nan
so the expected output is like this
expected_df = pd.DataFrame(data={'grade_1':['A','B','C','D','E','F'],
'grade_1_count': [19,28,32,0,0,0],
'grade_2': ['pass','fail','not preset','borderline', np.nan, np.nan],
'grade_2_count': [39,18,0,0,np.nan,np.nan]})
so far I have this rather inelegant code that creates a column that includes all the correct categories for the grades, but i cannot reinsert it in to the dataframe, or fill the count columns with zeros (where the np.nans just reflect empty cells due to coercing columns with different lengths of rows) I hope that makes sense. any advice would be great. thanks
x=[]
for k, v in grade_dict.items():
out = df[k].reindex(grade_dict[k], axis=0, fill_value=0)
x = pd.concat([out], axis=1)
x[k] = x.index
x = x.reset_index(drop=True)
df[k] = x.fillna(np.nan)
Here is a solution using two consecutive merges:
# set up combinations
from itertools import zip_longest
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
# merge
(df2.merge(df.filter(like='grade_1'),
on='grade_1', how='left')
.merge(df.filter(like='grade_2'),
on='grade_2', how='left')
.sort_index(axis=1)
)
output:
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D NaN borderline NaN
4 E NaN None NaN
5 F NaN None NaN
multiple merges:
df2 = pd.DataFrame(list(zip_longest(*grade_dict.values())), columns=grade_dict)
for col in grade_dict:
df2 = df2.merge(df.filter(like=col),
on=col, how='left')
df2
If you only need to merge on grade_1 without updating the non-NaNs of grade_2, you can cast grade_dict into a df and then use combine_first:
print (df.set_index("grade_1").combine_first(pd.DataFrame(grade_dict.values(),
index=grade_dict.keys()).T.set_index("grade_1"))
.fillna({"grade_1_count": 0}).reset_index())
grade_1 grade_1_count grade_2 grade_2_count
0 A 19.0 pass 39.0
1 B 28.0 fail 18.0
2 C 32.0 not present NaN
3 D 0.0 borderline NaN
4 E 0.0 None NaN
5 F 0.0 None NaN

Combination of two dataframes - still show NaN values

I would like to fill my first dataframe with data from the second dataframe. Since I don't need and any special condition I suppose combine_first function looks like the right choice for me.
Unfortunately when I try to combine two dataframes result is still the original dataframe.
My code:
import pandas as pd
df1 = pd.DataFrame({'Gen1': [5, None, 3, 2, 1],
'Gen2': [1, 2, None, 4, 5]})
df2 = pd.DataFrame({'Gen1': [None, 4, None, None, None],
'Gen2': [None, None, 3, None, None]})
df1.combine_first(df2)
Then, When I print(df1) I get df1 as I initiate it on the second row.
Where did I make a mistake?
For me working nice if assign back output, but very similar method DataFrame.update working inplace:
df = df1.combine_first(df2)
print (df)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
df1.update(df2)
print (df1)
Gen1 Gen2
0 5.0 1.0
1 4.0 2.0
2 3.0 3.0
3 2.0 4.0
4 1.0 5.0
combine_first returns a dataframe which has the change and not updating the existing dataframe so you should get the return dataframe
df1=df1.combine_first(df2)

flatten array of arrays json object column in a pandas dataframe

0 [{'review_id': 4873356, 'rating': '5.0'}, {'review_id': 4973356, 'rating': '4.0'}]
1 [{'review_id': 4635892, 'rating': '5.0'}, {'review_id': 4645839, 'rating': '3.0'}]
I have a situation where I want to flatten such json as solved here: Converting array of arrays into flattened dataframe
But I want to create new columns so that the output is:
review_id_1 rating_1 review_id_2 rating_2
4873356 5.0 4973356 4.0
4635892 5.0 4645839 3.0
Any help is highly appreciated..
Try using:
print(pd.DataFrame(s.apply(lambda x: {a: b for i in [{x + str(i): y for x, y in v.items()} for i, v in enumerate(x, 1)] for a, b in i.items()}).tolist()))
Output:
rating1 rating2 review_id1 review_id2
0 5.0 4.0 4873356 4973356
1 5.0 3.0 4635892 4645839
This type of data munging tends to be manual.
# Sample data.
df = pd.DataFrame({
'json_data': [
[{'review_id': 4873356, 'rating': '5.0'}, {'review_id': 4973356, 'rating': '4.0'}],
[{'review_id': 4635892, 'rating': '5.0'}, {'review_id': 4645839, 'rating': '3.0'}],
]
})
# Data transformation:
# Step 1: Temporary dataframe that splits data from `df` into two columns.
df2 = pd.DataFrame(zip(*df['json_data']))
# Step 2: Use a list comprehension to concatenate the records from each column so that the df now has 4 columns.
df2 = pd.concat([pd.DataFrame.from_records(df2[col]) for col in df2], axis=1)
# Step 3: Rename final columns
df2.columns = ['review_id_1', 'rating_1', 'review_id_2', 'rating_2']
>>> df2
review_id_1 rating_1 review_id_2 rating_2
0 4873356 5.0 4635892 5.0
1 4973356 4.0 4645839 3.0

pandas.DataFrame: How to align / group and sort data by index?

I'm new to pandas and still don't have a good overview about its power and how to use it. So the problem is hopefully simple :)
I have a DataFrame with a date-index and several columns (stocks and their Open and Close-prices). Here is some example data for two stocks A and B:
import pandas as pd
_ = pd.to_datetime
A_dt = [_('2018-01-04'), _('2018-01-01'), _('2018-01-05')]
B_dt = [_('2018-01-01'), _('2018-01-05'), _('2018-01-03'), _('2018-01-02')]
A_data = [(12, 11), (10, 9), (8, 9)]
B_data = [(2, 2), (3, 4), (4, 4), (5, 3)]
As you see the data is incomplete, different missing dates for each series. I want to put these data together in a single dataframe with sorted row-index dt and 4 columns (2 stocks x 2 time series each).
When I do it this way, everything works fine (except that I'd like to change the column-levels and don't know how to do it):
# MultiIndex on axis 0, then unstacking
i0_a = pd.MultiIndex.from_tuples([("A", x) for x in A_dt], names=['symbol', 'dt'])
i0_b = pd.MultiIndex.from_tuples([("B", x) for x in B_dt], names=['symbol', 'dt'])
df0_a = pd.DataFrame(A_data, index=i0_a, columns=["Open", "Close"])
df0_b = pd.DataFrame(B_data, index=i0_b, columns=["Open", "Close"])
df = pd.concat([df0_a, df0_b])
df = df.unstack('symbol') # this automatically sorts by dt.
print df
# Open Close
#symbol A B A B
#dt
#2018-01-01 10.0 2.0 9.0 2.0
#2018-01-02 NaN 5.0 NaN 3.0
#2018-01-03 NaN 4.0 NaN 4.0
#2018-01-04 12.0 NaN 11.0 NaN
#2018-01-05 8.0 3.0 9.0 4.0
However when I put the MultiIndex on the columns, things are different
# MultiIndex on axis 1
i1_a = pd.MultiIndex.from_tuples([("A", "Open"), ("A", "Close")], names=['symbol', 'series'])
i1_b = pd.MultiIndex.from_tuples([("B", "Open"), ("B", "Close")], names=['symbol', 'series'])
df1_a = pd.DataFrame(A_data, index=A_dt, columns=i1_a)
df1_b = pd.DataFrame(B_data, index=B_dt, columns=i1_b)
df = pd.concat([df1_a, df1_b])
print df
#symbol A B
#series Close Open Close Open
#2018-01-04 11.0 12.0 NaN NaN
#2018-01-01 9.0 10.0 NaN NaN
#2018-01-05 9.0 8.0 NaN NaN
#2018-01-01 NaN NaN 2.0 2.0
#2018-01-05 NaN NaN 4.0 3.0
#2018-01-03 NaN NaN 4.0 4.0
#2018-01-02 NaN NaN 3.0 5.0
Why isn't the data aligned automatically in this case, but in the other?
How can I align and sort it in the second example?
Which method would probably be faster on a large dataset (about 5000 stocks, 1000 timesteps and not only 2 series per stock (Open, Close), but about 20)? This will finally be used as input for a keras machine learning model.
Edit: With jezraels answer I timed 3 different methods of concat / combining DataFrames. My first approach is the fastest. Using combine_first turns out to be an order of magnitude slower than the other methods. The size of the data is still kept very small in the example:
import timeit
setup = """
import pandas as pd
import numpy as np
stocks = 20
steps = 20
features = 10
data = []
index_method1 = []
index_method2 = []
cols_method1 = []
cols_method2 = []
df = None
for s in range(stocks):
name = "stock{0}".format(s)
index = np.arange(steps)
data.append(np.random.rand(steps, features))
index_method1.append(pd.MultiIndex.from_tuples([(name, x) for x in index], names=['symbol', 'dt']))
index_method2.append(index)
cols_method1.append([chr(65 + x) for x in range(features)])
cols_method2.append(pd.MultiIndex.from_arrays([[name] * features, [chr(65 + x) for x in range(features)]], names=['symbol', 'series']))
"""
method1 = """
for s in range(stocks):
df_new = pd.DataFrame(data[s], index=index_method1[s], columns=cols_method1[s])
if s == 0:
df = df_new
else:
df = pd.concat([df, df_new])
df = df.unstack('symbol')
"""
method2 = """
for s in range(stocks):
df_new = pd.DataFrame(data[s], index=index_method2[s], columns=cols_method2[s])
if s == 0:
df = df_new
else:
df = df.combine_first(df_new)
"""
method3 = """
for s in range(stocks):
df_new = pd.DataFrame(data[s], index=index_method2[s], columns=cols_method2[s])
if s == 0:
df = df_new.stack()
else:
df = pd.concat([df, df_new.stack()], axis=1)
df = df.unstack().swaplevel(0,1, axis=1).sort_index(axis=1)
"""
print ("Multi-Index axis 0, then concat: {} s".format((timeit.timeit(method1, setup, number=1))))
print ("Multi-Index axis 1, combine_first: {} s".format((timeit.timeit(method2, setup, number=1))))
print ("Stack and then concat: {} s".format((timeit.timeit(method3, setup, number=1))))
Multi-Index axis 0, then concat: 0.134283173989 s
Multi-Index axis 1, combine_first: 5.02396191049 s
Stack and then concat: 0.272278263371 s
It is problem because both DataFrames have different MultiIndex in columns, so no align.
Solution is stack for Series, concat to 2 column DataFrame, then unstack and for correct order of MultiIndex add swaplevel and sort_index:
df = (pd.concat([df1_a.stack(), df1_b.stack()], axis=1)
.unstack()
.swaplevel(0,1, axis=1)
.sort_index(axis=1))
print (df)
series Close Open
symbol A B A B
2018-01-01 9.0 2.0 10.0 2.0
2018-01-02 NaN 3.0 NaN 5.0
2018-01-03 NaN 4.0 NaN 4.0
2018-01-04 11.0 NaN 12.0 NaN
2018-01-05 9.0 4.0 8.0 3.0
But better is use combine_first:
df = df1_a.combine_first(df1_b)
print (df)
symbol A B
series Close Open Close Open
2018-01-01 9.0 10.0 2.0 2.0
2018-01-02 NaN NaN 3.0 5.0
2018-01-03 NaN NaN 4.0 4.0
2018-01-04 11.0 12.0 NaN NaN
2018-01-05 9.0 8.0 4.0 3.0

Categories

Resources