How to merge two dataframes without generating extra rows in result?

How to merge two dataframes without generating extra rows in result? - python

I am doing the following with two dataframes but it generates duplicates and does not get sorted as the first dataframe.
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:10"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
result = pd.merge(df1, df2, on="time", how="left")
This generates the result of 8 rows! I am removing 3 digits from time column in df1 to match the time in df2.
time value counts growth
0 15:09 10 fg 1.0
1 15:09 10 mn 3.0
2 15:09 20 fg 1.0
3 15:09 20 mn 3.0
4 15:10 30 gl 6.0
5 15:11 40 NaN NaN
6 15:12 50 NaN NaN
7 15:12 60 NaN NaN
There are duplicated columns due to join.
Is it possible to join the dataframes based on time column in df1 where events are sorted well with more time granularity? Is there a way to partially match the time column values of two dataframes and merge? Ideal result would look like the following
time value counts growth
0 15:09.123 10 fg 1.0
1 15:09.234 20 mn 3.0
2 15:10.123 30 gl 6.0
3 15:11.123 40 NaN NaN
4 15:12.123 50 NaN NaN
5 15:12.987 60 NaN NaN

here is one way to do it
Assusmption: number of rows for a time without seconds in df1 and df2 will be same
# create time without seconds
df1['time2']=df1['time'].str[:-4]
# add a sequence when there are multiple rows for any time
df1['seq']=df1.groupby('time2')['time2'].cumcount()
# add a sequence when there are multiple rows for any time
df2['seq']=df2.groupby('time').cumcount()
# do a merge on time (stripped) in df1 and sequence
pd.merge(df1,
df2,
left_on=['time2', 'seq'],
right_on=['time','seq'],
how='left',
suffixes=(None,'_y')).drop(columns=['time2', 'seq'])
time value time_y counts growth
0 15:09.123 10 15:09 fg 1.0
1 15:09.234 20 15:09 mn 3.0
2 15:10.123 30 15:10 gl 6.0
3 15:11.123 40 NaN NaN NaN
4 15:12.123 50 NaN NaN NaN
5 15:12.987 60 NaN NaN NaN

Merge on column 'time' with preserved order
Assumption: Data from df1 and df2 are in order of occurrence
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:11"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
df1_keys = df1["time"].unique()
df_list = list()
for key in df1_keys:
tmp_df1 = df1[df1["time"] == key]
tmp_df1 = tmp_df1.reset_index(drop=True)
tmp_df2 = df2[df2["time"] == key]
tmp_df2 = tmp_df2.reset_index(drop=True)
df_list.append(pd.merge(tmp_df1, tmp_df2, left_index=True, right_index=True, how="left"))
print(pd.concat(df_list, axis = 0))

Related

How to concatenate two pandas dataframes while preserving the order of both?

I am trying to compare two pandas dataframes containing old and new values. Some of the ids have been deleted in the new data and some have been added, so the indexes are not identical. I am able to concatenate the dataframes using the ids as indexes, columns as the axis, and 'first' and 'second' as the keys, but cannot find a way to preserve the order of both dataframes. The order of data1 is preserved but I would like the order of data2 to also be preserved.
What I tried:
data1={'id':[1, 2, 3, 4, 5],'value':[10, 25, 12, 100, 26]}
data2={'id':[1, 2, 6, 4, 5],'value':[10, 24, 48, 100, 60]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df_both = pd.concat([df.set_index('id'), df2.set_index('id')],
axis='columns', keys=['First', 'Second'])
print(df_both)
Resulting df_both:
First Second
value value
id
1 10.0 10.0
2 25.0 24.0
3 12.0 NaN
4 100.0 100.0
5 26.0 60.0
6 NaN 48.0
id 6 was placed at the bottom of the dataframe because the index did not exist in data1.
What I want as the resulting df_both:
First Second
value value
id
1 10.0 10.0
2 25.0 24.0
3 12.0 NaN
6 NaN 48.0
4 100.0 100.0
5 26.0 60.0
I would like for the deleted and new rows to maintain their position.

There is no straightforward way to do that:
idx = pd.Index(pd.concat([df['id'], df2['id']]).sort_index().drop_duplicates())
df_both = (pd.concat([df.set_index('id'),
df2.set_index('id')],
axis=1, keys=['First', 'Second'])
.reindex(idx))
print(df_both)
# Output
First Second
value value
id
1 10.0 10.0
2 25.0 24.0
3 12.0 NaN
6 NaN 48.0
4 100.0 100.0
5 26.0 60.0

You can concatenate the dataframes on a temporary column with the desired order and then reset the index to obtain the desired result.
data1 = {'id': [1, 2, 3, 4, 5], 'value': [10, 25, 12, 100, 26]}
data2 = {'id': [1, 2, 6, 4, 5], 'value': [10, 24, 48, 100, 60]}
df = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df['temp'] = range(len(df))
df2['temp'] = range(len(df), len(df) + len(df2))
df_both = pd.concat([df, df2], ignore_index=True)
df_both = df_both.sort_values(by='temp').drop('temp', axis=1)
df_both = df_both.set_index('id')
df_both.columns = pd.MultiIndex.from_product([['First', 'Second'], ['value']])
print(df_both)

Using Python, how do I remove duplicates in a PANDAS dataframe column while keeping/ignoring all 'nan' values?

I have a dataframe like this:
import pandas as pd
data1 = {
"siteID": [1, 2, 3, 1, 2, 'nan', 'nan', 'nan'],
"date": [42, 30, 43, 29, 26, 34, 10, 14],
}
df = pd.DataFrame(data1)
But I want to delete any duplicates in siteID, keeping only the most up-to-date value AND keeping all 'nan' values.
I get close with this code:
df_no_dup = df.sort_values('date').drop_duplicates('siteID', keep='last')
which only keeps the siteID with the highest date value. The issue is that most of the rows with 'nan' for siteID are being removed when I want to ignore them all. Is there any way to keep all the rows where siteID is equal to 'nan'?
Expected output:
siteID date
nan 10
nan 14
2 30
nan 34
1 42
3 43

I would use df.duplicated to create a custom condition.
Like this
df.drop(df[df.sort_values('date').duplicated('siteID', keep='last') & (df.siteID!='nan')].index)
Result
siteID date
0 1 42
1 2 30
2 3 43
5 nan 34
6 nan 10
7 nan 14

How to combine multiple dataframes using for loop?

I am trying to merge multiple columns where after one column the following column starts in a specific index. for example, as you can see in the code below, I have 15 sets of data from df20 to df90. As seen in the code, I have merge the data i and then followed by another starting from index = 1,000.
So I wanted my output to be df20 followed by df25 starting at index=1000, then followed by df30 starting at index=2000, then followed by df35 at index=3000. I wanted to see all 15 columns but I only have one column in my output.
I have tried it below, but doesn't seem to work. Please help.
dframe = [df20, df25, df30, df35, df40, df45, df50, df55, df60, df65, df70, df75, df80, df85, df90]
for i in dframe:
a = i.merge((i).set_index((i).index+1000), how='outer', left_index=True, right_index=True)
print(a)
Output:
df90_x df90_y
0 0.000757 NaN
1 0.001435 NaN
2 0.002011 NaN
3 0.002497 NaN
4 0.001723 NaN
... ... ...
10995 NaN 1.223000e-12
10996 NaN 1.305000e-12
10997 NaN 1.809000e-12
10998 NaN 2.075000e-12
10999 NaN 2.668000e-12
[11000 rows x 2 columns]
Expected Output:
df20 df25 df30
0 0.000757 0 0
1 0.001435 0 0
2 0.002011 0 0
3 0.002497 0 0
4 0.001723 0 0
... ... ... ...
1000 1.223000e-12 0
1001 1.305000e-12 0
1002 1.809000e-12 0
1003 2.668000e-12 0
... ...
2000 0.1234
2001 0.4567
2002 0.8901
2003 0.2345

you can try this code, if you want variable for num_dataframe , length_dataframe:
import pandas as pd
import random
dframe = list()
num_dataframe = 3
len_dataframe = 5
for i in range((num_dataframe)):
dframe.append(pd.DataFrame({i:[random.randrange(1, 50, 1) for i in range(len_dataframe)]},
index=range(i*len_dataframe, (i+1)*len_dataframe)))
result = pd.concat([dframe[i] for i in range(num_dataframe)], axis=1)
result.fillna(0)
output:
and for your question, you want 20 data frame with 1000 length, you can try this:
import pandas as pd
import random
dframe = list()
num_dataframe = 20
len_dataframe = 1000
for i in range((num_dataframe)):
dframe.append(pd.DataFrame({i:[np.random.random() for i in range(len_dataframe)]},
index=range(i*len_dataframe, (i+1)*len_dataframe)))
result = pd.concat([dframe[i] for i in range(num_dataframe)], axis=1)
result.fillna(0)
output:
as you mentioned in the comment, I edit the post and add this code:
dframe = [df20, df25, df30, df35, df40, df45, df50, df55, df60, df65, df70, df75, df80, df85, df90]
result = pd.concat([dframe[i] for i in range(len(dframe))], axis=0)
result.fillna(0)

please refer to official page.
Concat multiple dataframes
df1=pd.DataFrame(
{
"A":["A0","A1","A2","A3"]
},
index=[0, 1, 2, 3]
)
df2=pd.DataFrame(
{
"B":["B4","B5"]
},
index=[4, 5]
)
df3=pd.DataFrame(
{
"C":["C6", "C7", "C8", "C9", "C10"]
},
index=[6, 7, 8, 9, 10]
)
result = pd.concat([df1, df2, df3], axis=1)
display(result)
Output:
A B C
0 A0 NaN NaN
1 A1 NaN NaN
2 A2 NaN NaN
3 A3 NaN NaN
4 NaN B4 NaN
5 NaN B5 NaN
6 NaN NaN C6
7 NaN NaN C7
8 NaN NaN C8
9 NaN NaN C9
10 NaN NaN C10
Import file into a list via looping
method 1:
you can create a list to put whole filenames into a list
filenames = ['sample_20.csv', 'sample_25.csv', 'sample_30.csv', ...]
dataframes = [pd.read_csv(f) for f in filenames]
method 1-1:
If you do have lots of files then you need a faster way to create the name list
filenames = ['sample_{}.csv'.format(i) for i in range(20, 90, 5)]
dataframes = [pd.read_csv(f) for f in filenames]
method 2:
from glob import glob
filenames = glob('sample*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

combine 2 columns of dataframe based on a condition

I have created a data frame
data = [['Nan', 10], [4, 'Nan'], ['Nan', 12], ['Nan', 13], [5, 'Nan'], [6, 'Nan'], [7, 'Nan'], ['Nan', 8]]
df = pd.DataFrame(data, columns = ['min', 'max'])
print(df)
my dataset looks like,
min max
Nan 10
4 Max
Nan 12
Nan 13
5 Nan
6 Nan
7 Nan
Nan 8
I want to create a new column which will take one value from min then one value from max. If there are cont. 2 values of min/max (as we can see that 12 and 13 are 2 values) I have to consider only one value (consider only 12 and then move to select min)
In short,
new column should have one min value row, then one max value row and so on.
OUTPUT should be
combined
10
4
12
5
8

You can try to change those values of min and max with previous row not NaN to NaN using .where(). Then remove the rows with both min and max being NaN. Then update those NaN value in min with the value of max in each row using .combine_first():
df = df.replace('Nan', np.nan)
df['min'] = df['min'].where(df['min'].shift().isna())
df['max'] = df['max'].where(df['max'].shift().isna())
df = df.dropna(how='all')
df['combined'] = df['min'].combine_first(df['max'])
Result:
print(df)
min max combined
0 NaN 10.0 10.0
1 4.0 NaN 4.0
2 NaN 12.0 12.0
4 5.0 NaN 5.0
7 NaN 8.0 8.0

Stack the dataframe to reshape into a multiindex series then reset the level 1 index, then using boolean indexing filter/select only rows where the min is followed by max or vice-a-versa
s = df[df != 'Nan'].stack().reset_index(name='combined', level=1)
m = s['level_1'] != s['level_1'].shift()
s[m].drop('level_1', 1)
combined
0 10.0
1 4.0
2 12.0
4 5.0
7 8.0

What you can do is to define the first key for the first value that you want to include, for example, 'max' and then iterate through the DataFrame, and append the values to your data structure while changing the key. At the same time, you will have to check for 'NaN' values since you have a lot of those,
combined = []
key = 'max'
for index, row in df.iterrows():
if not row[key] != row[key]:
combined.append(row[key])
if key == 'max':
key = 'min'
else:
key = 'max'
Here, I have just hardcoded in the first value, but if you do not want to do that you can just check which column in the first row has an actual value that is not 'NaN' and then make that the key.
Note: I have added the data to a list, because I am not sure how you plan to include this as a column when the lengths will be different.

If my assumptions are correct then this should work.
The value is 'Nan' string and not np.NaN
If the min column has 'Nan' value then max column will have number and vice versa, it means no row can have two numbers.
import numpy as np
import pandas as pd
data = [['Nan', 10], [4, 'Nan'], ['Nan', 12], ['Nan', 13], [5, 'Nan'], [6, 'Nan'], [7, 'Nan'], ['Nan', 8]]
df = pd.DataFrame(data, columns = ['min', 'max'])
df['combined'] = np.where(df['min']!='Nan', df['min'], df['max'])
This is the output I get
min max combined
0 Nan 10 10
1 4 Nan 4
2 Nan 12 12
3 Nan 13 13
4 5 Nan 5
5 6 Nan 6
6 7 Nan 7
7 Nan 8 8

merging intervals and timestamps dataframes

I have a table which contains intervals
dfa = pd.DataFrame({'Start': [0, 101, 666], 'Stop': [100, 200, 1000]})
I have another table which contains timestamps and values
dfb = pd.DataFrame({'Timestamp': [102, 145, 113], 'ValueA': [1, 2, 21],
'ValueB': [1, 2, 21]})
I need to create a dataframe same size as dfa, with added a columns which contains the result of some aggregation of ValueA/ValueB, for all the rows in dfb with a Timestamp contained between Start and Stop.
So here if define my aggregation as
{'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
my desired output would be:
ValueA ValueA ValueB
nanmean nanmin nanmax Start Stop
nan nan nan 0 100
8 1 21 101 200
nan nan nan 666 1000

Use merge with cross join with helper columns created by assign:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
df = dfa.assign(A=1).merge(dfb.assign(A=1), on='A', how='outer')
Then filter by Start and Stop and aggregate by dictionary:
df = (df[(df.Timestamp >= df.Start) & (df.Timestamp <= df.Stop)]
.groupby(['Start','Stop']).agg(d))
Flatten MultiIndex by map with join:
df.columns = df.columns.map('_'.join)
print (df)
ValueA_nanmean ValueA_nanmin ValueB_nanmax
Start Stop
101 200 8 1 21
And last join to original:
df = dfa.join(df, on=['Start','Stop'])
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN
EDIT:
Solution with cut:
d = {'ValueA':[np.nanmean,np.nanmin],
'ValueB':[np.nanmax]}
#if not default index create it
dfa = dfa.reset_index(drop=True)
print (dfa)
Start Stop
0 0 100
1 101 200
2 666 1000
#add to bins first value of Start
bins = np.insert(dfa['Stop'].values, 0, dfa.loc[0, 'Start'])
print (bins)
[ 0 100 200 1000]
#binning
dfb['id'] = pd.cut(dfb['Timestamp'], bins=bins, labels = dfa.index)
print (dfb)
Timestamp ValueA ValueB id
0 102 1 1 1
1 145 2 2 1
2 113 21 21 1
#aggregate and flatten
df = dfb.groupby('id').agg(d)
df.columns = df.columns.map('_'.join)
#add to dfa
df = pd.concat([dfa, df], axis=1)
print (df)
Start Stop ValueA_nanmean ValueA_nanmin ValueB_nanmax
0 0 100 NaN NaN NaN
1 101 200 8.0 1.0 21.0
2 666 1000 NaN NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to merge two dataframes without generating extra rows in result? - python

Related

How to concatenate two pandas dataframes while preserving the order of both?

Using Python, how do I remove duplicates in a PANDAS dataframe column while keeping/ignoring all 'nan' values?

How to combine multiple dataframes using for loop?

combine 2 columns of dataframe based on a condition

merging intervals and timestamps dataframes

Categories

Resources