Pandas merge adding column - python

I have two dataframes df1 and df2. df1 contains the columns subject_id and time and df2 contains the columns subject_id and final_time. What I want to do is for every subject_id in df1 add a column with final_time from df2 but only from the subject_ids's contained in df1. I have tried df1.merge(df2,how='left') but still get all of the subject_id's from df2 which is much longer and contains many duplicates of 'subject_id`.
Example of what I am looking for:
df1
subject_id time
0 15 12:00
1 20 12:05
2 21 12:10
3 25 12:00
df2
subject_id final_time
0 15 12:30
1 15 12:30
2 15 12:30
3 20 12:45
4 20 12:45
5 21 12:50
6 25 1:00
7 25 1:00
8 25 1:00
What I am looking for
subject_id time final_time
0 15 12:00 12:30
1 20 12:05 12:45
2 21 12:10 12:50
3 25 12:00 1:00

You should use
df1.merge(df2, on='subject_id')
The default for how is inner, which will only match those entries that are in both columns. on tells the merge to match only on the column you are interested in

Works for me. Nothing in results that aren't in df1
df1 = pd.DataFrame(dict(subject_id=[1, 2, 3], time=[9, 8, 7]))
df2 = pd.DataFrame(dict(subject_id=[2, 2, 4], final_time=[6, 5, 4]))
df1.merge(df2, 'left')
subject_id time final_time
0 1 9 NaN
1 2 8 6.0
2 2 8 5.0
3 3 7 NaN

Related

Merging columns with same same using pandas

I have the following data in CSV file:
time conc time conc time conc time conc
1:00 10 5:00 11 9:00 55 13:00 1
2:00 13 6:00 8 10:00 6 14:00 4
3:00 9 7:00 7 11:00 8 15:00 3
4:00 8 8:00 1 12:00 11 16:00 8
And I just wanted to merge them together as:
time conc
1:00 10
2:00 13
3:00 9
4:00 8
...
16:00 8
I've got more than 1000 columns, but I'm new to pandas. So just wondering how I can achieve?
One approach is to cut the dataframe in two-column slices, then re-combine using pd.concat() after renaming.
First load the dataframe normally:
df = pd.read_csv('time_conc.csv')
df
Which looks something like the below. Notice that pd.read_csv() has added a suffix to the duplicate column names:
time conc time.1 conc.1 time.2 conc.2 time.3 conc.3
0 1:00 10 5:00 11 9:00 55 13:00 1
1 2:00 13 6:00 8 10:00 6 14:00 4
2 3:00 9 7:00 7 11:00 8 15:00 3
3 4:00 8 8:00 1 12:00 11 16:00 8
Then slice using pd.DataFrame.iloc:
total_columns = len(df.columns)
columns_per_set = 2
column_sets = [df.iloc[:,set_start:set_start + columns_per_set].copy() for set_start in range(0, total_columns, columns_per_set)]
column_sets is now a list holding each pair of duplicate columns as a separate dataframe. Next, loop through the list to rename the columns back to the original:
for s in column_sets:
s.columns = ['time', 'conc']
This modifies each two-column dataframe in place so that their column names match.
Finally, use pd.concat() to combine them by matching the column axis:
new_df = pd.concat(column_sets, axis=0, sort=False)
new_df
Which gives you the full two columns:
time conc
0 1:00 10
1 2:00 13
2 3:00 9
3 4:00 8
0 5:00 11
1 6:00 8
2 7:00 7
3 8:00 1
0 9:00 55
1 10:00 6
2 11:00 8
3 12:00 11
0 13:00 1
1 14:00 4
2 15:00 3
3 16:00 8
Since your file has duplicated column names, Pandas will add suffixes. The DataFrame header by default will be like ['time', 'conc', 'time.1', 'conc.1', 'time.2', 'conc.2', 'time.3', 'conc.3' ...]
Assuming that the separator of your CSV file is a comma:
import pandas as pd
df = pd.read_csv('/path/to/your/file.csv', sep=',')
total_n = len(df.columns)
lst = []
for x in range(int(total_n / 2 )):
if x == 0:
cols = ['time', 'conc']
else:
cols = ['time'+'.'+str(x), 'conc'+'.'+str(x)]
df_sub = df[cols] #Slice two columns each time
df_sub.columns = ['time', 'conc'] #Slices should have the same column names
lst.append(df_sub)
df = pd.concat(lst) #Concatenate all the objects
Assuming that df is a DataFrame with the csv file data you can try the following:
# rename columns if needed
df.columns = ["time", "conc"]*(df.shape[1]//2)
# concatenate pairs of adjacent columns
pd.concat([df.iloc[:, [i, i+1]] for i in range(0, df.shape[1], 2)])
It gives:
time conc
0 1:00 10
1 2:00 13
2 3:00 9
3 4:00 8
0 5:00 11
.. ... ..
3 12:00 11
0 13:00 1
1 14:00 4
2 15:00 3
3 16:00 8

Finding historical seasonal average for given month in a monthly series in a dataframe time-series

I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)

Faster way to construct a multiindex of dates in Pandas

I have a Pandas dataframe, df. Here are the first five rows:
Id StartDate EndDate
0 0 2015-08-11 2018-07-13
1 1 2014-02-15 2016-01-25
2 2 2014-12-20 NaT
3 3 2015-01-09 2015-01-14
4 4 2014-07-20 NaT
I want to construct a new dataframe, df2. df2 should have a row for each month between StartDate and EndDate, inclusive, for each Id in df1. For example, since the first row of df1 has StartDate in August 2015 and EndDate in July 2018, df2 should have rows corresponding to August 2015, September 2015, ..., July 2018. If an Id in df1 has no EndDate, we will take it to be June 2019.
I would like df2 to use a multiindex with the first level being the corresponding Id in df1, the second level being the year, and the third level being the month. For example, if the above five rows were all of df1, then df2 should look like:
Id Year Month
0 2015 8
9
10
11
12
2016 1
2
3
4
5
6
7
8
9
10
11
12
2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
... ... ...
4 2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
2
3
4
5
6
7
8
9
10
11
12
2019 1
2
3
4
5
6
The following code does the trick, but takes about 20 seconds on my decent laptop for 10k Ids. Can I be more efficient somehow?
import numpy as np
def build_multiindex_for_id_(id_, enroll_month, enroll_year, cancel_month, cancel_year):
# Given id_ and start/end dates,
# returns 2d array to be converted to multiindex.
# Each row of returned array represents a month/year
# between enroll date and cancel date inclusive.
year = enroll_year
month = enroll_month
multiindex_array = [[],[],[]]
while (month != cancel_month) or (year != cancel_year):
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
month += 1
if month == 13:
month = 1
year += 1
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
return np.array(multiindex_array)
# Begin by constructing array for first id.
array_for_multiindex = build_multiindex_for_id_(0,8,2015,7,2018)
# Append the rest of the multiindices for the remaining ids.
for _, row in df.loc[1:].fillna(pd.to_datetime('2019-06-30')).iterrows():
current_id_array = build_multiindex_for_id_(
row['Id'],
row['StartDate'].month,
row['StartDate'].year,
row['EndDate'].month,
row['EndDate'].year)
array_for_multiindex = np.append(array_for_multiindex, current_id_array, axis=1)
df2_index = pd.MultiIndex.from_arrays(array_for_multiindex).rename(['Id','Year','Month'])
pd.DataFrame(index=df2_index)
Here's my approach after several trial and error:
(df.melt(id_vars='Id')
.fillna(pd.to_datetime('June 2019'))
.set_index('value')
.groupby('Id').apply(lambda x: x.asfreq('M').ffill())
.reset_index('value')
.assign(year=lambda x: x['value'].dt.year,
month=lambda x: x['value'].dt.month)
.set_index(['year','month'], append=True)
)
Output:
value Id variable
Id year month
0 2015 8 2015-08-31 NaN NaN
9 2015-09-30 NaN NaN
10 2015-10-31 NaN NaN
11 2015-11-30 NaN NaN
12 2015-12-31 NaN NaN
2016 1 2016-01-31 NaN NaN
2 2016-02-29 NaN NaN
3 2016-03-31 NaN NaN
4 2016-04-30 NaN NaN
5 2016-05-31 NaN NaN
6 2016-06-30 NaN NaN
7 2016-07-31 NaN NaN
8 2016-08-31 NaN NaN
9 2016-09-30 NaN NaN
10 2016-10-31 NaN NaN

Number of active IDs in each period

I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1

How to combine rows with the same timestamp?

I'm trying to combine all rows of a dataframe that have the same time stamp into a single row. The df is 5k by 20.
A B ...
timestamp
11:00 NaN 10 ...
11:00 5 NaN ...
12:00 15 20 ...
... ... ...
group the 2 11:00 rows as follows
A B ...
timestamp
11:00 5 10 ...
12:00 15 20 ...
... ... ...
Any help would be appreciated. Thank you.
I have tried
df.groupby( df.index ).sum()
You could melt ('unpivot') the DataFrame to convert it from wide form to long form, remove the null values, then aggregate via groupby.
import pandas as pd
df = pd.DataFrame({'timestamp' : ['11:00','11:00','12:00'],
'A' : [None,5,15],
'B' : [10,None,20]
})
A B timestamp
0 NaN 10 11:00
1 5 NaN 11:00
2 15 20 12:00
df2 = pd.melt(df, id_vars = 'timestamp') # specify the value_vars if needed
timestamp variable value
0 11:00 A NaN
1 11:00 A 5
2 12:00 A 15
3 11:00 B 10
4 11:00 B NaN
5 12:00 B 20
df2.dropna(inplace=True)
df3 = df2.groupby(['timestamp', 'variable']).sum()
value
timestamp variable
11:00 A 5
B 10
12:00 A 15
B 20
df3.unstack()
value
variable A B
timestamp
11:00 5 10
12:00 15 20
groupby after replacing the NaN values with 0's.
df.fillna(0, inplace=True)
df.groupby(df.index).sum()
Try using resample:
>>> df.resample('60Min', how='sum')
A B
2015-05-28 11:00:00 5 10
2015-05-28 12:00:00 15 20
More examples can be found in the Pandas Documentation.
You cannot sum a number and a NaN in python. You probably need to use .aggregate() :)

Categories

Resources