I'm trying to combine all rows of a dataframe that have the same time stamp into a single row. The df is 5k by 20.
A B ...
timestamp
11:00 NaN 10 ...
11:00 5 NaN ...
12:00 15 20 ...
... ... ...
group the 2 11:00 rows as follows
A B ...
timestamp
11:00 5 10 ...
12:00 15 20 ...
... ... ...
Any help would be appreciated. Thank you.
I have tried
df.groupby( df.index ).sum()
You could melt ('unpivot') the DataFrame to convert it from wide form to long form, remove the null values, then aggregate via groupby.
import pandas as pd
df = pd.DataFrame({'timestamp' : ['11:00','11:00','12:00'],
'A' : [None,5,15],
'B' : [10,None,20]
})
A B timestamp
0 NaN 10 11:00
1 5 NaN 11:00
2 15 20 12:00
df2 = pd.melt(df, id_vars = 'timestamp') # specify the value_vars if needed
timestamp variable value
0 11:00 A NaN
1 11:00 A 5
2 12:00 A 15
3 11:00 B 10
4 11:00 B NaN
5 12:00 B 20
df2.dropna(inplace=True)
df3 = df2.groupby(['timestamp', 'variable']).sum()
value
timestamp variable
11:00 A 5
B 10
12:00 A 15
B 20
df3.unstack()
value
variable A B
timestamp
11:00 5 10
12:00 15 20
groupby after replacing the NaN values with 0's.
df.fillna(0, inplace=True)
df.groupby(df.index).sum()
Try using resample:
>>> df.resample('60Min', how='sum')
A B
2015-05-28 11:00:00 5 10
2015-05-28 12:00:00 15 20
More examples can be found in the Pandas Documentation.
You cannot sum a number and a NaN in python. You probably need to use .aggregate() :)
Related
I have the following data in CSV file:
time conc time conc time conc time conc
1:00 10 5:00 11 9:00 55 13:00 1
2:00 13 6:00 8 10:00 6 14:00 4
3:00 9 7:00 7 11:00 8 15:00 3
4:00 8 8:00 1 12:00 11 16:00 8
And I just wanted to merge them together as:
time conc
1:00 10
2:00 13
3:00 9
4:00 8
...
16:00 8
I've got more than 1000 columns, but I'm new to pandas. So just wondering how I can achieve?
One approach is to cut the dataframe in two-column slices, then re-combine using pd.concat() after renaming.
First load the dataframe normally:
df = pd.read_csv('time_conc.csv')
df
Which looks something like the below. Notice that pd.read_csv() has added a suffix to the duplicate column names:
time conc time.1 conc.1 time.2 conc.2 time.3 conc.3
0 1:00 10 5:00 11 9:00 55 13:00 1
1 2:00 13 6:00 8 10:00 6 14:00 4
2 3:00 9 7:00 7 11:00 8 15:00 3
3 4:00 8 8:00 1 12:00 11 16:00 8
Then slice using pd.DataFrame.iloc:
total_columns = len(df.columns)
columns_per_set = 2
column_sets = [df.iloc[:,set_start:set_start + columns_per_set].copy() for set_start in range(0, total_columns, columns_per_set)]
column_sets is now a list holding each pair of duplicate columns as a separate dataframe. Next, loop through the list to rename the columns back to the original:
for s in column_sets:
s.columns = ['time', 'conc']
This modifies each two-column dataframe in place so that their column names match.
Finally, use pd.concat() to combine them by matching the column axis:
new_df = pd.concat(column_sets, axis=0, sort=False)
new_df
Which gives you the full two columns:
time conc
0 1:00 10
1 2:00 13
2 3:00 9
3 4:00 8
0 5:00 11
1 6:00 8
2 7:00 7
3 8:00 1
0 9:00 55
1 10:00 6
2 11:00 8
3 12:00 11
0 13:00 1
1 14:00 4
2 15:00 3
3 16:00 8
Since your file has duplicated column names, Pandas will add suffixes. The DataFrame header by default will be like ['time', 'conc', 'time.1', 'conc.1', 'time.2', 'conc.2', 'time.3', 'conc.3' ...]
Assuming that the separator of your CSV file is a comma:
import pandas as pd
df = pd.read_csv('/path/to/your/file.csv', sep=',')
total_n = len(df.columns)
lst = []
for x in range(int(total_n / 2 )):
if x == 0:
cols = ['time', 'conc']
else:
cols = ['time'+'.'+str(x), 'conc'+'.'+str(x)]
df_sub = df[cols] #Slice two columns each time
df_sub.columns = ['time', 'conc'] #Slices should have the same column names
lst.append(df_sub)
df = pd.concat(lst) #Concatenate all the objects
Assuming that df is a DataFrame with the csv file data you can try the following:
# rename columns if needed
df.columns = ["time", "conc"]*(df.shape[1]//2)
# concatenate pairs of adjacent columns
pd.concat([df.iloc[:, [i, i+1]] for i in range(0, df.shape[1], 2)])
It gives:
time conc
0 1:00 10
1 2:00 13
2 3:00 9
3 4:00 8
0 5:00 11
.. ... ..
3 12:00 11
0 13:00 1
1 14:00 4
2 15:00 3
3 16:00 8
I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.
You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1
I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1
I have two dataframes df1 and df2. df1 contains the columns subject_id and time and df2 contains the columns subject_id and final_time. What I want to do is for every subject_id in df1 add a column with final_time from df2 but only from the subject_ids's contained in df1. I have tried df1.merge(df2,how='left') but still get all of the subject_id's from df2 which is much longer and contains many duplicates of 'subject_id`.
Example of what I am looking for:
df1
subject_id time
0 15 12:00
1 20 12:05
2 21 12:10
3 25 12:00
df2
subject_id final_time
0 15 12:30
1 15 12:30
2 15 12:30
3 20 12:45
4 20 12:45
5 21 12:50
6 25 1:00
7 25 1:00
8 25 1:00
What I am looking for
subject_id time final_time
0 15 12:00 12:30
1 20 12:05 12:45
2 21 12:10 12:50
3 25 12:00 1:00
You should use
df1.merge(df2, on='subject_id')
The default for how is inner, which will only match those entries that are in both columns. on tells the merge to match only on the column you are interested in
Works for me. Nothing in results that aren't in df1
df1 = pd.DataFrame(dict(subject_id=[1, 2, 3], time=[9, 8, 7]))
df2 = pd.DataFrame(dict(subject_id=[2, 2, 4], final_time=[6, 5, 4]))
df1.merge(df2, 'left')
subject_id time final_time
0 1 9 NaN
1 2 8 6.0
2 2 8 5.0
3 3 7 NaN