I have the following data in CSV file:
time conc time conc time conc time conc
1:00 10 5:00 11 9:00 55 13:00 1
2:00 13 6:00 8 10:00 6 14:00 4
3:00 9 7:00 7 11:00 8 15:00 3
4:00 8 8:00 1 12:00 11 16:00 8
And I just wanted to merge them together as:
time conc
1:00 10
2:00 13
3:00 9
4:00 8
...
16:00 8
I've got more than 1000 columns, but I'm new to pandas. So just wondering how I can achieve?
One approach is to cut the dataframe in two-column slices, then re-combine using pd.concat() after renaming.
First load the dataframe normally:
df = pd.read_csv('time_conc.csv')
df
Which looks something like the below. Notice that pd.read_csv() has added a suffix to the duplicate column names:
time conc time.1 conc.1 time.2 conc.2 time.3 conc.3
0 1:00 10 5:00 11 9:00 55 13:00 1
1 2:00 13 6:00 8 10:00 6 14:00 4
2 3:00 9 7:00 7 11:00 8 15:00 3
3 4:00 8 8:00 1 12:00 11 16:00 8
Then slice using pd.DataFrame.iloc:
total_columns = len(df.columns)
columns_per_set = 2
column_sets = [df.iloc[:,set_start:set_start + columns_per_set].copy() for set_start in range(0, total_columns, columns_per_set)]
column_sets is now a list holding each pair of duplicate columns as a separate dataframe. Next, loop through the list to rename the columns back to the original:
for s in column_sets:
s.columns = ['time', 'conc']
This modifies each two-column dataframe in place so that their column names match.
Finally, use pd.concat() to combine them by matching the column axis:
new_df = pd.concat(column_sets, axis=0, sort=False)
new_df
Which gives you the full two columns:
time conc
0 1:00 10
1 2:00 13
2 3:00 9
3 4:00 8
0 5:00 11
1 6:00 8
2 7:00 7
3 8:00 1
0 9:00 55
1 10:00 6
2 11:00 8
3 12:00 11
0 13:00 1
1 14:00 4
2 15:00 3
3 16:00 8
Since your file has duplicated column names, Pandas will add suffixes. The DataFrame header by default will be like ['time', 'conc', 'time.1', 'conc.1', 'time.2', 'conc.2', 'time.3', 'conc.3' ...]
Assuming that the separator of your CSV file is a comma:
import pandas as pd
df = pd.read_csv('/path/to/your/file.csv', sep=',')
total_n = len(df.columns)
lst = []
for x in range(int(total_n / 2 )):
if x == 0:
cols = ['time', 'conc']
else:
cols = ['time'+'.'+str(x), 'conc'+'.'+str(x)]
df_sub = df[cols] #Slice two columns each time
df_sub.columns = ['time', 'conc'] #Slices should have the same column names
lst.append(df_sub)
df = pd.concat(lst) #Concatenate all the objects
Assuming that df is a DataFrame with the csv file data you can try the following:
# rename columns if needed
df.columns = ["time", "conc"]*(df.shape[1]//2)
# concatenate pairs of adjacent columns
pd.concat([df.iloc[:, [i, i+1]] for i in range(0, df.shape[1], 2)])
It gives:
time conc
0 1:00 10
1 2:00 13
2 3:00 9
3 4:00 8
0 5:00 11
.. ... ..
3 12:00 11
0 13:00 1
1 14:00 4
2 15:00 3
3 16:00 8
Related
I have a big date file that I'm trying to extract data from. I have two columns Start Time & Date What I would like to do is display each Date followed by each Start Time followed by a count of each of those start times. So the output would look like this:
Date Start Time
30/12/2021 15:00 2
30/12/2021 16:00 6
30/12/2021 17:00 3
This is what I've tried:
df = pd.read_excel(xls)
counter = df['Start Time'].value_counts()
date_counter = df['Date'].value_counts()
total = (df['Start Time']).groupby(df['Date']).sum()
pd.set_option("display.max_rows", None, "display.max_columns", None)
print(total)
input()
But this outputs like this:
Date Start Time
30/12/2021 15:0016:0016:0017:0018:0018:00
Any suggestions are much appreciated!
You're only grouping by 1 column. You need to group-by both columns and get the count using size()
df.groupby(['Date', 'Start Time']).size()
You can value count with the 2 keys
counts = df[['Date','Start Time']].value_counts()
for this input
Date Start Time
0 30/12/21 15:00
1 30/12/21 16:00
2 31/12/21 15:00
3 30/12/21 15:00
4 31/12/21 16:00
5 30/12/21 18:00
6 30/12/21 13:00
7 31/12/21 15:00
throws
Date Start Time
31/12/21 15:00 2
30/12/21 15:00 2
31/12/21 16:00 1
30/12/21 18:00 1
16:00 1
13:00 1
I have a dataframe, df, that I am wanting to calculate the delta over a 7 day time period:
Monday Tuesday Wednesday Thursday Friday Sat Sun
5 10 15 20 25 30 35
1 2 3 4 5 6 7
I would like to find the delta for the first row, starting with Monday (5) and ending on Sun (35)
The delta for the first 7 day time period would be: 35 - 5 = 30
The next 7 day window delta would be: 7 - 1 = 6 and so on
The date would begin on 1/1/2020 and continue by 7 day or weekly increments.
Desired output: (New dataframe with the newly created Date and Delta columns)
Date Delta
1/1/2020 30
1/8/2020 6
This is what I am doing:
import pandas as pd
import numpy as np
df = pd.read_csv('df.csv')
df['Delta'] = df['Sunday'] - df['Monday]
df['Date'] = pd.date_range(start='1/1/2020', periods=len(df), freq='Weeks')
df2.to_csv('df2.csv')
Any suggestion is appreciated
Lets Try calculate date_range by incorporating multiples in the freq
df['Delta']=df.Sun.sub(df.Monday)
df['Date']=pd.Series(pd.date_range(pd.Timestamp('2020-01-01'), periods=7, freq='7d'))
or simply
df=df.assign(Delta=df.Sun.sub(df.Monday),Date=pd.Series\
(pd.date_range(pd.Timestamp('2020-01-01'), periods=7, freq='7d')))
Monday Tuesday Wednesday Thursday Friday Sat Sun Delta Date
0 5 10 15 20 25 30 35 30 2020-01-01
1 1 2 3 4 5 6 7 6 2020-01-08
# necessary imports
import datetime
import pandas
Can do:
numdays=5
base = datetime.datetime(2020,1,1)
date_list = [base + datetime.timedelta(days=7*x) for x in range(numdays)]
Then:
df=pd.DataFrame({'Date':date_list})
If you have another list of values, ie Deltas_list you want to include in this dataframe:
Deltas_list=[0,1,2,3,4]
Deltas=pd.Series(Deltas_list)
df['Delta']=Deltas
df will be:
Date Delta
0 2020-01-01 0
1 2020-01-08 1
2 2020-01-15 2
3 2020-01-22 3
4 2020-01-29 4
I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
I have two dataframes df1 and df2. df1 contains the columns subject_id and time and df2 contains the columns subject_id and final_time. What I want to do is for every subject_id in df1 add a column with final_time from df2 but only from the subject_ids's contained in df1. I have tried df1.merge(df2,how='left') but still get all of the subject_id's from df2 which is much longer and contains many duplicates of 'subject_id`.
Example of what I am looking for:
df1
subject_id time
0 15 12:00
1 20 12:05
2 21 12:10
3 25 12:00
df2
subject_id final_time
0 15 12:30
1 15 12:30
2 15 12:30
3 20 12:45
4 20 12:45
5 21 12:50
6 25 1:00
7 25 1:00
8 25 1:00
What I am looking for
subject_id time final_time
0 15 12:00 12:30
1 20 12:05 12:45
2 21 12:10 12:50
3 25 12:00 1:00
You should use
df1.merge(df2, on='subject_id')
The default for how is inner, which will only match those entries that are in both columns. on tells the merge to match only on the column you are interested in
Works for me. Nothing in results that aren't in df1
df1 = pd.DataFrame(dict(subject_id=[1, 2, 3], time=[9, 8, 7]))
df2 = pd.DataFrame(dict(subject_id=[2, 2, 4], final_time=[6, 5, 4]))
df1.merge(df2, 'left')
subject_id time final_time
0 1 9 NaN
1 2 8 6.0
2 2 8 5.0
3 3 7 NaN
I'm trying to combine all rows of a dataframe that have the same time stamp into a single row. The df is 5k by 20.
A B ...
timestamp
11:00 NaN 10 ...
11:00 5 NaN ...
12:00 15 20 ...
... ... ...
group the 2 11:00 rows as follows
A B ...
timestamp
11:00 5 10 ...
12:00 15 20 ...
... ... ...
Any help would be appreciated. Thank you.
I have tried
df.groupby( df.index ).sum()
You could melt ('unpivot') the DataFrame to convert it from wide form to long form, remove the null values, then aggregate via groupby.
import pandas as pd
df = pd.DataFrame({'timestamp' : ['11:00','11:00','12:00'],
'A' : [None,5,15],
'B' : [10,None,20]
})
A B timestamp
0 NaN 10 11:00
1 5 NaN 11:00
2 15 20 12:00
df2 = pd.melt(df, id_vars = 'timestamp') # specify the value_vars if needed
timestamp variable value
0 11:00 A NaN
1 11:00 A 5
2 12:00 A 15
3 11:00 B 10
4 11:00 B NaN
5 12:00 B 20
df2.dropna(inplace=True)
df3 = df2.groupby(['timestamp', 'variable']).sum()
value
timestamp variable
11:00 A 5
B 10
12:00 A 15
B 20
df3.unstack()
value
variable A B
timestamp
11:00 5 10
12:00 15 20
groupby after replacing the NaN values with 0's.
df.fillna(0, inplace=True)
df.groupby(df.index).sum()
Try using resample:
>>> df.resample('60Min', how='sum')
A B
2015-05-28 11:00:00 5 10
2015-05-28 12:00:00 15 20
More examples can be found in the Pandas Documentation.
You cannot sum a number and a NaN in python. You probably need to use .aggregate() :)