How to split a dataframe using pandas wide_to_long keeping first column as index and balance columns (in group of 3) into single dataframe.
I have sample dataframe like below:
columns = [timestamp, BQ_0, BP_0, BO_0, BQ_1, BP_2, BO_2, BQ_3, BP_3,BO_3, BQ_4, BP_4, BO_4, BQ_4, BP_4, BO_4]
09:15:00 900 29450.00 2 20 29,436 1 100 29425.15 1 60 29352.05 1 20 29352.00 1
09:15:01 900 29450.00 2 20 29,436 1 100 29425.15 1 60 29352.05 1 20 29352.00 1
09:15:02 20 29412.40 1 20 29,410 1 80 29410.10 1 20 29407.60 1 20 29388.90 1
09:15:03 80 29430.20 1 80 29,430 1 80 29430.05 2 20 29430.00 1 20 29424.75 1
09:15:04 120 29445.80 1 40 29,440 2 40 29440.10 1 40 29440.05 1 20 29439.10 1
I want to melt this Dataframe in group of [timestamp , BQ_ , BP_ , BO_ ] using pandas wide_to_long where
_Q = Quantity, _P = Price, _O = Orders,
I want my result dataframe a like below:
timestamp, BQ_, BP_, BO_
09:15:00 900 29450.00 2 <= 1st Row
09:15:00 20 29,436 1
09:15:00 100 29425.15 1
09:15:00 60 29352.05 1
09:15:00 20 29352.00 1
09:15:01 900 29450.00 2 <= 2nd Row
09:15:01 20 29,436 1
09:15:01 100 29425.15 1
09:15:01 60 29352.05 1
09:15:01 20 29352.00 1
09:15:02 20 29412.40 1 <= 3rd Row
09:15:02 20 29,410 1
...
Source : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html
pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\d+')
df : DataFrame
The wide-format DataFrame
stubnames : str or list-like
The stub name(s). The wide format variables are assumed to start with the stub names.
i : str or list-like
Column(s) to use as id variable(s)
j : str
The name of the sub-observation variable. What you wish to name your suffix in the long format.
sep : str, default “”
A character indicating the separation of the variable names in the wide format, to be stripped from the names in the long format. For example, if your column names are A-suffix1, A-suffix2, you can strip the hyphen by specifying sep=’-‘
New in version 0.20.0.
suffix : str, default ‘\d+’
A regular expression capturing the wanted suffixes. ‘\d+’ captures numeric suffixes. Suffixes with no numbers could be specified with the negated character class ‘\D+’. You can also further disambiguate suffixes, for example, if your wide variables are of the form A-one, B-two,.., and you have an unrelated column A-rating, you can ignore the last one by specifying suffix=’(!?one|two)’
New in version 0.20.0.
Changed in version 0.23.0: When all suffixes are numeric, they are cast to int64/float64.
You can try it like this
result = pd.wide_to_long(df, stubnames=['BQ_','BP_','BO_'], i=['timestamp'],j="Number")
Related
Hi my use case is I have dynamic names of datetime field and date will be some thing like unix timestamp . so every time there could be different column name and there can be multiple date filed . so how I can do this ? for now if I do hardcode for column name this works for me
df['date'] = pandas.to_datetime(df['date'], unit='s')
but not sure how I can make this for dynamic names and multiple fields with pandas
As suggested on my comment, you can try to convert all columns as DatetimeIndex then just keep one where conversion succeeded?
# Inspired from #mozway, https://stackoverflow.com/a/75106101/15239951
def find_datetime_col(df):
mask = df.astype(str).apply(pd.to_datetime, errors='coerce').notna()
return mask.mean().idxmax()
col = find_datetime_col(df)
print(col)
# Output
Heading 1
Input dataframe:
>>> df
Heading 1 Heading 2 Heading 3 Heading 4
0 2023-01-01 34 12 34
1 2023-01-02 42 99 42
2 2023-01-03 42 99 42
3 2023-01-04 42 99 42
I've got the following dataframe
lst=[['01012021','',100],['01012021','','50'],['01022021',140,5],['01022021',160,12],['01032021','',20],['01032021',200,25]]
df1=pd.DataFrame(lst,columns=['Date','AuM','NNA'])
I am looking for a code which sums the columns AuM and NNA only if the values of column AuM contains a value. The result is showed below:
lst=[['01012021','',100,''],['01012021','','50',''],['01022021',140,5,145],['01022021',160,12,172],['01032021','',20,'']]
df2=pd.DataFrame(lst,columns=['Date','AuM','NNA','Sum'])
It is not a good practice to use '' in place of NaN when you have numeric data.
That said, a generic solution to your issue would be to use sum with the skipna=False option:
df1['Sum'] = (df1[['AuM', 'NNA']] # you can use as many columns as you want
.apply(pd.to_numeric, errors='coerce') # convert to numeric
.sum(1, skipna=False) # sum if all are non-NaN
.fillna('') # fill NaN with empty string (bad practice)
)
output:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I assume you mean to include the last row too:
df2 = (df1.assign(Sum=df1.loc[df1.AuM.ne(""), ["AuM", "NNA"]].sum(axis=1))
.fillna(""))
print(df2)
Result:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I have a Pandas dataframe of the form:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/04/27 1 42
2019/04/28 1 41
2019/01/27 2 33
2019/08/27 2 23
What I need to do?
Select Rows which are at least 30 days old from their latest measurement for each id.
i.e. the latest date for Id = 2 is 2019/08/27, so for ID =2 I need to select rows which are at least 30 days older. So, the row with 2019/08/27 for ID=2 will itself be dropped.
Similarly, the latest date for ID = 1 is 2019/04/28. This means I can select rows for ID =1 only if the date is less than 2019/03/28 (30 days older). So, the row 2019/04/27 with ID=1 will be dropped.
How to do this in Pandas. Any help is greatly appreciated.
Thank you.
Final dataframe will be:
Date ID Temp
2019/03/27 1 23
2019/04/27 2 32
2019/01/27 2 33
In your case using groupby + transform('last') and filter the original df
Yourdf=df[df.Date<df.groupby('ID').Date.transform('last')-pd.Timedelta('30 days')].copy()
Date ID Temp
0 2019-03-27 1 23
1 2019-04-27 2 32
4 2019-01-27 2 33
Notice I am adding the .copy at the end to prevent the setting copy error.
I am working with expeditions geodata. Could you help with enumeration of stations and records for the same station depending on expedition ID (ID), date (Date), latitude (Lat), longitude (Lon) and some value (Val, it is not reasonable for enumeration)? Assume that station is a group of rows with the same (ID,Date,Lat,Lon), expedition is a group of rows with the same ID.
Dataframe is sorted by 4 columns as in example.
Dataset and required columns
import pandas as pd
data = [[1,'2017/10/10',70.1,30.4,10],\
[1,'2017/10/10',70.1,31.4,20],\
[1,'2017/10/10',70.1,31.4,10],\
[1,'2017/10/10',70.1,31.4,10],\
[1,'2017/10/12',70.1,31.4,20],\
[2,'2017/12/10',70.1,30.4,20],\
[2,'2017/12/10',70.1,31.4,20]];
df = pd.DataFrame(data,columns=['ID','Date','Lat','Lon','Val']);
Additional (I need it, St for station number and Rec for record number within the same station data; output for example above):
df['St'] = [1,2,2,2,3,1,2];
df['Rec'] = [1,1,2,3,1,1,1];
print(df)
I tried and used groupby/cumcount/agg/factorize but have not solved my problem.
Any help! Thanks!
To create 'St', you can use groupby on 'ID' and then check when any of the columns 'Date','Lat','Lon' is different than the previous one using shift, and use cumsum to get the numbers you want, such as:
df['St'] = (df.groupby(['ID'])
.apply(lambda x: (x[['Date','Lat','Lon']].shift() != x[['Date','Lat','Lon']])
.any(axis=1).cumsum())).values
And to create 'Rec', you also need groupby but on all columns 'ID','Date','Lat','Lon' and then use cumcount and add such as:
df['Rec'] = df.groupby(['ID','Date','Lat','Lon']).cumcount().add(1)
and you get:
ID Date Lat Lon Val St Rec
0 1 2017/10/10 70.1 30.4 10 1 1
1 1 2017/10/10 70.1 31.4 20 2 1
2 1 2017/10/10 70.1 31.4 10 2 2
3 1 2017/10/10 70.1 31.4 10 2 3
4 1 2017/10/12 70.1 31.4 20 3 1
5 2 2017/12/10 70.1 30.4 20 1 1
6 2 2017/12/10 70.1 31.4 20 2 1
I have a Dataframe which looks like this :
date,time,metric_x
2016-02-27,00:00:28.0000000,31
2016-02-27,00:01:19.0000000,40
2016-02-27,00:02:55.0000000,39
2016-02-27,00:03:51.0000000,48
2016-02-27,00:05:22.0000000,42
2016-02-27,00:05:59.0000000,35
I wish to generate a new column
df['time_slot'] = df.apply(lambda row: time_slot_convert(pd.to_datetime(row['time'])), axis =1)
Where,
def time_slot_convert(time):
return time.hour + 1
This functions finds the hour for this record, plus 1.
This is extremely slow. I understand that the data is read as a string. is there a more efficient way that will speed this up?
Faster is remove apply:
df['time_slot'] = pd.to_datetime(df['time']).dt.hour + 1
print (df)
date time metric_x time_slot
0 2016-02-27 00:00:28.0000000 31 1
1 2016-02-27 00:01:19.0000000 40 1
2 2016-02-27 00:02:55.0000000 39 1
3 2016-02-27 00:03:51.0000000 48 1
4 2016-02-27 00:05:22.0000000 42 1
5 2016-02-27 00:05:59.0000000 35 1