I've got a function which I've set to return two values (call them Site & Date). I'm trying to use df.apply to create two new columns, each representing one of the returned values. I don't want to apply this function twice, or more, times because it will take ages, so I need some way to set the values of two columns to two, or more, values from the function. Here is my code.
df1[['Site','Site Date']] = df1.apply(
lambda row: firstSite(biomass, row['lat'], row['long'], row['Date']),
axis = 1)
The input value biomass is a dataframe of coordinates, row 'lat', 'lng', 'Date' are all columns from df1. If I decide to apply this function to df['Site'] it works perfectly but when I want to apply values to two columns I get this error.
ValueError: Shape of passed values is (999, 2), indices imply (999, 28)
def firstSite(biomass, lat, long, date):
biomass['Date of Operation'] = pd.to_datetime(biomass['Date of Operation'])
biomass = biomass[biomass['Date of Operation'] <= date]
biomass['distance'] = biomass.apply(
lambda row: distanceBetweenCm(lat, long, row['Lat'], row['Lng']),
axis=1)
biomass['Site Name'] = np.where((biomass['distance'] <= 2), biomass['Site Name'], "Null")
biomass = biomass.drop_duplicates('Site Name')
Site = biomass.loc[biomass['Date of Operation'].idxmin(),'Site Name']
Lat = biomass.loc[biomass['Date of Operation'].idxmin(),'Lat']
return Site, Lat
This function has a few tasks:
1 - It removes any rows from biomass where the date is after df1['Date'].
2 - If the distance between coordinates is more than 2, the 'Site Name' is changed to 'Null'
3 - It removes any duplicates from the site name, ensuring that there will only be one row with the value 'Null'.
4 - It returns the value of 'Site Name' & 'Lat' where the 'Date of Operation' is least.
I need my code to return the first (by date) record from biomass where the distance between the coordinates from df1 & biomass is less than 2km.
Hopefully I'll be able to return the first record for many different radius', such as first biomass site within 2km, 4km, 6km, 8km, 10km.
I think your function need return Series with 2 values:
df1 = pd.DataFrame({'A':list('abcdef'),
'lat':[4,5,4,5,5,4],
'long':[7,8,9,4,2,3],
'Date':pd.date_range('2011-01-01', periods=6),
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df1)
A Date E F lat long
0 a 2011-01-01 5 a 4 7
1 b 2011-01-02 3 a 5 8
2 c 2011-01-03 6 a 4 9
3 d 2011-01-04 9 b 5 4
4 e 2011-01-05 2 b 5 2
5 f 2011-01-06 4 b 4 3
biomass = 10
def firstSite(a,b,c,d):
return pd.Series([a + b, d])
df1[['Site','Site Date']] = df1.apply(lambda row: firstSite(biomass,
row['lat'], row['long'], row['Date']),
axis = 1)
print (df1)
A Date E F lat long Site Site Date
0 a 2011-01-01 5 a 4 7 14 2011-01-01
1 b 2011-01-02 3 a 5 8 15 2011-01-02
2 c 2011-01-03 6 a 4 9 14 2011-01-03
3 d 2011-01-04 9 b 5 4 15 2011-01-04
4 e 2011-01-05 2 b 5 2 15 2011-01-05
5 f 2011-01-06 4 b 4 3 14 2011-01-06
Related
So I have the following dataframe:
Period group ID
20130101 A 10
20130101 A 20
20130301 A 20
20140101 A 20
20140301 A 30
20140401 A 40
20130101 B 11
20130201 B 21
20130401 B 31
20140401 B 41
20140501 B 51
I need to count how many different ID there are by group in the last year. So my desired output would look like this:
Period group num_ids_last_year
20130101 A 2 # ID 10 and 20 in the last year
20130301 A 2
20140101 A 2
20140301 A 2 # ID 30 enters, ID 10 leaves
20140401 A 3 # ID 40 enters
20130101 B 1
20130201 B 2
20130401 B 3
20140401 B 2 # ID 11 and 21 leave
20140501 B 2 # ID 31 leaves, ID 51 enters
Period is in datetime format. I tried many things along the lines of:
df.groupby(['group','Period'])['ID'].nunique() # Get number of IDs by group in a given period.
df.groupby(['group'])['ID'].nunique() # Get total number of IDs by group.
df.set_index('Period').groupby('group')['ID'].rolling(window=1, freq='Y').nunique()
But the last one isn't even possible. Is there any straightforward way to do this? I'm thinking maybe some kind of combination of cumcount() and pd.DateOffset or maybe ge(df.Period - dt.timedelta(365), but I can't find the answer.
Thanks.
Edit: added the fact that I can find more than one ID in a given Period
looking at your data structure, I am guessing you have MANY duplicates, so start with dropping them. drop_duplicates tend to be fast
I am assuming that df['Period'] columns is of dtype datetime64[ns]
df = df.drop_duplicates()
results = dict()
for start in df['Period'].drop_duplicates():
end = start.date() - relativedelta(years=1)
screen = (df.Period <= start) & (df.Period >= end) # screen for 1 year of data
singles = df.loc[screen, ['group', 'ID']].drop_duplicates() # screen for same year ID by groups
x = singles.groupby('group').count()
results[start] = x
results = pd.concat(results, 0)
results
ID
group
2013-01-01 A 2
B 1
2013-02-01 A 2
B 2
2013-03-01 A 2
B 2
2013-04-01 A 2
B 3
2014-01-01 A 2
B 3
2014-03-01 A 2
B 1
2014-04-01 A 3
B 2
2014-05-01 A 3
B 2
is that any faster?
p.s. if df['Period'] is not a datetime:
df['Period'] = pd.to_datetime(df['Period'],format='%Y%m%d', errors='ignore')
Here the solution using groupby and rolling. Note: your desired ouput counts a year from YYYY0101 to next year YYYY0101, so you need rolling 366D instead of 365D
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
df = df.set_index('Period')
df_final = (df.groupby('group')['ID'].rolling(window='366D')
.apply(lambda x: np.unique(x).size, raw=True)
.reset_index(name='ID_count')
.drop_duplicates(['group','Period'], keep='last'))
Out[218]:
group Period ID_count
1 A 2013-01-01 2.0
2 A 2013-03-01 2.0
3 A 2014-01-01 2.0
4 A 2014-03-01 2.0
5 A 2014-04-01 3.0
6 B 2013-01-01 1.0
7 B 2013-02-01 2.0
8 B 2013-04-01 3.0
9 B 2014-04-01 2.0
10 B 2014-05-01 2.0
Note: On 18M+ rows, I don't think this solution will make it at 10 mins. I hope it would take about 30 mins.
from dateutil.relativedelta import relativedelta
df.sort_values(by=['Period'], inplace=True) # if not already sorted
# create new output df
df1 = (df.groupby(['Period','group'])['ID']
.apply(lambda x: list(x))
.reset_index())
df1['num_ids_last_year'] = df1.apply(lambda x: len(set(df1.loc[(df1['Period'] >= x['Period']-relativedelta(years=1)) & (df1['Period'] <= x['Period']) & (df1['group'] == x['group'])].ID.apply(pd.Series).stack())), axis=1)
df1.sort_values(by=['group'], inplace=True)
df1.drop('ID', axis=1, inplace=True)
df1 = df1.reset_index(drop=True)
I'am trying to convert this dataframe into a series or the series to a dataframe (basicly one into an other) in order to be able to do operations with it, my second problem is wanting to delete the first column of the dataframe below (before of after converting doesn't really matter) or be able to delete a column from a series.
I searched for similar questions but they did not correspond to my issue.
Thanks in advance here are the dataframe and the series.
JOUR FL_AB_PCOUP FL_ABER_NEGA FL_AB_PMAX FL_AB_PSKVA FL_TROU_PDC \
0 2018-07-09 -0.448787 0.0 1.498464 -0.197012 1.001577
CDC_INCOMPLET_HORS_ABERRANTS CDC_COMPLET_HORS_ABERRANTS CDC_ABSENT \
0 -0.729002 -1.03586 1.032936
CDC_ABERRANTS PRM_X_PDC_ZERO mean.msr.pdc sd.msr.pdc sum.msr.pdc \
0 1.49976 -0.497693 -1.243274 -1.111366 0.558516
FL_AB_PCOUP 8.775974e-05
FL_ABER_NEGA 0.000000e+00
FL_AB_PMAX 1.865632e-03
FL_AB_PSKVA 2.027215e-05
FL_TROU_PDC 2.222952e-02
FL_AB_COMBI 1.931156e-03
CDC_INCOMPLET_HORS_ABERRANTS 1.562195e-03
CDC_COMPLET_HORS_ABERRANTS 9.758743e-01
CDC_ABSENT 2.063239e-02
CDC_ABERRANTS 1.931156e-03
PRM_X_PDC_ZERO 2.127753e+01
mean.msr.pdc 1.125987e+03
sd.msr.pdc 1.765955e+03
sum.msr.pdc 3.310615e+08
n.resil 3.884103e-04
dtype: float64
Setup:
df = pd.DataFrame({'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4]})
print (df)
B C D E
0 4 7 1 5
1 5 8 3 3
2 4 9 5 6
3 5 4 7 9
4 5 2 1 2
5 4 3 0 4
Use for DataFrame to Series selecting, e.g. by position by iloc or by name of index by loc :
#select some row, e.g. first
s = df.iloc[0]
print (s)
B 4
C 7
D 1
E 5
Name: 0, dtype: int64
And for Series to DataFrame use to_frame with transpose if necessary:
df = s.to_frame().T
print (df)
B C D E
0 4 7 1 5
Last for remove column from DataFrame use DataFrame.drop:
df = df.drop('B',axis=1)
print (df)
C D E
0 7 1 5
And value from Series use Series.drop:
s = s.drop('C')
print (s)
B 4
D 1
E 5
Name: 0, dtype: int64
you can delete your particular column by
df.drop(df.columns[i], axis=1)
to convert dataframe to series
pd.Series(df)
I have the following DataFrame (in reality I'm working with around 20 million rows):
shop month day sale
1 7 1 10
1 6 1 8
1 5 1 9
2 7 1 10
2 6 1 8
2 5 1 9
I want another column: "Prev month sales", where sales are equal to the "Sales of previous month with same day, e.g.
shop month day sale prev month sale
1 7 1 10 8
1 6 1 8 9
1 5 1 9 9
2 7 1 10 8
2 6 1 8 9
2 5 1 9 9
One solution using .concat(), set_index(), and .loc[]:
# Get index of (shop, previous month, day).
# This will serve as a unique index to look up prev. month sale.
prev = pd.concat((df.shop, df.month - 1, df.day), axis=1)
# Unfortunately need to convert to list of tuples for MultiIndexing
prev = pd.MultiIndex.from_arrays(prev.values.T)
# old: [tuple(i) for i in prev.values]
# Now call .loc on df to look up each prev. month sale.
sale_prev_month = df.set_index(['shop', 'month', 'day']).loc[prev]
# And finally just concat rather than merge/join operation
# because we want to ignore index & mimic a left join.
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
shop month day sale sale
0 1 7 1 10 8.0
1 1 6 1 8 9.0
2 1 5 1 9 NaN
3 2 7 1 10 8.0
4 2 6 1 8 9.0
5 2 5 1 9 NaN
Your new column will be float, not int, because of the presence of NaNs.
Update - an attempt with dask
I don't use dask day to day so this is probably woefully sub-par. Trying to work around the fact that dask does not implement pandas' MultiIndex. So, you can concatenate your three existing indices into a string column and lookup on that.
import dask.dataframe as dd
# Play around with npartitions or chunksize here!
df2 = dd.from_pandas(df, npartitions=10)
# Get a *single* index of unique (shop, month, day IDs)
# Dask doesn't support MultiIndex
empty = pd.Series(np.empty(len(df), dtype='object')) # Passed to `meta`
current = df2.loc[:, on].apply(lambda col: '_'.join(col.astype(str)), axis=1,
meta=empty)
prev = df2.loc[:, on].assign(month=df2['month'] - 1)\
.apply(lambda col: '_'.join(col.astype(str)), axis=1, meta=empty)
df2 = df2.set_index(current)
# We know have two dask.Series, `current` and `prev`, in the
# concatenated format "shop_month_day".
# We also have a dask.DataFrame, df2, which is indexed by `current`
# I would think we could just call df2.loc[prev].compute(), but
# that's throwing a KeyError for me, so slightly more expensive:
sale_prev_month = df2.compute().loc[prev.compute()][['sale']]\
.reset_index(drop=True)
# Now just concat as before
# Could re-break into dask objects here if you really needed to
df = pd.concat((df, sale_prev_month.reset_index(drop=True)), axis=1)
I am trying to generate a heatmap using seaborn, however I am having a small problem with the formatting of my data.
Currently, my data is in the form:
Name Diag Date
A 1 2006-12-01
A 1 1994-02-12
A 2 2001-07-23
B 2 1999-09-12
B 1 2016-10-12
C 3 2010-01-20
C 2 1998-08-20
I would like to create a heatmap (preferably in python) showing Name on one axis against Diag - if occured. I have tried to pivot the table using pd.pivot, however I was given the error
ValueError: Index contains duplicate entries, cannot reshape
this came from:
piv = df.pivot_table(index='Name',columns='Diag')
Time is irrelevant, but I would like to show which Names have had which Diag, and which Diag combos cluster together. Do I need to create a new table for this or is it possible for that I have? In some cases the Name is not associated with all Diag
EDIT:
I have since tried:
piv = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
However as Time is in datetime format, I end up with:
pandas.core.base.DataError: No numeric types to aggregate
You need pivot_table with some aggregate function, because for same index and column have multiple values and pivot need unique values only:
print (df)
Name Diag Time
0 A 1 12 <-duplicates for same A, 1 different value
1 A 1 13 <-duplicates for same A, 1 different value
2 A 2 14
3 B 2 18
4 B 1 1
5 C 3 9
6 C 2 8
df = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
print (df)
Diag 1 2 3
Name
A 12.5 14.0 NaN
B 1.0 18.0 NaN
C NaN 8.0 9.0
Alternative solution:
df = df.groupby(['Name','Diag'])['Time'].mean().unstack()
print (df)
Diag 1 2 3
Name
A 12.5 14.0 NaN
B 1.0 18.0 NaN
C NaN 8.0 9.0
EDIT:
You can also check all duplicates by duplicated:
df = df.loc[df.duplicated(['Name','Diag'], keep=False), ['Name','Diag']]
print (df)
Name Diag
0 A 1
1 A 1
EDIT:
mean of datetimes is not easy - need convert dates to nanoseconds, get mean and last convert to datetimes. Also there is another problem - need replace NaN to some scalar, e.g. 0 what is converted to 0 datetime - 1970-01-01.
df.Date = pd.to_datetime(df.Date)
df['dates_in_ns'] = pd.Series(df.Date.values.astype(np.int64), index=df.index)
df = df.pivot_table(index='Name',
columns='Diag',
values='dates_in_ns',
aggfunc='mean',
fill_value=0)
df = df.apply(pd.to_datetime)
print (df)
Diag 1 2 3
Name
A 2000-07-07 12:00:00 2001-07-23 1970-01-01
B 2016-10-12 00:00:00 1999-09-12 1970-01-01
C 1970-01-01 00:00:00 1998-08-20 2010-01-20
I have a pandas data frame mydf that has two columns,and both columns are datetime datatypes: mydate and mytime. I want to add three more columns: hour, weekday, and weeknum.
def getH(t): #gives the hour
return t.hour
def getW(d): #gives the week number
return d.isocalendar()[1]
def getD(d): #gives the weekday
return d.weekday() # 0 for Monday, 6 for Sunday
mydf["hour"] = mydf.apply(lambda row:getH(row["mytime"]), axis=1)
mydf["weekday"] = mydf.apply(lambda row:getD(row["mydate"]), axis=1)
mydf["weeknum"] = mydf.apply(lambda row:getW(row["mydate"]), axis=1)
The snippet works, but it's not computationally efficient as it loops through the data frame at least three times. I would just like to know if there's a faster and/or more optimal way to do this. For example, using zip or merge? If, for example, I just create one function that returns three elements, how should I implement this? To illustrate, the function would be:
def getHWd(d,t):
return t.hour, d.isocalendar()[1], d.weekday()
Here's on approach to do it using one apply
Say, df is like
In [64]: df
Out[64]:
mydate mytime
0 2011-01-01 2011-11-14
1 2011-01-02 2011-11-15
2 2011-01-03 2011-11-16
3 2011-01-04 2011-11-17
4 2011-01-05 2011-11-18
5 2011-01-06 2011-11-19
6 2011-01-07 2011-11-20
7 2011-01-08 2011-11-21
8 2011-01-09 2011-11-22
9 2011-01-10 2011-11-23
10 2011-01-11 2011-11-24
11 2011-01-12 2011-11-25
We'll take the lambda function out to separate line for readability and define it like
In [65]: lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
And, apply and store the result to df[['hour', 'weekday', 'weeknum']]
In [66]: df[['hour', 'weekday', 'weeknum']] = df.apply(lambdafunc, axis=1)
And, the output is like
In [67]: df
Out[67]:
mydate mytime hour weekday weeknum
0 2011-01-01 2011-11-14 0 52 5
1 2011-01-02 2011-11-15 0 52 6
2 2011-01-03 2011-11-16 0 1 0
3 2011-01-04 2011-11-17 0 1 1
4 2011-01-05 2011-11-18 0 1 2
5 2011-01-06 2011-11-19 0 1 3
6 2011-01-07 2011-11-20 0 1 4
7 2011-01-08 2011-11-21 0 1 5
8 2011-01-09 2011-11-22 0 1 6
9 2011-01-10 2011-11-23 0 2 0
10 2011-01-11 2011-11-24 0 2 1
11 2011-01-12 2011-11-25 0 2 2
To complement John Galt's answer:
Depending on the task that is performed by lambdafunc, you may experience some speedup by storing the result of apply in a new DataFrame and then joining with the original:
lambdafunc = lambda x: pd.Series([x['mytime'].hour,
x['mydate'].isocalendar()[1],
x['mydate'].weekday()])
newcols = df.apply(lambdafunc, axis=1)
newcols.columns = ['hour', 'weekday', 'weeknum']
newdf = df.join(newcols)
Even if you do not see a speed improvement, I would recommend using the join. You will be able to avoid the (always annoying) SettingWithCopyWarning that may pop up when assigning directly on the columns:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
You can do this in a somewhat cleaner method by having the function you apply return a pd.Series with named elements:
def process(row):
return pd.Series(dict(b=row["a"] * 2, c=row["a"] + 2))
my_df = pd.DataFrame(dict(a=range(10)))
new_df = my_df.join(my_df.apply(process, axis="columns"))
The result is:
a b c
0 0 0 2
1 1 2 3
2 2 4 4
3 3 6 5
4 4 8 6
5 5 10 7
6 6 12 8
7 7 14 9
8 8 16 10
9 9 18 11
def getWd(d):
d.isocalendar()[1], d.weekday()
def getH(t):
return t.hour
mydf["hour"] = zip(*df["mytime"].map(getH))
mydf["weekday"], mydf["weeknum"] = zip(*df["mydate"].map(getWd))