The funky way that you index into pandas dataframes to change values is difficult for me. I can never figure out if I'm changing the value of a dataframe element, or if I'm changing a copy of that value.
I'm also new to python's syntax for operating on arrays, and struggle to turn loops over indexes (like in C++) into vector operations in python.
My problem is that I wish to add a column of pandas.Timestamp values to a dataframe based on values in other columns. Lets say I start with a dataframe like
import pandas as pd
import numpy as np
mydata = np.transpose([ [11, 22, 33, 44, 66, 77],
pd.to_datetime(['2015-02-26', '2015-02-27', '2015-02-25', np.NaN, '2015-01-24', '2015-03-24'], errors='coerce'),
pd.to_datetime(['2015-02-24', np.NaN, '2015-03-24', '2015-02-26', '2015-02-27', '2015-02-25'], errors='coerce')
])
df = pd.DataFrame(columns=['ID', 'BEFORE', 'AFTER'], data=mydata)
df.head(6)
which returns
ID BEFORE AFTER
0 11 2015-02-26 2015-02-24
1 22 2015-02-27 NaT
2 33 2015-02-25 2015-03-24
3 44 NaT 2015-02-26
4 66 2015-01-24 2015-02-27
5 77 2015-03-24 2015-02-25
I want to find the lesser of the dates BEFORE and AFTER and then make a new column called RELEVANT_DATE with the results. I can then drop BEFORE and AFTER. There are a zillion ways to do this but, for me, almost all of them don't work. The best I can do is this
# fix up NaT's only in specific columns, real data has more columns
futureDate = pd.to_datetime('2099-01-01')
df.fillna({'BEFORE':futureDate, 'AFTER':futureDate}, inplace=True)
# super clunky solution
numRows = np.shape(df)[0]
relevantDate = []
for index in range(numRows):
if df.loc[index, 'AFTER'] >= df.loc[index, 'BEFORE']:
relevantDate.append(df.loc[index, 'BEFORE'])
else:
relevantDate.append(df.loc[index, 'AFTER'])
# add relevant date column to df
df['RELEVANT_DATE'] = relevantDate
# delete irrelevant dates
df.drop(labels=['BEFORE', 'AFTER'], axis=1, inplace=True)
df.head(6)
returning
ID RELEVANT_DATE
0 11 2015-02-24
1 22 2015-02-27
2 33 2015-02-25
3 44 2015-02-26
4 66 2015-01-24
5 77 2015-02-25
This approach is super slow. With a few million rows it takes too long to be useful.
Can you provide a pythonic-style solution for this? Recall that I'm having trouble both with vectorizing these operations AND making sure they get set for real in the DataFrame.
Take the minimum across a row (axis=1). Set the index so you can bring 'ID' along for the ride.
df.set_index('ID').min(axis=1).rename('RELEVANT DATE').reset_index()
ID RELEVANT DATE
0 11 2015-02-24
1 22 2015-02-27
2 33 2015-02-25
3 44 2015-02-26
4 66 2015-01-24
5 77 2015-02-25
Or assign the new column to your existing DataFrame:
df['RELEVANT DATE'] = df[['BEFORE', 'AFTER']].min(1)
Related
I have two dataframes A and B that contain different sets of patient data, and need to append certain columns from B to A - however only for those rows that contain information from the same patient and visit, i.e. where A and B have a matching value in two particular columns. B is longer than A, not all rows in A are contained in B. I don't know how this would be possible without looping, but many people discourage from looping over pandas dataframes (apart from the fact that my loop solution does not work because "Can only compare identically-labeled Series objects"). I read the options here
How to iterate over rows in a DataFrame in Pandas
but don't see which one I could apply here and would appreciate any tips!
Toy example (the actual dataframe has about 300 rows):
dict_A = {
'ID': ['A_190792','X_210392','B_050490','F_311291','K_010989'],
'Visit_Date': ['2010-10-31','2011-09-24','2010-30-01','2012-01-01','2013-08-13'],
'Score': [30, 23, 42, 23, 31],
}
A = pd.DataFrame(dict_A)
dict_B = {
'ID': ['G_090891','A_190792','Z_060791','X_210392','B_050490','F_311291','K_010989','F_230989'],
'Visit_Date': ['2013-03-01','2010-10-31','2013-04-03','2011-09-24','2010-30-01','2012-01-01','2013-08-13','2014-09-09'],
'Diagnosis': ['F12', 'G42', 'F34', 'F90', 'G98','G87','F23','G42'],
}
B = pd.DataFrame(dict_B)
for idx, row in A.iterrows():
A.loc[row,'Diagnosis'] = B['Diagnosis'][(B['Visit_Date']==A['Visit_Date']) & (B['ID']==A['ID'])]
# Appends Diagnosis column from B to A for rows where ID and date match
I have seen this question Append Columns to Dataframe 1 Based on Matching Column Values in Dataframe 2 but the only answer is quite specific to it and also does not address the question whether a loop can/should be used or not
i think you can use merge:
A['Visit_Date']=pd.to_datetime(A['Visit_Date'])
B['Visit_Date']=pd.to_datetime(B['Visit_Date'])
final=A.merge(B,on=['Visit_Date','ID'],how='outer')
print(final)
'''
ID Visit_Date Score Diagnosis
0 A_190792 2010-10-31 30.0 G42
1 X_210392 2011-09-24 23.0 F90
2 B_050490 2010-30-01 42.0 G98
3 F_311291 2012-01-01 23.0 G87
4 K_010989 2013-08-13 31.0 F23
5 G_090891 2013-03-01 NaN F12
6 Z_060791 2013-04-03 NaN F34
7 F_230989 2014-09-09 NaN G42
'''
if you want to only A:
A['Visit_Date']=pd.to_datetime(A['Visit_Date'])
B['Visit_Date']=pd.to_datetime(B['Visit_Date'])
final=A.merge(B,on=['Visit_Date','ID'],how='left')
print(final)
'''
ID Visit_Date Score Diagnosis
0 A_190792 2010-10-31 30 G42
1 X_210392 2011-09-24 23 F90
2 B_050490 2010-30-01 42 G98
3 F_311291 2012-01-01 23 G87
4 K_010989 2013-08-13 31 F23
'''
I have three dataframes with row counts more than 71K. Below are the samples.
df_1 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001],'Col_A':[45,56,78,33]})
df_2 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887],'Col_B':[35,46,78,33,66]})
df_3 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887,1223],'Col_C':[5,14,8,13,16,8]})
Edit
As suggested, below is my desired out put
df_final
Device_ID Col_A Col_B Col_C
1001 45 35 5
1034 56 46 14
1223 78 78 8
1001 33 33 13
1887 Nan 66 16
1223 NaN NaN 8
While using pd.merge() or df_1.set_index('Device_ID').join([df_2.set_index('Device_ID'),df_3.set_index('Device_ID')],on='Device_ID') it is taking very long time. One reason is repeating values of Device_ID.
I am aware of reduce method, but my suspect is it may lead to same situation.
Is there any better and efficient way?
To get your desired outcome, you can use this:
result = pd.concat([df_1.drop('Device_ID', axis=1),df_2.drop('Device_ID',axis=1),df_3],axis=1).set_index('Device_ID')
If you don't want to use Device_ID as index, you can remove the set_index part of the code. Also, note that because of the presence of NaN's in some columns (Col_A and Col_B) in the final dataframe, Pandas will cast non-missing values to floats, as NaN can't be stored in an integer array (unless you have Pandas version 0.24, in which case you can read more about it here).
I'm having trouble creating averages using pandas. My problem is that I want to create the averages combining the months Nov,Dec,Jan,Feb,March, for each winter, however they fall on different years and therefore I can't just do an average of those values falling within one calendar year. I have tried subsetting the data into two datetime objects as..
nd_npi_obs = ndjfm_npi_obs[ndjfm_npi_obs.index.month.isin([11,12])]
jfm_npi_obs = ndjfm_npi_obs[ndjfm_npi_obs.index.month.isin([1,2,3])]
..however I'm having trouble manipulating the dates (years) in order to do a simple average. I'm inexperienced with pandas and wondering if there is a more elegant way than exporting to excel and changing the year! The data is in the form..
Date
1899-01-01 00:00:00 100994.0
1899-02-01 00:00:00 100932.0
1899-03-01 00:00:00 100978.0
1899-11-01 00:00:00 100274.0
1899-12-01 00:00:00 100737.0
1900-01-01 100655.0
1900-02-01 100633.0
1900-03-01 100512.0
1900-11-01 101212.0
1900-12-01 100430.0
Interesting problem. Since you are averaging over five months this makes resampling more tricky. You should be able to overcome this by logical indexing and building a new dataframe. I assume your index is a datetime value.
index = pd.date_range('1899 9 1', '1902, 3, 1', freq='1M')
data = np.random.randint(0, 100, (index.size, 5))
df = pd.DataFrame(index=index, data=data, columns=list('ABCDE'))
# find rows that meet your criteria and average
idx1 = (df.index.year==1899) & (df.index.month >10)
idx2 = (df.index.year==1900) & (df.index.month < 4)
winterAve = df.loc[idx1 | idx2, :].mean(axis=0)
Just to visually check that the indexing/slicing is doing what we need....
>>>df.loc[idx1 | idx2, :]
Out[200]:
A B C D E
1899-11-30 48 91 87 29 47
1899-12-31 63 5 0 35 22
1900-01-31 37 8 89 86 38
1900-02-28 7 35 56 63 46
1900-03-31 72 34 96 94 35
You should be able to put this in a for loop to iterate over multiple years, etc.
Group data by month using pd.Grouper
g = df.groupby(pd.Grouper(freq="M")) # DataFrameGroupBy (grouped by Month)
For each group, calculate the average of only 'A' column
monthly_averages = g.aggregate({"A":np.mean})
I am a newbie in Python and I am struggling for coding things that seem simple in PHP/SQL and I hope you can help me.
I have 2 Pandas Dataframes that I have simplified for a better understanding.
In the first Dataframe df2015, I have the Sales for the 2015.
! Notice that unfortunately, we do not have ALL the values for each store !
>>> df2015
Store Date Sales
0 1 2015-01-15 6553
1 3 2015-01-15 7016
2 6 2015-01-15 8840
3 8 2015-01-15 10441
4 9 2015-01-15 7952
And another Dataframe named df2016 for the Sales Forecast in 2016, which lists ALL the stores.( As you guess, the column SalesForecast is the column to fill. )
>>> df2016
Store Date SalesForecast
0 1 2016-01-15
1 2 2016-01-15
2 3 2016-01-15
3 4 2016-01-15
4 5 2016-01-15
I want to create a function that for each row in df2016 will retrieve the Sales values from df2015, and for example, will increase by 5% these values and add these new values in SalesForecast column of df2016.
Let's say forecast is the function I have created that I want to apply :
def forecast(store_id,date):
sales2015 = df2015['Sales'].loc[(df2015['Store'].values == store_id) & (df2015['Date'].values == date )].values
forecast2016 = sales2015 * 1.05
return forecast2016
I have tested this function in a hardcoding way as below and it works:
>>> forecast(1,'2015-01-15')
array([ 6880.65])
But here we are where my problem is... How can I apply this function to the dataframes ?
It would be very easy to do it in PHP by creating a loop for each row in df2016 and retrieve the values (if they exist) from df2015 by a SELECT and WHERE Store = store_id and Date = date.. ...but the it seems the logic is not the same with Pandas Dataframes and Python.
I have tried the apply function as follows :
df2016['SalesForecast'] = df2016.apply(df2016['Store'],df2016['Date'])
but I am unable to put the arguments correctly or there is something I am doing wrong..
I think I do not have the good method or maybe my method is not suitable at all with Pandas and Python.. ?
I believe you are almost there! What's missing is the function, you've passed in the args.
The apply function takes in a function and its args. The documentation is here.
Without having tried this on my own system, I would suggest doing:
df2016['SalesForecast'] = df2016.apply(func=forecast, args=(df2016['Store'],df2016['Date']))
One of the nice things about Pandas is that it handles missing data well. The trick is to use a common index on both dataframes. For instance, if we set the index of both dataframes to be the 'Store' column:
df2015.set_index('Store', inplace=True)
df2016.set_index('Store', inplace=True)
Then doing what you'd like is as simple as:
df2016['SalesForecast'] = df2015['Sales'] * 1.05
resulting in:
Date SalesForecast
Store
1 2016-01-15 6880.65
2 2016-01-15 NaN
3 2016-01-15 7366.80
4 2016-01-15 NaN
5 2016-01-15 NaN
That the SalesForecast for store 2 is NaN reflects the fact that store 2 doesn't exist in the df2015 dataframe.
Notice that if you just need to multiply the Sales column from df2015 by 1.05, you can just do so, all in df2015:
In [18]: df2015['Forecast'] = df2015['Sales'] * 1.05
In [19]: df2015
Out[19]:
Store Date Sales Forecast
0 1 2015-01-15 6553 6880.65
1 3 2015-01-15 7016 7366.80
2 6 2015-01-15 8840 9282.00
3 8 2015-01-15 10441 10963.05
4 9 2015-01-15 7952 8349.60
At this point, you can join that result onto df2016 if you need this to appear in the df2016 data set:
In [20]: pandas.merge(df2016, # left side of join
df2015, # right side of join
on='Store', # similar to SQL 'on' for 'join'
how='outer', # same as SQL, outer join.
suffixes=('_2016', '_2015')) # rename same-named
# columns w/suffix
Out[20]:
Store Date_2016 Date_2015 Sales Forecast
0 1 2016-01-15 2015-01-15 6553 6880.65
1 2 2016-01-15 NaN NaN NaN
2 3 2016-01-15 2015-01-15 7016 7366.80
3 4 2016-01-15 NaN NaN NaN
4 5 2016-01-15 NaN NaN NaN
5 6 2016-01-15 2015-01-15 8840 9282.00
6 7 2016-01-15 NaN NaN NaN
7 8 2016-01-15 2015-01-15 10441 10963.05
8 9 2016-01-15 2015-01-15 7952 8349.60
If the two DataFrames happen to have compatible indexs already, you can simply write in the result column to df2016 directly, even if it's a computation on another DataFrame like df2015. In general though, you need to be careful about this, and it can be more general to perform the join explicitly (as I did above by using the merge function). Which way is best will depend on your application and your knowledge of index columns.
For more general function application to a column, a whole DataFrame, or groups of sub-frames, refer to the documentation for this type of operation in Pandas.
There are also links with some cookbook examples and comparisons with the way you might express similar operations in SQL.
Note that I created data to replicate your example data with these commands:
df2015 = pandas.DataFrame([[1, datetime.date(2015, 1, 15), 6553],
[3, datetime.date(2015, 1, 15), 7016],
[6, datetime.date(2015, 1, 15), 8840],
[8, datetime.date(2015, 1, 15), 10441],
[9, datetime.date(2015, 1, 15), 7952]],
columns=['Store', 'Date', 'Sales'])
from itertools import izip_longest
df2016 = pandas.DataFrame(
list(izip_longest(range(1,10),
[datetime.date(2016, 1, 15)],
fillvalue=datetime.date(2016, 1, 15))),
columns=['Store', 'Date']
)
I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2