I have a little script that creates a new column in my pandas dataset called class, and assigns class values for a given time range. It works well, but suddenly I have thousands of time ranges to input, and wondered if it might possible to write some kind of loop which gets the three columns (start, finish, and class) from a pandas dataframe.
To complicate things, the time ranges are of irregular interval in dataframe 1 (e.g. a nanosecond, 30 seconds, 4 minutes) and in dataframe 2, (which contains accelerometer data) the time series data increases in increments of 0.010 seconds. Any help appreciated as I'm new to Python.
conditions = [(X['DATETIME'] < '2017-11-17 07:31:07') & (X['DATETIME']>= '2017-11-17 00:00:00'),(X['DATETIME'] < '2017-11-17 07:32:35') & (X['DATETIME']>= '2017-11-17 07:31:07'),(X['DATETIME'] < '2017-11-17 09:01:05') & (X['DATETIME']>= '2017-11-17 08:58:39')]
classes = ['0','1','2']
X['CLASS'] = np.select(conditions, classes, default='5')
There are many possible solutions to this, you could use for loops as you said, etc. But if you are new to Python, I think this answer would show you more about the power of python and its great packages. I will use the numpy package here. And I suppose that your first table is in a pandas data frame named X while the second in one named condidtions.
import numpy as np
X['CLASS'] = conditions['CLASS'].iloc[np.digitize(X['Datetime'].view('i8'),
conditions['Start'].view('i8')) - 1]
Don't worry, I won't let you there. So np.digitize takes it's first list and bins it based on the bin borders defined by the second argument. So here you will get the index of the condition corresponding to the time in the given row.
There are a couple of details to be noted:
.view('i8') provides a view of the datetime object which can be easily used by the numpy package (if you are interested, you can read more about the details)
-1 is needed to realign the results (the value after the start of the first condition will get a value of 1, but we want it to start from 0.
in the end we use the iloc function of the conditions['CLASS'] series to map these indices to the class values.
Related
I have the following data (see attached - easier this way). I am trying to find the first occurrence of the value 0 for each customer ID. Then, I plan to use code similar to below to create a Kaplan-Meier curve:
from lifelines import KaplanMeierFitter
## Example Data
durations = [5,6,6,2.5,4,4]
event_observed = [1, 0, 0, 1, 1, 1]
## create a kmf object
kmf = KaplanMeierFitter()
## Fit the data into the model
kmf.fit(durations, event_observed,label='Kaplan Meier Estimate')
## Create an estimate
kmf.plot(ci_show=False) ## ci_show is meant for Confidence interval, since our data set is too tiny, thus i am not showing it.
(this code is from here).
What' the simplest way to do this? Note that I want to ignore the NAs: I have plenty of them and there's no getting around that.
Thanks!
I'm gonna assume that all rows contain at least one non-NaN value.
One thing we'd have to do first is just ensure that we operate on a dataframe where there is indeed a zero; we can accomplish this with min.
This will give us a series, and we just have to select on the rows that contain zero:
df.loc[min_series == 0]
Then, we can use idxmin:
df.idxmin(1, skipna=True)
This should spit out the period on which the first 0 is encountered (we've guaranteed that all rows contain a 0).
Then, this should give you what you're looking for!
I am sure this is not hard, but I can't figure it out!
I want to create a dataframe that starts at 1 for the first row and ends at 100,000 in increments of 1, 2, 4, 5, or whatever. I could do this in my sleep in Excel, but is there a slick way to do this without importing a .csv or .txt file?
I have needed to do this in variations many times and just settled on importing a .csv, but I am tired of that.
Example in Excel
Generating numbers
Generating numbers is not something special to pandas, rather numpy module or range function (as mentioned by #Grismer) can do the trick. Let's say you want to generate a series of numbers and assign these numbers to a dataframe. As I said before, there are multiple approaches two of which I personally prefer.
range function
Take range(1,1000,1) as an Example. This function gets three arguments two of which are not mandatory. The first argument defines the start number, the second one defines the end number, and the last one points to the steps of this range. So the abovementioned example will result in the numbers 1 to 9999 (Note that this range is a half-open interval which is closed at the start and open at the end).
numpy.arange function
To have the same results as the previous example, take numpy.arange(1,1000,1) as an example. The arguments are completely the same as the range's arguments.
Assigning to dataframe
Now, if you want to assign these numbers to a dataframe, you can easily do this by using the pandas module. Code below is an example of how to generate a dataframe:
import numpy as np
import pandas as pd
myRange = np.arange(1,1001,1) # Could be something like myRange = range(1,1000,1)
df = pd.DataFrame({"numbers": myRange})
df.head(5)
which results in a dataframe like(Note that just the first five rows have been shown):
numbers
0
1
1
2
2
3
3
4
4
5
Difference of numpy.arange and range
To keep this answer short, I'd rather to refer to this answer by #hpaulj
I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.
I am trying to create new dataframe based on condition per groupby.
Suppose, I have dataframe with Name, Flag and Month.
import pandas as pd
import numpy as np
data = {'Name':['A', 'A', 'B', 'B'], 'Flag':[0, 1, 0, 1], 'Month':[1,2,1,2]}
df = pd.DataFrame(data)
need = df.loc[df['Flag'] == 0].groupby(['Name'], as_index = False)['Month'].min()
My condition is to find minimum month where flag equal to 0 per name.
I have used .loc to define my condition, it works fine but I found that it quite poor performance when applying with 10 million of rows.
Any more efficient way to do so?
Thank you!
Just had this same scenario yesterday, where I took a 90 second process down to about 3 seconds. Because speed is your concern (like mine was), and not using solely Pandas itself, I would recommend using Numba and NumPy. The catch is you're going to have to brush up on your data structures and types to get a good grasp on what Numba is really doing with JIT. Once you do though, it rocks.
I would recommend finding a way to get every value in your DataFrame to an integer. For your name column, try unique ID's. Flag and month already look good.
name_ids = []
for i, name in enumerate(np.unique(df["Name"])):
name_ids.append({i: name})
Then, create a function and loop the old-fashioned way:
#njit
def really_fast_numba_loop(data):
for row in data:
# do stuff
return data
new_df = really_fast_numba_loop(data)
The first time your function is called in your file, it will be about the same speed as it would elsewhere, but all the other times it will be lightning fast. So, the trick is finding the balance between what to put in the function and what to put in its outside loop.
In either case, when you're done processing your values, convert your name_ids back to strings and wrap your data in pd.DataFrame.
Et voila. You just beat Pandas iterrows/itertuples.
Comment back if you have questions!
I'm using Dask to load an 11m row csv into a dataframe and perform calculations. I've reached a position where I need conditional logic - If this, then that, else other.
If I were to use pandas, for example, I could do the following, where a numpy select statement is used along with an array of conditions and results. This statement takes about 35 seconds to run - not bad, but not great:
df["AndHeathSolRadFact"] = np.select(
[
(df['Month'].between(8,12)),
(df['Month'].between(1,2) & df['CloudCover']>30) #Array of CONDITIONS
], #list of conditions
[1, 1], #Array of RESULTS (must match conditions)
default=0) #DEFAULT if no match
What I am hoping to do is use dask to do this, natively, in a dask dataframe, without having to first convert my dask dataframe to a pandas dataframe, and then back again.
This allows me to:
- Use multithreading
- Use a dataframe that is larger than available ram
- Potentially speed up the result.
Sample CSV
Location,Date,Temperature,RH,WindDir,WindSpeed,DroughtFactor,Curing,CloudCover
1075,2019-20-09 04:00,6.8,99.3,143.9,5.6,10.0,93.0,1.0
1075,2019-20-09 05:00,6.4,100.0,93.6,7.2,10.0,93.0,1.0
1075,2019-20-09 06:00,6.7,99.3,130.3,6.9,10.0,93.0,1.0
1075,2019-20-09 07:00,8.6,95.4,68.5,6.3,10.0,93.0,1.0
1075,2019-20-09 08:00,12.2,76.0,86.4,6.1,10.0,93.0,1.0
Full Code for minimum viable sample
import dask.dataframe as dd
import dask.multiprocessing
import dask.threaded
import pandas as pd
import numpy as np
# Dataframes implement the Pandas API
import dask.dataframe as dd
from timeit import default_timer as timer
start = timer()
ddf = dd.read_csv(r'C:\Users\i5-Desktop\Downloads\Weathergrids.csv')
#Convert back to a Dask dataframe because we want that juicy parallelism
ddf2 = dd.from_pandas(df,npartitions=4)
del [df]
print(ddf2.head())
#print(ddf.tail())
end = timer()
print(end - start)
#Clean up remaining dataframes
del [[ddf2]
So, the answer I was able to come up with that was the most performant was:
#Create a helper column where we store the value we want to set the column to later.
ddf['Helper'] = 1
#Create the column where we will be setting values, and give it a default value
ddf['AndHeathSolRadFact'] = 0
#Break the logic out into separate where clauses. Rather than looping we will be selecting those rows
#where the conditions are met and then set the value we went. We are required to use the helper
#column value because we cannot set values directly, but we can match from another column.
#First, a very simple clause. If Temperature is greater than or equal to 8, make
#AndHeathSolRadFact equal to the value in Helper
#Note that at the end, after the comma, we preserve the existing cell value if the condition is not met
ddf['AndHeathSolRadFact'] = (ddf.Helper).where(ddf.Temperature >= 8, ddf.AndHeathSolRadFact)
#A more complex example
#this is the same as the above, but demonstrates how to use a compound select statement where
#we evaluate multiple conditions and then set the value.
ddf['AndHeathSolRadFact'] = (ddf.Helper).where(((ddf.Temperature == 6.8) & (ddf.RH == 99.3)), ddf.AndHeathSolRadFact)
I'm a newbie at this, but I'm assuming this approach counts as being vectorised. It makes full use of the array and evaluates very quickly.
Adding the new column, filling it with 0, evaluating both select statements and replacing the values in the target rows only added 0.2s to the processing time on an 11m row dataset with npartitions = 4.
Former, and similar approaches in pandas took 45 seconds or so.
The only thing left to do is to remove the helper column once we're done. Currently, I'm not sure how to do this.
It sounds like you're looking to dd.Series.where