I have the following dataframe with datetime, lon and lat variables. This data is collected for each second which means each date is repeated 60 times
I am doing some calculations using lat, lon values and at the end I need to write
this data to Postgres table.
2016-07-27 06:43:45 50.62 3.15
2016-07-27 06:43:46 50.67 3.22
2016-07-28 07:23:45 52.32 3.34
2016-07-28 07:24:46 52.67 3.45
Currently I have 10 million records . It is taking longer time if I use whole dataframe for computing.
How can I loop this for each date, write it to DB and clear the dataframe??
I have converted the datetime variable to date format
df['date'] = df['datetime'].dt.date
df = df.sort(['datetime'])
my computation is
df.loc[(df['lat'] > 50.10) & (df['lat'] <= 50.62), 'var1'] = 1
df.loc[(df['lan'] > 3.00) & (df['lan'] <= 3.20), 'var2'] = 1
Writing it to DB
df.to_sql('Table1', engine,if_exists = "replace",index = False)
Have you considered using the groupby() function? You can use it to treat each 'date' as a seperate DataFrame and then run your computations.
for sub_df in df.groupby('date'):
# your computations
Related
I have a dataframe 15000 rows of binary data, with each string being 365 characters. And I convert each binary numbers to 365 days with a starting date of 13/12/2020.
Because the data is so large, so my program runs very slowly. Is there a way I can optimize my program?
Data example:
ID
Nature
Binary
1122
M
1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100
Output:
ID
Nature
Date
Code
1122
M
13/12/2020
1
1122
M
14/12/2020
0
1122
M
..........
...
1122
M
11/12/2021
0
Code:
start_date = '2021-12-13'
table_ = pd.DataFrame({'ID': df.id[0],'Nature':df.Nature[0], Date':pd.date_range(start_date, periods=len(df.binairy[0]), freq='D'), 'Code': list(df.binairy[0])})
for i in range(1,len(df)):
table_i = pd.DataFrame({'ID': df.id[i],'Nature':df.Nature[i],'Date':pd.date_range(start_date, periods=len(df.binairy[i]), freq='D'), 'Code': list(df.binairy[i]})
table_ = pd.concat([table_,table_i],ignore_index=True)
table_
the best way to optimize the computation time is to parallelize your process, using multiprocessing library if you have multiple cores and/or multithreading (i guess you're using a cpu based environment).
Do you have to process the data in the dataframe, or can you load it in to a database?
You can split the strings of 1's and 0's in to rows with dates using a number table. For this implementation I borrowed the number table generator from this answer on SO; so the following assumes you have those views defined.
Create a table to hold your source data
create table sample_data (
id int,
nature char(1),
bin_str varchar(365)
);
For testing I loaded 2500 rows by duplicating a row
insert sample_data(id, nature, bin_str) values (1,'M','1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100');
insert sample_data(id, nature, bin_str)
select n, nature, bin_str
from sample_data s join generator_4k g
where g.n>1 and g.n<=2500;
Then split the binary string and add dates
select id,
nature,
date_add('2020-12-13', INTERVAL n DAY) date,
substring(bin_str, n, 1) code
from generator_4k
join sample_data
where generator_4k.n > 0 and generator_4k.n <= length(bin_str)
order by id, n;
id
nature
date
code
1
M
2020-12-14
1
1
M
2020-12-15
0
1
M
2020-12-16
0
.
.
..........
.
1
M
2021-12-12
0
It took my local machine a few seconds to process 25000 rows, so depending on how slow your existing solution is YMMV.
I have a Pandas dataframe df that looks as follows:
created_time action_time
2021-03-05T07:18:12.281-0600 2021-03-05T08:32:19.153-0600
2021-03-04T15:34:23.373-0600 2021-03-04T15:37:32.360-0600
2021-03-01T04:57:47.848-0600 2021-03-01T08:37:39.083-0600
import pandas as pd
df = pd.DataFrame({'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600'],
'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600']})
I then create another column which represents the the difference in minutes between these two columns:
df['elapsed_time'] = (pd.to_datetime(df['action_time']) - pd.to_datetime(df['created_time'])).dt.total_seconds() / 60
df['elapsed_time']
elapsed_time
74.114533
3.149783
219.853917
We assume that "action" can only take place during business hours (which we assume to start 8:30am).
I would like to create another column named created_time_adjusted, which adjusts the created_time to 08:30am if the created_time is before 08:30am).
I can parse out the date and time string that I need, as follows:
df['elapsed_time'] = pd.to_datetime(df['created_time']).dt.date.astype(str) + 'T08:30:00.000-0600'
But, this doesn't deal with the conditional.
I'm aware of a few ways that I might be able to do this:
replace
clip
np.where
loc
What is the best (and least hacky) way to accomplish this?
Thanks!
First of all, I think your life would be easier if you convert the columns to datetime dtypes from the go. Then, its just a matter of running an apply op on the 'created_time' column.
df.created_time = pd.to_datetime(df.created_time)
df.action_time = pd.to_datetime(df.action_time)
df.elapsed_time = df.action_time-df.created_time
time_threshold = pd.to_datetime('08:30').time()
df['created_time_adjusted']=df.created_time.apply(lambda x:
x.replace(hour=8,minute=30,second=0)
if x.time()<time_threshold else x)
Output:
>>> df
created_time action_time created_time_adjusted
0 2021-03-05 07:18:12.281000-06:00 2021-03-05 08:32:19.153000-06:00 2021-03-05 08:30:00.281000-06:00
1 2021-03-04 15:34:23.373000-06:00 2021-03-04 15:37:32.360000-06:00 2021-03-04 15:34:23.373000-06:00
2 2021-03-01 04:57:47.848000-06:00 2021-03-01 08:37:39.083000-06:00 2021-03-01 08:30:00.848000-06:00
df['created_time']=pd.to_datetime(df['created_time'])#Coerce to datetime
df1=df.set_index(df['created_time']).between_time('00:00:00', '08:30:00', include_end=False)#Isolate earlier than 830 into df
df1['created_time']=df1['created_time'].dt.normalize()+ timedelta(hours=8,minutes=30, seconds=0)#Adjust time
df2=df1.append(df.set_index(df['created_time']).between_time('08:30:00','00:00:00', include_end=False)).reset_index(drop=True)#Knit before and after 830 together
df2
I have a .txt file with three columns: Time, ticker, price. The time is spaced in 15 second intervals. It looks like this uploaded to jupyter notebook and put into a Pandas DF.
time ticker price
0 09:30:35 EV 33.860
1 00:00:00 AMG 60.430
2 09:30:35 AMG 60.750
3 00:00:00 BLK 455.350
4 09:30:35 BLK 451.514
... ... ... ...
502596 13:00:55 TLT 166.450
502597 13:00:55 VXX 47.150
502598 13:00:55 TSLA 529.800
502599 13:00:55 BIDU 103.500
502600 13:00:55 ON 12.700
# NOTE: the first set of data has the data at market open for -
# every other time point, so that's what the 00:00:00 is.
#It is only limited to the 09:30:35 data.
I need to create a function that takes an input (a ticker) and then creates a bar chart that displays the data with 5 minute ticks ( the data is every 20 seconds, so for every 15 points in time).
So far I've thought about separating the "mm" part of the hh:mm:ss to just get the minutes in another column and then right a for loop that looks something like this:
for num in df['mm']:
if num %5 == 0:
print('tick')
then somehow appending the "tick" to the "time" column for every 5 minutes of data (I'm not sure how I would do this), then using the time column as the index and only using data with the "tick" index in it (some kind of if statement). I'm not sure if this makes sense but I'm drawing a blank on this.
You should have a look at the built-in functions in pandas. In the following example I'm using a date + time format but it shouldn't be hard to convert one to the other.
Generate data
%matplotlib inline
import pandas as pd
import numpy as np
dates = pd.date_range(start="2020-04-01", periods=150, freq="20S")
df1 = pd.DataFrame({"date":dates,
"price":np.random.rand(len(dates))})
df2 = df1.copy()
df1["ticker"] = "a"
df2["ticker"] = "b"
df = pd.concat([df1,df2], ignore_index=True)
df = df.sample(frac=1).reset_index(drop=True)
Resample Timeseries every 5 minutes
Here you can try to see the output of
df1.set_index("date")\
.resample("5T")\
.first()\
.reset_index()
Where we are considering just the first element at 05:00, 10:00 and so on. In general to do the same for every ticker we need a groupby
out = df.groupby("ticker")\
.apply(lambda x: x.set_index("date")\
.resample("5T")\
.first()\
.reset_index())\
.reset_index(drop=True)
Plot function
def plot_tick(data, ticker):
ts = data[data["ticker"]==ticker].reset_index(drop=True)
ts.plot(x="date", y="price", kind="bar", title=ticker);
plot_tick(out, "a")
Then you can improve the plot or, eventually, try to use plotly.
I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}
I'm performing a Cohort analysis using python, and I am having trouble creating a new column that sums up the total months a user has stayed with us.
I know the math behind the answer, all I have to do is:
subtract the year when they canceled our service from when they started it
Multiply that by 12.
Subtract the month when they canceled our service from when they started it.
Add those two numbers together.
So in Excel, it looks like this:
=(YEAR(C2)-YEAR(B2))*12+(MONTH(C2)-MONTH(B2))
C is when the customer canceled the date, and B is when they started
The problem is that I am very new to Python and Pandas, and I am having trouble translating that function in Python
What I have tried so far:
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
It returns with an error 'Series' is not callable, and I have a general understanding of what that means.
I then tried:
def LTVCalc (Plan_Start_Date, Plan_Cancel_Date):
df['Lifetime'] = df.Plan_Cancel_Date('%Y') - df.Plan_Start_Date('%Y')*12 +
df.Plan_Cancel_Date('%m') - df.Plan_Start_Date('%m')
df.head()
But that didn't add the Column 'Lifetime' to the DataFrame.
Anyone able to help a rookie?
I think need first convert to_datetime and then use dt.year and
dt.month:
df = pd.DataFrame({
'Plan_Cancel_Date': ['2018-07-07','2019-03-05','2020-10-08'],
'Plan_Start_Date': ['2016-02-07','2017-01-05','2017-08-08']
})
#print (df)
#if necessary convert to datetimes
df.Plan_Start_Date = pd.to_datetime(df.Plan_Start_Date)
df.Plan_Cancel_Date = pd.to_datetime(df.Plan_Cancel_Date)
df['Lifetime'] = ((df.Plan_Cancel_Date.dt.year - df.Plan_Start_Date.dt.year)*12 +
df.Plan_Cancel_Date.dt.month - df.Plan_Start_Date.dt.month)
print (df)
Plan_Cancel_Date Plan_Start_Date Lifetime
0 2018-07-07 2016-02-07 29
1 2019-03-05 2017-01-05 26
2 2020-10-08 2017-08-08 38