Plot Count of Pandas Dataframe with Start_Date and End_Date - python

I am trying to plot a daily follower count for various twitter handles. The result being something like what you see below, but filterable by more than 1 twitter handle:
Usually, I would do this by simply appending a new dataset pulled from Twitter to the original table, with the date of the log being pulled. However, this would make me end up with a million lines in just a few days. And it wouldn't allow me to clearly see when a user has dropped off.
As an alternative, after pulling my data from Twitter, I structured my pandas dataframe like this:
Follower_ID Handles Start_Date End_Date
100 x 30/05/2017 NaN
101 x 21/04/2017 29/05/2017
201 y 14/06/2017 NaN
100 y 16/06/2017 28/06/2017
Where:
Handles:are the accounts I am pulling the Followers for
Follower_ID:is the user following an handle
So, for example, if I wereFollower_ID 100, I could follow both handle x and handle y
I am wondering what would be the best way to prepare the data (pivot, clean through a function, groupby) so that then it can be plotted accordingly. Any ideas?

I ended up using iterrows in a naïve approach, so there could be a more efficient way that takes advantage of pandas reshaping, etc. But my idea was to make a function that takes in your dataframe and the handle you want to plot, and then returns another dataframe with that handle's daily follower counts. To do this, the function
filters the df to the desired handle only,
takes each date range (for example, 21/04/2017 to 29/05/2017),
turns that into a pandas date_range, and
puts all the dates in a single list.
At that point, collections.Counter on the single list is a simple way to tally up the results by day.
One note is that the null End_Dates should be coalesced to whatever end date you want on your graph. I call that the max_date when I wrangle the data. So altogether:
from io import StringIO
from collections import Counter
import pandas as pd
def get_counts(df, handle):
"""Inputs: your dataframe and the handle
you want to plot.
Returns a dataframe of daily follower counts.
"""
# filters the df to the desired handle only
df_handle = df[df['Handles'] == handle]
all_dates = []
for _, row in df_handle.iterrows():
# Take each date range (for example, 21/04/2017 to 29/05/2017),
# turn that into a pandas `date_range`, and
# put all the dates in a single list
all_dates.extend(pd.date_range(row['Start_Date'],
row['End_Date']) \
.tolist())
counts = pd.DataFrame.from_dict(Counter(all_dates), orient='index') \
.rename(columns={0: handle}) \
.sort_index()
return counts
That's the function. Now reading and wrangling your data ...
data = StringIO("""Follower_ID Handles Start_Date End_Date
100 x 30/05/2017 NaN
101 x 21/04/2017 29/05/2017
201 y 14/06/2017 NaN
100 y 16/06/2017 28/06/2017""")
df = pd.read_csv(data, delim_whitespace=True)
# fill in missing end dates
max_date = pd.Timestamp('2017-06-30')
df['End_Date'].fillna(max_date, inplace=True)
# pandas timestamps (so that we can use pd.date_range)
df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
print(get_counts(df, 'y'))
The last line prints this for handle y:
y
2017-06-14 1
2017-06-15 1
2017-06-16 2
2017-06-17 2
2017-06-18 2
2017-06-19 2
2017-06-20 2
2017-06-21 2
2017-06-22 2
2017-06-23 2
2017-06-24 2
2017-06-25 2
2017-06-26 2
2017-06-27 2
2017-06-28 2
2017-06-29 1
2017-06-30 1
You can plot this dataframe with your preferred package.

Related

Filtering out improperly formatted datetime values in Python DataFrame

I have a DataFrame with one column storing the date.
However, some of these dates are properly formatted datetime objects like'2018-12-24 17:00:00'while others are not and are stored like '20181225'.
When I tried to plot these using plotly, the improperly formatted values got turned into EPOCH dates, which is a problem.
Is there any way I can get a copy of the DataFrame with only those rows with properly formatted dates?
I tried using
clean_dict= dailySum_df.where(dailySum_df[isinstance(dailySum_df['time'],datetime.datetime)])
methods and but it doesn't to work due to the 'Array conditional must be same shape as self' error.
dailySum_df = pd.DataFrame(list(cursors['dailySum']))
trace = go.Scatter(
x=dailySum_df['time'],
y=dailySum_df['countMessageIn']
)
data = [trace]
py.plot(data, filename='basic-line')
Apply dateutil.parser, see also my answer here:
import dateutil.parser as dparser
def myparser(x):
try:
return dparser.parse(x)
except:
return None
df = pd.DataFrame( {'time': ['2018-12-24 17:00:00', '20181225', 'no date at all'], 'countMessageIn': [1,2,3]})
df.time = df.time.apply(myparser)
df = df[df.time.notnull()]
Input:
time countMessageIn
0 2018-12-24 17:00:00 1
1 20181225 2
2 no date at all 3
Output:
time countMessageIn
0 2018-12-24 17:00:00 1
1 2018-12-25 00:00:00 2
Unlike Gustavo's solution this can handle rows with no recognizable date at all and it filters out such rows as required by your question.
If your original time column may contain other text besides the dates themselves, include the fuzzy=True parameter as shown here.
Try parsing the dates column of your dataframe using dateutil.parser.parse and Pandas apply function.

python time lags holidays

In pandas, I have two data frames. One containing the Holidays of a particular country from http://www.timeanddate.com/holidays/austria and another one containing a date column. I want to calculate the #days after a holiday.
def compute_date_diff(x, y):
difference = y - x
differenceAsNumber = (difference/ np.timedelta64(1, 'D'))
return differenceAsNumber.astype(int)
for index, row in holidays.iterrows():
secondDF[row['name']+ '_daysAfter'] = secondDF.dateColumn.apply(compute_date_diff, args=(row.day,))
However, this
calculates the wrong difference e.g. > than a year in case holidays contains data for more than a year.
is pretty slow.
How could I fix the flaw and increase performance? Is there a parallel apply? Or what about http://pandas.pydata.org/pandas-docs/stable/timeseries.html#holidays-holiday-calendars
As I am new to pandas I am unsure how to obtain the current date/index of the date object whilst iterating through in apply. As far as I know I cannot loop the other way round e.g. over all my rows in secondDF as it was impossible for me to generate feature columns whilst iterating via apply
To do this, join both data frames using a common column and then try this code
import pandas
import numpy as np
df = pandas.DataFrame(columns=['to','fr','ans'])
df.to = [pandas.Timestamp('2014-01-24'), pandas.Timestamp('2014-01-27'), pandas.Timestamp('2014-01-23')]
df.fr = [pandas.Timestamp('2014-01-26'), pandas.Timestamp('2014-01-27'), pandas.Timestamp('2014-01-24')]
df['ans']=(df.fr-df.to) /np.timedelta64(1, 'D')
print df
output
to fr ans
0 2014-01-24 2014-01-26 2.0
1 2014-01-27 2014-01-27 0.0
2 2014-01-23 2014-01-24 1.0
I settled for something entirely different:
Now, only the number of days since before the most current holiday will be calculated.
my function:
def get_nearest_holiday(holidays, pivot):
return min(holidays, key=lanbda x: abs(x- pivot)
# this needs to be converted to an int, but at least the nearest holiday is found efficiently
is called as a lambda expression on a per-row basis

How to apply a function to each column of a pivot table in pandas?

Code:
df = pd.read_csv("example.csv", parse_dates=['ds'])
df2 = df.set_index(['ds', 'city']).unstack('city')
rm = pd.rolling_mean(df2, 3)
sd = pd.rolling_std(df2,3)
df2 output:
What I want: I want to be able to see whether for each city, for each date, if the number is greater than 1 std dev away from the mean of bookings for that city. For ex pseudocode:
for each (city column)
for each (date)
see whether the (number of bookings) - (same date and city rolling mean) > (same date and city std dev)
print that date and city and number of bookings
What the problem is: I'm having trouble trying to figure out how to access the data I need from each of the data frames to do so. The parts of the pseudocode in parenthesis is what I need help figuring out.
What I tried:
df2['city']
list(df2)
Both give me errors.
df2[1:2]
Splicing works, but I feel like thats not the best way to access it.
You should use apply function of DataFrame API. Demo is below:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4,5]; 'B': [1,2,3,4,5]})
df['C'] = df.apply(lambda row: row['A']*row['B'], axis=1)
Output:
>>> df
A B C
0 1 1 1
1 2 2 4
2 3 3 9
3 4 4 16
4 5 5 25
More concretely for your case:
You have to precompute: "same date and city rolling mean", "same date and city std dev". You can use groupby function for it, it allows to aggregate data by city and date, after that you can calculate std dev and mean.
Put std dev and mean in your table, use dictionary for it: some_dict = {('city', 'date'):[std_dev, mean], ..}. For putting data in dataframe use apply function.
You have all necessary data for running your check by apply function.

Count repeating events in pandas timeseries

I'm working on very basic pandas since a few days but struggle with my current task:
I have a (non normalized) timeseries with items that contains a userid per timestamp. So something like: (date, userid, payload) So think about an server logffile where I would like to find how much IPs return within a certain timeperiod.
Now I like to find how much of the users have multiple items within an intervall for example in 4 weeks etc. So it's more a sliding window than constant intervals on the t-axis.
So my approaches were:
df_users reindex on userids
or multiindex?
Sadly I didn't found a way to generate the results successfully.
So all in all I'm not sure how I realize that kind of search with Pandas, or maybe this is easier to implement in pure Python? Or do I just lack some keywords for that problem?
Some dummy data that I think fits your problem.
df = pd.DataFrame({'id': ['A','A','A','B','B','B','C','C','C'],
'time': ['2013-1-1', '2013-1-2', '2013-1-3',
'2013-1-1', '2013-1-5', '2013-1-7',
'2013-1-1', '2013-1-7', '2013-1-12']})
df['time'] = pd.to_datetime(df['time'])
This approach requires some kind non-missing numeric column to count with, so just add a dummy one.
df['dummy_numeric'] = 1
My approach to the problem is this. First, groupby the id and iterate so we are working with one user id worth of data at time. Next, resample the irregular data up to daily values so it is normalized.
Then, using the rolling_count function, count the number of observations in each X day window (using 3 here). This works because the upsampled data will be filled with NaN and not counted. Notice that only the numeric column is being passed to rolling_count, and also note the use of double-brackets (which results in a DataFrame being selected rather than a series).
window_days = 3
ids = []
for _, df_gb in df.groupby('id'):
df_gb = df_gb.set_index('time').resample('D')
df_gb = pd.rolling_count(df_gb[['dummy_numeric']], window_days).reset_index()
ids.append(df_gb)
Combine all the data back together, mark the spans with more than observations
df_stack = pd.concat(ids, ignore_index=True)
df_stack['multiple_requests'] = (df_stack['dummy_numeric'] > 1).astype(int)
Then groupby and sum, and you should have the right answer.
df_stack.groupby('time')['multiple_requests'].sum()
Out[356]:
time
2013-01-01 0
2013-01-02 1
2013-01-03 1
2013-01-04 0
2013-01-05 0
2013-01-06 0
2013-01-07 1
2013-01-08 0
2013-01-09 0
2013-01-10 0
2013-01-11 0
2013-01-12 0
Name: multiple, dtype: int32

A Multi-Index Construction for Intraday TimeSeries (10 min price data)

I have a file with intraday prices every ten minutes. [0:41] times in a day. Each date is repeated 42 times. The multi-index below should "collapse" the repeated dates into one for all times.
There are 62,035 rows x 3 columns: [date, time, price].
I would like write a function to get the difference of the ten minute prices, restricting differences to each unique date.
In other words, 09:30 is the first time of each day and 16:20 is the last: I cannot overlap differences between days of price from 16:20 - 09:30. The differences should start as 09:40 - 09:30 and end as 16:20 - 16:10 for each unique date in the dataframe.
Here is my attempt. Any suggestions would be greatly appreciated.
def diffSeries(rounded,data):
'''This function accepts a column called rounded from 'data'
The 2nd input 'data' is a dataframe
'''
df=rounded.shift(1)
idf=data.set_index(['date', 'time'])
data['diff']=['000']
for i in range(0,length(rounded)):
for day in idf.index.levels[0]:
for time in idf.index.levels[1]:
if idf.index.levels[1]!=1620:
data['diff']=rounded[i]-df[i]
else:
day+=1
time+=2
data[['date','time','price','II','diff']].to_csv('final.csv')
return data['diff']
Then I call:
data=read_csv('file.csv')
rounded=roundSeries(data['price'],5)
diffSeries(rounded,data)
On the traceback - I get an Assertion Error.
You can use groupby and then apply to achieve what you want:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
For a full example, suppose you create a test data set for 14 Nov to 16 Nov:
import pandas as pd
from numpy.random import randn
from datetime import time
# Create date range with 10 minute intervals, and filter out irrelevant times
times = pd.bdate_range(start=pd.datetime(2012,11,14,0,0,0),end=pd.datetime(2012,11,17,0,0,0), freq='10T')
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
prices = randn(len(filtered_times))
# Create MultiIndex and data frame matching the format of your CSV
arrays = [[x.date() for x in filtered_times]
,[x.time() for x in filtered_times]]
tuples = zip(*arrays)
m_index = pd.MultiIndex.from_tuples(tuples, names=['date', 'time'])
data = pd.DataFrame({'prices': prices}, index=m_index)
You should get a DataFrame a bit like this:
prices
date time
2012-11-14 09:30:00 0.696054
09:40:00 -1.263852
09:50:00 0.196662
10:00:00 -0.942375
10:10:00 1.915207
As mentioned above, you can then get the differences by grouping by the first index and then subtracting the previous row for each row:
diffs = data.groupby(lambda idx: idx[0]).apply(lambda row: row - row.shift(1))
Which gives you something like:
prices
date time
2012-11-14 09:30:00 NaN
09:40:00 -1.959906
09:50:00 1.460514
10:00:00 -1.139036
10:10:00 2.857582
Since you are grouping by the date, the function is not applied for 16:20 - 09:30.
You might want to consider using a TimeSeries instead of a DataFrame, because it will give you far greater flexibility with this kind of data. Supposing you have already loaded your DataFrame from the CSV file, you can easily convert it into a TimeSeries and perform a similar function to get the differences:
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
# or dt_index = pd.DatetimeIndex([datetime.combine(i.date,i.time) for i in data.index])
# if you don't have an multi-level index on data yet
ts = pd.Series(data.prices.values, dt_index)
diffs = ts.groupby(lambda idx: idx.date()).apply(lambda row: row - row.shift(1))
However, you would now have access to the built-in time series functions such as resampling. See here for more about time series in pandas.
#MattiJohn's construction gives a filtered list of length 86,772--when run over 1/3/2007-8/30/2012 for 42 times (10 minute intervals). Observe the data cleaning issues.
Here the data of prices coming from the csv is length: 62,034.
Hence, simply importing from the .csv, as follows, is problematic:
filtered_times = [x for x in times if x.time() >= time(9,30) and x.time() <= time(16,20)]
DF=pd.read_csv('MR10min.csv')
prices = DF.price
# I.E. rather than the generic: prices = randn(len(filtered_times)) above.
The fact that the real data falls short of the length it "should be" means there are data cleaning issues. Often we do not have the full times as bdate_time will generate (half days in the market, etc, holidays).
Your solution is elegant. But I am not sure how to overcome the mismatch between the actual data and the a priori, prescribed dataframe.
Your second TimesSeries suggestion seems to still require construction of a datetime index similar to the first one. For example, if I were use the following two lines to get the actual data of interest:
DF=pd.read_csv('MR10min.csv')
data=pd.DF.set_index(['date','time'])
dt_index = pd.DatetimeIndex([datetime.combine(i[0],i[1]) for i in data.index])
It will generate a:
TypeError: combine() argument 1 must be datetime.date, not str
How does one make a bdate_time array completely informed by the actual data available?
Thank you to (#MattiJohn) and to anyone with interest in continuing this discussion.

Categories

Resources