Pandas infrastructure data statistics plot with date per user

Pandas infrastructure data statistics plot with date per user - python

I am trying to display some infrastructure usage daily statistics with Pandas but I'm a beginner and can't figure it out after many hours of research.
Here's my data types per column:
Name object UserService object
ItemSize int64 ItemsCount int64
ExtractionDate datetime64[ns]
Each day I have a new extraction for each users, so I probably need to use the group_by before plotting.
Data sample:
Name UserService ItemSize ItemsCount ExtractionDate
1 xyzf_s xyfz 40 1 2018-12-12
2 xyzf1 xyzf 53 5 2018-12-12
3 xyzf2 xyzf 71 4 2018-12-12
4 xyzf3 xyzf 91 3 2018-12-12
14 vo12 vo 41 5 2018-12-12
One of the graph I am trying to display is as follow:
x axis should be the extraction date
y axis should be the items count (it's divided by 1000 so it's by thousands of items from 1 to 100)
Each line on the graph should represent a user evolution (to look at data spikes), I guess I would have to display the top 10 or 50 because it would be difficult to have a graph of 1500 users.
I'm also interested by any other way you would exploit those data to look for data increase and anomaly in data consumption.

Assuming the user is shown in the name columns and there is only one line per user per day, to get the plot you are explicitly asking for, you can use the following code:
# Limit to 10 users
users_to_plot = df.Name.unique()[:10]
for u in users_to_plot:
mask = (df['Name'] == u)
values = df[mask]
plt.plot('ExtractionDate','ItemsCount',data=values.sort_values('ExtractionDate'))
It's important to look at the data and think about what information you are trying to extract and what that looks like. It's probably worth exploring with some individuals first and getting an idea of what is the thing you are trying to identify. Think about what makes that unique and if you can make it pop on a graph.

Related

Identifying outliers in an event sequence using a Python Dataframe

I'm experimenting with Machine Learning and LSTM models for river level prediction based on the current level and rainfall within the upstream catchment. I'm currently using TensorFlow and a Keras LSTM model.
I have a hourly rainfall data in mm from eight sites within the catchment, and the river stage (level) in meters at a site downstream from the catchment area. The problem I face is that every now and then the rainfall sites are tested by pouring water into them. This creates a significant spike in the data that I would like to filter out.
Here's an example of what a typical rainfall event looks like within the dataframe:
DataFrame showing typical seqeunce rainfall and river stage data
And here is an example of what it looks like when two sites have been tested
DataFrame showing abnormal rainfall data due to two sites being tested
I've come across several ways to statistically cluster data and identify outliers however none of these really worked on a timed sequence of events. Also, the rainfall site columns are listed in the DataFrame in the order in which they are located with the catchment so there is a loose spatial coupling moving across the columns.
I was thinking of using something a little like a 3x3 or 3x4 convolution matrix, but rather than calculating new cell values it would find outliers by comparing the values from the central cells with the values in the outer cells. Any significant difference would indicate abnormal data.
The Pandas DataFrame API is quite large and I'm still getting familiar with it. Any suggestions on specific functions or methods I should be looking at would be much appreciated.
In the following example the 10:00:00 reading for Site 2 would be an obvious anomaly.
Timestamp
Site 1
Site 2
Site 3
09:00:00
0
0
0
10:00:00
0
20
0
11:00:00
0
0
0
20mm of rainfall at one site with zero rainfall at the adjacent sites, or at the same site for the hour before and hour after is a very clear and obvious case.
This is what a normal rainfall pattern might look like:
Timestamp
Site 1
Site 2
Site 3
09:00:00
6
4
0
10:00:00
0
20
2
11:00:00
0
0
11
This is a less obvious example:
Timestamp
Site 1
Site 2
Site 3
09:00:00
1
0
0
10:00:00
0
20
2
11:00:00
0
3
1
One possibility might be to compare the central cell value to the maximum of the surrounding cell values and if the difference is greater than 15 (or some other arbitrary threshold value).
The exact criteria will probably change as I discover more about the data. The mechanism of how to apply that criteria to the dataframe is what I'm looking for. For example, if the criteria were implemented as a lambda function that could reference adjacent cells, is there a way to apply that lambda function across DataFrame?'
An extra complication is how to deal with checking values for Site 1 when there is preceding site to the left, and Site 3 where there is no following site to the right.

How to get statistics of once column of dataframe using data from a second column?

I'm trying to write a program to give a deeper analysis of stock trading data but am coming up against a wall. I'm pulling all trades for a given timeframe and creating a new CSV file in order to use that file as the input for a predictive neural network.
The dataframe I currently have has three values: (1) the price of the stock; (2) the number of shares sold at that price; and (3) the unix timestamp of that particular trade. I'm having trouble getting any accurate statistical analysis of the data. For example, if I use .median(), the program only looks at the number of values listed rather than the fact that each value may have been traded hundreds of times based on the volume column.
As an example, this is the partial trading history for one of the stocks that I'm trying to analyze.
0 227.60 40 1570699811183
1 227.40 27 1570699821641
2 227.59 50 1570699919891
3 227.60 10 1570699919891
4 227.36 100 1570699967691
5 227.35 150 1570699967691 . . .
To better understand the issue, I've also grouped it by price and summed the other columns with groupby('p').sum(). I realize this means the timestamp is useless, but it makes visualization easier.
227.22 2 1570700275307
227.23 100 1570699972526
227.25 100 4712101657427
227.30 105 4712101371199
227.33 50 1570700574172
227.35 4008 40838209836171 . . .
Is there any way to use the number from the trade volume column to perform a statistical analysis of the price column? I've considered creating a new dataframe where each price is listed the number of times that it is traded, but am not sure how to do this.
Thanks in advance for any help!

Pandas series function that shows the probability of the up and down moves of the stock price

Days Adjusted stock price
price
0 100
1 50
2 200
3 210
4 220
5 34
6 35
7 36
8 89
Assuming this table is a pandas dataframe. Can someone help me out with writing function that show the probability of the up and down moves of the stock price. For example, what is the probability of the stock price having two up days in a row.
Thanks I am new to python and I have been trying to figure this out for a while!

Actual stock price movement prediction is both a broad and a deep subject usually associated with time series analysis which I would consider out of the scope of this question.
However, the naive approach would be to assume the Bernoulli model where each price move is considered independent both of any previous moves and of time.
In this case, the probability of the price moving up can be inferred by measuring all the up moves against all moves recorded.
# df is a single-column pandas DataFrame storing the price
((df['price'] - df['price'].shift(1)) > 0).sum()/(len(df) - 1)
which for the data you posted gives 0.75.
Given the above, the probability of the price going up for two consecutive days would be 0.75*0.75 approximately equating to 0.56.

How to remove transients in time-series data in Python (or Pandas)?

I have a time-series set of data recording the flow and temperature of a heat pump. The first few minutes when the system kicks on, the flows and temperatures aren't fully developed and I'd like to filter them out.
Time (min) Flow Supply T Return T
….
45 0 0 0
46 0 0 0
47 1.338375 92.711328 78.72152
48 2.267975 82.578552 74.239624
49 0.778125 96.073136 74.288664
50 0.778125 101.3998 74.686288
51 0.7885 102.1189 74.490528
….
For instance, the first 3 minutes of operation (from 47-49 minutes), don't do any calculations with the data.
I can do that with a loop, but the data set is very large (>200 mb text file) and takes a really long time to loop through. I was wondering if there's a more efficient way to pull it out, perhaps using Pandas?
Any help or advice is appreciated! Thanks in advance!!

Please try the following, I think it should work, basically it filters out the rows where row at n-3 does not equal 0 and is not NaN this assumes that when there is no flow you have a value of 0:
In [12]:
df[(df.Flow.shift(3)!=0) & (df.Flow.shift(3).notnull())]
Out[12]:
Time_(min) Flow Supply_T Return_T
5 50 0.778125 101.3998 74.686288
6 51 0.788500 102.1189 74.490528

random sampling with pandas dataframe

I'm relatively new to pandas (and python... and programming) and I'm trying to do a Montecarlo simulation, but I have not being able to find a solution that takes a reasonable amount of time
The data is stored in a data frame called "YTDSales" which has sales per day, per product
Date Product_A Product_B Product_C Product_D ... Product_XX
01/01/2014 1000 300 70 34500 ... 780
02/01/2014 400 400 70 20 ... 10
03/01/2014 1110 400 1170 60 ... 50
04/01/2014 20 320 0 71300 ... 10
...
15/10/2014 1000 300 70 34500 ... 5000
and what I want to do is to simulate different scenarios, using for the rest of the year (from October 15 to Year End) the historical distribution that each product had. For example with the data presented I will like to fill the rest of the year with sales between 20 and 1100.
What I've done is the following
# creates range of "future dates"
last_historical = YTDSales.index.max()
year_end = dt.datetime(2014,12,30)
DatesEOY = pd.date_range(start=last_historical,end=year_end).shift(1)
# function that obtains a random sales number per product, between max and min
f = lambda x:np.random.randint(x.min(),x.max())
# create all the "future" dates and fill it with the output of f
for i in DatesEOY:
YTDSales.loc[i]=YTDSales.apply(f)
The solution works, but takes about 3 seconds, which is a lot if I plan to 1,000 iterations... Is there a way not to iterate?
Thanks

Use the size option for np.random.randint to get a sample of the needed size all at once.
One approach that I would consider is briefly as follows.
Allocate the space you'll need into a new array that will have index values from DatesEOY, columns from the original DataFrame, and all NaN values. Then concatenate onto the original data.
Now that you know the length of each random sample you'll need, use the extra size keyword in numpy.random.randint to sample all at once, per column, instead of looping.
Overwrite the data with this batch sampling.
Here's what this could look like:
new_df = pandas.DataFrame(index=DatesEOY, columns=YTDSales.columns)
num_to_sample = len(new_df)
f = lambda x: np.random.randint(x[1].min(), x[1].max(), num_to_sample)
output = pandas.concat([YTDSales, new_df], axis=0)
output[len(YTDSales):] = np.asarray(map(f, YTDSales.iteritems())).T
Along the way, I choose to make a totally new DataFrame, by concatenating the old one with the new "placeholder" one. This could obviously be inefficient for very large data.
Another way to approach is setting with enlargement as you've done in your for-loop solution.
I did not play around with that approach long enough to figure out how to "enlarge" batches of indexes all at once. But, if you figure that out, you can just "enlarge" the original data frame with all NaN values (at index values from DatesEOY), and then apply the function about to YTDSales instead of bringing output into it at all.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.