just getting into data visualization with pandas. At the moment i try to visualize a pd with matplotlib that looks like this:
Initiative_160608 Initiative_160570 Initiative_160056
Beschluss_BR 2009-05-15 2009-05-15 2006-04-07
Vorlage_BT 2009-05-22 2009-05-22 2006-04-26
Beratung_BT 2009-05-28 2009-05-28 2006-05-11
ABeschluss_BT 2009-06-17 2009-06-17 2006-05-17
Beschlussempf 2009-06-17 2009-06-17 2006-05-26
As you can see, i have a number of columns with five different dates (every date symbolizes one event in a total chain of five events). Now to the problem:
My plan is to visualize shown data with a stacked horizontal chart, using the timedeltas between the 5 different events (how many days have passed between the first and last event, including the dates in between). Every Column should represent one bar in the chart. The whole chart is not about the absolute time that has passed, but about the duration of the five events in relation to the overall duration of one column, which means that all bars should have the same overall length.
Yet i haven`t found anything similar or found a solution by myself. I would be extremely thankful for any kind of solution to proceed with the shown data.
I'm not exactly sure if this is what you are looking for, but if each column is supposed to be a bar, and you want the time deltas within each column, then you need the difference in days between each row, and I am guessing the first row should have a difference of 0 days (since it is the starting point).
Also for stacked barplots, the index is used to create the categories, but in your case, you want the columns as categories, and each bar to be composed of the different index values. This means you need to transpose your df eventually.
This solution is pretty ugly, but hopefully it helps.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Initiative_160608": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160570": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160056": ['2006-04-07', '2006-04-26', '2006-05-11', '2006-05-17', '2006-05-26']})
df.index = ['Beschless_BR', 'Vorlage_BT', 'Beratung_BT', 'ABeschless_BT', 'Beschlussempf']
# convert everything to dates
df = df.apply(lambda x: pd.to_datetime(x, format="%Y-%m-%d"))
def get_days(x):
diff_list = []
for i in range(len(x)):
if i == 0:
diff_list.append(x[i] - x[i])
else:
diff_list.append(x[i] - x[i-1])
return diff_list
# get the difference in days, then convert back to numbers
df_diff = df.apply(lambda x: get_days(x), axis = 0)
df_diff = df_diff.apply(lambda x: x.dt.days)
# transpose the matrix so that each initiative becomes a stacked bar
df_diff = df_diff.transpose()
# replace 0 values with 0.2 so that the bars are visible
df_diff = df_diff.replace(0, 0.2)
df_diff.plot.bar(stacked = True)
plt.show()
Related
I'm trying to create a graph with Seaborn that shows all of the Windows events in the Domain Controller that have taken place in a given time range, which means you have, say, five events now, but when you run the program again in 10 minutes, you might get 25 events.
With that said, I've been able to parse these events (labeled Actions) from a mumbo-jumbo of other data in the log file and then create a DataFrame in Pandas. The script outputs the data as a dictionary. After creating the DataFrame, this is what the output looks like:
logged-out kerberos-authentication-ticket-requested logged-in created-process exited-process
1 1 5 2 1 1
Note: The values you see above are the number of times the process took place within that time frame.
That would be good enough for me, but only if a table was all I needed. When I try to put this DataFrame into Seaborn, I get an error because I don't know what to name the x and y axes because, well, they are always changing. So, my solution was to use the df.melt() function in order to convert those columns into rows, and then label the only two columns needed ('Actions','Count'). But that's where I fumbled multiple times. I can't figure out how to use the df.melt() functions correctly.
Here is my code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#Ever-changing data
actions = {'logged-out': 2, 'kerberos-authentication-ticket-requested': 5, 'logged-in': 2,
'created-process': 1, 'exited-process': 1, 'logged-out': 1}
#Create DataFrame
data = actions
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index
df = pd.DataFrame(data,index=index,columns=['Action','Count'])
print(df)
#Convert Columns to Rows and Add
df.melt(id_vars=["Action", "Count"],
var_name="Action",
value_name="Count")
#Create graph
sns.barplot(data=df,x='Action',y='Count',
palette=['#476a6f','#e63946'],
dodge=False,saturation=0.65)
plt.savefig('fig.png')
plt.show()
Any help is appreciated.
You can use:
df.melt(var_name="Action", value_name="Count")
without using any id_vars!
I am working on some glaicer borehole temperature data consisting of ~1,000 rows by 700 columns. The vertical index is depth (i.e. as you move down the array depth increases) and the column headers are datetime values (i.e. as you move right along the array you move forwards in time).
I am looking for a way to average all temperatures in the columns depending on a date sampling rate. For example, the early datetimes have a spacing of 10 minutes, but the later datetimes have a spacing of six hours.
It would be good to be able to put in the sampling as an input and get out data based on that sampling rate so that I can see which one works best.
It would also be good that if I choose say 3 hour sampling this is simply ignored for spacing of above 3 hours and no change to the data is made in this case (i.e. datetime spacings of 10 minutes are averaged, but datetime spacings of 6 hours are left unaffected).
All of this needs to come out in either a pandas dataframe with date as column headers and depth as the index, or as a numpy array and separate list of datetimes.
I'm fairly new to Python, and this is my first question on stackoverflow!! Thanks :)
(I know the following is not totally correct use of Pandas, but it works for the figure slider I've produced!)
import numpy as np
import pandas as pd
#example array
T = np.array([ [-2, -2, -2, -2.1, -2.3, -2.6],
[-2.2, -2.3, -3, -3.1, -3.3, -3.3],
[-4, -4, -4.5, -4.4, -4.6, -4.5]])
#example headers at 8 and then 4 hour spacing
headers = [pd.date_range(start='2018-04-24 00:00:00', end='2018-04-24 08:00:00', periods=3).tolist() +
pd.date_range(start='2018-04-24 12:00:00', end='2018-04-25 12:00:00', periods=3).tolist()]
#pandas dataframe in same setup as much larger one I'm using
T_df = pd.DataFrame(T, columns = headers)
One trick you can use is to convert your time series to a numeric series, and then use the groupby method.
For instance, imagine you have
df = pd.DataFrame([['10:00:00', 1.],['10:10:00', 2.],['10:20:00', 3.],['10:30:00', 4.]],columns=['Time', 'Value'])
df.Time = pd.to_datetime(df.Time, format='%X')
You can convert your time series by:
df['DeltaT'] = pd.to_timedelta(df.Time).dt.total_seconds().astype(int)
df['DeltaT'] -= df['DeltaT'][0] # To start to 0
Then use the groupby method. You can for instance create a new column to floor the time interval you want:
myInterval = 1200.
df['group'] = (df['DeltaT']/myInterval).astype(int)
So you can use groupby followed by mean() (or a function you define)
df.groupby('group').mean()
Hope this helps!
I illustrate my question with the following example.
I have two panda dataframes.
The first with ten second timesteps, which is continuous. Example data for two days:
import pandas as pd
import random
t_10s = pd.date_range(start='1/1/2018', end='1/3/2018', freq='10s')
t_10s = pd.DataFrame(columns = ['b'],
data = [random.randint(0,10) for _ in range(len(t_10s))],
index = t_10s)
The next dataframe have five minute timesteps, but there is only data during daytime, and the logging starts at different times in the morning on each day. Example data for two days, starting at two different times in the morning to resemble the real data:
t_5m1 = pd.date_range(start='1/1/2018 08:08:30', end='1/1/2018 18:03:30', freq='5min')
t_5m2 = pd.date_range(start='1/2/2018 08:10:25', end='1/2/2018 18:00:25', freq='5min')
t_5m = t_5m1.append(t_5m2)
t_5m = pd.DataFrame(columns = ['a'],
data = [0 for _ in range(len(t_5m))],
index = t_5m)
Now what I want to do is for each datapoint, x, in t_5m, to find the equivalent average of the t_10s data, in a five minute window surrounding x.
Now, I have found a way to do this with a list-comprehension as follows:
tstep = pd.to_timedelta(2.5, 'm')
t_5m['avg'] = [t_10s.loc[((t_10s.index >= t_5m.index[i] - tstep) &
(t_10s.index < t_5m.index[i] + tstep))].b.mean() for i in range(0,len(t_5m))]
However, I want to do this for a timeseries spanning at least two years and for many columns (not just b as here. Current solution is to for loop over the relevant columns). The code then gets very slow. Can anyone think of a trick to do this more efficiently? I have thought about using resample or groupby. That would work if I had a regular 5-minute interval, but since it is irregular between days, I cannot make it work. Grateful for any input!
Have looked around some, e.g. here, but couldn't find what I need.
Ok so I have a dataframe object that's indexed as follows:
index, rev, metric1 (more metrics.....)
exp1, 92365, 0.018987
exp2, 92365, -0.070901
exp3, 92365, 0.150140
exp1, 87654, 0.003008
exp2, 87654, -0.065196
exp3, 87654, -0.174096
For each of these metrics I want to create individual stacked barplots comparing them based on their rev.
here's what I've tried:
df = df[['rev', 'metric1']]
df = df.groupby("rev")
df.plot(kind = 'bar')
This results in 2 individual bar graphs of the metric. Ideally I would have these two merged and stacked (right now stacked=true does nothing). Any help would be much appreciated.
This would give me my ideal result, however I don't think reorganizing to fit this is the best way to achieve my goal as I have many metrics and many revisions.
index, metric1(rev87654), metric1(rev92365)
exp1, 0.018987, 0.003008
exp2, -0.070901, -0.065196
exp3, 0.150140, -0.174096
This is my goal. (made by hand)
http://i.stack.imgur.com/5GRqB.png
following from this matplotlib gallery example:
http://matplotlib.org/examples/api/barchart_demo.html
there they get multiple to plot by calling bar once for each set.
You could access these values in pandas with indexing operations as follows:
fig, ax = subplots(figsize=(16.2,10),dpi=300)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[0]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))
ax.bar(X,Y,width=.4)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[2]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))+.5
ax.bar(X,Y,width=.4,color='r')
working from the inside out:
get all of the unique values of 'SL' in one of the cols (rev in your case)
Get a Boolean vector of all rows where 'SL' equals the first (or nth) unique value
Index Tire by that Boolean vector (this will pull out only those rows where the vector is True
access the values of SA or a metric in yourcase. (took only the `[0:13]' values because i was testing this on a huge data set)
bar plot those values
if your experiments are consistently in order in the frame(as shown), that's that. Otherwise you might need to run a little sorting to get your Y values in the right order. .sort(column name) should take care of that. In my code, i'd slip it in between ...[0]] and.SA...
In general, this kind of operation can really help you out in wrangling big frames. .between is useful. And you can always add, multiply etc. the Boolean vectors to construct more complex logic.
I'm not sure how to get the plot you want automatically without doing exactly the reorganization you specify at the end. The answer by user3823992 gives you more detailed control of the plots, but if you want them more automatic here is some temporary reorganization that should work using the indexing similarly but also concatenating back into a DataFrame that will do the plot for you.
import numpy as np
import pandas as pd
exp = ['exp1','exp2','exp3']*2
rev = [1,1,1,2,2,2]
met1 = np.linspace(-0.5,1,6)
met2 = np.linspace(1.0,5.0,6)
met3 = np.linspace(-1,1,6)
df = pd.DataFrame({'rev':rev, 'met1':met1, 'met2':met2, 'met3':met3}, index=exp)
for met in df.columns:
if met != 'rev':
merged = df[df['rev'] == df.rev.unique()[0]][met]
merged.name = merged.name+'rev'+str(df.rev.unique()[0])
for rev in df.rev.unique()[1:]:
tmp = df[df['rev'] == rev][met]
tmp.name = tmp.name+'rev'+str(rev)
merged = pd.concat([merged, tmp], axis=1)
merged.plot(kind='bar')
This should give you three plots, one for each of my fake metrics.
EDIT : Or something like this might do also
df['exp'] = df.index
pt = pd.pivot_table(df, values='met1', rows=['exp'], cols=['rev'])
pt.plot(kind='bar')
Thanks for reading, I've spent 3-4 hours searching for examples to solve this but can't find any that solve.. the ones I did try didn't seem to work with pandas DataFrame object.. any help would be very much appreciated!!:)
Ok this is my problem.
I have a Pandas DataFrame containing 12 columns.
I have 500,000 rows of data.
Most of the columns are useless. The variables/columns I am interested in are called: x,y and profit
Many of the x and y points are the same,
so i'd like to group them into a unique combination then add up all the profit for each unique combination.
Each unique combination is a bin (like a bin used in histograms)
Then I'd like to plot a 2d chart/heatmap etc of x,y for each bin and the colour to be total profit.
e.g.
x,y,profit
7,4,230.0
7,5,162.4
6,8,19.3
7,4,-11.6
7,4,180.2
7,5,15.7
4,3,121.0
7,4,1162.8
Note how values x=7, y=4, there are 3 rows that meet this criteria.. well the total profit should be:
230.0 - 11.6 +1162.8 = 1381.2
So in bin x=7, y = 4, the profit is 1381.2
Note for values x=7, y=5, there are 2 instances.. total profit should be: 162.4 + 15.7 = 178.1
So in bin x=7, y = 5, the profit is 178.1
So finally, I just want to be able to plot: x,y,total_profit_of_bin
e.g. To help illustrate what I'm looking for, I found this on internet, it is similar to what I'd like, (ignore the axis & numbers)
http://2.bp.blogspot.com/-F8q_ZcI-HJg/T4_l7D0C7yI/AAAAAAAAAgE/Bqtx3eIHzRk/s1600/heatmap.jpg
Thank-you so much for taking the time to read:)
If for 'bin' of x where the values are x are equal, and the values of y are equal, then you can use groupby.agg. That would look something like this
import pandas as pd
import numpy as np
df = YourData
AggDF = df.groupby('x').agg({'y' : 'max', 'profit' : 'sum'})
AggDF
That would get you the data I think you want, then you could plot as you see fit. Do you need assistance with that also?
NB this is only going to work in the way you want it to if within each 'bin' i.e. the data grouped according to the values of x, the values of y are equal. I assume this must be the case, as otherwise I don't think it would make much sense to be trying to graph x and y together.