Convert Varying Column Length to Rows in Pandas - python

I'm trying to create a graph with Seaborn that shows all of the Windows events in the Domain Controller that have taken place in a given time range, which means you have, say, five events now, but when you run the program again in 10 minutes, you might get 25 events.
With that said, I've been able to parse these events (labeled Actions) from a mumbo-jumbo of other data in the log file and then create a DataFrame in Pandas. The script outputs the data as a dictionary. After creating the DataFrame, this is what the output looks like:
logged-out kerberos-authentication-ticket-requested logged-in created-process exited-process
1 1 5 2 1 1
Note: The values you see above are the number of times the process took place within that time frame.
That would be good enough for me, but only if a table was all I needed. When I try to put this DataFrame into Seaborn, I get an error because I don't know what to name the x and y axes because, well, they are always changing. So, my solution was to use the df.melt() function in order to convert those columns into rows, and then label the only two columns needed ('Actions','Count'). But that's where I fumbled multiple times. I can't figure out how to use the df.melt() functions correctly.
Here is my code:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
#Ever-changing data
actions = {'logged-out': 2, 'kerberos-authentication-ticket-requested': 5, 'logged-in': 2,
'created-process': 1, 'exited-process': 1, 'logged-out': 1}
#Create DataFrame
data = actions
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index
df = pd.DataFrame(data,index=index,columns=['Action','Count'])
print(df)
#Convert Columns to Rows and Add
df.melt(id_vars=["Action", "Count"],
var_name="Action",
value_name="Count")
#Create graph
sns.barplot(data=df,x='Action',y='Count',
palette=['#476a6f','#e63946'],
dodge=False,saturation=0.65)
plt.savefig('fig.png')
plt.show()
Any help is appreciated.

You can use:
df.melt(var_name="Action", value_name="Count")
without using any id_vars!

Related

Python 3.x Pandas Matplotlib Ignore non float dtype in series (ignore mixed data)

I have a program that browses for an excel file in the local directory, loads the selected file, then looks for all cells within a certain limit of columns after a certain number of rows (only columns C-G after row 2 for example. The problem is, all of the data should be floats, but they're mixed with strings and datetime. I'm completely happy ignoring anything that isn't a float as the strings are usually notes typed in as a warning, and the dates are usually user error and not data that is needed.
Before it plots the data it takes the 5 columns and combines them into a 1-D dataframe. What I would like to do is iterate through the dataframe and throw out the data that doesn't match. I can pseudo it, but for some reason I can't wrap my head around the actual coding of it.
for each in df:
if df[x].dtype != float64:
continue
else
df2 += df[x]
continue
I know there's a simple way to do this, I just can't figure it out for the life of me.
Any help is greatly appreciated. TIA!
EDIT:
I found out it's actually a series not a dataframe.
I tried adding the following code, but I'm getting that my values are ranging from 0 to 1000 when they should all be well within 0.5 of each other.
data = pd.to_numeric(data, errors='coerce').dropna()
Edit 2:
Example code
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('testdata.xlsx', header=None)
data = data.stack()
data = pd.to_numeric(data, errors='coerce').dropna()
plt.violinplot(data)
plt.show()
The output shows a graph between 0.0000 and 0.0005 on the Y axis and 0.8 and 1.2 on the x axis. The x axis is relatively expected with the small sample size provided, but the y axis is expected to be larger (between 3.5-3.7 or more).
Sample data is in an excel file, although I assume a csv would be input the same. Row and Column headers included so you can replicate if needed.
df = pd.DataFrame({'col': [3.615, 3.6155, "hello", 3.6153, 1/3/1900]}
Paste bin of full code: https://pastebin.com/ycfsK6gc
Drive link for sample xlsx: https://docs.google.com/spreadsheets/d/11Tp91y33OmhXsZYg3lL-I3XMXoZl7xuc/edit?usp=drivesdk&ouid=104875635689576986452&rtpof=true&sd=true

Pandas Dataframe - How to get a Rolling Sum grouped by a value?

Working with some COVID-19 data, how should I be calculating a 14 day rolling sum of case counts?
Here's my existing code:
import pandas as pd
import matplotlib.pyplot as plt
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv'
all_counties = pd.read_csv(url, dtype={"fips": str})
all_counties.date = pd.to_datetime(all_counties.date)
oregon = all_counties.loc[all_counties['state'] == 'Oregon']
oregon.set_index('date', inplace=True)
oregon['delta']=oregon.groupby(['state','county'])['cases'].diff().fillna(0)
oregon.head()
This code calculates the daily incremental case count (thanks to an earlier question's answers).
The next step is calculating the rolling 14 day sum, for which I have attempted this:
oregon['rolling_14']=oregon.groupby(['state','county'])['delta'].rolling(min_periods=1, window=14).sum()
It is unfortunately failing. If I have a single county's data, this works:
county['rolling_14']=county.rolling(min_periods=1, window=14).sum()
But unfortunately, this isn't viable when the data frame contains multiple counties' datasets.
The groupby().rolling() has two extra index level, namely state, county. Remove them and assignment would work
oregon['rolling_14'] = (oregon.groupby(['state','county'])['delta']
.rolling(min_periods=1, window=14).sum()
.reset_index(level=['state','county'])
)
Also, since you are working with several groupby functions, lazy groupby would help improve run time/code base a bit:
groups = oregon.groupby(['state','county'])
oregon['delta'] = groups['cases'].diff().fillna(0)
oregon['rolling_14'] = (groups['delta']
.rolling(min_periods=1, window=14).sum()
.reset_index(level=['state','county'])
)

How can I create a stacked barchart with timedeltas using matplotlib?

just getting into data visualization with pandas. At the moment i try to visualize a pd with matplotlib that looks like this:
Initiative_160608 Initiative_160570 Initiative_160056
Beschluss_BR 2009-05-15 2009-05-15 2006-04-07
Vorlage_BT 2009-05-22 2009-05-22 2006-04-26
Beratung_BT 2009-05-28 2009-05-28 2006-05-11
ABeschluss_BT 2009-06-17 2009-06-17 2006-05-17
Beschlussempf 2009-06-17 2009-06-17 2006-05-26
As you can see, i have a number of columns with five different dates (every date symbolizes one event in a total chain of five events). Now to the problem:
My plan is to visualize shown data with a stacked horizontal chart, using the timedeltas between the 5 different events (how many days have passed between the first and last event, including the dates in between). Every Column should represent one bar in the chart. The whole chart is not about the absolute time that has passed, but about the duration of the five events in relation to the overall duration of one column, which means that all bars should have the same overall length.
Yet i haven`t found anything similar or found a solution by myself. I would be extremely thankful for any kind of solution to proceed with the shown data.
I'm not exactly sure if this is what you are looking for, but if each column is supposed to be a bar, and you want the time deltas within each column, then you need the difference in days between each row, and I am guessing the first row should have a difference of 0 days (since it is the starting point).
Also for stacked barplots, the index is used to create the categories, but in your case, you want the columns as categories, and each bar to be composed of the different index values. This means you need to transpose your df eventually.
This solution is pretty ugly, but hopefully it helps.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Initiative_160608": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160570": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160056": ['2006-04-07', '2006-04-26', '2006-05-11', '2006-05-17', '2006-05-26']})
df.index = ['Beschless_BR', 'Vorlage_BT', 'Beratung_BT', 'ABeschless_BT', 'Beschlussempf']
# convert everything to dates
df = df.apply(lambda x: pd.to_datetime(x, format="%Y-%m-%d"))
def get_days(x):
diff_list = []
for i in range(len(x)):
if i == 0:
diff_list.append(x[i] - x[i])
else:
diff_list.append(x[i] - x[i-1])
return diff_list
# get the difference in days, then convert back to numbers
df_diff = df.apply(lambda x: get_days(x), axis = 0)
df_diff = df_diff.apply(lambda x: x.dt.days)
# transpose the matrix so that each initiative becomes a stacked bar
df_diff = df_diff.transpose()
# replace 0 values with 0.2 so that the bars are visible
df_diff = df_diff.replace(0, 0.2)
df_diff.plot.bar(stacked = True)
plt.show()

Categorizing CSV data by groups defined through string values

So I am trying to organize data through a CSV file using pandas so I can graph it in matplotlib, I have different rows of values in which some are control and others are experimental. I am able to separate the rows to graph however I can not seem to make it work, I have attempted for loops (seen below) to graph although I keep getting 'TypeError: 'type' object is not subscriptable'.
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv('C:\\Users\\User\\Desktop\\Ubiome samples\\samples.csv')
group = (df['Group'])
count = (df['Count'])
time = (df['Time'])
for steps in range [group]:
plt.plot([time],[count],'bs')
plt.show()
There is a typo in your for loop :
for steps in range [group]:
Should be
for steps in range(group):
Your for loop tries to call __getitem__ on range, but since this method isn't defined for range, you get a TypeError: 'type' object is not subscriptable. Check python documentation for getitem() for more details.
However, you cannot use range on a pandas Series to loop over every item in it, since range expects integers as it's input. Instead you should use :
for steps in group:
This will loop over every row in your csv file, and output the exact same plot for each row. I'm quite sure this is not what you actually want to do.
If I understand your question well, you want to plot each group of experimental/control values you have in your csv.
Then you should try (untested) :
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv('C:\\Users\\User\\Desktop\\Ubiome samples\\samples.csv')
for group in df['Group'].unique():
group_data = df[df['Group'] == group]
plt.plot(group_data['Time'], group_data['Count'], 'bs')
plt.show()
for group in df['Group'].unique() will loop over every piece of data in the Group column, ignoring duplicates.
For instance, if your column have 1000 strings in it, but all of these strings are either "experimental" or "control", then this will loop over ['experimental', 'control'] (actually a numpy array, also, do note that unique() doesn't sort, so the order of the output depends on the order of the input).
df[df['Group'] == group] will then select all the rows where the column 'Group' is equal to group.
Check pandas documentation for where method and masking for more details.

Pandas python barplot by subgroup

Ok so I have a dataframe object that's indexed as follows:
index, rev, metric1 (more metrics.....)
exp1, 92365, 0.018987
exp2, 92365, -0.070901
exp3, 92365, 0.150140
exp1, 87654, 0.003008
exp2, 87654, -0.065196
exp3, 87654, -0.174096
For each of these metrics I want to create individual stacked barplots comparing them based on their rev.
here's what I've tried:
df = df[['rev', 'metric1']]
df = df.groupby("rev")
df.plot(kind = 'bar')
This results in 2 individual bar graphs of the metric. Ideally I would have these two merged and stacked (right now stacked=true does nothing). Any help would be much appreciated.
This would give me my ideal result, however I don't think reorganizing to fit this is the best way to achieve my goal as I have many metrics and many revisions.
index, metric1(rev87654), metric1(rev92365)
exp1, 0.018987, 0.003008
exp2, -0.070901, -0.065196
exp3, 0.150140, -0.174096
This is my goal. (made by hand)
http://i.stack.imgur.com/5GRqB.png
following from this matplotlib gallery example:
http://matplotlib.org/examples/api/barchart_demo.html
there they get multiple to plot by calling bar once for each set.
You could access these values in pandas with indexing operations as follows:
fig, ax = subplots(figsize=(16.2,10),dpi=300)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[0]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))
ax.bar(X,Y,width=.4)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[2]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))+.5
ax.bar(X,Y,width=.4,color='r')
working from the inside out:
get all of the unique values of 'SL' in one of the cols (rev in your case)
Get a Boolean vector of all rows where 'SL' equals the first (or nth) unique value
Index Tire by that Boolean vector (this will pull out only those rows where the vector is True
access the values of SA or a metric in yourcase. (took only the `[0:13]' values because i was testing this on a huge data set)
bar plot those values
if your experiments are consistently in order in the frame(as shown), that's that. Otherwise you might need to run a little sorting to get your Y values in the right order. .sort(column name) should take care of that. In my code, i'd slip it in between ...[0]] and.SA...
In general, this kind of operation can really help you out in wrangling big frames. .between is useful. And you can always add, multiply etc. the Boolean vectors to construct more complex logic.
I'm not sure how to get the plot you want automatically without doing exactly the reorganization you specify at the end. The answer by user3823992 gives you more detailed control of the plots, but if you want them more automatic here is some temporary reorganization that should work using the indexing similarly but also concatenating back into a DataFrame that will do the plot for you.
import numpy as np
import pandas as pd
exp = ['exp1','exp2','exp3']*2
rev = [1,1,1,2,2,2]
met1 = np.linspace(-0.5,1,6)
met2 = np.linspace(1.0,5.0,6)
met3 = np.linspace(-1,1,6)
df = pd.DataFrame({'rev':rev, 'met1':met1, 'met2':met2, 'met3':met3}, index=exp)
for met in df.columns:
if met != 'rev':
merged = df[df['rev'] == df.rev.unique()[0]][met]
merged.name = merged.name+'rev'+str(df.rev.unique()[0])
for rev in df.rev.unique()[1:]:
tmp = df[df['rev'] == rev][met]
tmp.name = tmp.name+'rev'+str(rev)
merged = pd.concat([merged, tmp], axis=1)
merged.plot(kind='bar')
This should give you three plots, one for each of my fake metrics.
EDIT : Or something like this might do also
df['exp'] = df.index
pt = pd.pivot_table(df, values='met1', rows=['exp'], cols=['rev'])
pt.plot(kind='bar')

Categories

Resources