Plotting by Index with different labels - python

I am using pandas and matplotlib to generate some charts.
My DataFrame:
Journal Papers per year in journal
0 Information and Software Technology 4
1 2012 International Conference on Cyber Securit... 4
2 Journal of Network and Computer Applications 4
3 IEEE Security & Privacy 5
4 Computers & Security 11
My Dataframe is a result of a groupby out of a larger dataframe. What I want now, is a simple barchart, which in theory works fine with a df_groupby_time.plot(kind='bar'). However, I get this:
What I want are different colored bars, and a legend which states which color corresponds to which paper.
Playing around with relabeling hasn't gotten me anywhere so far. And I have no idea anymore on how to achieve what I want.
EDIT:
Resetting the index and plotting isn't what I want:
df_groupby_time.set_index("Journals").plot(kind='bar')

I found a solution, based on this question here.
SO, the dataframe needs to be transformed into a matrix, were the values exist only on the main diagonal.
First, I save the column journals for later in a variable.
new_cols = df["Journal"].values
Secondly, I wrote a function, that takes a series, the column Papers per year in Journal, and the previously saved new columns, as input parameters, and returns a dataframe, where the values are only on the main diagonal.:
def values_into_main_diagonal(some_series, new_cols):
"""Puts the values of a series onto the main diagonal of a new df.
some_series - any series given
new_cols - the new column labels as list or numpy.ndarray"""
x = [{i: some_series[i]} for i in range(len(some_series))]
main_diag_df = pd.DataFrame(x)
main_diag_df.columns = new_cols
return main_diag_df
Thirdly, feeding the function the Papers per year in Journal column and our saved new columns names, returns the following dataframe:
new_df:
1_journal 2_journal 3_journal 4_journal 5_journal
0 4 NaN NaN NaN NaN
1 NaN 4 NaN NaN NaN
2 NaN NaN 4 NaN NaN
3 NaN NaN NaN 5 NaN
4 NaN NaN NaN NaN 11
Finally plotting the new_df via new_df.plot(kind='bar', stacked=True) gives me what I want. The Journals in different colors as the legend and NOT on the axis.:

Related

How to Fill NaNs in Column of Main Dataframe Based On Conditions Matching Secondary Dataframe of Values to Fill NaNs With Multiple Filler Values

I need to fill NA values in my main data frame based on a second dataframe I created by the groupby and mean functions. My original dataframe has about 1.5K NaNs I need to fill so this needs to reproducible at a mass scale. I've created a fake dataframe that's a short quick and dirty imitation of my data using a fake scenario. I can't share my real data with you.
My general idea is:
main_data[
(main_data["Animal_Type"] == mean_data["Animal_Type"]) &
(main_data["Cost_Type"] == mean_data["Cost_Type"])
] = main_data["Price"].fillna(mean_data["Price"])
Obviously, that doesn't work and but that's the general gist of how my logic is working. I found t[his answer][1] but I can't see to apply it properly to my problem. A lot of answers involve mask or assume my data is pretty small with a single value to replace all my NaNs with. I have about 50 different means in my original dataset that are uniquely paired with a "Animal Type" per each "Cost Type". My original data frame is about 30K observations long full of unique observations too. I can map but that's only for a single column. I'm fairly new to coding so a lot of the other answers were too complicated for me too understand and alter too.
main_data
mean_data.head(10)
**Pet_ID Animal_Type Cost_Type Price**
0 101 Goat Housing 6.0
1 102 Dog Housing 6.0
2 103 Horse Housing NaN
3 104 Horse Housing 5.0
4 105 Goat Housing 3.0
5 106 Dog Feeding 3.0
6 107 Cat Feeding 6.0
7 108 Horse Housing 6.0
8 109 Hamster Feeding 5.0
9 110 Horse Feeding 3.0
mean_data
Animal_Type Cost_Type Price
0 Cat Feeding 4.500000
1 Cat Housing 5.000000
2 Chicken Feeding 5.000000
3 Chicken Housing 4.500000
4 Dog Feeding 3.000000
5 Dog Housing 6.000000
6 Goat Feeding 5.000000
7 Goat Housing 5.000000
8 Hamster Feeding 5.250000
9 Hamster Housing 3.000000
10 Horse Feeding 3.500000
11 Horse Housing 5.666667
12 Rabit Feeding 3.000000
13 Rabit Housing 3.000000
My Reproducible code:
random.seed(10)
random.seed(10)
main_data = pd.DataFrame(columns = ["Pet_ID", "Animal_Type", "Cost_Type", "Price", "Cost"])
main_data["Pet_ID"] = pd.Series(list(range(101,150)))
main_data["Animal_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Dog", "Cat", "Rabit", "Horse", "Goat", "Chicken", "Hamster"]))
main_data["Cost_Type"] = main_data.Animal_Type.apply(lambda x: random.choice(["Housing", "Feeding"]))
main_data["Price"] = main_data.Price.apply(lambda x: random.choice([3, 5, 6, np.nan]))
main_data["Cost"] = main_data.Cost.apply(lambda x: random.choice([2, 1, 3, np.nan]))
mean_data = main_data.groupby(["Animal_Type", "Cost_Type"])["Price"].mean().reset_index()
Edit: I have put together two solutions but I wouldn't say it's the more elegant or dependable. Probably not the most efficient too.
main_data = pd.merge(
main_data,
mean_data,
on = ["Animal_Type", "Cost_Type"],
how = "left"
)
main_data["Price_z"] = main_data["Price_x"].fillna(main_data["Price_y"])
Edit 2: I've added a "Cost" Column with NaNs. I don't want this column touched but would like to use the same methodology with this column we're using for the Price column.
[1]: Replace values based on multiple conditions with groupby mean in Pandas
I need to fill NA values in my main data frame based on a second dataframe I created by the groupby and mean functions.
You don't need that step. You can do this in one step by grouping into multiple dataframes, applying mean on each individual dataframe, and filling NA values within just that dataframe.
So, instead of creating the mean_data dataframe, do this:
def fill_by_mean(df):
df["Price"] = df["Price"].fillna(df["Price"].mean())
return df
main_data = main_data.groupby(["Animal_Type", "Cost_Type"]).apply(fill_by_mean)
Each individual call to fill_by_mean() sees a dataframe which looks like this:
Pet_ID Animal_Type Cost_Type Price
11 112 Rabit Feeding NaN
34 135 Rabit Feeding 3.0
38 139 Rabit Feeding 3.0
Then it gets the mean of the price column and fills NA values using that. Groupby then concatenates all of the individual dataframes back together.

Reorganize Dataframe to Multi Index

I have the following dataframe after I appended the data from different sources of files:
Owed Due Date
Input NaN 51.83 08012019
Net NaN 35.91 08012019
Output NaN -49.02 08012019
Total -1.26 38.72 08012019
Input NaN 58.43 09012019
Net NaN 9.15 09012019
Output NaN -57.08 09012019
Total -3.48 10.50 09012019
Input NaN 66.50 10012019
Net NaN 9.64 10012019
Output NaN -64.70 10012019
Total -5.16 11.44 10012019
I have been trying to figure out how to reorganize this dataframe to become multi index like this:
I have tried to use melt and pivot but with limited success to even reshape anything. Will appreciate for some guidance!
P.S: The date when using print(df) shows DD for date (e.g 08). However if I change this to a csv file, it becomes 8 instead of 08 for single digit day. Hope someone can guide me on this too, thanks.
Here you go:
df.set_index('Date', append=True).unstack(0).dropna(axis=1)
set_index() moves Date to become an additional index column. Then unstack(0) moves the original index to become column names. Finally, drop the NAN columns and you have your desired result.

Pandas - finding anomaly in paired column values in large Dataframe

I've been banging my head against a wall on this for a couple of hours, and would appreciate any help I could get.
I'm working with a large data set (over 270,000 rows), and am trying to find an anomaly within two columns that should have paired values.
From the snippet of output below - I'm looking at the Alcohol_Category_ID and Alcohol_Category_Name columns. The ID column has a numeric string value that should pair up 1:1 with a string descriptor in the Name column. (e.g., "1031100.0" == "100 PROOF VODKA".
As you can see, both columns have the same count of non-null values. However, there are 72 unique IDs and only 71 unique Names. I take this to mean that one Name is incorrectly associated with two different IDs.
County Alcohol_Category_ID Alcohol_Category_Name Vendor_Number \
count 269843 270288 270288 270920
unique 99 72 71 116
top Polk 1031080.0 VODKA 80 PROOF 260
freq 49092 35366 35366 46825
first NaN NaN NaN NaN
last NaN NaN NaN NaN
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
My trouble is in actually isolating out where this duplication is occurring so that I can hopefully replace the erroneous ID with its correct value. I am having a dog of a time with this.
My dataframe is named i_a.
I've been trying to examine the pairings of values between these two columns with groupby and count statements like this:
i_a.groupby(["Alcohol_Category_Name", "Alcohol_Category_ID"]).Alcohol_Category_ID.count()
However, I'm not sure how to whittle it down from there. And there are too many pairings to make this easy to do visually.
Can someone recommend a way to isolate out the Alcohol_Category_Name associated with more than one Alcohol_Category_ID?
Thank you so much for your consideration!
EDIT: After considering the advice of Dmitry, I found the solution by continually pairing down duplicates until I honed in on the value of interest, like so:
#Finding all unique pairings of Category IDs and Names
subset = i_a.drop_duplicates(["Alcohol_Category_Name", "Alcohol_Category_ID"])
#Now, determine which of the category names appears more than once (thus paired with more than one ID)
subset[subset["Alcohol_Category_Name"].duplicated()]
Thank you so much for your help. It seems really obvious in retrospect, but I could not figure it out for the life of me.
I think this snippet meets your needs:
> df = pd.DataFrame({'a':[1,2,3,1,2,3], 'b':[1,2,1,1,2,1]})
So df.a has 3 unique values mapping to 2 uniques in df.b.
> df.groupby('b')['a'].nunique()
b
1 2
2 1
That shows that df.b=1 maps to 2 uniques in a (and that df.b=2 maps to only 1).

Rolling standard deviation with Pandas, and NaNs

I have data that looks like this:
1472698113000000000 -28.84
1472698118000000000 -26.69
1472698163000000000 -27.65
1472698168000000000 -26.1
1472698238000000000 -27.33
1472698243000000000 -26.47
1472698248000000000 -25.24
1472698253000000000 -25.53
1472698283000000000 -27.3
...
This is a time series that grows. Each time it grows, I attempt to get the rolling standard deviation of the set, using pandas.rolling_std. Each time, the result includes NaNs, which I cannot use (I am trying to insert the result into InfluxDB, and it complains when it sees the NaNs.)
I've experimented with different window sizes. I am doing this on different series, of varying rates of growth and current sizes (some just a couple of measurements long, some hundreds or thousands).
Simply, I just want to have a rolling standard deviation in InfluxDB so that I can graph it and watch how the source data is changing over time, with respect to its mean. How can I overcome this NaN problem?
If you are doing something like
df.rolling(5).std()
and getting
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 5.032395e+10 1.037386
5 5.345559e+10 0.633024
6 4.263215e+10 0.967352
7 3.510698e+10 0.822879
8 1.767767e+10 0.971972
You can strip away the NaNs by using .dropna().
df.rolling(5).std().dropna():
4 5.032395e+10 1.037386
5 5.345559e+10 0.633024
6 4.263215e+10 0.967352
7 3.510698e+10 0.822879
8 1.767767e+10 0.971972

Combining two columns from two dataframes; same indices but different lengths

Please be advised, I am a beginning programmer and a beginning python/pandas user. I'm a behavioral scientist and learning to use pandas to process and organize my data. As a result, some of this might seem completely obvious and it may seem like a question not worthy of the forum. Please have tolerance! To me, this is days of work, and I have indeed spent hours trying to figure out the answer to this question already. Thanks in advance for any help.
My data look like this. The "real" Actor and Recipient data are always 5-digit numbers, and the "Behavior" data are always letter codes. My problem is that I also use this format for special lines, denoted by markers like "date" or "s" in the Actor column. These markers indicate that the "Behavior" column holds this special type of data, and not actual Behavior data. So, I want to replace the markers in the Actor column with NaN values, and grab the special data from the behavior column to put in another column (in this example, the empty Activity column).
follow Activity Actor Behavior Recipient1
0 1 NaN date 2.1.3.2012 NaN
1 1 NaN s ss.hx NaN
2 1 NaN 50505 vo 51608
3 1 NaN 51608 vr 50505
4 1 NaN s ss.he NaN
So far, I have written some code in pandas to select out the "s" lines into a new dataframe:
def get_act_line(group):
return group.ix[(group.Actor == 's')]
result = trimdata.groupby('follow').apply(get_act_line)
I've copied over the Behavior column in this dataframe to the Activity column, and replaced the Actor and Behavior values with NaN:
result.Activity = result.Behavior
result.Behavior = np.nan
result.Actor = np.nan
result.head()
So my new dataframe looks like this:
follow follow Activity Actor Behavior Recipient1
1 2 1 ss.hx NaN NaN NaN
34 1 hf.xa NaN NaN f.53702
74 1 hf.fe NaN NaN NaN
10 1287 10 ss.hf NaN NaN db
1335 10 fe NaN NaN db
What I would like to do now is to combine this dataframe with the original, replacing all of the values in these selected rows, but maintaining values for the other rows in the original dataframe.
This may seem like a simple question with an obvious solution, or perhaps I have gone about it all wrong to begin with!
I've worked through Wes McKinney's book, I've read the documentation on different types of merges, mapping, joining, transformations, concatenations, etc. I have browsed the forums and have not found an answer that helps me to figure this out. Your help will be very much appreciated.
One way you can do this (though there may be more optimal or elegant ways) is:
mask = (df['Actor']=='s')
df['Activity'] = df[mask]['Behavior']
df.ix[mask, 'Behavior'] = np.nan
where df is equivalent to your results dataframe. This should return (my column orders are slightly different):
Activity Actor Behavior Recipient1 follow
0 NaN date 2013-04-01 00:00:00 NaN 1
1 ss.hx NaN ss.hx NaN 1
2 NaN 50505 vo 51608 1
3 NaN 51608 vr 50505 1
4 ss.he NaN ss.hx NaN 1
References:
Explanation of df.ix from other STO post.

Categories

Resources