Pandas - finding anomaly in paired column values in large Dataframe

Pandas - finding anomaly in paired column values in large Dataframe - python

I've been banging my head against a wall on this for a couple of hours, and would appreciate any help I could get.
I'm working with a large data set (over 270,000 rows), and am trying to find an anomaly within two columns that should have paired values.
From the snippet of output below - I'm looking at the Alcohol_Category_ID and Alcohol_Category_Name columns. The ID column has a numeric string value that should pair up 1:1 with a string descriptor in the Name column. (e.g., "1031100.0" == "100 PROOF VODKA".
As you can see, both columns have the same count of non-null values. However, there are 72 unique IDs and only 71 unique Names. I take this to mean that one Name is incorrectly associated with two different IDs.
County Alcohol_Category_ID Alcohol_Category_Name Vendor_Number \
count 269843 270288 270288 270920
unique 99 72 71 116
top Polk 1031080.0 VODKA 80 PROOF 260
freq 49092 35366 35366 46825
first NaN NaN NaN NaN
last NaN NaN NaN NaN
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
My trouble is in actually isolating out where this duplication is occurring so that I can hopefully replace the erroneous ID with its correct value. I am having a dog of a time with this.
My dataframe is named i_a.
I've been trying to examine the pairings of values between these two columns with groupby and count statements like this:
i_a.groupby(["Alcohol_Category_Name", "Alcohol_Category_ID"]).Alcohol_Category_ID.count()
However, I'm not sure how to whittle it down from there. And there are too many pairings to make this easy to do visually.
Can someone recommend a way to isolate out the Alcohol_Category_Name associated with more than one Alcohol_Category_ID?
Thank you so much for your consideration!
EDIT: After considering the advice of Dmitry, I found the solution by continually pairing down duplicates until I honed in on the value of interest, like so:
#Finding all unique pairings of Category IDs and Names
subset = i_a.drop_duplicates(["Alcohol_Category_Name", "Alcohol_Category_ID"])
#Now, determine which of the category names appears more than once (thus paired with more than one ID)
subset[subset["Alcohol_Category_Name"].duplicated()]
Thank you so much for your help. It seems really obvious in retrospect, but I could not figure it out for the life of me.

I think this snippet meets your needs:
> df = pd.DataFrame({'a':[1,2,3,1,2,3], 'b':[1,2,1,1,2,1]})
So df.a has 3 unique values mapping to 2 uniques in df.b.
> df.groupby('b')['a'].nunique()
b
1 2
2 1
That shows that df.b=1 maps to 2 uniques in a (and that df.b=2 maps to only 1).

Related

Pandas - instead of dropping rows with nan values I want to keep those rows and drop the others in a particular column

I have several column in my df, one is error. If that column has rows with a value (this one always has 99 as the error message value) I want to remove those rows and keep the ones that are nan.
df:
error
date
count
99
nan
nan
nan
2022-02-01
234
nan
2022-02-02
34643
99
nan
nan
nan
2022-03-02
23425
99
nan
nan
I know how to drop if nan, but I want to do the opposite for the error column

A more general solution than proposed by enke is:
df = df[df.error.isna()]
This way you retain only rows with NaN in error column,
regardless of the error value in original DataFrame.

Pandas's dataframe becoming empty after removing empty rows

I have the following data set:
Survived Not Survived
0 NaN 22.0
1 38.0 NaN
2 26.0 NaN
3 35.0 NaN
4 NaN 35.0
.. ... ...
886 NaN 27.0
887 19.0 NaN
888 NaN NaN
889 26.0 NaN
890 NaN 32.0
I want to remove all the rows which contains NaN so i wrote the following code(the dataset's name is titanic_feature_data):
titanic_feature_data = titanic_feature_data.dropna()
And when i try to display the new dataset i get the following result:
Empty DataFrame
Columns: [Survived, Not Survived]
Index: []
What's the problem ? and how can i fix it ?

By using titanic_feature_data.dropna(), you are removing all rows with at least one missing value. From the data you printed in your question, it looks like all rows contains at least one missing value. Is it possible that simply all your rows contains at least one missing value? If so, it makes total sense that your dataframe is empty after dropna(), right?
Having said that, perhaps you are looking to drop rows that have a missing value for one particular column, for example column Not Survived. Then you could use:
titanic_feature_data.dropna(subset='Not Survived')
Also, if you are confused about why certain rows are dropped, I recommend checking for missing values explicitly first, without dropping them. That way you can see which instances would have been dropped:
incomplete_rows = titanic_feature_data.isnull().any(axis=1)
incomplete_rows is a boolean series, which indicates whether a row contains any missing value or not. You can use this series to subset your dataframe and see which rows contain missing values (presumably all of them, given your example)
titanic_feature_data.loc[incomplete_rows, :]

Plotting by Index with different labels

I am using pandas and matplotlib to generate some charts.
My DataFrame:
Journal Papers per year in journal
0 Information and Software Technology 4
1 2012 International Conference on Cyber Securit... 4
2 Journal of Network and Computer Applications 4
3 IEEE Security & Privacy 5
4 Computers & Security 11
My Dataframe is a result of a groupby out of a larger dataframe. What I want now, is a simple barchart, which in theory works fine with a df_groupby_time.plot(kind='bar'). However, I get this:
What I want are different colored bars, and a legend which states which color corresponds to which paper.
Playing around with relabeling hasn't gotten me anywhere so far. And I have no idea anymore on how to achieve what I want.
EDIT:
Resetting the index and plotting isn't what I want:
df_groupby_time.set_index("Journals").plot(kind='bar')

I found a solution, based on this question here.
SO, the dataframe needs to be transformed into a matrix, were the values exist only on the main diagonal.
First, I save the column journals for later in a variable.
new_cols = df["Journal"].values
Secondly, I wrote a function, that takes a series, the column Papers per year in Journal, and the previously saved new columns, as input parameters, and returns a dataframe, where the values are only on the main diagonal.:
def values_into_main_diagonal(some_series, new_cols):
"""Puts the values of a series onto the main diagonal of a new df.
some_series - any series given
new_cols - the new column labels as list or numpy.ndarray"""
x = [{i: some_series[i]} for i in range(len(some_series))]
main_diag_df = pd.DataFrame(x)
main_diag_df.columns = new_cols
return main_diag_df
Thirdly, feeding the function the Papers per year in Journal column and our saved new columns names, returns the following dataframe:
new_df:
1_journal 2_journal 3_journal 4_journal 5_journal
0 4 NaN NaN NaN NaN
1 NaN 4 NaN NaN NaN
2 NaN NaN 4 NaN NaN
3 NaN NaN NaN 5 NaN
4 NaN NaN NaN NaN 11
Finally plotting the new_df via new_df.plot(kind='bar', stacked=True) gives me what I want. The Journals in different colors as the legend and NOT on the axis.:

Can I seperate the values of a dictionary into multiple columns and still be able to plot them?

I want to seperate the values of a dictionary into multiple columns and still be able to plot them. At this moment all the values are in one column.
So concretely I would like to split all the different values in the list of values. And use the amount of values in the longest list as the amount of columns. So for all the shorter lists I would like to fill in the gaps with something like 'NA' so I can still plot it in seaborn.
This is the dictionary that I used:
dictio = {'seq_7009': [6236.9764, 6367.049999999999], 'seq_418': [3716.3642000000004, 3796.4124000000006], 'seq_9143_unamb': [4631.958999999999], 'seq_2888': [5219.3359, 5365.4089], 'seq_1101': [4287.7417, 4422.8254], 'seq_107': [5825.695099999999, 5972.8073], 'seq_6946': [5179.3118, 5364.420900000001], 'seq_6162': [5531.503199999999, 5645.577399999999], 'seq_504': [4556.920899999999, 4631.959], 'seq_3535': [3396.1715999999997, 3446.1969999999997, 5655.896546], 'seq_4077': [4551.9108, 4754.0073,4565.987654,5668.9999976], 'seq_1626_unamb': [3724.3894999999998]}
This is the code for the dataframe:
df = pd.Series(dictio)
test=pd.DataFrame({'ID':df.index, 'Value':df.values})
seq_107 [5825.695099999999, 5972.8073]
seq_1101 [4287.7417, 4422.8254]
seq_1626_unamb [3724.3894999999998]
seq_2888 [5219.3359, 5365.4089]
seq_3535 [3396.1715999999997, 3446.1969999999997, 5655....
seq_4077 [4551.9108, 4754.0073, 4565.987654, 5668.9999976]
seq_418 [3716.3642000000004, 3796.4124000000006]
seq_504 [4556.920899999999, 4631.959]
seq_6162 [5531.503199999999, 5645.577399999999]
seq_6946 [5179.3118, 5364.420900000001]
seq_7009 [6236.9764, 6367.049999999999]
seq_9143_unamb [4631.958999999999]
Thanks in advance for the help!

Convert the Value column to a list of lists, and reload it into a new dataframe. Afterwards, call plot. Something like this -
df = pd.DataFrame(test.Value.tolist(), index=test.ID)
df
0 1 2 3
ID
seq_107 5825.6951 5972.8073 NaN NaN
seq_1101 4287.7417 4422.8254 NaN NaN
seq_1626_unamb 3724.3895 NaN NaN NaN
seq_2888 5219.3359 5365.4089 NaN NaN
seq_3535 3396.1716 3446.1970 5655.896546 NaN
seq_4077 4551.9108 4754.0073 4565.987654 5668.999998
seq_418 3716.3642 3796.4124 NaN NaN
seq_504 4556.9209 4631.9590 NaN NaN
seq_6162 5531.5032 5645.5774 NaN NaN
seq_6946 5179.3118 5364.4209 NaN NaN
seq_7009 6236.9764 6367.0500 NaN NaN
seq_9143_unamb 4631.9590 NaN NaN NaN
df.plot()

Combining two columns from two dataframes; same indices but different lengths

Please be advised, I am a beginning programmer and a beginning python/pandas user. I'm a behavioral scientist and learning to use pandas to process and organize my data. As a result, some of this might seem completely obvious and it may seem like a question not worthy of the forum. Please have tolerance! To me, this is days of work, and I have indeed spent hours trying to figure out the answer to this question already. Thanks in advance for any help.
My data look like this. The "real" Actor and Recipient data are always 5-digit numbers, and the "Behavior" data are always letter codes. My problem is that I also use this format for special lines, denoted by markers like "date" or "s" in the Actor column. These markers indicate that the "Behavior" column holds this special type of data, and not actual Behavior data. So, I want to replace the markers in the Actor column with NaN values, and grab the special data from the behavior column to put in another column (in this example, the empty Activity column).
follow Activity Actor Behavior Recipient1
0 1 NaN date 2.1.3.2012 NaN
1 1 NaN s ss.hx NaN
2 1 NaN 50505 vo 51608
3 1 NaN 51608 vr 50505
4 1 NaN s ss.he NaN
So far, I have written some code in pandas to select out the "s" lines into a new dataframe:
def get_act_line(group):
return group.ix[(group.Actor == 's')]
result = trimdata.groupby('follow').apply(get_act_line)
I've copied over the Behavior column in this dataframe to the Activity column, and replaced the Actor and Behavior values with NaN:
result.Activity = result.Behavior
result.Behavior = np.nan
result.Actor = np.nan
result.head()
So my new dataframe looks like this:
follow follow Activity Actor Behavior Recipient1
1 2 1 ss.hx NaN NaN NaN
34 1 hf.xa NaN NaN f.53702
74 1 hf.fe NaN NaN NaN
10 1287 10 ss.hf NaN NaN db
1335 10 fe NaN NaN db
What I would like to do now is to combine this dataframe with the original, replacing all of the values in these selected rows, but maintaining values for the other rows in the original dataframe.
This may seem like a simple question with an obvious solution, or perhaps I have gone about it all wrong to begin with!
I've worked through Wes McKinney's book, I've read the documentation on different types of merges, mapping, joining, transformations, concatenations, etc. I have browsed the forums and have not found an answer that helps me to figure this out. Your help will be very much appreciated.

One way you can do this (though there may be more optimal or elegant ways) is:
mask = (df['Actor']=='s')
df['Activity'] = df[mask]['Behavior']
df.ix[mask, 'Behavior'] = np.nan
where df is equivalent to your results dataframe. This should return (my column orders are slightly different):
Activity Actor Behavior Recipient1 follow
0 NaN date 2013-04-01 00:00:00 NaN 1
1 ss.hx NaN ss.hx NaN 1
2 NaN 50505 vo 51608 1
3 NaN 51608 vr 50505 1
4 ss.he NaN ss.hx NaN 1
References:
Explanation of df.ix from other STO post.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.