Combining two columns from two dataframes; same indices but different lengths

Combining two columns from two dataframes; same indices but different lengths - python

Please be advised, I am a beginning programmer and a beginning python/pandas user. I'm a behavioral scientist and learning to use pandas to process and organize my data. As a result, some of this might seem completely obvious and it may seem like a question not worthy of the forum. Please have tolerance! To me, this is days of work, and I have indeed spent hours trying to figure out the answer to this question already. Thanks in advance for any help.
My data look like this. The "real" Actor and Recipient data are always 5-digit numbers, and the "Behavior" data are always letter codes. My problem is that I also use this format for special lines, denoted by markers like "date" or "s" in the Actor column. These markers indicate that the "Behavior" column holds this special type of data, and not actual Behavior data. So, I want to replace the markers in the Actor column with NaN values, and grab the special data from the behavior column to put in another column (in this example, the empty Activity column).
follow Activity Actor Behavior Recipient1
0 1 NaN date 2.1.3.2012 NaN
1 1 NaN s ss.hx NaN
2 1 NaN 50505 vo 51608
3 1 NaN 51608 vr 50505
4 1 NaN s ss.he NaN
So far, I have written some code in pandas to select out the "s" lines into a new dataframe:
def get_act_line(group):
return group.ix[(group.Actor == 's')]
result = trimdata.groupby('follow').apply(get_act_line)
I've copied over the Behavior column in this dataframe to the Activity column, and replaced the Actor and Behavior values with NaN:
result.Activity = result.Behavior
result.Behavior = np.nan
result.Actor = np.nan
result.head()
So my new dataframe looks like this:
follow follow Activity Actor Behavior Recipient1
1 2 1 ss.hx NaN NaN NaN
34 1 hf.xa NaN NaN f.53702
74 1 hf.fe NaN NaN NaN
10 1287 10 ss.hf NaN NaN db
1335 10 fe NaN NaN db
What I would like to do now is to combine this dataframe with the original, replacing all of the values in these selected rows, but maintaining values for the other rows in the original dataframe.
This may seem like a simple question with an obvious solution, or perhaps I have gone about it all wrong to begin with!
I've worked through Wes McKinney's book, I've read the documentation on different types of merges, mapping, joining, transformations, concatenations, etc. I have browsed the forums and have not found an answer that helps me to figure this out. Your help will be very much appreciated.

One way you can do this (though there may be more optimal or elegant ways) is:
mask = (df['Actor']=='s')
df['Activity'] = df[mask]['Behavior']
df.ix[mask, 'Behavior'] = np.nan
where df is equivalent to your results dataframe. This should return (my column orders are slightly different):
Activity Actor Behavior Recipient1 follow
0 NaN date 2013-04-01 00:00:00 NaN 1
1 ss.hx NaN ss.hx NaN 1
2 NaN 50505 vo 51608 1
3 NaN 51608 vr 50505 1
4 ss.he NaN ss.hx NaN 1
References:
Explanation of df.ix from other STO post.

Related

Renaming a placeholder heading in a pandas df with the string in the column

What I am looking for, but can't seem to get right is essentially this:
I'll read in a csv file to build a df. With the output being a table that I can manipulate or edit.
Ultimately, I'm looking for a method to rename the column header placeholders with the first "string" in the column.
Currently:
import pandas as pd
df = pd.read_csv("NameOfCSV.csv")
df
Example Output:
m1_name
m1_delta
m2_name
m2_delta
m3_name
m3_delta
...
m10_name
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
NaN
2
...
NaN
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
CH4
2
...
NaN
What I am trying to understand how to do is create a generic program that will grab the gas name within the column and rename the header "m*_name" and any other header corresponding to "m*_blah" with the gas name of that "m*_name" column.
Example Desired Output where the headers reflect the gas name:
**CO2_name
CO2_delta
NMHC_name
NMHC_delta
NO2_name
NO2_delta**
...
m10_name
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
NaN
2
...
NaN
CO2
1
NMHC
2
NO2
1
...
NaN
I've tried playing around with the several functions (primarily .rename()), trying to find some examples of similar problems, and digging for some documentation but have been unsuccessful with coming up with a solution. Beyond the couple column headers here in the example there are about a dozen per gas name, so I was also trying to build a loop structure to find the headers with the corresponding m_number,"m*_otherHeader", to populate those headers also. The datasets I'm working with are dynamic and the positions of these original columns from the csv change position as well. Some help would be appreciated, and/or pointing me in the direction of some examples or the proper documentation to read through would be great!

df = df.drop_duplicates().set_axis([(f"{df[f'{x}_name'].mode()[0]}_{y}" if df[f'{x}_name'].mode()[0] != 0 else f'{x}_{y}') for x, y in df.columns.str.split('_', n=1)], axis=1)
Output:
>>> df
CO2_name CO2_delta NMHC_name NMHC_delta CH4_name CH4_delta m10_name
0 CO2 1 NMHC 2 CH4 1 0
1 CO2 2 NMHC 1 CH4 2 0

Pandas's dataframe becoming empty after removing empty rows

I have the following data set:
Survived Not Survived
0 NaN 22.0
1 38.0 NaN
2 26.0 NaN
3 35.0 NaN
4 NaN 35.0
.. ... ...
886 NaN 27.0
887 19.0 NaN
888 NaN NaN
889 26.0 NaN
890 NaN 32.0
I want to remove all the rows which contains NaN so i wrote the following code(the dataset's name is titanic_feature_data):
titanic_feature_data = titanic_feature_data.dropna()
And when i try to display the new dataset i get the following result:
Empty DataFrame
Columns: [Survived, Not Survived]
Index: []
What's the problem ? and how can i fix it ?

By using titanic_feature_data.dropna(), you are removing all rows with at least one missing value. From the data you printed in your question, it looks like all rows contains at least one missing value. Is it possible that simply all your rows contains at least one missing value? If so, it makes total sense that your dataframe is empty after dropna(), right?
Having said that, perhaps you are looking to drop rows that have a missing value for one particular column, for example column Not Survived. Then you could use:
titanic_feature_data.dropna(subset='Not Survived')
Also, if you are confused about why certain rows are dropped, I recommend checking for missing values explicitly first, without dropping them. That way you can see which instances would have been dropped:
incomplete_rows = titanic_feature_data.isnull().any(axis=1)
incomplete_rows is a boolean series, which indicates whether a row contains any missing value or not. You can use this series to subset your dataframe and see which rows contain missing values (presumably all of them, given your example)
titanic_feature_data.loc[incomplete_rows, :]

Plotting by Index with different labels

I am using pandas and matplotlib to generate some charts.
My DataFrame:
Journal Papers per year in journal
0 Information and Software Technology 4
1 2012 International Conference on Cyber Securit... 4
2 Journal of Network and Computer Applications 4
3 IEEE Security & Privacy 5
4 Computers & Security 11
My Dataframe is a result of a groupby out of a larger dataframe. What I want now, is a simple barchart, which in theory works fine with a df_groupby_time.plot(kind='bar'). However, I get this:
What I want are different colored bars, and a legend which states which color corresponds to which paper.
Playing around with relabeling hasn't gotten me anywhere so far. And I have no idea anymore on how to achieve what I want.
EDIT:
Resetting the index and plotting isn't what I want:
df_groupby_time.set_index("Journals").plot(kind='bar')

I found a solution, based on this question here.
SO, the dataframe needs to be transformed into a matrix, were the values exist only on the main diagonal.
First, I save the column journals for later in a variable.
new_cols = df["Journal"].values
Secondly, I wrote a function, that takes a series, the column Papers per year in Journal, and the previously saved new columns, as input parameters, and returns a dataframe, where the values are only on the main diagonal.:
def values_into_main_diagonal(some_series, new_cols):
"""Puts the values of a series onto the main diagonal of a new df.
some_series - any series given
new_cols - the new column labels as list or numpy.ndarray"""
x = [{i: some_series[i]} for i in range(len(some_series))]
main_diag_df = pd.DataFrame(x)
main_diag_df.columns = new_cols
return main_diag_df
Thirdly, feeding the function the Papers per year in Journal column and our saved new columns names, returns the following dataframe:
new_df:
1_journal 2_journal 3_journal 4_journal 5_journal
0 4 NaN NaN NaN NaN
1 NaN 4 NaN NaN NaN
2 NaN NaN 4 NaN NaN
3 NaN NaN NaN 5 NaN
4 NaN NaN NaN NaN 11
Finally plotting the new_df via new_df.plot(kind='bar', stacked=True) gives me what I want. The Journals in different colors as the legend and NOT on the axis.:

Reorganize Dataframe to Multi Index

I have the following dataframe after I appended the data from different sources of files:
Owed Due Date
Input NaN 51.83 08012019
Net NaN 35.91 08012019
Output NaN -49.02 08012019
Total -1.26 38.72 08012019
Input NaN 58.43 09012019
Net NaN 9.15 09012019
Output NaN -57.08 09012019
Total -3.48 10.50 09012019
Input NaN 66.50 10012019
Net NaN 9.64 10012019
Output NaN -64.70 10012019
Total -5.16 11.44 10012019
I have been trying to figure out how to reorganize this dataframe to become multi index like this:
I have tried to use melt and pivot but with limited success to even reshape anything. Will appreciate for some guidance!
P.S: The date when using print(df) shows DD for date (e.g 08). However if I change this to a csv file, it becomes 8 instead of 08 for single digit day. Hope someone can guide me on this too, thanks.

Here you go:
df.set_index('Date', append=True).unstack(0).dropna(axis=1)
set_index() moves Date to become an additional index column. Then unstack(0) moves the original index to become column names. Finally, drop the NAN columns and you have your desired result.

Pandas - finding anomaly in paired column values in large Dataframe

I've been banging my head against a wall on this for a couple of hours, and would appreciate any help I could get.
I'm working with a large data set (over 270,000 rows), and am trying to find an anomaly within two columns that should have paired values.
From the snippet of output below - I'm looking at the Alcohol_Category_ID and Alcohol_Category_Name columns. The ID column has a numeric string value that should pair up 1:1 with a string descriptor in the Name column. (e.g., "1031100.0" == "100 PROOF VODKA".
As you can see, both columns have the same count of non-null values. However, there are 72 unique IDs and only 71 unique Names. I take this to mean that one Name is incorrectly associated with two different IDs.
County Alcohol_Category_ID Alcohol_Category_Name Vendor_Number \
count 269843 270288 270288 270920
unique 99 72 71 116
top Polk 1031080.0 VODKA 80 PROOF 260
freq 49092 35366 35366 46825
first NaN NaN NaN NaN
last NaN NaN NaN NaN
mean NaN NaN NaN NaN
std NaN NaN NaN NaN
min NaN NaN NaN NaN
25% NaN NaN NaN NaN
50% NaN NaN NaN NaN
75% NaN NaN NaN NaN
max NaN NaN NaN NaN
My trouble is in actually isolating out where this duplication is occurring so that I can hopefully replace the erroneous ID with its correct value. I am having a dog of a time with this.
My dataframe is named i_a.
I've been trying to examine the pairings of values between these two columns with groupby and count statements like this:
i_a.groupby(["Alcohol_Category_Name", "Alcohol_Category_ID"]).Alcohol_Category_ID.count()
However, I'm not sure how to whittle it down from there. And there are too many pairings to make this easy to do visually.
Can someone recommend a way to isolate out the Alcohol_Category_Name associated with more than one Alcohol_Category_ID?
Thank you so much for your consideration!
EDIT: After considering the advice of Dmitry, I found the solution by continually pairing down duplicates until I honed in on the value of interest, like so:
#Finding all unique pairings of Category IDs and Names
subset = i_a.drop_duplicates(["Alcohol_Category_Name", "Alcohol_Category_ID"])
#Now, determine which of the category names appears more than once (thus paired with more than one ID)
subset[subset["Alcohol_Category_Name"].duplicated()]
Thank you so much for your help. It seems really obvious in retrospect, but I could not figure it out for the life of me.

I think this snippet meets your needs:
> df = pd.DataFrame({'a':[1,2,3,1,2,3], 'b':[1,2,1,1,2,1]})
So df.a has 3 unique values mapping to 2 uniques in df.b.
> df.groupby('b')['a'].nunique()
b
1 2
2 1
That shows that df.b=1 maps to 2 uniques in a (and that df.b=2 maps to only 1).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.