Pivoting python - python

Quick question:
I have the following situation (table):
Imported data frame
Now what I would like to achieve is the following (or something in those lines, it does not have to be exactly that)
Goal
I do not want the following columns so I drop them
data.drop(data.columns[[0,5,6]], axis=1,inplace=True)
What I assumed is that the following line of code could solve it, but I am missing something?
pivoted = data.pivot(index=["Intentional homicides and other crimes","Unnamed: 2"],columns='Unnamed: 3', values='Unnamed: 4')
produces
ValueError: Length of passed values is 3395, index implies 2
Difference to the 8 question is that I do not want any aggregation functions, just to leave values as is.
Data can be found at: Data

The problem with the method pandas.DataFrame.pivot is that it does not handle duplicate values in the index. One way to solve this is to use the function pandas.pivot_table instead.
df = pd.read_csv('Crimes_UN_data.csv', skiprows=[0], encoding='latin1')
cols = list(df.columns)
cols[1] = 'Region'
df.columns = cols
pivoted = pd.pivot_table(df, values='Value', index=['Region', 'Year'], columns='Series', aggfunc=sum)
It should not sum anything, despite the aggfunc argument, but it was throwing pandas.core.base.DataError: No numeric types to aggregate if the argument was not provided.

Related

Pandas, groupby by 2 non numeric columns

I have a dataframe with several columns that I only need to use 2 non numeric columns
1 is 'hashed_id' another is 'event name' with 10 unique names
I'm trying to do a groupby by 2 non numeric columns, so aggregation functions would not work here
my solution is:
df_events = df.groupby('subscription_hash', 'event_name')['event_name']
df_events = pd.DataFrame (df_events, columns = ["subscription_hash",
'event_name'])
I'm trying to get a format like:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) AddToQueue
1 (0000379144f24717a8d124d798008a0e672) page_view
but instead getting:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) 832433 AddToQueue
1 (0000379144f24717a8d124d798008a0e672) 245400 page_view
Please advise
Is your data clean ? what are those undesired numbers coming from ?
from the docs, I see groupby being used by providing the name of columns as a list and an aggregate function:
df.groupby(['col1','col2']).mean()
since your values are not numeric maybe try the pivot method:
df.pivot(columns=['col1','col2'])
so id try first putting [] around your colnames, then try the pivot.

Passing list-likes to .loc or [] with any missing labels is no longer supported

I want to create a modified dataframe with the specified columns.
I tried the following but throws the error "Passing list-likes to .loc or [] with any missing labels is no longer supported"
# columns to keep
filtered_columns = ['text', 'agreeCount', 'disagreeCount', 'id', 'user.firstName', 'user.lastName', 'user.gender', 'user.id']
tips_filtered = tips_df.loc[:, filtered_columns]
# display tips
tips_filtered
Thank you
It looks like Pandas has deprecated this method of indexing. According to their docs:
This behavior is deprecated and will show a warning message pointing
to this section. The recommended alternative is to use .reindex()
Using the new recommended method, you can filter your columns using:
tips_filtered = tips_df.reindex(columns = filtered_columns).
NB: To reindex rows, you would use reindex(index = ...) (More information here).
Some of the columns in the list are not included in the dataframe , if you do want do that , let us try reindex
tips_filtered = tips_df.reindex(columns=filtered_columns)
I encountered the same error with missing row index labels rather than columns.
For example, I would have a dataset of products with the following ids: ['a','b','c','d']. I store those products in a dataframe with indices ['a','b','c','d']:
df=pd.DataFrame(['product a','product b','product c', 'product d'],index=['a','b','c','d'])
Now let's assume I have an updated product index:
row_indices=['b','c','d','e'] in which 'e' corresponds to a new product: 'product e'. Note that 'e' was not present in my original index ['a','b','c','d'].
If I try to pass this updated index to my df dataframe: df.loc[row_indices,:],
I'll get this nasty error message:
KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['e'], dtype='object').
To avoid this error I need to do intersection of my updated index with the original index:
df.loc[df.index.intersection(row_indices),:]
this is in line with recommendation of what pandas docs
This error pops up if indexing on something which is not present - reset_index() worked for me as I was indexing on a subset of the actual dataframe with actual indices, in this case the column may not be present in the dataframe.
I had the same issue while trying to create new columns along with existing ones :
df = pd.DataFrame([[1,2,3]], columns=["a","b","c"])
def foobar(a,b):
return a,b
df[["c","d"]] = df.apply(lambda row: foobar(row["a"], row["b"]), axis=1)
The solution was to add result_type="expand" as an argument of apply() :
df[["c","d"]] = df.apply(lambda row: foobar(row["a"], row["b"]), axis=1, result_type="expand")

Getting the columns to displays month instead of tuples when using pivot_table method

pivot a dataframe using this line:
month_df = pd.pivot_table(loadexpense_df,index="Category",columns="Month",aggfunc={"Month":len}, fill_value=0)
The end result as display below:
Q1: How do I write a logic to change the column name ('Month,'Apr) to Apr only and to the rest of column headers?
Q2: Also can pivot table return only the month as header (eg: Apr, May, etc) instead of the tuple (eg: ('Month,'Apr), ('Month,'Aug), etc?
Thanks
After getting result of pd.pivot_table() function you get a Dataframe with a columns which represents MultiIndex with two levels. You can solve both of your problems by setting second level of your MultiIndex to Dataframe column labels:
month_df = pd.pivot_table(loadexpense_df,
index="Category",
columns="Month",
aggfunc={"Month":len},
fill_value=0)
month_df.columns = month_df.columns.get_level_values(level=1)
Q1: Table column-names is contained in month_df.columns, and they are iterable. So:
columns = month_df.columns
tmp = []
for col in columns:
if len(col) == 2:
tmp.append(col[1]) #Taking second tuple-value
else:
tmp.append(col)
month_df.columns = tmp #Reassigning columns
Q2: I was unable to reproduce your issue with the code given. Mine printed out just fine from the beginning, but I still answered Q1 as I could generate it by hand. But the code
month_df = pd.pivot_table(
loadexpense_df,index="Category",
columns="Month",aggfunc={"Month":len},
fill_value=0)
Gives me only the name of the months as column-names, and not a tuple.

Pandas - Function to remove na

trying to do a quick function but struggling since new to Pandas/Python. I'm trying to remove nas from two of my columns, but I keep getting this error, my code is the following:
def remove_na():
df.dropna(subset=['Column 1', 'Column 2'])
df.reset_index(drop=True)
df = remove_rows()
df.head(3)
AttributeError: 'NoneType' object has no attribute 'dropna'
I want to use this function on different tables, hence why I thought it would make sense to create a method. However, I just don't understand why it's not working for this method when compared to others it seems fine. Thank you.
I believe you can specify if you want to remove NA from columns or rows by the paremeter axis where 0 is index and 1 is columns. This would remove all NAs from all columns
df.dropna(axis =1, inplace=True )
I think you can use apply with dropna:
df = df.apply(lambda x: pd.Series(x.dropna().values))
print (df)
OR you can also try this
df=df.dropna(axis=0, how='any')
You're getting an error cos the dropna function here yields a dataframe as its output.
You can either save it to a dataframe:
df = df.dropna(subset=['Column 1', 'Column 2'])
or call the argument 'inplace=True' :
df.dropna(subset=['Column 1', 'Column 2'], inplace=True)
In order to remove all the missing values from the data set at once using pandas you can use the following:(Remember You have to specify the index in the arguments so that you can efficiently remove the missing values)
# making new data frame with dropped NA values
new_data = data.dropna(axis = 0, how ='any')

Refer to pandas dataframe columns or index, depending on parameter

I am writing a function that operates on the labels of a pandas dataframe and I want to have a parameter axis to decide whether to operate on index or columns.
So I wrote something like:
if axis==0:
to_sort = df.index
elif axis==1:
to_sort = df.columns
else:
raise AttributeError
where df is a pandas dataframe.
Is there a better way of doing this?
Note I am not asking for a code review, but more specifically asking if there is a pandas attribute (something like labels would make sense to me) that allows me to get index or columns depending on a parameter/index to be passed.
For example (code not working):
df.labels[0] # index
df.labels[1] # columns
Short answer: You can use iloc(axis=...)
Documentation: http://pandas.pydata.org/pandas-docs/stable/advanced.html
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
(They seem to have omitted iloc in regards to the axis parameter)
A complete example
df = pd.DataFrame({"A":['a1', 'a2'], "B":['b1', 'b2']})
print(df)
Output:
A B
0 a1 b1
1 a2 b2
With axis=0
print(df.iloc(axis=0)[0].index)
Output:
Index(['A', 'B'], dtype='object')
With axis=1
print(df.iloc(axis=1)[0].index)
Output:
RangeIndex(start=0, stop=2, step=1)
Looking at reindex documentation examples, I realized I can do something like this:
Let the parameter be axis={'index', 'columns'}
Get the relevant labels using getattr: labels = getattr(df, axis)
Open to other pandas specific solutions.
If I were forced to use axis={1, 0}, then #Bharath suggestion to use an helper function makes sense.

Categories

Resources