I have a pandas dataframe like so:
import pandas as pd
import numpy as np
df = pd.DataFrame([['WY','M',2014,'Seth',5],
['WY','M',2014,'Spencer',5],
['WY','M',2014,'Tyce',5],
['NY','M',2014,'Seth',25],
['MA','M',2014,'Spencer',23]],columns = ['state','sex','year','name','number'])
print df
How do I manipulate the data to get a dataframe like:
df1 = pd.DataFrame([['M',2014,'Seth',30],
['M',2014,'Spencer',28],
['M',2014,'Tyce',5]],
columns = ['sex','year','name','number'])
print df1
This is just part of a very large dataframe, how would I do this for every name for every year?
df[['sex','year','name','number']].groupby(['sex','year','name']).sum().reset_index()
For a brief description of what this does, from left to right:
Select only the columns we care about. We could replace this part with df.drop('state',axis=1)
Perform a groupby on the columns we care about.
Sum the remaining columns (in this case, just number).
Reset the index so that the columns ['sex','year','name'] are no longer a part of the index.
you can use pivot table
df.pivot_table(values = 'number',aggfunc = 'sum',columns = ['sex','year','name']).reset_index().rename(columns={0:'number'})
Group by the columns you want, sum number, and flatten the multi-index:
df.groupby(['sex','year','name'])['number'].sum().reset_index()
In your case the column state is not sum-able, so you can shorten to:
df.groupby(['sex','year','name']).sum().reset_index()
Related
A lot of times (e.g. for time series) I need to use all the values in a column until the current row.
For instance if my dataframe has 100 rows, I want to create a new column where the value in each row is a (sum, average, product, [any other formula]) of all the previous rows, and excluding then next ones:
Row 20 = formula(all_values_until_row_20)
Row 21 = formula(all_values_until_row_21)
etc
I think the easiest way to ask this question would be: How to implement the cumsum() function for a new column in pandas without using that specific method?
One approach, if you cannot use cumsum is to introduce a new column or index and then apply a lambda function that uses all rows that have the new column value less than the current row's.
import pandas as pd
df = pd.DataFrame({'x': range(20, 30), 'y': range(40, 50)}).set_index('x')
df['Id'] = range(0, len(df.index))
df['Sum'] = df.apply(lambda x: df[df['Id']<=x['Id']]['y'].sum(), axis=1)
print(df)
Since there is no sample data I go with an assumed dataframe with atleast one column with numeric data and no NaN values.
I would start like below for cumulativbe sum and averages.
cumulative sum:
df['cum_sum'] = df['existing_col'].cumsum()
cumulative average:
df['cum_avg'] = df['existing_col'].cumsum() / df['index_col']
or
df['cum_avg'] = df['existing_col'].expanding().mean()
if you can provide a sample DataFrame you can get better help I believe so.
I have a DataFrame in Pandas for example:
df = pd.DataFrame("a":[0,0,1,1,0], "penalty":["12", "15","13","100", "22"])
and how can I sum values in column "penalty" but I would like to sum only these values from column "penalty" which have values 0 in column "a" ?
You can filter your dataframe with this:
import pandas as pd
data ={'a':[0,0,1,1,0],'penalty':[12, 15,13,100, 22]}
df = pd.DataFrame(data)
print(df.loc[df['a'].eq(0), 'penalty'].sum())
This way you are selecting the column penalty from your dataframe where the column a is equal to 0. Afterwards, you are performing the .sum() operation, hence returning your expected output (49). The only change I made was remove the quotation mark so that the values for the column penalty were interpreted as integers and not strings. If the input are necessarily strings, you can simply change this with df['penalty'] = df['penalty'].astype(int)
Filter the rows which has 0 in column a and calculate the sum of penalty column.
import pandas as pd
data ={'a':[0,0,1,1,0],'penalty':[12, 15,13,100, 22]}
df = pd.DataFrame(data)
df[df.a == 0].penalty.sum()
Let's say I have a 100x100 pandas dataframe, consisting entirely of numerical values.
What I want to do is get the difference in each column for the nth row and n-1th row:
Let's say the first column has values (1,2,3,4.....100) what I would want is the output (1,1,1,1,1,1,1.....1) it would subtract the first row from the second row, the second row from the third etc....for each column.
I have done it using a for-loop where it loops through each column, then each row. But I'm wondering if there's a more elegant solution
This is what I figure will work, haven't actually had a chance to try it yet for reasons....
outputframe = pd.DataFrame(data=0, index = list(range(1,99)), column = list(range(1,100))
For i in range(0,100):
For x in range(1,100):
outputframe.iloc[x,i]= df.iloc[x,i]-df[x-1,i]
I believe this will give me the correct results, however, I'm wondering if there's possibly a more elegant solution
the key here is the pandas shift(n) method, which allows you to shift the index by n rows.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 100)))
df_new = df.shift(-1) - df
Like #ALollz says .diff() will work fine and fast here.
First row will get NaNs so I'm reassigning the first row again.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(100, 100)))
df_new = df.diff()
df_new.iloc[0] = df.iloc[0]
Original dataframe
After .diff() (NaN on first row)
After df_new.iloc[0] = df.iloc[0]
I'm trying to learn python and have been trying to figure out how to create a sum column of my data. I want to sum all other columns. I create the new column but all sum values are zero. The data can be found here. My code is below, thank you for the help:
import pandas as pd
#Importing csv file to chinaimport_df datafram
filename=r'C:\Users\Ing PC\Documents\Intro to Data Analysis\Final Project\CHINA_DOLLAR_IMPORTS.csv'
chinaimport_df = pd.read_csv(filename)
# Removing all rows that contain only zeros, thresh since since first column is words
chinaimport_df = chinaimport_df.dropna(how='all',axis=0, thresh=2)
#Convert NANs to zeros
chinaimport_df=chinaimport_df.fillna(0)
#create a list of columns excluding the first column, to make sum func work later
col_list= list(chinaimport_df)
col_list.remove('Commodity')
print(col_list)
#adding column that sums
chinaimport_df['Total'] = chinaimport_df[col_list].sum(axis=1)
chinaimport_df.to_csv("output.csv", index=False)
IIUC this should do it.
import pandas as pd
df = pd.read_csv('CHINA_DOLLAR_IMPORTS.csv')
df['Total'] = df.replace(r',',"", regex=True).iloc[:, 1:].astype(float).sum(axis=1)
df.to_csv('output.csv', index=False)
I get an empty dataframe when I try to group values using the pivot_table. Let's first create some stupid data:
import pandas as pd
df = pd.DataFrame({"size":['large','middle','xsmall','large','middle','small'],
"color":['blue','blue','red','black','red','red']})
When I use:
df1 = df.pivot_table(index='size', aggfunc='count')
returns me what I expect. Now I would like to have a complete pivot table with the color as column:
df2 = df.pivot_table(index='size', aggfunc='count',columns='color')
But this results in an empty dataframe. Why? How can I get a simple pivot table which counts me the number of combinations?
Thank you.
You need to use len as the aggfunc, like so
df.pivot_table(index='size', aggfunc=len, columns='color')
If you want to use count, here are the steps:
First add a frequency columns, like so:
df['freq'] = df.groupby(['color', 'size'])['color'].transform('count')
Then create the pivot table using the frequency column:
df.pivot_table(values='freq', index='size', aggfunc='count', columns='color')
you need another column to be used as values for aggregation.
Add a column -
df['freq']=1
Your code will work.