Pandas Dataframes sum value counts of different columns - python

I just started to analyze a little csv file and I created a Dataframe that looks like this:
The Dataframe contains columns from diag1 to diag12, each column contains a string or a NaN value. My objective is to create a chart and show the number of apparitions of each code.
How I should sum the value_counts result for example in this case for the diag1 and diag2 columns, I should have a DataFrame or a series with the sum of both series. For example, the code J20.9 should appear with a value of 194, 88 from the diag1 series and 106 from the diag2.
How I should do this sum for the value counts of the columns from diag1 to diag12 ?

Use this code:
from collections import Counter
import pandas as pd
final_count = Counter()
for col in df.columns:
final_count = Counter(df[col]) + final_count
print(final_count)
The final_count will have counts for all the values. Use pd.Series(final_count) to convert it to series.

Related

Pandas dataframes: Create new column with a formula that uses all values of column X until each row (similar to cumsum)

A lot of times (e.g. for time series) I need to use all the values in a column until the current row.
For instance if my dataframe has 100 rows, I want to create a new column where the value in each row is a (sum, average, product, [any other formula]) of all the previous rows, and excluding then next ones:
Row 20 = formula(all_values_until_row_20)
Row 21 = formula(all_values_until_row_21)
etc
I think the easiest way to ask this question would be: How to implement the cumsum() function for a new column in pandas without using that specific method?
One approach, if you cannot use cumsum is to introduce a new column or index and then apply a lambda function that uses all rows that have the new column value less than the current row's.
import pandas as pd
df = pd.DataFrame({'x': range(20, 30), 'y': range(40, 50)}).set_index('x')
df['Id'] = range(0, len(df.index))
df['Sum'] = df.apply(lambda x: df[df['Id']<=x['Id']]['y'].sum(), axis=1)
print(df)
Since there is no sample data I go with an assumed dataframe with atleast one column with numeric data and no NaN values.
I would start like below for cumulativbe sum and averages.
cumulative sum:
df['cum_sum'] = df['existing_col'].cumsum()
cumulative average:
df['cum_avg'] = df['existing_col'].cumsum() / df['index_col']
or
df['cum_avg'] = df['existing_col'].expanding().mean()
if you can provide a sample DataFrame you can get better help I believe so.

How to subtract all the date columns from each other (in permutation) and store them in a new pandas DataFrame?

I was working on Jupyter and arrived at a situation where I had to take differences of each column from every other column taken in permutation and then store them in a separate DataFrame. I tried using nested loops but got stuck while assigning the values to the DataFrame.
n=0
for i in range(len(list(df.columns))-1):
for j in range(i+1, len(list(df.columns))-1):
df1[n] = pd.DataFrame(abs((df.iloc[:,i] - df.iloc[:,j]).dt.days))
n=n+1
df1
Also, I would like to have column headers in this format: D1-D2, D1-D3, etc. The difference in dates has to be a positive integer. I would really appreciate if anyone could help me with this code. Thanks!
A snippet of the DataFrame
import itertools
import pandas as pd
# create a sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4]})
# iterate over all permutations of size 2 and write to dictionary
newcols = {}
for col1, col2 in itertools.permutations(df.columns, 2):
newcols["-".join([col1, col2])] = df[col1]-df[col2]
# create dataframe from dict
newdf = pd.DataFrame(newcols)

Sum values in one column based on specific values in other column

I have a DataFrame in Pandas for example:
df = pd.DataFrame("a":[0,0,1,1,0], "penalty":["12", "15","13","100", "22"])
and how can I sum values in column "penalty" but I would like to sum only these values from column "penalty" which have values 0 in column "a" ?
You can filter your dataframe with this:
import pandas as pd
data ={'a':[0,0,1,1,0],'penalty':[12, 15,13,100, 22]}
df = pd.DataFrame(data)
print(df.loc[df['a'].eq(0), 'penalty'].sum())
This way you are selecting the column penalty from your dataframe where the column a is equal to 0. Afterwards, you are performing the .sum() operation, hence returning your expected output (49). The only change I made was remove the quotation mark so that the values for the column penalty were interpreted as integers and not strings. If the input are necessarily strings, you can simply change this with df['penalty'] = df['penalty'].astype(int)
Filter the rows which has 0 in column a and calculate the sum of penalty column.
import pandas as pd
data ={'a':[0,0,1,1,0],'penalty':[12, 15,13,100, 22]}
df = pd.DataFrame(data)
df[df.a == 0].penalty.sum()

I'm trying to sum multiple columns into a new sum column using a python pandas dataframe

I'm trying to learn python and have been trying to figure out how to create a sum column of my data. I want to sum all other columns. I create the new column but all sum values are zero. The data can be found here. My code is below, thank you for the help:
import pandas as pd
#Importing csv file to chinaimport_df datafram
filename=r'C:\Users\Ing PC\Documents\Intro to Data Analysis\Final Project\CHINA_DOLLAR_IMPORTS.csv'
chinaimport_df = pd.read_csv(filename)
# Removing all rows that contain only zeros, thresh since since first column is words
chinaimport_df = chinaimport_df.dropna(how='all',axis=0, thresh=2)
#Convert NANs to zeros
chinaimport_df=chinaimport_df.fillna(0)
#create a list of columns excluding the first column, to make sum func work later
col_list= list(chinaimport_df)
col_list.remove('Commodity')
print(col_list)
#adding column that sums
chinaimport_df['Total'] = chinaimport_df[col_list].sum(axis=1)
chinaimport_df.to_csv("output.csv", index=False)
IIUC this should do it.
import pandas as pd
df = pd.read_csv('CHINA_DOLLAR_IMPORTS.csv')
df['Total'] = df.replace(r',',"", regex=True).iloc[:, 1:].astype(float).sum(axis=1)
df.to_csv('output.csv', index=False)

Combing pandas dataframe values based on other column values

I have a pandas dataframe like so:
import pandas as pd
import numpy as np
df = pd.DataFrame([['WY','M',2014,'Seth',5],
['WY','M',2014,'Spencer',5],
['WY','M',2014,'Tyce',5],
['NY','M',2014,'Seth',25],
['MA','M',2014,'Spencer',23]],columns = ['state','sex','year','name','number'])
print df
How do I manipulate the data to get a dataframe like:
df1 = pd.DataFrame([['M',2014,'Seth',30],
['M',2014,'Spencer',28],
['M',2014,'Tyce',5]],
columns = ['sex','year','name','number'])
print df1
This is just part of a very large dataframe, how would I do this for every name for every year?
df[['sex','year','name','number']].groupby(['sex','year','name']).sum().reset_index()
For a brief description of what this does, from left to right:
Select only the columns we care about. We could replace this part with df.drop('state',axis=1)
Perform a groupby on the columns we care about.
Sum the remaining columns (in this case, just number).
Reset the index so that the columns ['sex','year','name'] are no longer a part of the index.
you can use pivot table
df.pivot_table(values = 'number',aggfunc = 'sum',columns = ['sex','year','name']).reset_index().rename(columns={0:'number'})
Group by the columns you want, sum number, and flatten the multi-index:
df.groupby(['sex','year','name'])['number'].sum().reset_index()
In your case the column state is not sum-able, so you can shorten to:
df.groupby(['sex','year','name']).sum().reset_index()

Categories

Resources