Pandas: combining duplicate index values

Pandas: combining duplicate index values - python

I have a pandas series that I would like to combine in three different ways. The series is as follows:
import pandas as pd
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
series = pd.Series(quantities, index=timestamps)
Clearly the timestamps have 3 values of 1, 1 value of 2, 3 values of 3 and 1 value of 1. I would like to generate the following series:
1. Sum of the duplicate index values:
pd.Series([12,6,17,0], index=[1,2,3,4])
2. Median of the duplicate index values:
pd.Series([2,6,7,0], index=[1,2,3,4])
2. The number of duplicate index values:
pd.Series([3,1,3,1], index=[1,2,3,4])
In numpy I would achieve this using a unique_elements_to_indices method:
from typing import Dict
import numpy as np
def unique_elements_to_indices(array: np.array) -> Dict:
mapping = {}
for unique_element in np.unique(array):
mapping[unique_element] = np.where(array == unique_element)[0]
return mapping
... and then I would loop through the unique_elements and use np.where to locate the quantities for that given unique_element.
Is there away to achieve this quickly in pandas, please?
Thanks.

Here is possible use functions sum, median for separate outputs with parameter level=0 for aggregate by index:
print (series.sum(level=0))
print (series.median(level=0))
But generaly aggregate by index with function:
print (series.groupby(level=0).sum())
print (series.groupby(level=0).median())
#difference between count and size is count exclude NaNs values
print (series.groupby(level=0).size())
print (series.groupby(level=0).count())
If need all together to new DataFrame use GroupBy.agg with list of aggregate functions:
print(series.groupby(level=0).agg(['sum', 'median', 'size']))

You could use .groupby for this:
import pandas as pd
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
sr = pd.Series(quantities, index=timestamps)
print(sr.groupby(sr.index).sum())
print(sr.groupby(sr.index).median())
print(sr.groupby(sr.index).count())

When you are working with pandas library then advisable to convert your data into dataframe. The Easiest way is as below in pandas
timestamps = [1,1,1,2,3,3,3,4]
quantities = [10,0,2,6,7,2,8,0]
d = {'quantities': quantities, 'timestamps': timestamps}
df = pd.DataFrame(d)
df.groupby('timestamps').sum().reset_index()
The Similar way you can also use other function as well. Kindly let me know if this works for you.

Related

pandas apply function to each group (output is not really an aggregation)

I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile.
One option is to iterate all the devices - which seems to be slow.
A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.
Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
print(g)
this_group = df[df.bar == g]
# perform a UDF which needs to have all the values per group
# i.e. for real I want to calculate the matrixprofile for each time-series of a device
this_group['result'] = this_group.baz.apply(lambda x: 1)
display(this_group)
print('***************************')
def my_non_scalar1_1_agg_function(x):
display(pd.DataFrame(x))
return x
# neatly vectorized application of a non_scalar function
# but this fails as: Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)

For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.
Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.
# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)
# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)

Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
grouped_df = df.groupby(['bar'])
altered = []
for index, subframe in grouped_df:
display(subframe)
subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
altered.append(subframe)
print (index)
#print (subframe)
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)

How to update a pandas dataframe with an array of values, indexes, and columns?

I have a large dataframe and would like to update specific values at known row and column indices. I would like to do this without an explicit for loop.
For example:
import string
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 10), index = range(10), columns = list(string.ascii_lowercase)[:10])
I have arbitrary arrays of indexes, columns, and values that I would like to use to update df. For example:
update_values = [0,-2,-3]
update_index = [3,5,7]
update_columns = ["d","g","i"]
I can loop over the arrays to update the original dataframe:
for i,j,v in zip(update_index, update_columns, update_values):
df.loc[i,j] = v
but would like to use a technique not involving an explicit for loop.

Use the underlying numpy values
indexes = map(df.columns.get_loc, update_columns)
df.values[update_index, list(indexes)] = update_values

try using loc which is used to specify the needed indexes and columns names loc[[index_names], [columns_names]]
df.loc[[3,5,7], ["d","g","i"]] = [0,-2,-3]

How to pick the numeric columns in pd.Dataframe() [duplicate]

Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)

You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)

Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']

You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.

df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')

Simple one-liner:
df.select_dtypes('number').columns

Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']

This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index

We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'

Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.

Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.

Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.

A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.

def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.

Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)

# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

How to iterate over columns of pandas dataframe to run regression

I have this code using Pandas in Python:
all_data = {}
for ticker in ['FIUIX', 'FSAIX', 'FSAVX', 'FSTMX']:
all_data[ticker] = web.get_data_yahoo(ticker, '1/1/2010', '1/1/2015')
prices = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})
returns = prices.pct_change()
I know I can run a regression like this:
regs = sm.OLS(returns.FIUIX,returns.FSTMX).fit()
but how can I do this for each column in the dataframe? Specifically, how can I iterate over columns, in order to run the regression on each?
Specifically, I want to regress each other ticker symbol (FIUIX, FSAIX and FSAVX) on FSTMX, and store the residuals for each regression.
I've tried various versions of the following, but nothing I've tried gives the desired result:
resids = {}
for k in returns.keys():
reg = sm.OLS(returns[k],returns.FSTMX).fit()
resids[k] = reg.resid
Is there something wrong with the returns[k] part of the code? How can I use the k value to access a column? Or else is there a simpler approach?

for column in df:
print(df[column])

You can use iteritems():
for name, values in df.iteritems():
print('{name}: {value}'.format(name=name, value=values[0]))

This answer is to iterate over selected columns as well as all columns in a DF.
df.columns gives a list containing all the columns' names in the DF. Now that isn't very helpful if you want to iterate over all the columns. But it comes in handy when you want to iterate over columns of your choosing only.
We can use Python's list slicing easily to slice df.columns according to our needs. For eg, to iterate over all columns but the first one, we can do:
for column in df.columns[1:]:
print(df[column])
Similarly to iterate over all the columns in reversed order, we can do:
for column in df.columns[::-1]:
print(df[column])
We can iterate over all the columns in a lot of cool ways using this technique. Also remember that you can get the indices of all columns easily using:
for ind, column in enumerate(df.columns):
print(ind, column)

You can index dataframe columns by the position using ix.
df1.ix[:,1]
This returns the first column for example. (0 would be the index)
df1.ix[0,]
This returns the first row.
df1.ix[:,1]
This would be the value at the intersection of row 0 and column 1:
df1.ix[0,1]
and so on. So you can enumerate() returns.keys(): and use the number to index the dataframe.

A workaround is to transpose the DataFrame and iterate over the rows.
for column_name, column in df.transpose().iterrows():
print column_name

Using list comprehension, you can get all the columns names (header):
[column for column in df]

Based on the accepted answer, if an index corresponding to each column is also desired:
for i, column in enumerate(df):
print i, df[column]
The above df[column] type is Series, which can simply be converted into numpy ndarrays:
for i, column in enumerate(df):
print i, np.asarray(df[column])

I'm a bit late but here's how I did this. The steps:
Create a list of all columns
Use itertools to take x combinations
Append each result R squared value to a result dataframe along with excluded column list
Sort the result DF in descending order of R squared to see which is the best fit.
This is the code I used on DataFrame called aft_tmt. Feel free to extrapolate to your use case..
import pandas as pd
# setting options to print without truncating output
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
import statsmodels.formula.api as smf
import itertools
# This section gets the column names of the DF and removes some columns which I don't want to use as predictors.
itercols = aft_tmt.columns.tolist()
itercols.remove("sc97")
itercols.remove("sc")
itercols.remove("grc")
itercols.remove("grc97")
print itercols
len(itercols)
# results DF
regression_res = pd.DataFrame(columns = ["Rsq", "predictors", "excluded"])
# excluded cols
exc = []
# change 9 to the number of columns you want to combine from N columns.
#Possibly run an outer loop from 0 to N/2?
for x in itertools.combinations(itercols, 9):
lmstr = "+".join(x)
m = smf.ols(formula = "sc ~ " + lmstr, data = aft_tmt)
f = m.fit()
exc = [item for item in x if item not in itercols]
regression_res = regression_res.append(pd.DataFrame([[f.rsquared, lmstr, "+".join([y for y in itercols if y not in list(x)])]], columns = ["Rsq", "predictors", "excluded"]))
regression_res.sort_values(by="Rsq", ascending = False)

I landed on this question as I was looking for a clean iterator of columns only (Series, no names).
Unless I am mistaken, there is no such thing, which, if true, is a bit annoying. In particular, one would sometimes like to assign a few individual columns (Series) to variables, e.g.:
x, y = df[['x', 'y']] # does not work
There is df.items() that gets close, but it gives an iterator of tuples (column_name, column_series). Interestingly, there is a corresponding df.keys() which returns df.columns, i.e. the column names as an Index, so a, b = df[['x', 'y']].keys() assigns properly a='x' and b='y'. But there is no corresponding df.values(), and for good reason, as df.values is a property and returns the underlying numpy array.
One (inelegant) way is to do:
x, y = (v for _, v in df[['x', 'y']].items())
but it's less pythonic than I'd like.

Most of these answers are going via the column name, rather than iterating the columns directly. They will also have issues if there are multiple columns with the same name. If you want to iterate the columns, I'd suggest:
for series in (df.iloc[:,i] for i in range(df.shape[1])):
...

assuming X-factor, y-label (multicolumn):
columns = [c for c in _df.columns if c in ['col1', 'col2','col3']] #or '..c not in..'
_df.set_index(columns, inplace=True)
print( _df.index)
X, y = _df.iloc[:,:4].values, _df.index.values

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: combining duplicate index values - python

You could use .groupby for this: import pandas as pd timestamps = [1,1,1,2,3,3,3,4] quantities = [10,0,2,6,7,2,8,0] sr = pd.Series(quantities, index=timestamps) print(sr.groupby(sr.index).sum()) print(sr.groupby(sr.index).median()) print(sr.groupby(sr.index).count())

Related

pandas apply function to each group (output is not really an aggregation)

How to update a pandas dataframe with an array of values, indexes, and columns?

How to pick the numeric columns in pd.Dataframe() [duplicate]

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

How to iterate over columns of pandas dataframe to run regression

Categories

Resources