python pandas column with averages [duplicate] - python

This question already has an answer here:
Aggregation over Partition - pandas Dataframe
(1 answer)
Closed 7 months ago.
I have a dataframe with in column "A" locations and in column "B" values. Locations occure multiple times in this DataFrame, now i'd like to add a third column in which i store the average value of column "B" that have the same location value in column "A".
-I know the .mean() can be used to get an average
-I know how to filter with .loc()
I could make a list of all unique values in column A, and compute the average for all of them by making a for loop. Hover, this seems combersome to me. Any idea how this can be done more efficiently?

Sounds like what you need is GroupBy. Take a look here
Given
df = pd.DataFrame({'A': [1, 1, 2, 1, 2],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 1, 1, 2]}, columns=['A', 'B', 'C'])
You can use
df.groupby('A').mean()
to group the values based on the common values in column "A" and find the mean.
Output:
B C
A
1 3.0 1.333333
2 4.0 1.500000

I could make a list of all unique values in column A, and compute the
average for all of them by making a for loop.
This can be done using pandas.DataFrame.groupby consider following simple example
import pandas as pd
df = pd.DataFrame({"A":["X","Y","Y","X","X"],"B":[1,3,7,10,20]})
means = df.groupby('A').agg('mean')
print(means)
gives output
B
A
X 10.333333
Y 5.000000

import pandas as pd
data = {'A': ['a', 'a', 'b', 'c'], 'B': [32, 61, 40, 45]}
df = pd.DataFrame(data)
df2 = df.groupby(['A']).mean()
print(df2)

Based on your description, I'm not sure if you are trying to simply calculate the averages for each group, or if you are wanting to maintain the long format of your data. I'll break down a solution for each option.
The data I'll use below can be generated by running the following...
import pandas as pd
df = pd.DataFrame([['group1 ', 2],
['group2 ', 4],
['group1 ', 5],
['group2 ', 2],
['group1 ', 2],
['group2 ', 0]], columns=['A', 'B'])
Option 1 - Calculate Group Averages
This one is super simple. It uses the .groupby method, which is the bread and butter of crunching data calculations.
df.groupby('A').B.mean()
Output:
A
group1 3.0
group2 2.0
If you wish for this to return a dataframe instead of a series, you can add .to_frame() to the end of the above line.
Option 2 - Calculate Group Averages and Maintain Long Format
By long format, I mean you want your data to be structured the same as it is currently, but with a third column (we'll call it C) containing a mean that is connected to the A column. ie...
A
B
C (average)
group1
2
3
group2
4
2
group1
5
3
group2
2
2
group1
2
3
group2
0
2
Where the averages for each group are...
group1 = (2+5+2)/3 = 3
group2 = (4+2+0)/3 = 2
The most efficient solution, would be to use .transform, which behaves like an sql window function, but I think this method can be a little confusing when you're new to pandas.
import numpy as np
df.assign(C=df.groupby('A').B.transform(np.mean))
A less efficient, but more beginner friendly option would be to store the averages in a dictionary and then map each row to the group average.
I find myself using this option a lot for modeling projects, when I want to impute a historical average rather than the average of my sampled data.
To accomplish this, you can...
Create a dictionary containing the grouped averages
For every row in the dataframe, pass the group name into the dictionary
# Create the group averages
group_averages = df.groupby('A').B.mean().to_dict()
# For every row, pass the group name into the dictionary
new_column = df.A.map(group_averages)
# Add the new column to the dataframe
df = df.assign(C=new_column)
You can also, optionally, do all of this in a single line
df = df.assign(C=df.A.map(df.groupby('A').B.mean().to_dict()))

Related

Map a dataframe to a column of cartesian products by column name

Note: Cartesian product, might not be the right language, since we are working with data, not sets. It is more like "free product" or "words".
There is more than one way to turn a dataframe into a list of lists.
Here is one way
In that case, the list of lists represents actually a list of columns, where the list index is the row index.
What I want to do, is take a data frame, select specific columns by name, then produce a new list where the inner lists are cartesian products of the elements from the selected columns. A simplified example is given here:
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
magicMap(df)
df = [[1,3],[2,4],[3,5]]
With column names:
df # full of columns with names
magicMap(df, listOfCollumnNames)
df = [[c1r1,c2r1...],[c1r2, c2r2....], [c1r3, c2r3....]...]
Note: "cirj" is column i row j.
Is there a simple way to do this?
The code
import pandas as pd
df = pd.DataFrame([[1,2,3],[3,4,5]])
df2= df.transpose()
goes from, df
0 1 2
0 1 2 3
1 3 4 5
to that, df2
0 1
0 1 3
1 2 4
2 3 5
looks like what you need
df2.values.tolist()
[[1, 3], [2, 4], [3, 5]]
and to get the column order in the way you want use df3 = df2.reindex(columns=column_names) where column_names is the order you want,
You can also send the dataframe to a numpy array with:
df.T.to_numpy()
array([[1, 3],
[2, 4],
[3, 5]], dtype=int64)
If it must be a list, then use the other answer provided or use:
df.T.to_numpy().tolist()

Filter dataframe based on value_counts of other dataframe [duplicate]

I'm working in Python with a pandas DataFrame of video games, each with a genre. I'm trying to remove any video game with a genre that appears less than some number of times in the DataFrame, but I have no clue how to go about this. I did find a StackOverflow question that seems to be related, but I can't decipher the solution at all (possibly because I've never heard of R and my memory of functional programming is rusty at best).
Help?
Use groupby filter:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 5 6
In [13]: df.groupby("A").filter(lambda x: len(x) > 1)
Out[13]:
A B
0 1 2
1 1 4
I recommend reading the split-combine-section of the docs.
Solutions with better performance should be GroupBy.transform with size for count per groups to Series with same size like original df, so possible filter by boolean indexing:
df1 = df[df.groupby("A")['A'].transform('size') > 1]
Or use Series.map with Series.value_counts:
df1 = df[df['A'].map(df['A'].value_counts()) > 1]
#jezael solution works very well, Here is a different approach to filter based on values count :
For example, if the dataset is :
df = pd.DataFrame({'a': [1,2,3,3,1,6], 'b': [11,2,33,4,55,6]})
Convert and save the count as a dictionary
ount_freq = dict(df['a'].value_counts())
Create a new column and copy the target column, map the dictionary with newly created column
df['count_freq'] = df['a']
df['count_freq'] = df['count_freq'].map(count_freq)
Now we have a new column with count freq, you can now define a threshold and filter easily with this column.
df[df.count_freq>1]
Additionlly, in case one wants to filter and have 'count' column:
attr = 'A'
limit = 10
df2 = df.groupby(attr)[attr].agg(count='count')
df2 = df2.loc[df2['count'] > limit].reset_index()
print(df2)
#outputs rows with grouped 'A' count > 10 and columns ==> index, count, A
I might be a little late to this party but:
df = pd.DataFrame(df_you_have.groupby(['IdA', 'SomeOtherA'])['theA_you_want_to_count'].count())
df.reset_index(inplace=True)
This is how you create a new dataframe and then just filter it...
df[df['A']>100]

Converting a pandas series with dictionaries inside into dataframes, then append to the original dataframe

I have a dataframe wherein one of its columns has dictionaries inside of it (1 cell = 1 dictionary).
I would like for the key,value pair dictionaries to be two separate columns, and then append them to my original dataframe.
Here's the sample of my dataframe:
Example is that I want LUSH ASIA LIMITED to be in one column and then 1000 to be in another column. I understand that the values of my other columns will repeat when my dictionaries "explode".
I'm not sure if my logic is right, but the idea is to make each dictionary into a new dataframe, then just join them into the original dataframe. I'm not sure though how to do it.
Any advice? Thank you. :)
I think the idea is correct. Maybe this solution can help you. I am directly creating columns with dictionaries. If the columns were actually strings, you may need ast.literal_eval to turn your columns to actual dictionaries.
import pandas as pd
df = pd.DataFrame([[1,2], [3,4]], columns=['a', 'b'])
# create the dictionary columns - note one dictionary has 1 item only
df['c'] = [{'c': 5, 'e':7}, {'d': 6,}]
# pd.Series is a function that will directly split lists into different columns
# but to use it we must turn the dictionary into a list
# turn the items into a list by unpacking with [*x.items()]
# then flatten that list: for each sublist, take out the elements
df_new = df['c'].apply(lambda x: [l for s in [*x.items()] for l in s]).apply(pd.Series)
# you can then join the dataframes and drop the column with dictionaries if you want
final_df = df.join(df_new).drop('c', axis=1)
Input:
a b c
0 1 2 {'c': 5, 'e': 7}
1 3 4 {'d': 6}
Output: Here the [0, 1, 2, 3] columns came from the step where we applied pd.Series. Note how the NaN were also introduced
a b 0 1 2 3
0 1 2 c 5 e 7.0
1 3 4 d 6 NaN NaN

Loop through different Pandas Dataframes

im new to Python, and have what is probably a basis question.
I have imported a number of Pandas Dataframes consisting of stock data for different sectors. So all columns are the same, just with different dataframe names.
I need to do a lot of different small operations on some of the columns, and I can figure out how to do it on one Dataframe at a time, but I need to figure out how to loop over the different frames and do the same operations on each.
For example for one DF i do:
ConsumerDisc['IDX_EST_PRICE_BOOK']=1/ConsumerDisc['IDX_EST_PRICE_BOOK']
ConsumerDisc['IDX_EST_EV_EBITDA']=1/ConsumerDisc['IDX_EST_EV_EBITDA']
ConsumerDisc['INDX_GENERAL_EST_PE']=1/ConsumerDisc['INDX_GENERAL_EST_PE']
ConsumerDisc['EV_TO_T12M_SALES']=1/ConsumerDisc['EV_TO_T12M_SALES']
ConsumerDisc['CFtoEarnings']=ConsumerDisc['CASH_FLOW_PER_SH']/ConsumerDisc['TRAIL_12M_EPS']
And instead of just copying and pasting this code for the next 10 sectors, I want to to do it in a loop somehow, but I cant figure out how to access the df via variable, eg:
CS=['ConsumerDisc']
CS['IDX_EST_PRICE_BOOK']=1/CS['IDX_EST_PRICE_BOOK']
so I could just create a list of df names and loop through it.
Hope you can give a small example as how to do this.
You're probably looking for something like this
for df in (df1, df2, df3):
df['IDX_EST_PRICE_BOOK']=1/df['IDX_EST_PRICE_BOOK']
df['IDX_EST_EV_EBITDA']=1/df['IDX_EST_EV_EBITDA']
df['INDX_GENERAL_EST_PE']=1/df['INDX_GENERAL_EST_PE']
df['EV_TO_T12M_SALES']=1/df['EV_TO_T12M_SALES']
df['CFtoEarnings']=df['CASH_FLOW_PER_SH']/df['TRAIL_12M_EPS']
Here we're iterating over the dataframes that we've put in a tuple datasctructure, does that make sense?
Do you mean something like this?
import pandas as pd
d = {'a' : pd.Series([1, 2, 3, 10]), 'b' : pd.Series([2, 2, 6, 8])}
z = {'d' : pd.Series([4, 2, 3, 1]), 'e' : pd.Series([21, 2, 60, 8])}
df = pd.DataFrame(d)
zf = pd.DataFrame(z)
df.head()
a b
0 1 2
1 2 2
2 3 6
3 10 8
df = df.apply(lambda x: 1/x)
df.head()
a b
0 1.0 0.500000
1 2.0 0.500000
2 3.0 0.166667
3 10.0 0.125000
You have more functions so you can create a function and then just apply that to each DataFrame. Alternatively you could also apply these lambda functions to only specific columns. So lets say you want to apply only 1/column to the every column but the last (going by your example, I am assuming it is in the end) you could do df.ix[:, :-1].apply(lambda x : 1/x).

Equivalent of Pandas Factorize For Multiple Columns?

I have three binary-type columns of a dataframe whose values together constitute a meaningful grouping of the data. To refer to the group, I'm currently making a new column a hard-coded binary encoding like so:
data['type'] = data['a'] + 2 * data['b'] + 4 * data['c']
Pandas factorize will assign an integer for each distinct value of a sequence, but it doesn't seem to work with combinations of multiple columns. Is there a more general pandas function for situations like this? It would be nice if such a function generalized to K distinct categorical variables of arbitrary number of categories, rather than being limited to binary variables.
If such a thing doesn't exist, would there be interest in a pull request?
Here are two methods you can try:
df = pd.DataFrame({'a': [1, 1, 0],
'b': [0, 1, 0],
'c': [1, 1, 1]})
>>> df
a b c
0 1 0 1
1 1 1 1
2 0 0 1
>>> ["".join(row) for row in df[['a', 'b', 'c']].values.astype(str)]
Out[22]: ['101', '111', '001']
>>> [bytearray("".join(row)) for row in df[['a', 'b', 'c']].values.astype(str)]
Out[23]: [bytearray(b'101'), bytearray(b'111'), bytearray(b'001')]
You may want to take a look at patsy which addresses things like categorical variable encoding and other model-related issues: see docs.
Patsy offers quite a few encoding schemes, including:
Treatment (default)
Backward difference coding
Orthogonal polynomial contrast coding
Deviation coding (also known as sum-to-zero coding), and
Helmert contrasts

Categories

Resources