Python - Pandas: combination of unique rows and their statistics - python

I have been searching through the web whether there is a simple method when using python/pandas to get a dataframe consisting only the unique rows and their basic stats (occurences, mean, and so on) from an original dataframe.
So far my efforts came only half way:
I found how to get all the unique rows using
data.drop_duplicates
But then Im not quite sure how I should retrieve all the stats I desire easily. I could do a for loop on a groupedby, but that would be rather slow.
Another approach that I thought of was using the groupby and then use describe, e.g.,
data.groupby(allColumns)[columnImInterestedInForStats].describe()
But it turns out that this, for 19 columns in allColumns, only returns me one row with no stats at all. Surprisingly, if I choose only a small subset for allColumns, I actually do get each unique combination of the subset and all their stats. My expectation was that if I fill in all 19 columns in groupby() I would get all unique groups?
Data example:
df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), ['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3']
Desired result:
col2 col3 mean count and so on
A 1 1.1 1
3 4.8 3
B 2 6.0 2
4 2.5 1
5 5.2 2
6 3.4 1
C 3 3.4 1
D 1 5.5 3
into a dataframe.
Im sure it must be something very trivial that Im missing, but I cant find the proper answer. Thanks in advance.

You can achieve desired effect using agg().
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), \
['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3']
df['col1'] = df['col1'].astype(float)
df.groupby(['col2','col3'])['col1'].agg([np.mean,'count',np.max,np.min,np.median])
In place of 'col1' in df.groupby you can place list of columns you are interested in.

Related

How to assgin values to columns in Pandas

In picture it show clear i have dataframe with mentioned columns and data. Now how can i assign this data to columns.enter image description here. if you look at the picture it will be more clear.
I try differen assigning operation but it show error like shape of passed values. I expecting that data(array) value to columns
I assume that the problem you face is that there are many columns in the DataFrame and the Array contains values for some but not all of these columns. Hence your problem with shape when you try and combine the two. What you need to do is define the column names for the values in the data Array before combining the two. See example code below which forms another DataFrames with the correct column names and then finally joins things together.
import pandas as pd
df1 = pd.DataFrame({ 'a': [1.0, 2.0],
'b': [3.0, 5.0],
'c': [4.0, 7.0]
})
data = [1.1, 2.1]
names = ['a', 'b']
df2 = pd.DataFrame({key : val for key, val in zip(names, data)}, index= [0])
df3 = pd.concat([df1, df2]).reset_index(drop = True)
print(df3)
this produces
a b c
0 1.0 3.0 4.0
1 2.0 5.0 7.0
2 1.1 2.1 NaN
with NaN for the columns that were missing in the data to be added. You can change the NaN to any value you want using fillna
Please refrain from adding code in the form of an image, it's hard to access. Here's an article explaining why.
Create a new dataframe (although you can choose to modify the existing one) with the column names and the data. I have inferred that your array is stored in a variable named data.
df_updated = pd.DataFrame(data, df.columns)
Note: Thanks to Timus for suggesting the removal of redundant code.

How do I refer to an unnamed columns in query string in Pandas?

How do I refer to an unnamed DataFrame column in a query string when using pandas.DataFrame.query? I know I can to column names that are not valid Python variable names by surrounding them in backticks. However, that does not address unnamed columns.
For example, I would like to query for all rows in a DataFrame where an unnamed column contains a value greater than 0.5.
My code starts like:
import pandas as pd
import numpy as np
array=np.random.rand(10,3)
df=pd.DataFrame(array)
so far so good but then when I try to use pandas.DataFrame.query what query string should I use to find rows where the value in the second column (which happens to unnamed) are greater than 0.5 ?
The closest thing I can think of is
df.query('columns[1]>0.5')
which is flat out wrong because columns columns[1] returns the column number, 1, and does not reference the unnamed column.
I have looked at the Pandas documentation including
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query
https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-query
Any ideas?
There are ways to achieve what you are looking for:
Dummy DataFrame:
>>> df
0 1 2
0 0.210862 0.894414 0.713472
1 0.804793 0.656390 0.842293
2 0.617104 0.763162 0.697050
3 0.158506 0.190683 0.740970
4 0.380092 0.984326 0.138277
5 0.665413 0.445192 0.525754
6 0.274770 0.870642 0.987045
7 0.619918 0.196403 0.221361
8 0.642992 0.572529 0.893655
9 0.101074 0.871377 0.130874
Solution:
In other way around if you work with unnamed columns where you can look for all the rows across DataFrame column as follows, but Keep in mind as it will keep non matching values as NaN while displaying all the matching ones.
>>> df[ df.iloc[:,df.columns]> 0.5 ]
0 1 2
0 NaN 0.894414 0.713472
1 0.804793 0.656390 0.842293
2 0.617104 0.763162 0.697050
3 NaN NaN 0.740970
4 NaN 0.984326 NaN
5 0.665413 NaN 0.525754
6 NaN 0.870642 0.987045
7 0.619918 NaN NaN
8 0.642992 0.572529 0.893655
9 NaN 0.871377 NaN
Solution
Summary: Best options are given below. See further down for all other options.
df.query('#df[1] > 0.5')
df[df[1] > 0.5]
Unnammed columns in pandas are automatically named as 0, 1, 2, ..., where these are numbers and not strings.
The following shows you mainly three ways to achieve what you are looking for.
Option-1: Avoid renaming columns.
Option-1.1: using df.query('#df[1] > 0.5'). Here we use #df to specify that df is a variable.
Option-1.2: Here we use the other option df[df[1] > 0.5].
Option-2.x: Rename columns of the dataframe df by providing a dict: {0: 'A', 1: 'B', 2: 'C'}.
You can use df.query() in this case.
Option-3: Rename the columns of df using a dict-comprehension as C#, where # stands for the column number.
You can use df.query() in this case.
## Option-1: without renaming
# Option-1.1: with query
df.query('#df[1] > 0.5')
# Option-1.2: without using query
df[df[1] > 0.5]
## Option-2: rename columns (using a mapping provided manually)
# columns = {0: 'A', 1: 'B', 2: 'C'}
df = pd.DataFrame(arr).rename(columns={0: 'A', 1: 'B', 2: 'C'})
# Option-2.1
df[df['B'] > 0.5]
# Option-2.2
df[df.B > 0.5]
# Option-2.2
df.query('B > 0.5')
## Option-3: rename dynamically
df = pd.DataFrame(arr)
df = df.rename(columns=dict((x, 'C'+str(x)) for x in df.columns))
df.query('C1 > 0.5')
Output:
0 1 2
3 0.413839 0.889178 0.564845
5 0.802746 0.941901 0.564068
6 0.904837 0.716764 0.151075
8 0.788026 0.749503 0.960260
Dummy Data
import pandas as pd
import numpy as np
arr = np.random.rand(10, 3)
df = pd.DataFrame(arr)
References
Docs: pandas.DataFrame.query
Stackoverflow: How to query a numerical column name in pandas?
Older Pandas Docs: multiindex-query-syntax v-13.0

Python: Standard Deviation within a list of dataframes

I have a list of say 50 dataframes 'list1', each dataframe has columns 'Speed' and 'Value', like this;
Speed Value
1 12
2 17
3 19
4 21
5 25
I am trying to get the standard deviation of 'Value' for each speed, across all dataframes. The end goal is get a list or df of standard deviation for each speed, like this;
Speed Standard Deviation
1 1.23
2 2.5
3 1.98
4 5.6
5 5.77
I've tried to pull the values into a new dataframe using a for loop, to then use 'statistics.stdev' on but I can't seem to get it working. Any help is really appreciatted!
Update!
pd.concat([d.set_index('Speed').values for d in df_power], axis=1).std(1)
This worked. Although, I forgot to mention that the values for Speed are not always the same between dataframes. Some dataframes miss a few and this ends up returning nan in those instances.
You can concat and use std:
list_df = [df1, df2, df3, ...]
pd.concat([d.set_index('Speed') for d in list_dfs], axis=1).std(1)
You'll want to concatenate, groupby speed, and take the standard deviation.
1) Concatenate your dataframes
list1 = [df_1, df_2, ...]
full_df = pd.concat(list1, axis=0) # stack all dataframes
2) Groupby speed and take the standard deviation
std_per_speed_df = full_df.groupby('speed')[['value']].std()
If the dataframes are all saved on the same folder you can use pd.concat +groupby as already suggested or you can use dask
import dask.dataframe as dd
import pandas as pd
df = dd.read_csv("data/*")
out = df.groupby("Speed")["Value"].std()\
.compute()\
.reset_index(name="Standard Deviation")

pandas transform with NaN values in grouped columns [duplicate]

I have a DataFrame with many missing values in columns which I wish to groupby:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
In [4]: df.groupby('b').groups
Out[4]: {'4': [0], '6': [2]}
see that Pandas has dropped the rows with NaN target values. (I want to include these rows!)
Since I need many such operations (many cols have missing values), and use more complicated functions than just medians (typically random forests), I want to avoid writing too complicated pieces of code.
Any suggestions? Should I write a function for this or is there a simple solution?
pandas >= 1.1
From pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False:
pd.__version__
# '1.1.0.dev0+2004.g8d10bfb6f'
# Example from the docs
df
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
# without NA (the default)
df.groupby('b').sum()
a c
b
1.0 2 3
2.0 2 5
# with NA
df.groupby('b', dropna=False).sum()
a c
b
1.0 2 3
2.0 2 5
NaN 1 4
This is mentioned in the Missing Data section of the docs:
NA groups in GroupBy are automatically excluded. This behavior is consistent with R
One workaround is to use a placeholder before doing the groupby (e.g. -1):
In [11]: df.fillna(-1)
Out[11]:
a b
0 1 4
1 2 -1
2 3 6
In [12]: df.fillna(-1).groupby('b').sum()
Out[12]:
a
b
-1 2
4 1
6 3
That said, this feels pretty awful hack... perhaps there should be an option to include NaN in groupby (see this github issue - which uses the same placeholder hack).
However, as described in another answer, "from pandas 1.1 you have better control over this behavior, NA values are now allowed in the grouper using dropna=False"
Ancient topic, if someone still stumbles over this--another workaround is to convert via .astype(str) to string before grouping. That will conserve the NaN's.
df = pd.DataFrame({'a': ['1', '2', '3'], 'b': ['4', np.NaN, '6']})
df['b'] = df['b'].astype(str)
df.groupby(['b']).sum()
a
b
4 1
6 3
nan 2
I am not able to add a comment to M. Kiewisch since I do not have enough reputation points (only have 41 but need more than 50 to comment).
Anyway, just want to point out that M. Kiewisch solution does not work as is and may need more tweaking. Consider for example
>>> df = pd.DataFrame({'a': [1, 2, 3, 5], 'b': [4, np.NaN, 6, 4]})
>>> df
a b
0 1 4.0
1 2 NaN
2 3 6.0
3 5 4.0
>>> df.groupby(['b']).sum()
a
b
4.0 6
6.0 3
>>> df.astype(str).groupby(['b']).sum()
a
b
4.0 15
6.0 3
nan 2
which shows that for group b=4.0, the corresponding value is 15 instead of 6. Here it is just concatenating 1 and 5 as strings instead of adding it as numbers.
All answers provided thus far result in potentially dangerous behavior as it is quite possible you select a dummy value that is actually part of the dataset. This is increasingly likely as you create groups with many attributes. Simply put, the approach doesn't always generalize well.
A less hacky solve is to use pd.drop_duplicates() to create a unique index of value combinations each with their own ID, and then group on that id. It is more verbose but does get the job done:
def safe_groupby(df, group_cols, agg_dict):
# set name of group col to unique value
group_id = 'group_id'
while group_id in df.columns:
group_id += 'x'
# get final order of columns
agg_col_order = (group_cols + list(agg_dict.keys()))
# create unique index of grouped values
group_idx = df[group_cols].drop_duplicates()
group_idx[group_id] = np.arange(group_idx.shape[0])
# merge unique index on dataframe
df = df.merge(group_idx, on=group_cols)
# group dataframe on group id and aggregate values
df_agg = df.groupby(group_id, as_index=True)\
.agg(agg_dict)
# merge grouped value index to results of aggregation
df_agg = group_idx.set_index(group_id).join(df_agg)
# rename index
df_agg.index.name = None
# return reordered columns
return df_agg[agg_col_order]
Note that you can now simply do the following:
data_block = [np.tile([None, 'A'], 3),
np.repeat(['B', 'C'], 3),
[1] * (2 * 3)]
col_names = ['col_a', 'col_b', 'value']
test_df = pd.DataFrame(data_block, index=col_names).T
grouped_df = safe_groupby(test_df, ['col_a', 'col_b'],
OrderedDict([('value', 'sum')]))
This will return the successful result without having to worry about overwriting real data that is mistaken as a dummy value.
One small point to Andy Hayden's solution – it doesn't work (anymore?) because np.nan == np.nan yields False, so the replace function doesn't actually do anything.
What worked for me was this:
df['b'] = df['b'].apply(lambda x: x if not np.isnan(x) else -1)
(At least that's the behavior for Pandas 0.19.2. Sorry to add it as a different answer, I do not have enough reputation to comment.)
I answered this already, but some reason the answer was converted to a comment. Nevertheless, this is the most efficient solution:
Not being able to include (and propagate) NaNs in groups is quite aggravating. Citing R is not convincing, as this behavior is not consistent with a lot of other things. Anyway, the dummy hack is also pretty bad. However, the size (includes NaNs) and the count (ignores NaNs) of a group will differ if there are NaNs.
dfgrouped = df.groupby(['b']).a.agg(['sum','size','count'])
dfgrouped['sum'][dfgrouped['size']!=dfgrouped['count']] = None
When these differ, you can set the value back to None for the result of the aggregation function for that group.

Loop through different Pandas Dataframes

im new to Python, and have what is probably a basis question.
I have imported a number of Pandas Dataframes consisting of stock data for different sectors. So all columns are the same, just with different dataframe names.
I need to do a lot of different small operations on some of the columns, and I can figure out how to do it on one Dataframe at a time, but I need to figure out how to loop over the different frames and do the same operations on each.
For example for one DF i do:
ConsumerDisc['IDX_EST_PRICE_BOOK']=1/ConsumerDisc['IDX_EST_PRICE_BOOK']
ConsumerDisc['IDX_EST_EV_EBITDA']=1/ConsumerDisc['IDX_EST_EV_EBITDA']
ConsumerDisc['INDX_GENERAL_EST_PE']=1/ConsumerDisc['INDX_GENERAL_EST_PE']
ConsumerDisc['EV_TO_T12M_SALES']=1/ConsumerDisc['EV_TO_T12M_SALES']
ConsumerDisc['CFtoEarnings']=ConsumerDisc['CASH_FLOW_PER_SH']/ConsumerDisc['TRAIL_12M_EPS']
And instead of just copying and pasting this code for the next 10 sectors, I want to to do it in a loop somehow, but I cant figure out how to access the df via variable, eg:
CS=['ConsumerDisc']
CS['IDX_EST_PRICE_BOOK']=1/CS['IDX_EST_PRICE_BOOK']
so I could just create a list of df names and loop through it.
Hope you can give a small example as how to do this.
You're probably looking for something like this
for df in (df1, df2, df3):
df['IDX_EST_PRICE_BOOK']=1/df['IDX_EST_PRICE_BOOK']
df['IDX_EST_EV_EBITDA']=1/df['IDX_EST_EV_EBITDA']
df['INDX_GENERAL_EST_PE']=1/df['INDX_GENERAL_EST_PE']
df['EV_TO_T12M_SALES']=1/df['EV_TO_T12M_SALES']
df['CFtoEarnings']=df['CASH_FLOW_PER_SH']/df['TRAIL_12M_EPS']
Here we're iterating over the dataframes that we've put in a tuple datasctructure, does that make sense?
Do you mean something like this?
import pandas as pd
d = {'a' : pd.Series([1, 2, 3, 10]), 'b' : pd.Series([2, 2, 6, 8])}
z = {'d' : pd.Series([4, 2, 3, 1]), 'e' : pd.Series([21, 2, 60, 8])}
df = pd.DataFrame(d)
zf = pd.DataFrame(z)
df.head()
a b
0 1 2
1 2 2
2 3 6
3 10 8
df = df.apply(lambda x: 1/x)
df.head()
a b
0 1.0 0.500000
1 2.0 0.500000
2 3.0 0.166667
3 10.0 0.125000
You have more functions so you can create a function and then just apply that to each DataFrame. Alternatively you could also apply these lambda functions to only specific columns. So lets say you want to apply only 1/column to the every column but the last (going by your example, I am assuming it is in the end) you could do df.ix[:, :-1].apply(lambda x : 1/x).

Categories

Resources