Is there an easy and straightforward way to load the output from sp.stats.describe() into a DataFrame, including the value names? It doesn't seem to be a dictionary format or something related. Ofcourse I can manually attach the relevant column names (see below), but was wondering whether it might be possible to directly load into a DataFrame with named columns.
import pandas as pd
import scipy as sp
data = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5]})
sp.stats.describe(data['a'])
pd.DataFrame(a)
pd.DataFrame(a).transpose().rename(columns={0: 'N', 1: 'Min,Max',
2: 'Mean', 3: 'Var',
4: 'Skewness',
5: 'Kurtosis'})
You can use _fields for columns names from named tuple:
a = sp.stats.describe(data['a'])
df = pd.DataFrame([a], columns=a._fields)
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Also is possible create dictionary from named tuples by _asdict:
d = sp.stats.describe(data['a'])._asdict()
df = pd.DataFrame([d], columns=d.keys())
print (df)
nobs minmax mean variance skewness kurtosis
0 5 (1, 5) 3.0 2.5 0.0 -1.3
Related
I have a dataframe with week number as int, item name, and ranking.
For instance:
item_name ranking week_number
0 test 4 1
1 test 3 2
I'd like to add a new column with the ranking evolution since the last week.
The math is very simple:
df['ranking_evolution'] = ranking_previous_week - df['ranking']
It would only require exception handling for week 1.
But I'm not sure how to return the ranking previous week.
I could do it by iterating over the rows but I'm wondering if there is a cleaner way so I can just declare a column?
The issue is that I'd have to compare the dataframe to itself.
I've candidly tried:
df['ranking_evolution'] = df['ranking'].loc[(df[item_name] == df['item_name]) & (df['week_number'] == df['week_number'] - 1) - df['ranking']
But this return NaN values.
Even using a copy returned NaN values.
I assume this is a simplistic example, you probably have different products and maybe missing weeks?
A robust way would be to perform a self-merge with the week+1:
(df.merge(df.assign(week_number=df['week_number']+1),
on=['item_name', 'week_number'],
suffixes=(None, '_evolution'),
how='left')
.assign(ranking_evolution=lambda d: d['ranking_evolution'].sub(d['ranking']))
)
Output:
item_name ranking week_number ranking_evolution
0 test 4 1 NaN
1 test 3 2 1.0
Shortly, try this code to figure out the trick.
import pandas as pd
data = {
'item_name': ['test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test', 'test'],
'ranking': [4, 3, 2, 1, 2, 3, 4, 5, 6, 7],
'week_number': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
}
df = pd.DataFrame(data)
df['ranking_evolution'] = df['ranking'].diff(-1) # this is the line that does the trick
print(df)
Results
item_name ranking week_number ranking_evolution
test 4 1 1.0
test 3 2 1.0
test 2 3 1.0
test 1 4 -1.0
I currently have a file where I create a hierarchy from the product and calculate the percentage split based on the previous level.
My code looks like this:
data = [['product1', 'product1a', 'product1aa', 10],
['product1', 'product1a', 'product1aa', 5],
['product1', 'product1a', 'product1aa', 15],
['product1', 'product1a', 'product1ab', 10],
['product1', 'product1a', 'product1ac', 20],
['product1', 'product1b', 'product1ba', 15],
['product1', 'product1b', 'product1bb',15],
['product2', 'product2_a', 'product2_aa', 30]]
df = pd.DataFrame(data, columns = ["Product_level1", "Product_Level2", "Product_Level3", "Qty"])
prod_levels = ["Product_level1", "Product_Level2", "Product_Level3"]
df = df.groupby(prod_levels).sum("Qty")
df["Qty ratio"] = df["Qty"] / df["Qty"].sum(level=prod_levels[-2])
print(df)
This gives me this as a result:
Qty Qty ratio
Product_level1 Product_Level2 Product_Level3
product1 product1a product1aa 30 0.500000
product1ab 10 0.166667
product1ac 20 0.333333
product1b product1ba 15 0.500000
product1bb 15 0.500000
product2 product2_a product2_aa 30 1.000000
According to my version of pandas (1.3.2), I'm getting a FutureWarning that level is deprecated and that I should use a groupby instead.
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum()
Unfortunately, I cannot seem to figure out what would be the correct syntax to get to the same results using Group by to make sure this will work with futrue versions of Pandas. I've tried variations of what's below but none worked.
df["Qty ratio"] = df.groupby(["Product_level1", "Product_Level2", "Product_Level3"]).sum("Qty") / df.groupby(level=prod_levels[-1]).sum("Qty")
Can anyway suggest how I could approach this?
Thank you
The level keyword on many functions was deprecated in 1.3. Deprecate: level parameter for aggregations in DataFrame and Series #39983.
The following functions are affected:
any
all
count
sum
prod
max
min
mean
median
skew
kurt
sem
var
std
mad
The level argument was always rewritten internally to be a groupby operation. For this reason, to increase clarity and reduce redundancy in the library it was deprecated.
The general pattern is whatever the level arguments passed to the aggregation were, they should be moved to groupby instead.
Sample Data:
import pandas as pd
df = pd.DataFrame(
{'A': [1, 1, 2, 2],
'B': [1, 2, 1, 2],
'C': [5, 6, 7, 8]}
).set_index(['A', 'B'])
C
A B
1 1 5
2 6
2 1 7
2 8
With aggregate over level:
df['C'].sum(level='B')
B
1 12
2 14
Name: C, dtype: int64
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead.
This now becomes groupby over level:
df['C'].groupby(level='B').sum()
B
1 12
2 14
Name: C, dtype: int64
In this specific example:
df["Qty ratio"] = df["Qty"] / df["Qty"].sum(level=prod_levels[-2])
Becomes
df["Qty ratio"] = df["Qty"] / df["Qty"].groupby(level=prod_levels[-2]).sum()
*just move the level argument to groupby
Using a numpy random number generator, generate arrays on height and weight of the 88,000 people living in Utah.
The average height is 1.75 metres and the average weight is 70kg. Assume standard deviation on 3.
Combine these two arrays using column_stack method and convert it into a pandas DataFrame with the first column named as 'height' and the second column named as 'weight'
I've gotten the randomly generated data. However, I can't seem to convert the array to a DataFrame
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
print(Utah)
df = pd.DataFrame(
[[np_height],
[np_weight]],
index = [0, 1],
columns = ['height', 'weight'])
print(df)
You want 2 columns, yet you passed data [[np_height],[np_weight]] as 1 column. You can set the data as dict.
df = pd.DataFrame({'height':np_height,
'weight':np_weight},
columns = ['height', 'weight'])
print(df)
The data in Utah is already in a suitable shape. Why not use that?
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
df = pd.DataFrame(
data=Utah,
columns=['height', 'weight']
)
print(df.head())
height weight
0 3.57 65.32
1 -0.15 66.22
2 5.65 73.11
3 2.00 69.59
4 2.67 64.95
There is already an answer that deals with a relatively simple dataframe that is given here.
However, the dataframe I have at hand has multiple columns and large number of rows. One Dataframe contains three dataframes attached along axis=0. (Bottom end of one is attached to the top of the next.) They are separated by a row of NaN values.
How can I create three dataframes out of this one data by splitting it along the NaN rows?
Like in the answer you linked, you want to create a column which identifies the group number. Then you can apply the same solution.
To do so, you have to make a test for all the values of a row to be NaN. I don't know if there is such a test builtin in pandas, but pandas has a test to check if a Series is full of NaN. So what you want to do is to perform that on the transpose of your dataframe, so that your "Series" is actually your row:
df["group_no"] = df.isnull().all(axis=1).cumsum()
At that point you can use the same technique from that answer to split the dataframes.
You might want to do a .dropna() at the end, because you will still have the NaN rows in your result.
Ran into this same question in 2022. Here's what I did to split dataframes on rows with NaNs, caveat is this relies on pip install python-rle for run-length encoding:
import rle
def nanchucks(df):
# It chucks NaNs outta dataframes
# True if whole row is NaN
df_nans = pd.isnull(df).sum(axis="columns").astype(bool)
values, counts = rle.encode(df_nans)
df_nans = pd.DataFrame({"values": values, "counts": counts})
df_nans["cum_counts"] = df_nans["counts"].cumsum()
df_nans["start_idx"] = df_nans["cum_counts"].shift(1)
df_nans.loc[0, "start_idx"] = 0
df_nans["start_idx"] = df_nans["start_idx"].astype(int) # np.nan makes it a float column
df_nans["end_idx"] = df_nans["cum_counts"] - 1
# Only keep the chunks of data w/o NaNs
df_nans = df_nans[df_nans["values"] == False]
indices = []
for idx, row in df_nans.iterrows():
indices.append((row["start_idx"], row["end_idx"]))
return [df.loc[df.index[i[0]]: df.index[i[1]]] for i in indices]
Examples:
sample_df1 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, np.nan, 3, 4],
"c": [1, 2, np.nan, 3, 4],
})
sample_df2 = pd.DataFrame({
"a": [1, 2, np.nan, 3, 4],
"b": [1, 2, 3, np.nan, 4],
"c": [1, 2, np.nan, 3, 4],
})
print(nanchucks(sample_df1))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 3 3.0 3.0 3.0
# 4 4.0 4.0 4.0]
print(nanchucks(sample_df2))
# [ a b c
# 0 1.0 1.0 1.0
# 1 2.0 2.0 2.0,
# a b c
# 4 4.0 4.0 4.0]
Need a simple example of calculating RMSE with Pandas DataFrame. Providing there is function that returns in cycle true and predicted value:
def fun (data):
...
return trueVal, predVal
for data in set:
fun(data)
And then some code puts these results in the following data frame where x is a real value and p is a predicted value:
In [20]: d
Out[20]: {'p': [1, 10, 4, 5, 5], 'x': [1, 2, 3, 4, 5]}
In [21]: df = pd.DataFrame(d)
In [22]: df
Out[22]:
p x
0 1 1
1 10 2
2 4 3
3 5 4
4 5 5
Questions:
1) How to put results from fun function in df data frame?
2) How to calculate RMSE using df data frame?
Question 1
This depends on the format that data is in. And I'd expect you already have your true values, so this function is just a pass through.
Question 2
With pandas
((df.p - df.x) ** 2).mean() ** .5
With numpy
(np.diff(df.values) ** 2).mean() ** .5
Question 1
I understand you already have a dataframe df. To add the new values in new rows do the following:
for data in set:
trueVal, predVal = fun(data)
auxDf = pd.DataFrame([[predVal, trueVal]], columns = ['p', 'x'])
df.append(auxDf, ignore_index = True)
Question 2
To calculate RMSE using df, I recommend you to use the scikit learn function.
from sklearn.metrics import mean_squared_error
realVals = df.x
predictedVals = df.p
mse = mean_squared_error(realVals, predictedVals)
# If you want the root mean squared error
# rmse = mean_squared_error(realVals, predictedVals, squared = False)
It's very important that you don't have null values in the columns, otherwise it won't work