I have a large data frame composed of 450 columns with 550 000 rows.
In the columns i have :
73 float columns
30 columns dates
remainder columns in object
I would like to make a description of my variables, but not only describe as usual, but also include other descriptions in the same matrix. At the final, we will have a matrix of description with the set of 450 variables then a detailed description of:
- dtype
- count
- count null values
- % number of null values
- max
- min
- 50%
- 75%
- 25%
- ......
For now, i have juste a basic function that describe my data like this :
Dataframe.describe(include = 'all')
Do you have a function or method to do this more extensive descrition.
Thanks.
You need write custom functions for Series and then add to final describe DataFrame:
Notice:
First row of final df is count - used function count for count non NaNs values
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,np.nan,np.nan,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
print (df)
A B C D E F
0 a 4.0 7 1 5 a
1 b NaN 8 3 3 a
2 c NaN 9 5 6 a
3 d 5.0 4 7 9 b
4 e 5.0 2 1 2 b
5 f 4.0 3 0 4 b
df1 = df.describe(include = 'all')
df1.loc['dtype'] = df.dtypes
df1.loc['size'] = len(df)
df1.loc['% count'] = df.isnull().mean()
print (df1)
A B C D E F
count 6 4 6 6 6 6
unique 6 NaN NaN NaN NaN 2
top e NaN NaN NaN NaN b
freq 1 NaN NaN NaN NaN 3
mean NaN 4.5 5.5 2.83333 4.83333 NaN
std NaN 0.57735 2.88097 2.71416 2.48328 NaN
min NaN 4 2 0 2 NaN
25% NaN 4 3.25 1 3.25 NaN
50% NaN 4.5 5.5 2 4.5 NaN
75% NaN 5 7.75 4.5 5.75 NaN
max NaN 5 9 7 9 NaN
dtype object float64 int64 int64 int64 object
size 6 6 6 6 6 6
% count 0 0.333333 0 0 0 0
In pandas, there is no alternative function to describe(), but it clearly isn't displaying all the values that you need. You can use various parameters of the describe() function accordingly.
describe() on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn't show up in describe(), change the type with:
df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)
You could also create new columns for handling the numeric part of a mixed-type column, or convert strings to numbers using a dictionary and the map() function.
describe() on a non-numeric Series will give you some statistics (like count, unique and the most frequently-occurring value).
To call describe() on just the objects (strings) use describe(include = ['O']).
Related
I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?
Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64
I need to find the number of common values in a column wrt another column.
For example:
There are two columns X , Y.
X:
a
b
c
a
d
a
b
b
a
a
Y:
NaN
2
4
Nan
NaN
6
4
NaN
5
4
So how do I group values like NaN wrt a,b,c,d.
For example,
a has 2 NaN values.
b has 1 NaN values.
Per my comment, I have transposed your dataframe with df.set_index(0).T to get the following starting point.
In[1]:
0 X Y
1 a NaN
2 b 2
3 c 4
4 a NaN
5 d NaN
6 a 6
7 b 4
8 b NaN
9 a 5
10 a 4
From there, you can filter for null values with .isnull(). Then, you can use .groupby('X').size() to return the count of null values per group:
df[df['Y'].isnull()].groupby('X').size()
X
a 2
b 1
d 1
dtype: int64
Or, you could use value_counts() to achieve the same thing:
df[df['Y'].isnull()]['X'].value_counts()
I recently started working with Pandas and I'm currently trying to impute some missing values in my dataset.
I want to impute the missing values based on the median (for numerical entries) and mode (for categorical entries). However, I do not want to calculate the median and mode over the whole dataset, but per-group, based on a GroupBy of my column called "make".
For numerical NA values I did the following:
data = data.fillna(data.groupby("make").transform("median"))
...which works perfectly and replaces all my numerical NA values with the median of their "make".
However, for categorical NA values, I couldn't manage to do the same thing for the mode, i.e. replace all categorical NA values with the mode of their "make".
Does anyone know how to do it?
You can use GroupBy.transform with if-else for median for numeric and mode for categorical columns:
df = pd.DataFrame({
'A':list('ebcded'),
'B':[np.nan,np.nan,4,5,5,4],
'C':[7,np.nan,9,4,2,3],
'D':[1,3,5,np.nan,1,0],
'F':list('aaabbb'),
'make':list('aaabbb')
})
df.loc[[2,4], 'A'] = np.nan
df.loc[[2,5], 'F'] = np.nan
print (df)
A B C D F make
0 e NaN 7.0 1.0 a a
1 b NaN NaN 3.0 a a
2 NaN 4.0 9.0 5.0 NaN a
3 d 5.0 4.0 NaN b b
4 NaN 5.0 2.0 1.0 b b
5 d 4.0 3.0 0.0 NaN b
f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.mode().iloc[0]
df = df.fillna(df.groupby('make').transform(f))
print (df)
A B C D F make
0 e 4 7 1 a a
1 b 4 7 3 a a
2 b 4 9 5 a a
3 d 5 4 0 b b
4 d 5 2 1 b b
5 d 4 3 0 b b
I have a DataFrame with an Ids column an several columns with data, like the column "value" in this example.
For this DataFrame I want to move all the values that correspond to the same id to a new column in the row as shown below:
I guess there is an opposite function to "melt" that allow this, but I'm not getting how to pivot this DF.
The dicts for the input and out DFs are:
d = {"id":[1,1,1,2,2,3,3,4,5],"value":[12,13,1,22,21,23,53,64,9]}
d2 = {"id":[1,2,3,4,5],"value1":[12,22,23,64,9],"value2":[1,21,53,"","",],"value3":[1,"","","",""]}
Create MultiIndex by cumcount, reshape by unstack and add change columns names by add_prefix:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index())
print (df)
id value0 value1 value2
0 1 12.0 13.0 1.0
1 2 22.0 21.0 NaN
2 3 23.0 53.0 NaN
3 4 64.0 NaN NaN
4 5 9.0 NaN NaN
Missing values is possible replace by fillna, but get mixed numeric with strings data, so some function should failed:
df = (df.set_index(['id',df.groupby('id').cumcount()])['value']
.unstack()
.add_prefix('value')
.reset_index()
.fillna(''))
print (df)
id value0 value1 value2
0 1 12.0 13 1
1 2 22.0 21
2 3 23.0 53
3 4 64.0
4 5 9.0
You can GroupBy to a list, then expand the series of lists:
df = pd.DataFrame(d) # create input dataframe
res = df.groupby('id')['value'].apply(list).reset_index() # groupby to list
res = res.join(pd.DataFrame(res.pop('value').values.tolist())) # expand lists to columns
print(res)
id 0 1 2
0 1 12 13.0 1.0
1 2 22 21.0 NaN
2 3 23 53.0 NaN
3 4 64 NaN NaN
4 5 9 NaN NaN
In general, such operations will be expensive as the number of columns is arbitrary. Pandas / NumPy solutions work best when you can pre-allocate memory, which isn't possible here.
I have multiple datasets with different number of rows and same number of columns.
I would like to find Nan values in each column for example consider these two datasets:
dataset1 : dataset2:
a b a b
1 10 2 11
2 9 3 12
3 8 4 13
4 nan nan 14
5 nan nan 15
6 nan nan 16
I want to find nan values in two datasets a and b :
if it occurs in column b then remove all the rows that have nan values. and if it occurs in column a then fill that values with 0.
this is my snippet code:
a=pd.notnull(data['a'].values.any())
b= pd.notnull((data['b'].values.any()))
if a:
data = data.dropna(subset=['a'])
if b:
data[['a']] = data[['a']].fillna(value=0)
which does not work properly.
You just need fillna and dropna without control flow
data = data.dropna(subset=['b']).fillna(0)
Pass your condition to a dict
df=df.fillna({'a':0,'b':np.nan}).dropna()
You do not need 'b' here
df=df.fillna({'a':0}).dropna()
EDIT :
df.fillna({'a':0}).dropna()
Out[1319]:
a b
0 2.0 11
1 3.0 12
2 4.0 13
3 0.0 14
4 0.0 15
5 0.0 16