The following is the code that I am using:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
animals = DataFrame(np.arange(16).resize(4, 4), columns=['W', 'X', 'Y', 'Z'], index=['Dog', 'Cat', 'Bird', 'Mouse'])
print(animals)
The output I get for this is:
W X Y Z
Dog NaN NaN NaN NaN
Cat NaN NaN NaN NaN
Bird NaN NaN NaN NaN
Mouse NaN NaN NaN NaN
The output that I expect is:
W X Y Z
Dog 0 1 2 3
Cat 4 5 6 7
Bird 8 9 10 11
Mouse 12 13 14 15
However, if I run just:
print(np.arange(16))
the output I get is:
[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
use reshape
import pandas as pd
animals = pd.DataFrame(np.arange(16).reshape(4, 4), columns=['W', 'X', 'Y', 'Z'], index=['Dog', 'Cat', 'Bird', 'Mouse'])
print(animals)
or use numpy.resize()
np.resize(np.arange(16),(4, 4))
using resize you need to pass the array as an argument
import pandas as pd
animals = pd.DataFrame(np.resize(np.arange(16),(4, 4)), columns=['W', 'X', 'Y', 'Z'], index=['Dog', 'Cat', 'Bird', 'Mouse'])
print(animals)
ndarray.resize() will do inplace operation. So precompute the size and then create a dataframe
a=np.arange(16)
a.resize(4,4)
import pandas as pd
animals = pd.DataFrame(a, columns=['W', 'X', 'Y', 'Z'], index=['Dog', 'Cat', 'Bird', 'Mouse'])
print(animals)
From the docs for resize: "Change shape and size of array in-place."
Thus, your call to resize returns None.
You want reshape. As in np.arange(16).reshape(4, 4)
Just to add to the answer above, docs for resize:
ndarray.resize(new_shape, refcheck=True)
Change shape and size of array in-place.
Therefore, unlike reshape, resize doesn't create a new array. In fact np.arange(16).resize(4, 4) yields None, which is why you get the Nan values.
Using reshape returns a new array:
ndarray.reshape(shape, order='C')
Returns an array containing the same data with a new shape
.
Related
I need to use slice on DataFrameGroupBy object.
For example, assume there is DataFrame with A-Z columns, if I want to use columns A-C I will use .loc[:, 'A':'C'], but when I'm using DataFrameGroupBy, I can't use slicing so I have to write [['A', 'B', 'C']]
Take a look here:
from numpy import around
from numpy.random import uniform
from pandas import DataFrame
from string import ascii_lowercase
data = around(a=uniform(low=1.0, high=50.0, size=(6, len(ascii_lowercase) + 1)), decimals=3)
df = DataFrame(data=data, columns=['group'] + list(ascii_lowercase), dtype='float64')
rows, columns = df.shape
df.loc[:rows // 2, 'group'] = 1.0
df.loc[rows // 2:, 'group'] = 2.0
print(df)
abc = df.groupby(by='group')[['a', 'b', 'c']].shift(periods=1)
print(abc)
Output of df is:
group a b c ... w x y z
0 1.0 22.380 36.873 10.073 ... 26.052 38.625 48.122 33.841
1 1.0 16.702 32.160 35.018 ... 12.990 17.878 19.297 16.330
2 1.0 9.957 25.202 7.106 ... 46.500 12.932 37.401 43.134
3 2.0 42.395 40.616 24.611 ... 30.436 33.521 42.136 2.690
4 2.0 2.069 29.891 2.217 ... 20.734 12.365 9.302 47.019
5 2.0 4.208 23.955 33.966 ... 45.439 16.488 32.892 9.345
Output of abc is:
a b c
0 NaN NaN NaN
1 22.380 36.873 10.073
2 16.702 32.160 35.018
3 NaN NaN NaN
4 42.395 40.616 24.611
5 2.069 29.891 2.217
How can I avoid of using [['a', 'b', 'c']]? I have 105 columns that I need to write there, I want use slicing like .loc[:, 'a':'c']
Thank you all :)
You can grouping by Series df['group'], so is possible filter columns before groupby to pass only filtered columns names:
abc = df.loc[:, 'a':'c'].groupby(by=df['group']).shift(periods=1)
print(abc)
a b c
0 NaN NaN NaN
1 37.999 21.197 39.527
2 35.560 27.214 23.211
3 NaN NaN NaN
4 49.053 11.319 37.279
5 27.881 38.529 46.550
Another idea is use:
cols = df.loc[:, 'a':'c'].columns
abc = df.groupby(by='group')[cols].shift(periods=1)
I created a mask to replace detected outliers with NaN values in a specific column in a dataframe, and the code I wrote worked perfectly for the random dataframe I created, but the the same code did not work for the actual dataframe I am working on.
Here is the code using random dataframe:
import numpy as np
import pandas as pd
df =pd.DataFrame (np.random.randint(0,1000,size=(4,10)), columns=('A','B','C','D','E','F','G','H','I','J'))
df
lower= np.percentile(df['B'],25)
upper= np.percentile(df['B'],75)
outliers= [x for x in df['B'] if x < lower or x > upper]
print('Identified Outliers %d'% len(outliers))
mask= ((df['B']<lower)| (df['B']>upper))
df['B'][mask]=np.nan
The code above worked perfectly for this dataframe, Number of identified outliers and number of replaced values to NaN are equal.
Surprisingly, same code did not work for the actual dataframe, though it identified the numbers of outliers, but did not replace the outliers to NaN values.
Is there any particular reason for it? is there anything needs to be done with datatype of that column of the the actual dataframe?
Could be due to incompatible dtypes. Define a function to encapsulate the functionality, then run it for different dtype-columned dataframes, see below example:
import numpy as np
import pandas as pd
def mask_column(df):
print(df)
col_to_mask = df.columns.values[1]
lower = np.percentile(df[col_to_mask], 25)
upper = np.percentile(df[col_to_mask], 75)
outliers = [x for x in df[col_to_mask] if x < lower or x > upper]
print('Identified Outliers %d' % len(outliers))
mask = ((df[col_to_mask] < lower) | (df[col_to_mask] > upper))
df[col_to_mask][mask] = np.nan
print(df)
df_1 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
)
df_2 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
)
df_2['B'] = df_1['B'].astype(float)
df_3 = pd.DataFrame(np.random.randint(0, 1000, size=(4, 10)),
columns=('A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J')
)
df_3['B'] = df_1['B'].astype(str)
# mask_column(df_1)
# mask_column(df_2)
mask_column(df_3)
The first two function calls will succeed in applying the boolean mask, but not the third function call:
Traceback (most recent call last):
File "C:/Users/gtrm/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_56.py", line 34, in <module>
mask_column(df_3)
File "C:/Users/gtrm/AppData/Roaming/JetBrains/PyCharmCE2020.1/scratches/scratch_56.py", line 9, in mask_column
lower = np.percentile(df[col_to_mask], 25)
File "<__array_function__ internals>", line 5, in percentile
File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3705, in percentile
return _quantile_unchecked(
File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3824, in _quantile_unchecked
r, k = _ureduce(a, func=_quantile_ureduce_func, q=q, axis=axis, out=out,
File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3403, in _ureduce
r = func(a, **kwargs)
File "C:\Users\gtrm\AppData\Local\Continuum\anaconda3\envs\py38\lib\site-packages\numpy\lib\function_base.py", line 3941, in _quantile_ureduce_func
x1 = take(ap, indices_below, axis=axis) * weights_below
TypeError: can't multiply sequence by non-int of type 'float'
A B C D E F G H I J
0 450 524 545 697 94 703 97 894 710 974
1 238 367 48 224 698 116 974 943 235 244
2 503 107 937 700 506 411 818 511 932 641
3 993 148 284 580 218 957 917 73 96 853
I recommend using pandas.DataFrame.quantile.
With the default of axis=0, the specified quantile for each column is calculated.
By default, numeric_only=True, so only numeric values are considered, but if False is specified, this will work for datetime and timedelta data as well.
Columns that are not of numeric / dateime / timedelta type, will be ignored.
Use Pandas: Boolean Indexing to filter the dataframe along all the columns, at once.
To get the number of remaining numeric values, use filtered.count()
To find the number of NaN values, use df.count() - filtereed.count().
See pandas.DataFrame.count for parameter specifics.
In regards to the "real" dataframe, it's not possible to determine the issue, as it is not available.
Use df.info() to verify the Dtype of columns are a numeric type.
import pandas as pd
import numpy as np
# test data and dataframe
np.random.seed(50)
df = pd.DataFrame(np.random.randint(0, 10000, size=(4,10)) / 10, columns=('A','B','C','D','E','F','G','H','I','J'))
df['k'] = ['a', 'b', 'c', 'd']
# display(df)
A B C D E F G H I J k
0 560.0 625.3 832.4 621.4 826.2 791.7 730.1 623.9 741.8 211.9 a
1 855.9 147.6 302.2 60.3 220.2 431.4 730.2 347.6 388.3 648.5 b
2 511.8 50.7 461.4 371.4 451.0 727.3 963.5 561.9 37.1 800.2 c
3 99.2 493.1 180.2 612.8 574.2 572.6 102.4 195.0 988.2 824.3 d
# calculate upper and lower quantiles
quantiles = df.quantile([.25, .75])
# display(quantiles)
A B C D E F G H I J
0.25 408.650 123.375 271.70 293.625 393.3 537.3 573.175 309.45 300.5 539.350
0.75 633.975 526.150 554.15 614.950 637.2 743.4 788.525 577.40 803.4 806.225
# filter the dataframe
filtered = df[(df < quantiles.loc[0.75]) & (df > quantiles.loc[0.25])]
# display(filtered)
A B C D E F G H I J k
0 560.0 NaN NaN NaN NaN NaN 730.1 NaN 741.8 NaN NaN
1 NaN 147.6 302.2 NaN NaN NaN 730.2 347.6 388.3 648.5 NaN
2 511.8 NaN 461.4 371.4 451.0 727.3 NaN 561.9 NaN 800.2 NaN
3 NaN 493.1 NaN 612.8 574.2 572.6 NaN NaN NaN NaN NaN
print(filtered.count())
[out]:
A 2
B 2
C 2
D 2
E 2
F 2
G 2
H 2
I 2
J 2
k 0
dtype: int64
I have a list;
orig= [2, 3, 4, -5, -6, -7]
I want to create another where entries corresponding to positive values above are sum of positives, and those corresponding to negative values above are sum negatives. So the desired output is:
final = [9, 9, 9, 18, 18, 18]
I am doing this:
raw = pd.DataFrame(orig, columns =['raw'])
raw
raw
0 2
1 3
2 4
3 -5
4 -6
5 -7
sum_pos = raw[raw> 0].sum()
sum_neg = -1*raw[raw < 0].sum()
final = pd.DataFrame(index = raw.index, columns = ['final'])
final
final
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
final.loc[raw >0, 'final'] = sum_pos
KeyError: "[('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w')\n ('r', 'a', 'w') ('r', 'a', 'w')] not in index"
So basically i was trying to create an empty dataframe like raw, and then conditionally fill it. However, the above method is failing.
Even when i try to create a new column instead of a new df, it fails:
raw.loc[raw>0, 'final']= sum_pos
KeyError: "[('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w')\n ('r', 'a', 'w') ('r', 'a', 'w')] not in index"
The best solution I've found so far is this:
pd.DataFrame(np.where(raw>0, sum_pos, sum_neg), index= raw.index, columns=['final'])
final
0 9.0
1 9.0
2 9.0
3 18.0
4 18.0
5 18.0
However, I dont understand what is wrong with the other approaches. Is there something I am missing here?
You can try grouping on np.sign, then sum and abs:
s = pd.Series(orig)
s.groupby(np.sign(s)).transform('sum').abs().tolist()
Output:
[9, 9, 9, 18, 18, 18]
You're not aligning indexes. 'sum_pos' is a series with a single element that has an index of 'raw'. And, you are trying to assign that series to a part of dataframe that doesn't have 'raw' as an index.
Pandas does almost everything using index alignment. To properly do this you need to extract the values from the sum_pos series:
final.loc[raw['raw'] > 0, 'final'] = sum_pos.values
print(final)
Output:
final
0 9.0
1 9.0
2 9.0
3 NaN
4 NaN
5 NaN
I am trying to convert survey data on the marital status which look as follows:
df['d11104'].value_counts()
[1] Married 1 250507
[2] Single 2 99131
[4] Divorced 4 32817
[3] Widowed 3 24839
[5] Separated 5 8098
[-1] keine Angabe 2571
Name: d11104, dtype: int64
So far, I did df['marstat'] = df['d11104'].cat.codes.astype('category'), yielding
df['marstat'].value_counts()
1 250507
2 99131
4 32817
3 24839
5 8098
0 2571
Name: marstat, dtype: int64
Now, I'd like to add labels to the columnmarstat, such that the numerical values are maintained, i.e. I like to identify people by the condition df['marstat'] == 1, while at the same time being having labels ['Married','Single','Divorced','Widowed'] attached to this variable. How can this be done?
EDIT: Thanks to jpp's Answer, i simply created a new variable and defined the labels by hand:
df['marstat_lb'] = df['marstat'].map({1: 'Married', 2: 'Single', 3: 'Widowed', 4: 'Divorced', 5: 'Separated'})
You can convert your result to a dataframe and include both the category code and name in the output.
A dictionary of category mapping can be extracted via enumerating the categories. Minimal example below.
import pandas as pd
df = pd.DataFrame({'A': ['M', 'M', 'S', 'D', 'W', 'M', 'M', 'S',
'S', 'S', 'M', 'W']}, dtype='category')
print(df.A.cat.categories)
# Index(['D', 'M', 'S', 'W'], dtype='object')
res = df.A.cat.codes.value_counts().to_frame('count')
cat_map = dict(enumerate(df.A.cat.categories))
res['A'] = res.index.map(cat_map.get)
print(res)
# count A
# 1 5 M
# 2 4 S
# 3 2 W
# 0 1 D
For example, you can access "M" by either df['A'] == 'M' or df.index == 1.
A more straightforward solution is just to use apply value_counts and then add an extra column for codes:
res = df.A.value_counts().to_frame('count').reset_index()
res['code'] = res['index'].cat.codes
index count code
0 M 5 1
1 S 4 2
2 W 2 3
3 D 1 0
I want to get a 2d-numpy array from a column of a pandas dataframe df having a numpy vector in each row. But if I do
df.values.shape
I get: (3,) instead of getting: (3,5)
(assuming that each numpy vector in the dataframe has 5 dimensions, and that the dataframe has 3 rows)
what is the correct method?
Ideally, avoid getting into this situation by finding a different way to define the DataFrame in the first place. However, if your DataFrame looks like this:
s = pd.Series([np.random.randint(20, size=(5,)) for i in range(3)])
df = pd.DataFrame(s, columns=['foo'])
# foo
# 0 [4, 14, 9, 16, 5]
# 1 [16, 16, 5, 4, 19]
# 2 [7, 10, 15, 13, 2]
then you could convert it to a DataFrame of shape (3,5) by calling pd.DataFrame on a list of arrays:
pd.DataFrame(df['foo'].tolist())
# 0 1 2 3 4
# 0 4 14 9 16 5
# 1 16 16 5 4 19
# 2 7 10 15 13 2
pd.DataFrame(df['foo'].tolist()).values.shape
# (3, 5)
I am not sure what you want. But df.values.shape seems to be giving the correct result.
import pandas as pd
import numpy as np
from pandas import DataFrame
df3 = DataFrame(np.random.randn(3, 5), columns=['a', 'b', 'c', 'd', 'e'])
print df3
# a b c d e
#0 -0.221059 1.206064 -1.359214 0.674061 0.547711
#1 0.246188 0.628944 0.528552 0.179939 -0.019213
#2 0.080049 0.579549 1.790376 -1.301700 1.372702
df3.values.shape
#(3L, 5L)
df3["a"]
#0 -0.221059
#1 0.246188
#2 0.080049
df3[:1]
# a b c d e
#0 -0.221059 1.206064 -1.359214 0.674061 0.547711