Starting from a non unique pandas series, one can count the number of each unique value by .value_counts().
>> col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0])
0 1.0
1 1.0
2 2.0
3 3.0
4 3.0
5 3.0
dtype: object
>> stat = col.value_counts()
>> stat
3.0 3
1.0 2
2.0 1
dtype: int64
But, if starting from a data frame of two column, one for the unique values, while another for the number of occurrence. (stat in previous example). How to expand those into a single column.
Because I would like to calculate the median, mean, etc of the data in such a dataframe, I think describing a single column is much easier that two. Or is there any method to describe a 'value_count' dataframe derectly without expanding the data?
# turn `stat` into col ???
>> col.describe()
count 6.000000
mean 2.166667
std 0.983192
min 1.000000
25% 1.250000
50% 2.500000
75% 3.000000
max 3.000000
add testing data
>> df = pd.DataFrame({"Name": ["A", "B", "C"], "Value": [1,2,3], "Count": [2, 10, 2]})
>> df
Name Value Count
0 A 1 2
1 B 2 5
2 C 3 2
df2 = _reverse_count(df)
>> df2
Name Value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
5 B 2
6 B 2
7 B 2
8 C 3
9 C 3
You can use the repeat function from numpy
import pandas as pd
import numpy as np
col = pd.Series([1.0, 1.0, 2.0, 3.0, 3.0, 3.0])
stats=col.value_counts()
pd.Series(np.repeat(stats.index,stats))
# 0 3.0
# 1 3.0
# 2 3.0
# 3 1.0
# 4 1.0
# 5 2.0
# dtype: float64
Update :
for multiple columns you can use
df.loc[df.index.repeat(df['Count'])]
Related
I have a dataframe and I would like to add a column based on the values of the other columns
If the problem were only that, I think a good solution would be this answer
However my problem is a bit more complicated
Say I have
import pandas as pd
a= pd.DataFrame([[5,6],[1,2],[3,6],[4,1]],columns=['a','b'])
print(a)
I have
a b
0 5 6
1 1 2
2 3 6
3 4 1
Now I want to add a column called 'result' where each of the values would be the result of applying this function
def process(a,b,c,d):
return {"notthisone":2*a,
"thisone":(a*b+c*d),
}
to each of the rows and the next rows of the dataframe
This function is part of a library, it outputs two values but we are only interested in the values of the key thisone
Also, if possible we can not decompose the operations of the function but we have to apply it to the values
For example in the first row
a=5,b=6,c=1,d=2 (c and d being the a and b of the next rows) and we want to add the value "thisone" so 5*6+1*2=32
In the end I will have
a b result
0 5 6 32
1 1 2 20
2 3 6 22
3 4 1 22 --> This is an special case since there is no next row so just a repeat of the previous would be fine
How can I do this?
I am thinking of traversing the dataframe with a loop but there must be a better and faster way...
EDIT:
I have done this so far
def p4(a,b):
return {"notthisone":2*a,
"thisone":(a*b),
}
print(a.apply(lambda row: p4(row.a,row.b)["thisone"], axis=1))
and the result is
0 30
1 2
2 18
3 4
dtype: int64
So now I have to think of a way to incorporate next row values too
If you only need the values of the very next row, I think it would be best to shift these values back into the current row (with different column names). Then they can all be accessed by row-wise apply(fn, axis=1).
# library function
def process(a, b, c, d):
return {
"notthisone": 2 * a,
"thisone": (a * b + c * d),
}
# toy data
df = pd.DataFrame([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0], [7.0, 8.0]], columns=["a", "b"])
# shift some data back one row
df[["c", "d"]] = df[["a", "b"]].shift(-1)
# apply your function row-wise
df["result"] = df.apply(
lambda x: process(x["a"], x["b"], x["c"], x["d"])["thisone"], axis=1
)
Result:
a b c d result
0 1.0 2.0 3.0 4.0 14.0
1 3.0 4.0 5.0 6.0 42.0
2 5.0 6.0 7.0 8.0 86.0
3 7.0 8.0
Use loc accessor to select the rows, turn them into a numpy object and find the product and sum. I used list squares in this case. Last row will be Null. fillna the resulting column. We can fillna at the df level but that could impact other columns if the df is large and has nulls. Code below.
a = a.assign(x=pd.Series([np.prod(a.iloc[x].to_numpy()) + np.prod(a.iloc[x+1].to_numpy()) for x in np.arange(len(a)) if x!=len(a)-1]))
a =a.assign(x=a['x'].ffill())
a b x
0 5 6 32.0
1 1 2 20.0
2 3 6 22.0
3 4 1 22.0
I'm importing data where from excel where some rows may have notes in a column and are not truly part of the dataframe. dummy Eg. below:
H1 H2 H3
*highlighted cols are PII
sam red 5
pam blue 3
rod green 11
* this is the end of the data
When the above file is imported into dfPA it looks like:
dfPA:
Index H1 H2 H3
1 *highlighted cols are PII
2 sam red 5
3 pam blue 3
4 rod green 11
5 * this is the end of the data
I want to delete the first and last row. This is what I've done.
#get count of cols in df
input: cntcols = dfPA.shape[1]
output: 3
#get count of cols with nan in df
input: a = dfPA.shape[1] - dfPA.count(axis=1)
output:
0 2
1 3
2 3
4 3
5 2
(where a is a series)
#convert a from series to df
dfa = a.to_frame()
#delete rows where no. of nan's are greater than 'n'
n = 1
for r, row in dfa.iterrows():
if (cntcols - dfa.iloc[r][0]) > n:
i = row.name
dfPA = dfPA.drop(index=i)
This doesn't work. Is there way to do this?
You should use the pandas.DataFrame.dropna method. It has a thresh parameter that you can use to define a minimum number of NaN to drop the row/column.
Imagine the following dataframe:
>>> import numpy as np
>>> df = pd.DataFrame([[1,np.nan,1,np.nan], [1,1,1,1], [1,np.nan,1,1], [np.nan,1,1,1]], columns=list('ABCD'))
A B C D
0 1.0 NaN 1 NaN
1 1.0 1.0 1 1.0
2 1.0 NaN 1 1.0
3 NaN 1.0 1 1.0
You can drop columns with NaN using:
>>> df.dropna(axis=1)
C
0 1
1 1
2 1
3 1
The thresh parameter defines the minimum number of non-NaN values to keep the column:
>>> df.dropna(thresh=3, axis=1)
A C D
0 1.0 1 NaN
1 1.0 1 1.0
2 1.0 1 1.0
3 NaN 1 1.0
If you want to reason in terms of the number of NaN:
# example for a minimum of 2 NaN to drop the column
>>> df.dropna(thresh=len(df.columns)-(2-1), axis=1)
If the rows rather than the columns need to be filtered, remove the axis parameter or use axis=0:
>>> df.dropna(thresh=3)
I have a 21840x39 data frame. A few of my columns are numerically valued and I want to make sure they are all in the same data type (which I want to be a float).
Instead of naming all the columns out and converting them:
df[['A', 'B', 'C', '...]] = df[['A', 'B', 'C', '...]].astype(float)
Can I do a for loop that will allow me to say something like " convert to float from column 18 to column 35"
I know how to do one column: df['A'] = df['A'].astype(float)
But how can I do multiple columns? I tried with list slicing within a loop but couldn't get it right.
First idea is convert selected columns, python counts from 0, so for 18 to 36 columns use:
df.iloc[:, 17:35] = df.iloc[:, 17:35].astype(float)
If not working (because possible bug) use another solution:
df = df.astype(dict.fromkeys(df.columns[17:35], float))
Sample - convert 8 to 15th columns:
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8 0 0 8 9 3 7 2 3 6 5
1 0 4 8 6 4 1 1 5 9 5 6 6 6 5 4 6 4 2
2 3 4 7 1 4 9 3 2 0 9 1 2 7 1 0 2 8 8
df = df.astype(dict.fromkeys(df.columns[7:15], float))
print (df)
a b c d e f g h i j k l m n o p q r
0 0 8 3 6 3 3 7 8.0 0.0 0.0 8.0 9.0 3.0 7.0 2.0 3 6 5
1 0 4 8 6 4 1 1 5.0 9.0 5.0 6.0 6.0 6.0 5.0 4.0 6 4 2
2 3 4 7 1 4 9 3 2.0 0.0 9.0 1.0 2.0 7.0 1.0 0.0 2 8 8
Tweaked #jezrael code as typing in column names (I feel) is a good option.
import pandas as pd
import numpy as np
np.random.seed(2020)
df = pd.DataFrame(np.random.randint(10, size=(3, 18)),
columns=list('abcdefghijklmnopqr')).astype(str)
print(df)
columns = list(df.columns)
#change the first and last column names below as required
df = df.astype(dict.fromkeys(
df.columns[columns.index('h'):(columns.index('o')+1)], float))
print (df)
Leaving the original answer below here but note: Never loop in pandas if vectorized alternatives exist
If I had a dataframe and wanted to change columns 'col3' to 'col5' (human readable names) to floats I could...
import pandas as pd
import re
df = pd.read_csv('dummy_data.csv')
df
columns = list(df.columns)
#change the first and last column names below as required
start_column = columns.index('col3')
end_column = columns.index('col5')
for index, col in enumerate(columns):
if (start_column <= index) & (index <= end_column):
df[col] = df[col].astype(float)
df
...by just changing the column names. Perhaps it's easier to work in column names and 'from this one' and 'to that one' (inclusive).
I know rolling_mean() exists, but this is for a school project so I'm trying to avoid using rolling_mean()
I'm trying to use the following function on a dataframe series
def run_mean(array, period):
ret = np.cumsum(array, dtype=float)
ret[period:] = ret[period:] - ret[:-period]
return ret[period - 1:] / period
data['run_mean'] = run_mean(data['ratio'], 150)
But I'm getting the error 'ValueError: cannot set using a slice indexer with a different length than the value'.
Using data['run_mean'] = pd.rolling_mean(raw_data['ratio'],150) works exactly fine, what am I missing?
Fill the initial values up to period with NaN.
def run_mean(array, period): # Vector
ret = np.cumsum(array / period, dtype=float) # First divide by period to avoid overflow.
ret[period:] = ret[period:] - ret[:-period]
ret[:period - 1] = np.nan
return ret
run_mean(np.array(range(5)), 3)
Out[35]: array([ nan, nan, 1., 2., 3.])
To quote the pandas documentation,
A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels.
This example should illustrate what's going on:
In [1]: import numpy as np
...: import pandas as pd
In [2]: a = pd.Series(np.random.random(5))
In [3]: a
Out[3]:
0 0.740975
1 0.983654
2 0.274207
3 0.427542
4 0.874127
dtype: float64
In [4]: a[2:]
Out[4]:
2 0.274207
3 0.427542
4 0.874127
dtype: float64
In [5]: a[:-2]
Out[5]:
0 0.740975
1 0.983654
2 0.274207
dtype: float64
In [6]: a[2:] - a[:-2]
Out[6]:
0 NaN
1 NaN
2 0.0
3 NaN
4 NaN
dtype: float64
In [7]: a[2:] = _
The last statement will produce the ValueError you get.
Converting ret from a pandas Series to a numpy ndarray should give you the behaviour you're looking for.
You're mixing up the use of : in DataFrame slicing.
Solution
What you want to use is shift()
def run_mean(array, period):
ret = np.cumsum(array, dtype=float)
roll = ret - ret.shift(period).fillna(0)
return roll[(period - 1):] / period
Example Setup
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame((np.random.rand(6, 5) * 10).astype(int), columns=list('ABCDE'))
print df
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
2 7 2 1 3 8
3 2 0 6 5 5
4 6 6 4 3 5
5 4 8 8 1 0
Observe
print df[:4]
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
2 7 2 1 3 8
3 2 0 6 5 5
print df[:-4]
A B C D E
0 9 5 2 7 9
1 8 7 2 9 2
These are not the same length.
Demonstration
A B C D E
2 8.000000 4.666667 1.666667 6.333333 6.333333
3 5.666667 3.000000 3.000000 5.666667 5.000000
4 5.000000 2.666667 3.666667 3.666667 6.000000
5 4.000000 4.666667 6.000000 3.000000 3.333333
I'm trying to sum across columns of a Pandas dataframe, and when I have NaNs in every column I'm getting sum = zero; I'd expected sum = NaN based on the docs. Here's what I've got:
In [136]: df = pd.DataFrame()
In [137]: df['a'] = [1,2,np.nan,3]
In [138]: df['b'] = [4,5,np.nan,6]
In [139]: df
Out[139]:
a b
0 1 4
1 2 5
2 NaN NaN
3 3 6
In [140]: df['total'] = df.sum(axis=1)
In [141]: df
Out[141]:
a b total
0 1 4 5
1 2 5 7
2 NaN NaN 0
3 3 6 9
The pandas.DataFrame.sum docs say "If an entire row/column is NA, the result will be NA", so I don't understand why "total" = 0 and not NaN for index 2. What am I missing?
pandas documentation » API Reference » DataFrame » pandas.DataFrame »
DataFrame.sum(self, axis=None, skipna=None, level=None, numeric_only=None, min_count=0, **kwargs)
min_count: int, default 0
The required number of valid values to
perform the operation. If fewer than min_count non-NA values are
present the result will be NA.
New in version 0.22.0: Added with the default being 0. This means the
sum of an all-NA or empty Series is 0, and the product of an all-NA or
empty Series is 1.
Quoting from pandas latest docs it says the min_count will be 0 for the all-NA series.
If you say min_count=1 then the result of the sum will be a NaN.
Great link provided by Jeff.
Here you can find a example:
df1 = pd.DataFrame();
df1['a'] = [1,2,np.nan,3];
df1['b'] = [np.nan,2,np.nan,3]
df1
Out[4]:
a b
0 1.0 NaN
1 2.0 2.0
2 NaN NaN
3 3.0 3.0
df1.sum(axis=1, skipna=False)
Out[6]:
0 NaN
1 4.0
2 NaN
3 6.0
dtype: float64
df1.sum(axis=1, skipna=True)
Out[7]:
0 1.0
1 4.0
2 0.0
3 6.0
dtype: float64
df1.sum(axis=1, min_count=1)
Out[7]:
0 1.0
1 4.0
2 NaN
3 6.0
dtype: float64
A solution would be to select all cases where rows are all-nan, then set the sum to nan:
df['total'] = df.sum(axis=1)
df.loc[df['a'].isnull() & df['b'].isnull(),'total']=np.nan
or
df['total'] = df.sum(axis=1)
df.loc[df[['a','b']].isnull().all(1),'total']=np.nan
The latter option is probably more practical, because you can create a list of columns ['a','b', ... , 'z'] which you may want to sum.
I got around this by casting the series to a numpy array, which computes the answer correctly.
print(np.array([np.nan,np.nan,np.nan]).sum()) # nan
print(pd.Series([np.nan,np.nan,np.nan]).sum()) # 0.0
print(pd.Series([np.nan,np.nan,np.nan]).to_numpy().sum()) # nan