Efficiently adding calculated rows based on index values to a pandas DataFrame - python

I have a pandas DataFrame in the following format:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
I want to append a calculated row that performs some math based on a given items index value, e.g. adding a row that sums the values of all items with an index value < 2, with the new row having an index label of 'Red'. Ultimately, I am trying to add three rows that group the index values into categories:
A row with the sum of item values where index value are < 2, labeled as 'Red'
A row with the sum of item values where index values are 1 < x < 4, labeled as 'Blue'
A row with the sum of item values where index values are > 3, labeled as 'Green'
Ideal output would look like this:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Red 3 5 7
Blue 15 17 19
Green 27 29 31
My current solution involves transposing the DataFrame, applying a map function for each calculated column and then re-transposing, but I would imagine pandas has a more efficient way of doing this, likely using .append().
EDIT:
My in-elegant pre-set list solution (originally used .transpose() but I improved it using .groupby() and .append()):
df = pd.DataFrame(np.arange(18).reshape((6,3)),columns=['a', 'b', 'c'])
df['x'] = ['Red', 'Red', 'Blue', 'Blue', 'Green', 'Green']
df2 = df.groupby('x').sum()
df = df.append(df2)
del df['x']
I much prefer the flexibility of BrenBarn's answer (see below).

Here is one way:
def group(ix):
if ix < 2:
return "Red"
elif 2 <= ix < 4:
return "Blue"
else:
return "Green"
>>> print d
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
>>> print d.append(d.groupby(d.index.to_series().map(group)).sum())
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
For the general case, you need to define a function (or dict) to handle the mapping to different groups. Then you can just use groupby and its usual abilities.
For your particular case, it can be done more simply by directly slicing on the index value as Dan Allan showed, but that will break down if you have a more complex case where the groups you want are not simply definable in terms of contiguous blocks of rows. The method above will also easily extend to situations where the groups you want to create are not based on the index but on some other column (i.e., group together all rows whose value in column X is within range 0-10, or whatever).

The role of "transpose," which you say you used in your unshown solution, might be played more naturally by the orient keyword argument, which is available when you construct a DataFrame from a dictionary.
In [23]: df
Out[23]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
In [24]: dict = {'Red': df.loc[:1].sum(),
'Blue': df.loc[2:3].sum(),
'Green': df.loc[4:].sum()}
In [25]: DataFrame.from_dict(dict, orient='index')
Out[25]:
a b c
Blue 15 17 19
Green 27 29 31
Red 3 5 7
In [26]: df.append(_)
Out[26]:
a b c
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
5 15 16 17
Blue 15 17 19
Green 27 29 31
Red 3 5 7
Based the numbers in your example, I assume that by "> 4" you actually meant ">= 4".

Related

Compare 2 different columns of different dataframes

I have 2 dataframes let's say->
df1 =>
colA colB colC
0 1 2
3 4 5
6 7 8
df2 (same number of rows and columns) =>
colD colE colF
10 11 12
13 14 15
16 17 18
I want to compare columns from both dataframes , example ->
df1['colB'] < df2['colF']
Currently I am getting ->
ValueError: Can only compare identically-labeled Series objects
while comparing in ->
EDIT :
df1.loc[
df1['colB'] < df2['colF']
],'set_something' = 1;
Any help how I can implement it ? Thanks
You have the error because your series are not aligned (and might have duplicated indices)
If you just care about position, not indices use the underlying nummy array:
df1['colB'] < df2['colF'].to_numpy()
If you want to assign back in a column, make sure to transform the column full the other DataFrame to array.
df1['new'] = df1['colB'] < df2['colF'].to_numpy()
Or
df2['new'] = df1['colB'].to_numpy() < df2['colF']
This is a non-equi join; you should get more performance with some of binary search; conditional_join from pyjanitor does that under the hood:
# pip install pyjanitor
import pandas as pd
import janitor
df1.conditional_join(df2, ('colB', 'colF', '<'))
colA colB colC colD colE colF
0 0 1 2 10 11 12
1 0 1 2 13 14 15
2 0 1 2 16 17 18
3 3 4 5 10 11 12
4 3 4 5 13 14 15
5 3 4 5 16 17 18
6 6 7 8 10 11 12
7 6 7 8 13 14 15
8 6 7 8 16 17 18
If it is based on an equality (df1.colB == df2.colF), then pd.merge should suffice and is efficient

Compare even and odd rows in a Pandas Data Frame

I have a data frame like that :
Index
Time
Id
0
10:10:00
11
1
10:10:01
12
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
8
10:10:12
11
9
10:10:14
13
I want to compare id column for each pairs. So between the row 0 and 1, between the row 2 and 3 etc.
In others words I want to compare even rows with odd rows and keep same id pairs rows.
My ideal output would be :
Index
Time
Id
2
10:10:02
12
3
10:10:04
12
4
10:10:06
13
5
10:10:07
13
6
10:10:08
11
7
10:10:10
11
I tried that but it did not work :
df = df[
df[::2]["id"] ==df[1::2]["id"]
]
You can use a GroupBy.transform approach:
# for each pair, is there only one kind of Id?
out = df[df.groupby(np.arange(len(df))//2)['Id'].transform('nunique').eq(1)]
Or, more efficient, using the underlying numpy array:
# convert to numpy
a = df['Id'].to_numpy()
# are the odds equal to evens?
out = df[np.repeat((a[::2]==a[1::2]), 2)]
output:
Index Time Id
2 2 10:10:02 12
3 3 10:10:04 12
4 4 10:10:06 13
5 5 10:10:07 13
6 6 10:10:08 11
7 7 10:10:10 11

Dask.dataframe or Alternative: Scalable way of dropping rows of low frequency items

I am looking for a way to remove rows from a dataframe that contain low frequency items. I adapted the following snippet from this post:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, high=9, size=(100,2)),
columns = ['A', 'B'])
threshold = 10 # Anything that occurs less than this will be removed.
value_counts = df.stack().value_counts() # Entire DataFrame
to_remove = value_counts[value_counts <= threshold].index
df.replace(to_remove, np.nan, inplace=True)
The problem is, that this code does not scale, it seems.
The line to_remove = value_counts[value_counts <= threshold].index has now been running for several hours for my data (2 GB compressed HDFStore). I therefore need a better solution. Ideally out-of-core. I suspect dask.dataframe is suitable, but I fail to express the above code in terms of dask. The key functions stack and replace are absent from dask.dataframe.
I tried the following (works in normal pandas) to work around the lack of these two functions:
value_countss = [df[col].value_counts() for col in df.columns]
infrequent_itemss = [value_counts[value_counts < 3] for value_counts in value_countss]
rows_to_drop = set(i for indices in [df.loc[df[col].isin(infrequent_items.keys())].index.values for col, infrequent_items in zip(df.columns, infrequent_itemss)] for i in indices)
df.drop(rows_to_drop)
That does not actually work with dask though. It errors at infrequent_items.keys().
Even if it did work, given that this is the opposite of elegant, I suspect there must be a better way.
Can you suggest something?
Not sure if this will help you out, but it's too big for a comment:
df = pd.DataFrame(np.random.randint(0, high=20, size=(30,2)), columns = ['A', 'B'])
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
threshold = 10
to_remove = [k for k, v in d.items() if v < threshold]
df.replace(to_remove, np.nan, inplace=True)
See:
How to count the occurrence of certain item in an ndarray in Python?
how to count occurrence of each unique value in pandas
Toy problem showed a 40x speedup from 400 us to 10 us in the step you mentioned.
The following code, which incorporates Evan's improvement, solves my issue:
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
demo:
def filter_low_frequency(df, threshold=4):
unique, counts = np.unique(df.values.ravel(), return_counts=True)
d = dict(zip(unique, counts))
to_remove = {k for k, v in d.items() if v < threshold}
mask = df.isin(to_remove)
column_mask = (~mask).all(axis=1)
df = df[column_mask]
return df
df = pd.DataFrame(np.random.randint(0, high=20, size=(10,10)))
print(df)
print(df.stack().value_counts())
df = filter_low_frequency(df)
print(df)
0 1 2 3 4 5 6 7 8 9
0 3 17 11 13 8 8 15 14 7 8
1 2 14 11 3 16 10 19 19 14 4
2 8 13 13 17 3 13 17 18 5 18
3 7 8 14 9 15 12 0 15 2 19
4 6 12 13 11 16 6 19 16 2 17
5 2 1 2 17 1 3 12 10 2 16
6 0 19 9 4 15 3 3 3 4 0
7 18 8 15 9 1 18 15 17 9 0
8 17 15 9 11 13 9 11 4 19 8
9 13 6 7 8 8 10 0 3 16 13
8 9
3 8
13 8
17 7
15 7
19 6
2 6
9 6
11 5
16 5
0 5
18 4
4 4
14 4
10 3
12 3
7 3
6 3
1 3
5 1
dtype: int64
0 1 2 3 4 5 6 7 8 9
6 0 19 9 4 15 3 3 3 4 0
8 17 15 9 11 13 9 11 4 19 8

Replace by previous values

I have some dataframe like the one shown above. The goal of this program is to replace some specific value by the previous one.
import pandas as pd
test = pd.DataFrame([2,2,3,1,1,2,4,6,43,23,4,1,3,3,1,1,1,4,5], columns = ['A'])
obtaining:
If one want to replace all 1 by the previous values, a possible solution is:
for li in test[test['A'] == 1].index:
test['A'].iloc[li] = test['A'].iloc[li-1]
However, it is very inefficient. Can you suggest a more efficient solution?
IIUC, replace to np.nan then ffill
test.replace(1,np.nan).ffill().astype(int)
Out[881]:
A
0 2
1 2
2 3
3 3
4 3
5 2
6 4
7 6
8 43
9 23
10 4
11 4
12 3
13 3
14 3
15 3
16 3
17 4
18 5

Reassigning strings by value count in Pandas

I have a number of string-based columns in a pandas dataframe that I'm looking to use on scikitlearn classification models. I know I have to use oneHotEncoder to properly encode the variables, but first, I want to reduce the variation in the columns, taking out either the strings that appear less than x% of the time in the column, or are not among the top x strings by count in the column.
Here's an example:
df1 = pd.DataFrame({'a':range(22), 'b':list('aaaaaaaabbbbbbbcccdefg'), 'c':range(22)})
df1
a b c
0 0 a 0
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 5 a 5
6 6 a 6
7 7 a 7
8 8 b 8
9 9 b 9
10 10 b 10
11 11 b 11
12 12 b 12
13 13 b 13
14 14 b 14
15 15 c 15
16 16 c 16
17 17 c 17
18 18 d 18
19 19 e 19
20 20 f 20
21 21 g 21
As you can see, a, b, and c appear in column b more than 10% of the time, so I'd like to keep them. On the other hand, d, e, f, and g appear less than 10% (actually about 5% of the time), so I'd like to bucket these by changing them into 'other':
df1['b']
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 b
15 c
16 c
17 c
18 other
19 other
20 other
21 other
I'd similarly like to be able to say that I only want to keep the values that appear in the top 2 in terms of frequency, so that column b looks like this:
df1['b']
0 a
1 a
2 a
3 a
4 a
5 a
6 a
7 a
8 b
9 b
10 b
11 b
12 b
13 b
14 b
15 other
16 other
17 other
18 other
19 other
20 other
21 other
I don't see an obvious way to do this in Pandas, although I admittedly know a lot more about this in R. Any ideas? Any thoughts on how to make this robust to Nones, which may appear more than 10% of the time or sit in the top x number of values?
This is kinda contorted, but it's kind of a complicated question.
First, get the counts:
In [24]: sizes = df1["b"].value_counts()
In [25]: sizes
Out[25]:
b
a 8
b 7
c 3
d 1
e 1
f 1
g 1
dtype: int64
Now, pick the indices you don't like:
In [27]: bad = sizes.index[sizes < df1.shape[0]*0.1]
In [28]: bad
Out[28]: Index([u'd', u'e', u'f', u'g'], dtype='object')
Finally, assign "other" to those rows containing bad indices:
In [34]: df1.loc[df1["b"].isin(bad), "b"] = "other"
In [36]: df1
Out[36]:
a b c
0 0 a 0
1 1 a 1
2 2 a 2
3 3 a 3
4 4 a 4
5 5 a 5
6 6 a 6
7 7 a 7
8 8 b 8
9 9 b 9
10 10 b 10
11 11 b 11
12 12 b 12
13 13 b 13
14 14 b 14
15 15 c 15
16 16 c 16
17 17 c 17
18 18 other 18
19 19 other 19
20 20 other 20
21 21 other 21
[22 rows x 3 columns]
You can use sizes.sort() and get the last n values from the result in order to find just the top two indices.
Edit: you should be able to do something like this, replacing all instances of "b" with filterByColumn:
def filterDataFrame(df1, filterByColumn):
sizes = df1[filterByColumn].value_counts()
...
Here's my solution:
def cleanupData(inputCol, fillString, cutoffPercent=None, cutoffNum=31):
col=inputCol
col.fillna(fillString, inplace=True)
valueCounts=col.value_counts()
totalAmount=sum(valueCounts)
if cutoffPercent is not None and cutoffNum is not None:
raise NameError("both cutoff percent and number have values. Please only give one of these values")
if cutoffPercent is not None:
cutoffAmount=cutoffPercent*totalAmount
valuesToKeep=valueCounts[valueCounts>cutoffAmount]
valuesToKeep=valuesToKeep.index.tolist()
numValuesKept=len(valuesToKeep)
print "keeping "+str(numValuesKept)+" unique values in the returned column"
if cutoffNum is not None:
valueNames=valueCounts.index.tolist()
valuesToKeep=valueNames[0:cutoffNum]
newlist=[]
for row in col:
if any(row in element for element in valuesToKeep):
newlist.append(row)
else:
newlist.append("Other")
return newlist
##
cleanupData(df1['b'], "Other", cutoffNum=2)
['a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other', 'Other']
assign your frame to variable (I call it x) and then count per value in the column "b"
d = {s: sum(x["b"].values==s) for s in set(x["b"].values)}
You can then use this mask to assign a new value if d[s] is below a certain threshold
x[x["b"].values==s] = "Call me Peter Maffay"

Categories

Resources