Suppose I have the following DataFrame:
A B C D E F Cost
0 1 1 0 0 0 10
0 0 1 0 1 0 3
1 0 0 0 0 1 5
0 1 0 1 0 0 7
I want to construct a new DataFrame based on the values above.
Specifically, if value==1 then combine their columns into one and assign value for the new column from Cost column above.
So the expected output would be something like:
BC CE AF BD
10 3 5 7
How can I achieve such thing?
We can try the dot of the the binary columns with the column names to get the key string based on 1s and 0s, then add the Cost Column back:
cols = df.columns.difference(['Cost'])
new_df = df[cols].dot(cols).to_frame(name='key')
new_df['Cost'] = df['Cost']
new_df:
key Cost
0 BC 10
1 CE 3
2 AF 5
3 BD 7
The DataFrame can be transposed if needed:
cols = df.columns.difference(['Cost'])
new_df = df[cols].dot(cols).to_frame(name='key')
new_df['Cost'] = df['Cost']
new_df = new_df.set_index('key').T.rename_axis(columns=None)
new_df:
BC CE AF BD
Cost 10 3 5 7
DataFrame and imports:
import pandas as pd
df = pd.DataFrame({
"A": [0, 0, 1, 0],
"B": [1, 0, 0, 1],
"C": [1, 1, 0, 0],
"D": [0, 0, 0, 1],
"E": [0, 1, 0, 0],
"F": [0, 0, 1, 0],
"Cost": [10, 3, 5, 7],
})
You don't need a loop to do it. With datar, you can achieve it with dplyr-like syntax:
>>> from datar.all import *
>>>
>>> # Create the df
>>> df = tribble(
... f.A, f.B, f.C, f.D, f.E, f.F, f.Cost,
... 0, 1, 1, 0, 0, 0, 10,
... 0, 0, 1, 0, 1, 0, 3,
... 1, 0, 0, 0, 0, 1, 5,
... 0, 1, 0, 1, 0, 0, 7,
... )
>>> df
A B C D E F Cost
<int64> <int64> <int64> <int64> <int64> <int64> <int64>
0 0 1 1 0 0 0 10
1 0 0 1 0 1 0 3
2 1 0 0 0 0 1 5
3 0 1 0 1 0 0 7
>>> # replace value with column names
>>> df = df >> mutate(across(f[1:6], lambda x: if_else(x, x.name, "")))
>>> df
A B C D E F Cost
<object> <object> <object> <object> <object> <object> <int64>
0 B C 10
1 C E 3
2 A F 5
3 B D 7
>>> # unite the columns
>>> df = df >> unite('col', f[1:6], sep="")
>>> df
col Cost
<object> <int64>
0 BC 10
1 CE 3
2 AF 5
3 BD 7
>>> # reshape the result
>>> df >> column_to_rownames(f.col) >> t()
BC CE AF BD
<int64> <int64> <int64> <int64>
Cost 10 3 5 7
Disclaimer: I am the author of the datar package.
Here is how I will proceed:
Create the df
data = {
"A": [0, 0, 1, 0],
"B": [1, 0, 0, 1],
"C": [1, 1, 0, 0],
"D": [0, 0, 0, 1],
"E": [0, 1, 0, 0],
"F": [0, 0, 1, 0],
"Cost": [10, 3, 5, 7],
}
df = pd.DataFrame(data)
Get the columns names
def make_df(row):
row = row.to_dict()
return "".join([k for k, v in row.items() if v if k!="Cost"])
df_ind = df.apply(make_df, axis=1)
Create the desired data frame
pd.DataFrame(df.Cost.values, index=df_ind.values).T
This will give you:
BC CE AF BD
10 3 5 7
Not as nice as previous answers, but straightforward & step-by-step:
outp_dict = {}
for index, row in df.iterrows():
new_col = ""
col_nr = 0
for value in row:
if value and row.index[col_nr] is not "Cost":
new_col += str(row.index[col_nr])
col_nr += 1
outp_dict[new_col] = row[-1]
outp_df = pd.DataFrame(outp_dict, index = [0])
Related
I have the below data table
A = [2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
B = [0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df = pd.DataFrame({'A':A, 'B':B})
I'd like to calculate the average of column A when consecutive rows see column B equal to 1. All rows where column B equal to 0 are neglected and subsequently create a new dataframe like below:
Thanks for your help!
Keywords: groupby, shift, mean
Code:
df_result=df.groupby((df['B'].shift(1,fill_value=0)!= df['B']).cumsum()).mean()
df_result=df_result[df_result['B']!=0]
df_result
A B
1 2.0 1.0
3 3.0 1.0
As you might noticed, you need first to determine the consecutive rows blocks having the same values.
One way to do so is by shifting B one row and then comparing it with itself.
df['B_shifted']=df['B'].shift(1,fill_value=0) # fill_value=0 to return int and replace Nan with 0's
df['A'] =[2, 3, 1, 2, 4, 1, 5, 3, 1, 7, 5]
df['B'] =[0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0]
df['B_shifted'] =[0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0]
(df['B_shifted'] != df['B'])=[F, T, F, F, T, F, T, F, F, T, F]
[↑ ][↑ ][↑ ][↑ ]
Now we can use the groupby pandas method as follows:
df_grouped=df.groupby((df['B_shifted'] != df['B']).cumsum())
Now if we looped in the DtaFrameGroupBy object df_grouped
we'll see the following tuples:
(0, A B B_shifted
0 2 0 0)
(1, A B B_shifted
1 3 1 0
2 1 1 1
3 2 1 1)
(2, A B B_shifted
4 4 0 1
5 1 0 0)
(3, A B B_shifted
6 5 1 0
7 3 1 1
8 1 1 1)
(4, A B B_shifted
9 7 0 1
10 5 0 0)
We can simply calculate the mean and filter the zero values now as follow
df_result=df_grouped.mean()
df_result=df_result[df_result['B']!=0][['A','B']]
References:(link, link).
Try:
m = (df.B != df.B.shift(1)).cumsum() * df.B
df_out = df.groupby(m[m > 0])["A"].mean().reset_index(drop=True).to_frame()
df_out["B"] = 1
print(df_out)
Prints:
A B
0 2 1
1 3 1
df1 = df.groupby((df['B'].shift() != df['B']).cumsum()).mean().reset_index(drop=True)
df1 = df1[df1['B'] == 1].astype(int).reset_index(drop=True)
df1
Output
A B
0 2 1
1 3 1
Explanation
We are checking if each row's value of B is not equal to next value using pd.shift, if so then we are grouping those values and calculating its mean and assigning it to new dataframe df1.
Since we have mean of groups of all consecutive 0s and 1s, so we are then filtering only values of B==1.
I'm trying to get the count (where all row values are 1) of each possible combination between the eight columns of a dataframe. Basically I need to understand how many times different overlaps exist.
I've tried to use itertools.product to get all the combinations, but it doesn't seem to work.
import pandas as pd
import numpy as np
import itertools
df = pd.read_excel('filename.xlsx')
df.head(15)
a b c d e f g h
0 1 0 0 0 0 1 0 0
1 1 0 0 0 0 0 0 0
2 1 0 1 1 1 1 1 1
3 1 0 1 1 0 1 1 1
4 1 0 0 0 0 0 0 0
5 0 1 0 0 1 1 1 1
6 1 1 0 0 1 1 1 1
7 1 1 1 1 1 1 1 1
8 1 1 0 0 1 1 0 0
9 1 1 1 0 1 0 1 0
10 1 1 1 0 1 1 0 0
11 1 0 0 0 0 1 0 0
12 1 1 1 1 1 1 1 1
13 1 1 1 1 1 1 1 1
14 0 1 1 1 1 1 1 0
print(list(itertools.product(new_df.columns)))
The expected output would be a dataframe with the count (n) of rows for each of valid combinations (where values in the row are all 1).
For example:
a b
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 0 1
6 1 1
7 1 1
8 1 1
9 1 1
10 1 1
11 1 0
12 1 1
13 1 1
14 0 1
Would give
combination count
a 12
a_b 7
b 9
Note that the output would need to contain all the combinations possible between a and h, not just pairwise
Powerset Combinations
Use the powerset recipe with,
s = pd.Series({
'_'.join(c): df[c].min(axis=1).sum()
for c in map(list, filter(None, powerset(df)))
})
a 13
b 9
c 8
d 6
e 10
f 12
g 9
h 7
a_b 7
...
Pairwise Combinations
This is a special case, and can be vectorized.
from itertools import combinations
u = df.T.dot(df)
pd.DataFrame({
'combination': [*map('_'.join, combinations(df, 2))],
# pandas < 0.24
# 'count': u.values[np.triu_indices_from(u, k=1)]
# pandas >= 0.24
'count': u.to_numpy()[np.triu_indices_from(u, k=1)]
})
You can use dot, then extract the upper triangular matrix values:
combination count
0 a_b 7
1 a_c 7
2 a_d 5
3 a_e 8
4 a_f 10
5 a_g 7
6 a_h 6
7 b_c 6
8 b_d 4
9 b_e 9
As you happen to have 8 columns, np.packbits together with
np.bincount is rather convenient here:
import numpy as np
import pandas as pd
# make large example
ncol, nrow = 8, 1_000_000
df = pd.DataFrame(np.random.randint(0,2,(nrow,ncol)), columns=list("abcdefgh"))
from time import time
T = [time()]
# encode as binary numbers and count
counts = np.bincount(np.packbits(df.values.astype(np.uint8)),None,256)
# find sets in other sets
rng = np.arange(256, dtype=np.uint8)
contained = (rng & rng[:, None]) == rng[:, None]
# and sum
ccounts = (counts * contained).sum(1)
# if there are empty bins, remove them
nz = np.where(ccounts)[0].astype(np.uint8)
# helper to build bin labels
a2h = np.array(list("abcdefgh"))
# put labels to counts
result = pd.Series(ccounts[nz], index = ["_".join((*a2h[np.unpackbits(i).view(bool)],)) for i in nz])
from itertools import chain, combinations
def powerset(iterable):
"powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
s = list(iterable)
return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))
T.append(time())
s = pd.Series({
'_'.join(c): df[c].min(axis=1).sum()
for c in map(list, filter(None, powerset(df)))
})
T.append(time())
print("packbits {:.3f} powerset {:.3f}".format(*np.diff(T)))
print("results equal", (result.sort_index()[1:]==s.sort_index()).all())
This gives the same result as the powerset approach but literally 1000x faster:
packbits 0.016 powerset 21.974
results equal True
If you have just values of 1 and 0, you could do:
df= pd.DataFrame({
'a': [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1],
'b': [1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0],
'c': [1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1],
'd': [1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1],
})
(df.a * df.b).sum()
This results in 4.
To get all combinations you can use combinations from itertools:
from itertools import combinations
analyze=[(col,) for col in df.columns]
analyze.extend(combinations(df.columns, 2))
for cols in analyze:
num_ser= None
for col in cols:
if num_ser is None:
num_ser= df[col]
else:
num_ser*= df[col]
num= num_ser.sum()
print(f'{cols} contains {num}')
This results in:
('a',) contains 4
('b',) contains 7
('c',) contains 11
('d',) contains 23
('a', 'b') contains 4
('a', 'c') contains 4
('a', 'd') contains 4
('b', 'c') contains 7
('b', 'd') contains 7
('c', 'd') contains 11
Cooccurence matrix is all you need:
Let's construct an example first:
import numpy as np
import pandas as pd
mat = np.zeros((5,5))
mat[0,0] = 1
mat[0,1] = 1
mat[1,0] = 1
mat[2,1] = 1
mat[3,3] = 1
mat[3,4] = 1
mat[2,4] = 1
cols = ['a','b','c','d','e']
df = pd.DataFrame(mat,columns=cols)
print(df)
a b c d e
0 1.0 1.0 0.0 0.0 0.0
1 1.0 0.0 0.0 0.0 0.0
2 0.0 1.0 0.0 0.0 1.0
3 0.0 0.0 0.0 1.0 1.0
4 0.0 0.0 0.0 0.0 0.0
Now we construct the cooccurence matrix:
# construct the cooccurence matrix:
co_df = df.T.dot(df)
print(co_df)
a b c d e
a 2.0 1.0 0.0 0.0 0.0
b 1.0 2.0 0.0 0.0 1.0
c 0.0 0.0 0.0 0.0 0.0
d 0.0 0.0 0.0 1.0 1.0
e 0.0 1.0 0.0 1.0 2.0
Finally the result you need:
result = {}
for c1 in cols:
for c2 in cols:
if c1 == c2:
if c1 not in result:
result[c1] = co_df[c1][c2]
else:
if '_'.join([c1,c2]) not in result:
result['_'.join([c1,c2])] = co_df[c1][c2]
print(result)
{'a': 2.0, 'a_b': 1.0, 'a_c': 0.0, 'a_d': 0.0, 'a_e': 0.0, 'b_a': 1.0, 'b': 2.0, 'b_c': 0.0, 'b_d': 0.0, 'b_e': 1.0, 'c_a': 0.0, 'c_b': 0.0, 'c': 0.0, 'c_d': 0.0, 'c_e': 0.0, 'd_a': 0.0, 'd_b': 0.0, 'd_c': 0.0, 'd': 1.0, 'd_e': 1.0, 'e_a': 0.0, 'e_b': 1.0, 'e_c': 0.0, 'e_d': 1.0, 'e': 2.0}
I try to create new variable with np.where() function.
myDF['newVar'] = np.where((myDF['var1']==1) |
(myDF['var2']==1) |
(myDF['var3']==1) ,
1, 0)
Is there a way to replace var1, var2, var3 by a list like this (same condition ==1 for each column):
listVars=['var1', 'var2', 'var3']
myDF['newVar'] = np.where((myDF[listVars]==1),
1, 0)
You can use pandas.DataFrame.any() with axis=1:
listVars=['var1', 'var2', 'var3']
myDF['newVar'] = np.where((myDF[listVars]==1).any(axis=1), 1, 0)
For example:
myDF = pd.DataFrame({
"var1": [1, 1, 1, 0, 0],
"var2": [1, 0, 1, 0, 1],
"var3": [0, 0, 0, 0, 0]
})
listVars=['var1', 'var2', 'var3']
myDF['newVar'] = np.where((myDF[listVars]==1).any(1),1, 0)
print(myDF)
# var1 var2 var3 newVar
#0 1 1 0 1
#1 1 0 0 1
#2 1 1 0 1
#3 0 0 0 0
#4 0 1 0 1
I have a pandas Dataframe y with 1 million rows and 5 columns.
np.shape(y)
(1037889, 5)
The column values are all 0 or 1. Looks something like this:
y.head()
a, b, c, d, e
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I want a Dataframe with 1 million rows and 1 column.
np.shape(y)
(1037889, )
where the column is just the 5 columns concatenated together.
New column
0, 0, 1, 0, 0
1, 0, 0, 1, 1
0, 1, 1, 1, 1
0, 0, 0, 0, 0
I keep trying different things like merge, concat, dstack, etc...
but can't seem to figure this out.
If you want new column to have all data concatenated to string, it's good case for apply() function:
>>> df = pd.DataFrame({'a':[0,1,0,0], 'b':[0,0,1,0], 'c':[1,0,1,0], 'd':[0,1,1,0], 'c':[0,1,1,0]})
>>> df
a b c d
0 0 0 0 0
1 1 0 1 1
2 0 1 1 1
3 0 0 0 0
>>> df2 = df.apply(lambda row: ','.join(map(str, row)), axis=1)
>>> df2
0 0,0,0,0
1 1,0,1,1
2 0,1,1,1
3 0,0,0,0
I have a data that arrives in this format:
[
(1, "000010101001010101011101010101110101", "aaa", ... ),
(0, "111101010100101010101110101010111010", "bb", ... ),
(0, "100010110100010101001010101011101010", "ccc", ... ),
(1, "000010101001010101011101010101110101", "ddd", ... ),
(1, "110100010101001010101011101010111101", "eeee", ... ),
...
]
In tuple format, it looks like this:
(Y, X, other_info, ... )
At the end of the day, I need to train a classifier (e.g. sklearn.linear_model.logistic.LogisticRegression) using Y and X.
What's the most straightforward way to turn the string of ones and zeros into something like a np.array, so that I can run it through the classifier? Seems like there should be an easy answer here, but I haven't been able to think of/google one.
A few notes:
I'm already using numpy/pandas/sklearn, so anything in those libraries is fair game.
For a lot of what I'm doing, it's convenient to have the other_info columns together in a DataFrame
The strings are is pretty long (~20,000 columns), but the total data frame is not very tall (~500 rows).
Since you asked primarily for a way to convert a string of ones and zeros into a numpy array, I'll offer my solution as follows:
d = '0101010000' * 2000 # create a 20,000 long string of 1s and 0s
d_array = np.fromstring(d, 'int8') - 48 # 48 is ascii 0. ascii 1 is 49
This compares favourable to #DSM's solution in terms of speed:
In [21]: timeit numpy.fromstring(d, dtype='int8') - 48
10000 loops, best of 3: 35.8 us per loop
In [22]: timeit numpy.fromiter(d, dtype='int', count=20000)
100 loops, best of 3: 8.57 ms per loop
How about something like this:
Make the dataframe:
In [82]: v = [
....: (1, "000010101001010101011101010101110101", "aaa"),
....: (0, "111101010100101010101110101010111010", "bb"),
....: (0, "100010110100010101001010101011101010", "ccc"),
....: (1, "000010101001010101011101010101110101", "ddd"),
....: (1, "110100010101001010101011101010111101", "eeee"),
....: ]
In [83]:
In [83]: df = pandas.DataFrame(v)
We can use fromiter or array to get an ndarray:
In [84]: d ="000010101001010101011101010101110101"
In [85]: np.fromiter(d, int) # better: np.fromiter(d, int, count=len(d))
Out[85]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
In [86]: np.array(list(d), int)
Out[86]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])
There might be a slick vectorized way to do this, but I'd just apply the obvious per-entry function to the values and get on with my day:
In [87]: df[1]
Out[87]:
0 000010101001010101011101010101110101
1 111101010100101010101110101010111010
2 100010110100010101001010101011101010
3 000010101001010101011101010101110101
4 110100010101001010101011101010111101
Name: 1
In [88]: df[1] = df[1].apply(lambda x: np.fromiter(x, int)) # better with count=len(x)
In [89]: df
Out[89]:
0 1 2
0 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 aaa
1 0 [1 1 1 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 bb
2 0 [1 0 0 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 ccc
3 1 [0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 ddd
4 1 [1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 eeee
In [90]: df[1][0]
Out[90]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0,
1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1])