I am trying to create a new column based one on my first column. For example,
I have a list of a = ["A", "B", "C"] and existing dataframe
Race Boy Girl
W 0 1
B 1 0
H 1 1
W 1 0
B 0 0
H 0 1
W 1 0
B 1 1
H 0 1
My goal is to create a new column and add value to it base on W, B, H interval. So that the end result looks like:
Race Boy Girl New Column
W 0 1 A
B 1 0 A
H 1 1 A
W 1 0 B
B 0 0 B
H 0 1 B
W 1 0 C
B 1 1 C
H 0 1 C
The W,B,H interval is consistent, and I want to add new value to the new column every time I see W. The data is longer than this.
I have tried all possible ways and I couldn't come up with a code. I will be glad if someone can help and also explain the process. Thanks
Here is what you can do:
Use a loop to create a list that is repetitive for the column.
for i in len(dataframe['Race']):
#Create list for last column
Once you have that list you can add it to the list by using:
dataframe['New Column'] = list
maybe this works..
list = ['A','B','C',....]
i=-1
for entry in dataframe:
if entry['Race'] = 'W':
i+=1
entry['new column'] = list[i]
also if the new column list is very big to type you can use list comprehension:
list = [x for x in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
If your W, B, H is in this exact order and complete inteval, you may use np.repeat. As in your comment, np.repeat only would suffice.
import numpy as np
a = ["A", "B", "C"] #list
n = df.Race.nunique() # length of each interval
df['New Col'] = np.repeat(a, n)
In [20]: df
Out[20]:
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
Here is a way with pandas. It increments each time you see a new 'W' and handles missing values of Race.
# use original post's definition of df
df['New Col'] = (
(df['Race'] == 'W') # True (1) for W; False (0) otherwise
.cumsum() # increments each time you hit True (1)
.map({1: 'A', 2: 'B', 3: 'C'}) # 1->A, 2->B, ...
)
print(df)
Race Boy Girl New Col
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
There are multiple ways to solve this problem statement. You can iterate through the DataFrame and assign values to the new column at each interval.
Here's an approach I think will work.
#setting up the DataFrame you referred in the example
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H','W','B','H','W','B','H'],
'Boy':[0,1,1,1,0,0,1,1,0],
'Girl':[1,0,1,0,0,1,0,1,1]})
#if you have 3 values to assign, create a list say A, B, C
#By creating a list, you have to manage only the list and the frequency
a = ['A','B','C']
#iterate thru the dataframe and assign the values in batches
for (i,row) in df.iterrows(): #the trick is to assign for loc[i]
df.loc[i,'New'] = a[int(i/3)] #where i is the index and assign value in list a
#note: dividing by 3 will distribute equally
print(df)
The output of this will be:
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
I see that you are trying to get a solution that works for 17 sets of records. Here's the code and it works correctly.
import pandas as pd
df = pd.DataFrame({'Race':['W','B','H']*17,
'Boy':[0,1,1]*17,
'Girl':[1,0,1]*17})
#in the DataFrame, you can define the Boy and Girl value
#I think Race values are repeating so I just repeated it 17 times
#define a variable from a thru z
a = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
for (i,row) in df.iterrows():
df.loc[i,'New'] = a[int(i/3)] #still dividing it by 3 equal batches
print(df)
I didn't print for all 17 sets. I just did with 7 sets. It is still the same result.
Race Boy Girl New
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 0 1 B
4 B 1 0 B
5 H 1 1 B
6 W 0 1 C
7 B 1 0 C
8 H 1 1 C
9 W 0 1 D
10 B 1 0 D
11 H 1 1 D
12 W 0 1 E
13 B 1 0 E
14 H 1 1 E
15 W 0 1 F
16 B 1 0 F
17 H 1 1 F
18 W 0 1 G
19 B 1 0 G
20 H 1 1 G
The old pythonic fashion: use a function !
In [18]: df
Out[18]:
Race Boy Girl
0 W 0 1
1 B 1 0
2 H 1 1
3 W 1 0
4 B 0 0
5 H 0 1
6 W 1 0
7 B 1 1
8 H 0 1
The function:
def make_new_col(race_col, abc):
race_col = iter(race_col)
abc = iter(abc)
new_col = []
while True:
try:
race = next(race_col)
except:
break
if race == 'W':
abc_value = next(abc)
new_col.append(abc_value)
else:
new_col.append(abc_value)
return new_col
Then do:
abc = ['A', 'B', 'C']
df['New Column'] = make_new_col(df['Race'], abc)
You get:
In [20]: df
Out[20]:
Race Boy Girl New Column
0 W 0 1 A
1 B 1 0 A
2 H 1 1 A
3 W 1 0 B
4 B 0 0 B
5 H 0 1 B
6 W 1 0 C
7 B 1 1 C
8 H 0 1 C
Related
I have a DataFrame that looks as follows:
import pandas as pd
df = pd.DataFrame({
'ids': range(4),
'strc': ['some', 'thing', 'abc', 'foo'],
'not_relevant': range(4),
'strc2': list('abcd'),
'strc3': list('lkjh')
})
ids strc not_relevant strc2 strc3
0 0 some 0 a l
1 1 thing 1 b k
2 2 abc 2 c j
3 3 foo 3 d h
For each value in ids I want to collect all values that are stored in the
columns that start with strc and put them in a separate columns called strc_list, so I want:
ids strc not_relevant strc2 strc3 strc_list
0 0 some 0 a l some
0 0 some 0 a l a
0 0 some 0 a l l
1 1 thing 1 b k thing
1 1 thing 1 b k b
1 1 thing 1 b k k
2 2 abc 2 c j abc
2 2 abc 2 c j c
2 2 abc 2 c j j
3 3 foo 3 d h foo
3 3 foo 3 d h d
3 3 foo 3 d h h
I know that I can select all required columns using
df.filter(like='strc', axis=1)
but I don't know how to continue from here. How can I get my desired outcome?
After filter, you need stack, droplevel, rename and join back to df
df1 = df.join(df.filter(like='strc', axis=1).stack().droplevel(1).rename('strc_list'))
Out[135]:
ids strc not_relevant strc2 strc3 strc_list
0 0 some 0 a l some
0 0 some 0 a l a
0 0 some 0 a l l
1 1 thing 1 b k thing
1 1 thing 1 b k b
1 1 thing 1 b k k
2 2 abc 2 c j abc
2 2 abc 2 c j c
2 2 abc 2 c j j
3 3 foo 3 d h foo
3 3 foo 3 d h d
3 3 foo 3 d h h
You can first store the desired values in a list using apply:
df['strc_list'] = df.filter(like='strc', axis=1).apply(list, axis=1)
0 [some, a, l]
1 [thing, b, k]
2 [abc, c, j]
3 [foo, d, h]
Then use explode to distribute them over separate rows:
df = df.explode('strc_list')
A one-liner could then look like this:
df.assign(strc_list=df.filter(like='strc', axis=1).apply(list, axis=1)).explode('strc_list')
This is my csv file:
A B C D J
0 1 0 0 0
0 0 0 0 0
1 1 1 0 0
0 0 0 0 0
0 0 7 0 7
I need each time to select two columns and I verify this condition if I have Two 0 I delete the row so for exemple I select A and B
Input
A B
0 1
0 0
1 1
0 0
0 0
Output
A B
0 1
1 1
And Then I select A and C ..
I used This code for A and B but it return errors
import pandas as pd
df = pd.read_csv('Book1.csv')
a=df['A']
b=df['B']
indexes_to_drop = []
for i in df.index:
if df[(a==0) & (b==0)] :
indexes_to_drop.append(i)
df.drop(df.index[indexes_to_drop], inplace=True )
Any help please!
First we make your desired combinations of column A with all the rest, then we use iloc to select the correct rows per column combination:
idx_ranges = [[0,i] for i in range(1, len(df.columns))]
dfs = [df[df.iloc[:, idx].ne(0).any(axis=1)].iloc[:, idx] for idx in idx_ranges]
print(dfs[0], '\n')
print(dfs[1], '\n')
print(dfs[2], '\n')
print(dfs[3])
A B
0 0 1
2 1 1
A C
2 1 1
4 0 7
A D
2 1 0
A J
2 1 0
4 0 7
Do not iterate. Create a Boolean Series to slice your DataFrame:
cols = ['A', 'B']
m = df[cols].ne(0).any(1)
df.loc[m]
A B C D J
0 0 1 0 0 0
2 1 1 1 0 0
You can get all combinations and store them in a dict with itertools.combinations. Use .loc to select both the rows and columns you care about.
from itertools import combinations
d = {c: df.loc[df[list(c)].ne(0).any(1), list(c)]
for c in list(combinations(df.columns, 2))}
d[('A', 'B')]
# A B
#0 0 1
#2 1 1
d[('C', 'J')]
# C J
#2 1 0
#4 7 7
I have a dataframe as follows:
data
0 a
1 a
2 a
3 a
4 a
5 b
6 b
7 b
8 b
9 b
I want to group the repeating values of a and b into a single row element as follows:
data
0 a
a
a
a
a
1 b
b
b
b
b
How do I go about doing this? I tried the following but it puts each repeating value in its own column
df.groupby('data')
Seems like a pivot problem, but since you missing the column(create by cumcount) and index(create by factorize) columns , it is hard to figure out
pd.crosstab(pd.factorize(df.data)[0],df.groupby('data').cumcount(),df.data,aggfunc='sum')
Out[358]:
col_0 0 1 2 3 4
row_0
0 a a a a a
1 b b b b b
Something like
index = ((df['data'] != df['data'].shift()).cumsum() - 1).rename(columns= {'data':''})
df = df.set_index(index)
data
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
You can use pd.factorize followed by set_index:
df = df.assign(key=pd.factorize(df['data'], sort=False)[0]).set_index('key')
print(df)
data
key
0 a
0 a
0 a
0 a
0 a
1 b
1 b
1 b
1 b
1 b
I am using Python 2.7 with Pandas on a Windows 10 machine.
I have an n by n Dataframe where:
1) The index represents peoples names
2) The column headers are the same peoples names in the same order
3) Each cell of the Dataframeis the average number of times they email each other each day.
How would I transform that Dataframeinto a Dataframewith 3 columns, where:
1) Column 1 would be the index of the n by n Dataframe
2) Column 2 would be the row headers of the n by n Dataframe
3) Column 3 would be the cell value corresponding to those two names from the index, column header combination from the n by n Dataframe
Edit
Appologies for not providing an example of what I am looking for. I would like to take df1 and turn it into rel_df, from the code below.
import pandas as pd
from itertools import permutations
df1 = pd.DataFrame()
df1['index'] = ['a', 'b','c','d','e']
df1.set_index('index', inplace = True)
df1['a'] = [0,1,2,3,4]
df1['b'] = [1,0,2,3,4]
df1['c'] = [4,1,0,3,4]
df1['d'] = [5,1,2,0,4]
df1['e'] = [7,1,2,3,0]
##df of all relationships to build
flds = pd.Series(SO_df.fld1.unique())
flds = pd.Series(flds.append(pd.Series(SO_df.fld2.unique())).unique())
combos = []
for L in range(0, len(flds)+1):
for subset in permutations(flds, L):
if len(subset) == 2:
combos.append(subset)
if len(subset) > 2:
break
rel_df = pd.DataFrame.from_records(data = combos, columns = ['fld1','fld2'])
rel_df['value'] = [1,4,5,7,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
print df1
>>> print df1
a b c d e
index
a 0 1 4 5 7
b 1 0 1 1 1
c 2 2 0 2 2
d 3 3 3 0 3
e 4 4 4 4 0
>>> print rel_df
fld1 fld2 value
0 a b 1
1 a c 4
2 a d 5
3 a e 7
4 b a 1
5 b c 1
6 b d 1
7 b e 1
8 c a 2
9 c b 2
10 c d 2
11 c e 2
12 d a 3
13 d b 3
14 d c 3
15 d e 3
16 e a 4
17 e b 4
18 e c 4
19 e d 4
Use melt:
df1 = df1.reset_index()
pd.melt(df1, id_vars='index', value_vars=df1.columns.tolist()[1:])
(If in your actual code you're explicitly setting the index as you do here, just skip that step rather than doing the reset_index; melt doesn't work on an index.)
# Flatten your dataframe.
df = df1.stack().reset_index()
# Remove duplicates (e.g. fld1 = 'a' and fld2 = 'a').
df = df.loc[df.iloc[:, 0] != df.iloc[:, 1]]
# Rename columns.
df.columns = ['fld1', 'fld2', 'value']
>>> df
fld1 fld2 value
1 a b 1
2 a c 4
3 a d 5
4 a e 7
5 b a 1
7 b c 1
8 b d 1
9 b e 1
10 c a 2
11 c b 2
13 c d 2
14 c e 2
15 d a 3
16 d b 3
17 d c 3
19 d e 3
20 e a 4
21 e b 4
22 e c 4
23 e d 4
Hi all I have a csv file which contains data as the format below
A a
A b
B f
B g
B e
B h
C d
C e
C f
The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h]
I want to convert this to occurence matrix look like below
a,b,c,d,e,f,g,h
A 1,1,0,0,0,0,0,0
B 0,0,0,0,1,1,1,1
C 0,0,0,1,1,1,0,0
Can anyone tell me how to do this using pandas?
Here is another way to do it using pd.get_dummies().
import pandas as pd
# your data
# =======================
df
col1 col2
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)
a b d e f g h
col1
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
Unclear if your data has a typo or not but you can crosstab for this:
In [95]:
pd.crosstab(index=df['A'], columns = df['a'])
Out[95]:
a b d e f g h
A
A 1 0 0 0 0 0
B 0 0 1 1 1 1
C 0 1 1 1 0 0
In your sample data your second column has value a as the name of that column but in your expected output it's in the column as a value
EDIT
OK I fixed your input data so it generates the correct result:
In [98]:
import pandas as pd
import io
t="""A a
A b
B f
B g
B e
B h
C d
C e
C f"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a'])
df
Out[98]:
A a
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
In [99]:
ct = pd.crosstab(index=df['A'], columns = df['a'])
ct
Out[99]:
a a b d e f g h
A
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
This approach yields the same result in a scipy sparse coo matrix much faster
from scipy import sparse
df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
(df.col1.cat.codes,
df.col2.cat.codes)))