How does pandas convert one column of data into another? - python

I have a dataframe generated by pandas, as follows:
NO CODE
1 a
2 a
3 a
4 a
5 a
6 a
7 b
8 b
9 a
10 a
11 a
12 a
13 b
14 a
15 a
16 a
I want to convert the CODE column data to get the NUM column. The encoding rules are as follows:
NO CODE NUM
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 a 1
15 a 2
16 a 3
thank you!

Try:
a_group = df.CODE.eq('a')
df['NUM'] = np.where(a_group,
df.groupby(a_group.ne(a_group.shift()).cumsum())
.CODE.cumcount()+1,
df.CODE)
on
df = pd.DataFrame({'CODE':list('baaaaaabbaaaabbaa')})
yields
CODE NUM
-- ------ -----
0 b b
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 b b
8 b b
9 a 1
10 a 2
11 a 3
12 a 4
13 b b
14 b b
15 a 1
16 a 2

IIUC
s=df.CODE.eq('b').cumsum()
df['NUM']=df.CODE.where(df.CODE.eq('b'),s[~df.CODE.eq('b')].groupby(s).cumcount()+1)
df
Out[514]:
NO CODE NUM
0 1 a 1
1 2 a 2
2 3 a 3
3 4 a 4
4 5 a 5
5 6 a 6
6 7 b b
7 8 b b
8 9 a 1
9 10 a 2
10 11 a 3
11 12 a 4
12 13 b b
13 14 a 1
14 15 a 2
15 16 a 3

Related

categorize numerical series with python

I'm figuring out how to assign a categorization from an increasing enumeration column. Here an example of my dataframe:
df = pd.DataFrame({'A':[1,1,1,1,1,1,2,2,3,3,3,3,3],'B':[1,2,3,12,13,14,1,2,5,6,7,8,50]})
This produce:
df
Out[9]:
A B
0 1 1
1 1 2
2 1 3
3 1 12
4 1 13
5 1 14
6 2 1
7 2 2
8 3 5
9 3 6
10 3 7
11 3 8
12 3 50
The column B has an increasing numerical serie, but sometimes the series is interrupted and keeps going with other numbers or start again. My desired output is:
Out[11]:
A B C
0 1 1 1
1 1 2 1
2 1 3 1
3 1 12 2
4 1 13 2
5 1 14 2
6 2 1 3
7 2 2 3
8 3 5 3
9 3 6 4
10 3 7 4
11 3 8 4
12 3 50 5
I appreciate your suggestions, because I can not find an ingenious way to
do it. Thanks
Is this what you need ?
df.B.diff().ne(1).cumsum()
Out[463]:
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
8 4
9 4
10 4
11 4
12 5
Name: B, dtype: int32

islice and cycle with multiple levels

UPDATE:
Added the pattern required as asked
I have 2 lists and the expected output is different than the last time
Numberset1 = [10,11,12]
Numberset2 = [1,2,3,4,5]
and i want to display output by manipulating the lists, the expected output is
10 1 1
10 1 2
10 1 3
10 1 4
10 1 5
10 2 2
10 2 3
10 2 4
10 2 5
10 2 1
10 3 3
10 3 4
10 3 5
10 3 1
10 3 2
10 4 4
10 4 5
10 4 1
10 4 2
10 4 3
10 5 5
10 5 1
10 5 2
10 5 3
10 5 4
11 2 2
11 2 3
11 2 4
11 2 5
11 2 1
11 3 3
11 3 4
11 3 5
11 3 1
11 3 2
11 4 4
11 4 5
11 4 1
11 4 2
11 4 3
11 5 5
11 5 1
11 5 2
11 5 3
11 5 4
11 5 1
11 1 1
11 1 2
11 1 3
11 1 4
11 1 5
12 3 3
12 3 4
12 3 5
12 3 1
12 3 2
12 4 4
12 4 5
12 4 1
12 4 2
12 4 3
12 4 4
12 4 5
12 5 5
12 5 1
12 5 2
12 5 3
12 1 1
12 1 2
12 1 3
12 1 4
12 1 5
12 2 2
12 2 3
12 2 4
12 2 5
12 2 1
The code i have tried is as follows, this was suggested in previous question and i tried using it for the next level of looping but i could not get the desired output
Numberset1 = [10,11,12]
Numberset2 = [1,2,3,4,5]
from itertools import cycle, islice
it = cycle(Numberset2)
for i in Numberset1:
for a in Numberset2:
for j in islice(it, len(Numberset2)):
print(i, a,j)
skipped1 = next(it)
skipped1 = next(it)
The output i am getting is
10 1 1
10 1 2
10 1 3
10 1 4
10 1 5
10 2 2
10 2 3
10 2 4
10 2 5
10 2 1
10 3 3
10 3 4
10 3 5
10 3 1
10 3 2
10 4 4
10 4 5
10 4 1
10 4 2
10 4 3
10 5 5
10 5 1
10 5 2
10 5 3
10 5 4
11 1 2
11 1 3
11 1 4
11 1 5
11 1 1
11 2 3
11 2 4
11 2 5
11 2 1
11 2 2
11 3 4
11 3 5
11 3 1
11 3 2
11 3 3
11 4 5
11 4 1
11 4 2
11 4 3
11 4 4
11 5 1
11 5 2
11 5 3
11 5 4
11 5 5
12 1 3
12 1 4
12 1 5
12 1 1
12 1 2
12 2 4
12 2 5
12 2 1
12 2 2
12 2 3
12 3 5
12 3 1
12 3 2
12 3 3
12 3 4
12 4 1
12 4 2
12 4 3
12 4 4
12 4 5
12 5 2
12 5 3
12 5 4
12 5 5
12 5 1
Please note the change when the number 11 starts in the first column than the expected output
How can we use cycle and islice for multiple levels
Pattern:
The first column should be in order of numbers in Numberset1, the second column for first number in Numberset1 should be in order of numbers in Numberset2, the 3rd column for first number in Numberset1 should be in order of numbers in NUmberset2 but when the 2nd column for first number in Numberset1 changes it should also change and print from 2ndnumber in Numberset2 list and so on
Here's a version that accomplishes the task using cycle and islice. To make the code cleaner I've created a generator function aligned_cycle which cycles through the items yielded by cycle until we get the one we want to start the current cycle with.
This updated version can cope with Numberset1 having greater length than Numberset2.
from itertools import cycle, islice
def aligned_cycle(seq, start_item):
''' Make a generator that cycles over the items in `seq`.
The first item yielded equals `start_item`.
'''
if start_item not in seq:
raise ValueError("{} not in {}".format(start_item, seq))
it = cycle(seq)
for u in it:
if u == start_item:
break
yield u
yield from it
Numberset1 = [10, 11, 12]
Numberset2 = [1, 2, 3, 4, 5]
cycle_length = len(Numberset2)
for i, u in zip(Numberset1, cycle(Numberset2)):
for j in islice(aligned_cycle(Numberset2, u), cycle_length):
for k in islice(aligned_cycle(Numberset2, j), cycle_length):
print(i, j, k)
output
10 1 1
10 1 2
10 1 3
10 1 4
10 1 5
10 2 2
10 2 3
10 2 4
10 2 5
10 2 1
10 3 3
10 3 4
10 3 5
10 3 1
10 3 2
10 4 4
10 4 5
10 4 1
10 4 2
10 4 3
10 5 5
10 5 1
10 5 2
10 5 3
10 5 4
11 2 2
11 2 3
11 2 4
11 2 5
11 2 1
11 3 3
11 3 4
11 3 5
11 3 1
11 3 2
11 4 4
11 4 5
11 4 1
11 4 2
11 4 3
11 5 5
11 5 1
11 5 2
11 5 3
11 5 4
11 1 1
11 1 2
11 1 3
11 1 4
11 1 5
12 3 3
12 3 4
12 3 5
12 3 1
12 3 2
12 4 4
12 4 5
12 4 1
12 4 2
12 4 3
12 5 5
12 5 1
12 5 2
12 5 3
12 5 4
12 1 1
12 1 2
12 1 3
12 1 4
12 1 5
12 2 2
12 2 3
12 2 4
12 2 5
12 2 1
Jon Clements has written a more robust and more efficient version of aligned_cycle:
def aligned_cycle(iterable, start_item):
a, b = tee(iterable)
b = cycle(b)
for u, v in zip(a, b):
if u == start_item:
break
else:
return
yield u
yield from b
Thanks, Jon!

Python Pandas: DataFrame modification with diagnal value = 0 [duplicate]

This question already has answers here:
Set values on the diagonal of pandas.DataFrame
(8 answers)
Closed 5 years ago.
I have a Pandas Dataframe question. I have a df with index=column. It looks like below.
df:
DNA Cat2
Item A B C D E F F H I J .......
DNA Item
Cat2 A 812 62 174 0 4 46 46 7 2 15
B 62 427 27 0 0 12 61 2 4 11
C 174 27 174 0 0 13 22 5 2 4
D 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 130 10 57 33 4 5
F 46 12 13 0 10 187 4 5 0 0
......
Another words, df=df.transpose(). All I want to do is find pandas (or numpy for df.values())function to delete index=column values. My ideal output would be below.
df:
DNA Cat2
Item A B C D E F F H I J .......
DNA Item
Cat2 A 0 62 174 0 4 46 46 7 2 15
B 62 0 27 0 0 12 61 2 4 11
C 174 27 0 0 0 13 22 5 2 4
D 0 0 0 0 0 0 0 0 0 0
E 4 0 0 0 0 10 57 33 4 5
F 46 12 13 0 10 0 4 5 0 0
......
Is there a python function that makes this step very fast? I tried for loop with df.iloc[i,i]=0 but since my dataset is ver big, it takes long time to finish. Thanks in advance!
Setup
np.random.seed([3,1415])
i = pd.MultiIndex.from_product(
[['Cat2'], list('ABCDEFGHIJ')],
names=['DNA', 'Item']
)
a = np.random.randint(5, size=(10, 10))
df = pd.DataFrame(a + a.T + 1, i, i)
df
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 1 6 6 7 7 7 4 4 8 2
B 6 1 3 6 1 6 6 4 8 5
C 6 3 9 8 9 6 7 8 4 9
D 7 6 8 1 6 9 4 5 4 3
E 7 1 9 6 9 7 3 7 2 6
F 7 6 6 9 7 9 3 4 6 6
G 4 6 7 4 3 3 9 4 5 5
H 4 4 8 5 7 4 4 5 4 5
I 8 8 4 4 2 6 5 4 9 7
J 2 5 9 3 6 6 5 5 7 3
Option 1
Simplest way is to multiply by 1 less the identity
df * (1 - np.eye(len(df), dtype=int))
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 0 6 6 7 7 7 4 4 8 2
B 6 0 3 6 1 6 6 4 8 5
C 6 3 0 8 9 6 7 8 4 9
D 7 6 8 0 6 9 4 5 4 3
E 7 1 9 6 0 7 3 7 2 6
F 7 6 6 9 7 0 3 4 6 6
G 4 6 7 4 3 3 0 4 5 5
H 4 4 8 5 7 4 4 0 4 5
I 8 8 4 4 2 6 5 4 0 7
J 2 5 9 3 6 6 5 5 7 0
Option 2
However, we can also use pd.DataFrame.mask with np.eye. Masking is nice because it doesn't have to be numeric and it will still work.
df.mask(np.eye(len(df), dtype=bool), 0)
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 A 0 6 6 7 7 7 4 4 8 2
B 6 0 3 6 1 6 6 4 8 5
C 6 3 0 8 9 6 7 8 4 9
D 7 6 8 0 6 9 4 5 4 3
E 7 1 9 6 0 7 3 7 2 6
F 7 6 6 9 7 0 3 4 6 6
G 4 6 7 4 3 3 0 4 5 5
H 4 4 8 5 7 4 4 0 4 5
I 8 8 4 4 2 6 5 4 0 7
J 2 5 9 3 6 6 5 5 7 0
Option 3
In the event the columns and indices are not identical, OR the are out of order. We can use equality to tell us where to mask.
d = df.iloc[::-1]
d.mask(d.index == d.columns.values[:, None], 0)
DNA Cat2
Item A B C D E F G H I J
DNA Item
Cat2 J 2 5 9 3 6 6 5 5 7 0
I 8 8 4 4 2 6 5 4 0 7
H 4 4 8 5 7 4 4 0 4 5
G 4 6 7 4 3 3 0 4 5 5
F 7 6 6 9 7 0 3 4 6 6
E 7 1 9 6 0 7 3 7 2 6
D 7 6 8 0 6 9 4 5 4 3
C 6 3 0 8 9 6 7 8 4 9
B 6 0 3 6 1 6 6 4 8 5
A 0 6 6 7 7 7 4 4 8 2

Holding a first value in a column while another column equals a value?

I would like to hold the first value in a column while another column does not equal zero. For Column B, values alternate between -1, 0, 1. For Column C, values equal any integer. The objective is holding the first value of Column C while Column B equals zero. The current DataFrame is as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 8
5 5 0 9
6 6 0 1
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 8
11 5 0 9
12 6 0 10
The resulting DataFrame should be as follows:
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
You need first create NaNs by condition in column C and then add values by ffill:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df.loc[mask, 'C']
df['C'] = df['C'].ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9
Or use where and if type of all values is integer, add astype:
mask = (df['B'].shift().fillna(False)).astype(bool) | (df['B'])
df['C'] = df['C'].where(mask).ffill().astype(int)
print (df)
A B C
1 8 1 9
2 2 1 1
3 3 0 7
4 9 0 7
5 5 0 7
6 6 0 7
7 1 1 9
8 6 1 10
9 3 0 4
10 8 0 4
11 5 0 4
12 6 0 4
13 3 1 9

Reshaping dataframe in Pandas

Is there a quick pythonic way to transform this table
index = pd.date_range('2000-1-1', periods=36, freq='M')
df = pd.DataFrame(np.random.randn(36,4), index=index, columns=list('ABCD'))
In[1]: df
Out[1]:
A B C D
2000-01-31 H 1.368795 0.106294 2.108814
2000-02-29 -1.713401 0.557224 0.115956 -0.851140
2000-03-31 -1.454967 -0.791855 -0.461738 -0.410948
2000-04-30 1.688731 -0.216432 -0.690103 -0.319443
2000-05-31 -1.103961 0.181510 -0.600383 -0.164744
2000-06-30 0.216871 -1.018599 0.731617 -0.721986
2000-07-31 0.621375 0.790072 0.967000 1.347533
2000-08-31 0.588970 -0.360169 0.904809 0.606771
...
into this table
2001 2000
12 11 10 9 8 7 6 5 4 3 2 1 12 11 10 9 8 7 6 5 4 3 2 1
A H
B
C
D
Please excuse the missing values. I added the "H" manually. I hope it gets clear what I am looking for.
For easier check, I've created dataframe of the same shape but with integers as values.
The core of the solution is pandas.DataFrame.transpose, but you need to use index.year + index.month as a new index:
>>> df = pd.DataFrame(np.random.randint(10,size=(36, 4)), index=index, columns=list('ABCD'))
>>> df.set_index(keys=[df.index.year, df.index.month]).transpose()
2000 2001 2002
1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
A 0 0 8 7 8 0 7 1 5 1 5 4 2 1 9 5 2 0 5 3 6 4 9 3 5 1 7 3 1 7 6 5 6 8 4 1
B 4 9 9 5 2 0 8 0 9 5 2 7 5 6 3 6 8 8 8 8 0 6 3 7 5 9 6 3 9 7 1 4 7 8 3 3
C 3 2 4 3 1 9 7 6 9 6 8 6 3 5 3 2 2 1 3 1 1 2 8 2 2 6 9 6 1 5 6 5 4 6 7 5
D 8 1 3 9 2 3 8 7 3 2 1 0 1 3 9 1 8 6 4 7 4 6 3 2 9 8 9 9 0 7 4 7 3 6 5 2
Of course, this will not work properly if you have more then one record per year+month. In this case you need to groupby your data first:
>>> i = pd.date_range('2000-1-1', periods=36, freq='W') # weekly index
>>> df = pd.DataFrame(np.random.randint(10,size=(36, 4)), index=i, columns=list('ABCD'))
>>> df.groupby(by=[df.index.year, df.index.month]).sum().transpose()
2000
1 2 3 4 5 6 7 8 9
A 12 13 15 23 9 21 21 31 7
B 33 24 19 30 15 19 20 7 4
C 20 24 26 24 15 18 29 17 4
D 23 29 14 30 19 12 12 11 5

Categories

Resources