Question about iterating through Pandas series with conditional statements - python

I'm trying to generate a column that would have zeros everywhere except when a specific condition is met.
Right now, I have an existing series of 0s and 1s saved as a Series object. Let's call this Series A. I've created another series of the same size filled with zeros, let's call this Series B. What I'd like to do is, whenever I hit the last 1 in a sequence of 1s in Series A, then the next six rows of Series B should replace the 0s with 1s.
For example:
Series A
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0...
Should produce Series B
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1...
Here's what I've tried so far:
for row in SeriesA:
if row == 1:
continue
if SeriesA[row] == 1 and SeriesA[row + 1] == 0:
SeriesB[row]=1
SeriesB[row+1]=1
SeriesB[row+2]=1
SeriesB[row+3]=1
SeriesB[row+4]=1
SeriesB[row+5]=1
However, this just generates Series B full of zeros except for the first five rows with become 1s. (Series A is all zeros until at least row 50)
I think I'm not understanding how iterating works with Pandas, so any help is appreciated!
EDIT: Full(ish) code
import os
import numpy as np
import pandas as pd
df = pd.read_csv("Python_Datafile.csv", names = fields) #fields is a list with names for each column, the first column is called "Date".
df["Date"] = pd.to_datetime(df["Date"], format = "%m/%Y")
df.set_index("Date", inplace = True)
Recession = df["NBER"] # This is series A
Rin6 = Recession*0 # This is series B
gps = Recession.ne(Recession.shift(1)).where(Recession.astype(bool)).cumsum()
idx = Recession[::-1].groupby(gps).idxmax()
to_one = np.hstack(pd.date_range(start=x+pd.offsets.DateOffset(months=1), freq='M', periods=6) for x in idx)
Rin6[Rin6.index.isin(to_one)]= 1
Rin6.unique() # Returns -> array([0], dtype=int64)

You can create an ID for consecutive groups of 1s using .shift + .cumsum:
gps = s.ne(s.shift(1)).where(s.astype(bool)).cumsum()
Then you can get the last index for each group by:
idx = s[::-1].groupby(gps).idxmax()
#0
#1.0 5
#2.0 18
#Name: 0, dtype: int64
Frorm the list of all indices with np.hstack
import numpy as np
np.hstack(np.arange(x+1, x+7, 1) for x in idx)
#array([ 6, 7, 8, 9, 10, 11, 19, 20, 21, 22, 23, 24])
And set those indices to 1 in the second Series:
s2[np.hstack(np.arange(x+1, x+7, 1) for x in idx)] = 1
s2.ravel()
# array([0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0.,..
Update from your comment: Assuming you have a Series s whose indices are datetimes, and another Series s2 which has the same indices but all values are 0 and they have the MonthStart frequency, you can proceed in a similar fasion:
s = pd.Series([0,0,0,0,0,0,0,0,0,1,1]*5, index=pd.date_range('2010-01-01', freq='MS', periods=55))
s2 = s*0
gps = s.ne(s.shift(1)).where(s.astype(bool)).cumsum()
idx = s[::-1].groupby(gps).idxmax()
#1.0 2010-11-01
#2.0 2011-10-01
#3.0 2012-09-01
#4.0 2013-08-01
#5.0 2014-07-01
#dtype: datetime64[ns]
to_one = np.hstack(pd.date_range(start=x+pd.offsets.DateOffset(months=1), freq='MS', periods=6) for x in idx)
s2[s2.index.isin(to_one)]= 1
# I check .isin in case the indices extend beyond the indices in s2

Related

Calculating the sum of connections for all possible pairings in a network

Based on a Data Frame listing connections between a source and a destination
import pandas as pd
df = pd.DataFrame({'source':['A','B','B'],'destination':['B','C','C']})
print(df)
source destination
0 A B
1 B C
2 B C
I want to calculate a square matrix containing the number of connections for all pairings, i.e. the resulting DataFrame should be
A B C
A 0 1 0
B 0 0 2
C 0 0 0
where the indices represent the sources and the column labels the destinations.
How can I get there?
Use crosstab with DataFrame.reindex:
v = np.unique(df.values)
df1 = pd.crosstab(df.source, df.destination).reindex(index=v, columns=v, fill_value=0)
print (df1)
destination A B C
source
A 0 1 0
B 0 0 2
C 0 0 0
Use pivot_table. locations collects all unique entries to expand the final index and columns to include the zero rows and columns.
import numpy as np
locations = np.unique(df.values)
df.pivot_table(index='source',
columns='destination',
aggfunc=len, dropna=False
).loc[locations, locations].fillna(0)
destination A B C
source
A 0.0 1.0 0.0
B 0.0 0.0 2.0
C 0.0 0.0 0.0
Here is my solution in which I count connections after converting the letters to integers (indices):
import pandas as pd
import numpy as np
df = pd.DataFrame({'source':['A','B','b'],'destination':['B','C','C']})
print(df)
nodes = np.unique(df)
n_nodes = len(nodes) # assuming you have no letters missing
adj = np.zeros((n_nodes, n_nodes))
lett2num = lambda letter : ord(letter.lower()) - 96 # convert letter to number, case insensitive
for index, row in df.iterrows():
i = lett2num(row.source) - 1
j = lett2num(row.destination) - 1
adj[i,j] += 1
It outputs for adj:
array([[0., 1., 0.],
[0., 0., 2.],
[0., 0., 0.]])

Check if positive value exist in dataframe

I have dataframe:
A B C D
1 0 0 2
0 1 0 0
0 0 0 0
I need to select all values which are greater then 0 and put them in a list.
if row doesnt contain any positive value 0 should be written to list.
So, the output for given dataframe should look like this:
[1,2,1,0]
How this can be resolved?
Here is a simple loop you could use (looping through df.values gives us rows as arrays):
output = []
for ar in df.values:
nonzeros = ar[ar > 0]
# If nonzeros is not empty proceed and extend the output
if nonzeros.size:
output.extend(nonzeros)
# If not add 0
else:
output.append(0)
print(output)
returns:
[1, 2, 1, 0]
We can make extensive use of pandas + numpy here:
Mask all values which are greater than 0
m = df.gt(0)
A B C D
0 True False False True
1 False True False False
2 False False False False
Mask rows which dont contain any values above 0:
s1 = m.any(axis=1).astype(int).values
Get all the values greater than 0 in an array:
s2 = df.values[m]
Finally concat both arrays with each other:
np.concatenate([s2, s1[s1==0]]).tolist()
Output
[1, 2, 1, 0]
In your case , first stack with your df, then we apply your condition , if the row contain the none 0 we select , if all 0 , then we keep it as zero
df.stack().groupby(level=0).apply(lambda x : x.head(1) if all(x==0) else x[x!=0]).tolist()
[1, 2, 1, 0]
Or without apply
np.concatenate(df.mask(df==0).stack().groupby(level=0).apply(list).reindex(df.index,fill_value=[0]).values)
array([1., 2., 1., 0.])
Shorten the process
np.concatenate(list(map(lambda x : [x[0]] if all(x==0) else x[x!=0],df.values)))
array([1, 2, 1, 0])
You could apply a custom function which will process each row of the DataFrame and return a list. Then to sum returned lists.
In [1]: import pandas as pd
In [2]: df = pd.read_clipboard()
In [3]: df
Out[3]:
A B C D
0 1 0 0 2
1 0 1 0 0
2 0 0 0 0
In [4]: def get_positive_values(row):
...: # If all elements in a row are zeros
...: # then return a list with a single zero
...: if row.eq(0).all():
...: return [0]
...: # Else return a list with positive values only.
...: return row[row.gt(0)].tolist()
...:
...:
In [5]: df.apply(get_positive_values, axis=1).sum()
Out[5]: [1, 2, 1, 0]

build a map out of a matrix with multiple outcomes

I have an input matrix that is of unknown n x m dimensions that is populated by 1s and 0s
For example, a 5x4 matrix:
A = array(
[[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 1, 0],
[0, 1, 1, 0],
[1, 0, 1, 1]])
Goal
I need to create a 1 : 1 map between as many columns and rows as possible, where the element at that location is 1.
What I mean by a 1 : 1 map is that each column and row can be used once at most.
the ideal solution has the most mappings possible ie. the most rows and columns used. It should also avoid exhaustive combinations or operations that do not scale well with larger matrices (practically, maximum dimensions should be 100x100, but there is no declared limit so they could go higher)
Here's a possible outcome of the above
array([[ 1., 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 1., 0., 0.],
[ 0., 0., 0., 1.]])
Some more Examples:
input:
0 1 1
0 1 0
0 1 1
output (one of several possible ones):
0 0 1
0 1 0
0 0 0
another (this shows one problem that can arise)
input:
0 1 1 1
0 1 0 0
1 1 0 0
a good output (again, one of several):
0 0 1 0
0 1 0 0
1 0 0 0
a bad output (still valid, but has fewer mappings)
0 1 0 0
0 0 0 0
1 0 0 0
to better show how their can be multiple outputs
input:
0 1 1
1 1 0
one possible output:
0 1 0
1 0 0
a second possible output:
0 0 1
0 1 0
a third possible output
0 0 1
1 0 0
What have I done?
I have a really dumb way of handling it right now which is not at all guaranteed to work. Basically I just build a filter matrix out of an identity matrix (because its the perfect map, every row and every column are used once) and then I randomly swap its columns (n times) and filter the original matrix with it, recording the filter matrix with the best results.
My [non] solution:
import random
import numpy as np
# this is a starting matrix with random values
A = np.array(
[[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 1, 0],
[0, 1, 1, 0],
[1, 0, 1, 1]])
# add dummy row to make it square
new_col = np.zeros([5,1]) + 1
A = np.append(A, new_col, axis=1)
# make an identity matrix (the perfect map)
imatrix = np.diag([1]*5)
# randomly swap columns on the identity matrix until they match.
n = 1000
# this will hold the map that works the best
best_map_so_far = np.zeros([1,1])
for i in range(n):
a, b = random.sample(range(5), 2)
t = imatrix[:,a].copy()
imatrix[:,a] = imatrix[:,b]
imatrix[:,b] = t
# is this map better than the previous best?
if sum(sum(imatrix * A)) > sum(sum(best_map_so_far)):
best_map_so_far = imatrix
# could it be? a perfect map??
if sum(sum(imatrix * A)) == A.shape[0]:
break
# jk.
# did we still fail
if sum(sum(imatrix * A)) != 5:
print('haha')
# drop the dummy row
output = imatrix * A
output[:,:-1]
#... wow. it actually kind of works.
How about this?
let S be the solution vector, as wide as A, containing row numbers.
let Z be a vector containing the number of zeros in each column.
for each row:
select the cells which contain 1 in A and no value in S.
select from those cells those with the highest score in Z.
select from those cells the first (or a random) cell.
store the row number in the column of S corresponding to the cell.
Does that give you a sufficient solution? If so it should be much more efficient than what you have.
Let me give it a go. The algorithm I suggest will not always give the optimal solution, but maybe somebody can improve it.
You can always interchange two columns or two rows without changing the problem. Further, by keeping track of the changes you can always go back to the original problem.
We are going to fill the main diagonal with 1s as far as it will go. Get the first 1 in the upper left corner by interchanging columns, or rows, or both. Now the first row and column are fixed and we don't touch them anymore. We now try to fill in the second element on the diagonal with 1, and then fix the second row and column. And so on.
If the bottom right submatrix is zero, we should try to bring a 1 there by interchanging two columns or two rows using the whole matrix but preserving the existing 1s in the diagonal. (Here lies the problem. It is easy to check efficiently if one interchange can help. But it could be that at least two interchanges are required, or maybe more.)
We stop when no more 1s can be obtained on the diagonal.
So, while the algorithm is not always optimal, maybe it is possible to come up with extra rules how to interchange columns and rows so as to populate the diagonal with 1s as far as possible.

Conversion of pandas dataframe to sparse key-item matrix with composite key

I have a data frame of 3 columns. Col 1 is a string order number, Col 2 is an integer day, and Col 3 is a product name.
I would like to convert this into a matrix where each row represents a unique order/day combination, and each column represents a 1/0 for the presence of a product name for that combination.
My approach so far makes use of a product dictionary, and a dictionary with a composite key of order # & day.
The final step, which iterates through the original dataframe in order to flip the bits in the matrix to 1s is sloooow. Like 10 minutes for a matrix the size of 363K X 331 and a sparseness of ~97%.
Is there a different approach I should consider?
E.g.,
ord_nb day prod
1 1 A
1 1 B
1 2 B
1 2 C
1 2 D
would become
A B C D
1 1 0 0
0 1 1 1
My approach has been to create a dictionary of order/day pairs:
ord_day_dict = {}
print("Making a dictionary of ord-by-day keys...")
gp = df.groupby(['day', 'ord'])
for i,g in enumerate(gp.groups.items()):
ord_day_dict[g[0][0], g[0][1]] = i
I append the index represention to the original dataframe:
df['ord_day_idx'] = 0 #Create a place holder column
for i, row in df.iterrows(): #populate the column with the index
df.set_value(i,'ord_day_idx',ord_day_dict[(row['day'], row['ord_nb'])])
I then initialize a matrix the size of my ord/day X unique products:
n_items = df.prod_nm.unique().shape[0] #unique number of products
n_ord_days = len(ord_day_dict) #unique number of ord-by-day combos
df_fac_matrix = np.zeros((n_ord_days, n_items), dtype=np.float64)#-1)
I convert my products from strings into an index via a dictionary:
prod_dict = dict()
i = 0
for v in df.prod:
if v not in prod_dict:
prod_dict[v] = i
i = i + 1
And finally iterate through the original dataframe to populate the matrix with 1s where a specific order on a specific day included a specific product.
for line in df.itertuples():
df_fac_matrix[line[4], line[3]] = 1.0 #in the order-by-day index row and the product index column of our ord/day-by-prod matrix, mark a 1
Here is one option you can try:
df.groupby(['ord_nb', 'day'])['prod'].apply(list).apply(lambda x: pd.Series(1, x)).fillna(0)
# A B C D
#ord_nb day
# 1 1 1.0 1.0 0.0 0.0
# 2 0.0 1.0 1.0 1.0
Here's a NumPy based approach to have an array as output -
a = df[['ord_nb','day']].values.astype(int)
row = np.unique(np.ravel_multi_index(a.T,a.max(0)+1),return_inverse=1)[1]
col = np.unique(df.prd.values,return_inverse=1)[1]
out_shp = row.max()+1, col.max()+1
out = np.zeros(out_shp, dtype=int)
out[row,col] = 1
Please note that the third column was assumed to be of name 'prd' instead to avoid name conflict with built-in.
Possible improvements with focus on performance -
If prd has single letter characters only starting from A, we could compute col with simply : df.prd.values.astype('S1').view('uint8')-65.
Alternatively, we could compute row with : np.unique(a[:,0]*(a[:,1].max()+1) + a[:,1],return_inverse=1)[1].
Saving memory with sparse array : For really huge arrays, we could save on memory by storing them as sparse matrices. Thus, the final steps to get such a sparse matrix would be -
from scipy.sparse import coo_matrix
d = np.ones(row.size,dtype=int)
out_sparse = coo_matrix((d,(row,col)), shape=out_shp)
Sample input, output -
In [232]: df
Out[232]:
ord_nb day prd
0 1 1 A
1 1 1 B
2 1 2 B
3 1 2 C
4 1 2 D
In [233]: out
Out[233]:
array([[1, 1, 0, 0],
[0, 1, 1, 1]])
In [241]: out_sparse
Out[241]:
<2x4 sparse matrix of type '<type 'numpy.int64'>'
with 5 stored elements in COOrdinate format>
In [242]: out_sparse.toarray()
Out[242]:
array([[1, 1, 0, 0],
[0, 1, 1, 1]])

Create three arrays from pandas series

For example, I have pandas data series like this:
df = pd.DataFrame({'A': ['foo', 'bar', 'ololo'] * 4,
'B': np.random.randn(12),
'C': np.random.randint(0, 2, 12)})
ga = df.groupby(['A'])['C'].value_counts()
print ga
A
bar 1 3
0 1
foo 0 3
1 1
ololo 0 4
I want to create three arrays, like this:
First array
bar, foo, ololo
Second array (number of '1')
2 3 1
Third array (number of '0')
2 1 3
What's a simplest way to do this?
Starting with:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['foo', 'bar', 'ololo'] * 4,
'B': np.random.randn(12),
'C': np.random.randint(0, 2, 12)
})
counts = df.groupby('A')['C'].value_counts()
Gives (for counts):
A
bar 1 4
foo 1 4
ololo 0 3
1 1
dtype: int64
So, effectively we want to unstack and transpose so that 0/1 are the index, which we do by:
reshaped = counts.unstack().T.reindex([0, 1]).fillna(0)
DSM points out it's possible to avoid .reindex by doing the following:
reshaped = counts.unstack().T.loc[[0, 1]].fillna(0)
Which gives:
A bar foo ololo
0 0 0 3
1 4 4 1
We force a .reindex to ensure it always contains 0/1 (in cases where the randomness means that nothing turns up for 0/1) and force all columns values to be 0 (.fillna(0)) where that's the case. You can then get your arrays by doing the following:
arrays = reshaped.columns.values, reshaped.loc[1].values, reshaped.loc[0].values
Which gives you:
(array(['bar', 'foo', 'ololo'], dtype=object),
array([ 4., 4., 1.]),
array([ 0., 0., 3.]))

Categories

Resources