Nested loop results in table, Python - python

I need to loop a computation over two lists of elements and save the results in a table. So, say that
months = [1,2,3,4,5]
Region = ['Region1', 'Region2']
and that my code is of the type
df=[]
for month in month:
for region in Region:
code
x = result
df.append(x)
What I cannot achieve is rendering the final result in a table in which the rows are regions and coumns are months
1 2 3 4 5
Region1 a b c d e
Region2 f g h i j

Assuming that there is the right numbers of items in result
result = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
months = [1, 2, 3, 4, 5]
Region = ['Region1', 'Region2']
df = pd.DataFrame([[Region[i]] + result[i*len(months): ((i+1)*len(months))] for i in range(len(Region))], columns=["Region"] + months).set_index("Region")
Output
1 2 3 4 5
Region
Region1 a b c d e
Region2 f g h i j
This part
[[Region[i]] + result[i*len(months): ((i+1)*len(months))] for i in range(len(Region))]
is equivalent to something like this
res = []
for i in range(len(Region)):
row = [Region[i]] + result[i*len(months): ((i+1)*len(months))]
res.append(row)
where I use the length of Region to slice result in equals part for each row. And I add the name of the region at the begging of the row.

Another solution - more lines of code:
import pandas as pd
import ast
months = [1,2,3,4,5]
Regions = ['Region1', 'Region2']
df = pd.DataFrame()
for region in Regions:
row = '{\'Region\': \'' + region +'\', '
for month in months:
# put your calculation code
x = month + 1
row = row + '\'' + str(month) + '\':[' + str(x) + '],'
row = row[:len(row)-1] + '}'
row = ast.literal_eval(row)
df = df.append(pd.DataFrame(row))
df

This might work, depending on what and how y want the results.
import pandas as pd
months = [1, 2, 3, 4, 5]
Region = ['Region1', 'Region2']
df = pd.DataFrame(columns=[1, 2, 3]) # just to put in something
value = 59
for r in Region:
value += 5
for m in months:
df.loc[r, m] = chr(m+value)

Related

Using Python for loop to take rolling sum product

I am trying to use a for loop to calculate the trailing sum product of a list of values, and a reverse counter. That is, at each iteration of the loop, the current value would be multiplied by 1, the previous value would be multiplied by 2, and so on back to n values, where n is the number of values between the start of the list and the current iteration.
The below screenshot shows how I would do this in Excel using SUMPRODUCT:
SUMPRODUCT in Excel
In Python, I am able to use cumsum from Numpy to return the rolling cumulative sum of the list of values:
import numpy as np
import pandas as pd
data = [8,6,7,5,9]
df=pd.DataFrame(data=data,columns=['values'])
values=df['values']
test=[]
for x, index in enumerate(values):
test.append(np.cumsum(values)[x])
I am also able to use the below to get a reversed counter, but I'm not sure how to incorporate this into the first for loop, and I'm not sure how to get the reverse counter to reset to 1 at each iteration:
for i in reversed(range(len(values))):
print(i+1)
What is the most straightforward way to get the counter to reset at each iteration of the loop, and to incorporate it into a trailing sumproduct?
Thank you in advance.
As #JonClements suggested you may need simple
(df['values'] * range(len(df), 0, -1)).sum()
without any for-loop
if you need partial result then you can get part of rows ie. sub_df = df[:3]
sub_df = df[:3]
(sub_df['values'] * range(len(sub_df), 0, -1)).sum()
Minimal working example
import pandas as pd
data = [8, 6, 7, 5, 9]
df = pd.DataFrame(data, columns=['values'])
size = 3
sub_df = df[:size]
#result = (sub_df['values'] * range(len(sub_df), 0, -1)).sum()
result = (sub_df['values'] * range(size, 0, -1)).sum()
print(f'result [{size}]: {result}')
size = 4
sub_df = df[:size]
#result = (sub_df['values'] * range(len(sub_df), 0, -1)).sum()
result = (sub_df['values'] * range(size, 0, -1)).sum()
print(f'result [{size}]: {result}')
Result:
result [3]: 43
result [4]: 69
EDIT:
If you want to put all in DataFrame
import pandas as pd
data = [8, 6, 7, 5, 9]
df = pd.DataFrame(data, columns=['values'])
# ---
size = 3
sub_df = df[:size]
result = (sub_df['values'] * range(size, 0, -1)).sum()
print(f'result [{size}]: {result}')
df['counter C'] = '' # default value
df.loc[:size-1, 'counter C'] = range(size, 0, -1)
df['result C'] = '' # default value
df.loc[size-1,'result C'] = result
# ---
size = 4
sub_df = df[:size]
result = (sub_df['values'] * range(size, 0, -1)).sum()
print(f'result [{size}]: {result}')
df['counter D'] = '' # default value
df.loc[:size-1, 'counter D'] = range(size, 0, -1)
df['result D'] = '' # default value
df.loc[size-1, 'result D'] = result
print(df)
Result:
values counter C result C counter D result D
0 8 3 4
1 6 2 3
2 7 1 43 2
3 5 1 69
4 9

Given a dataframe, how do I bucket columns according to their names and merge columns in the same bucket into one?

Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.
one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]
First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]
Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)
Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)

Algorithm for grouping records

I have table that looks like:
Group Name
1 A
1 B
2 R
2 F
3 B
3 C
And i need group this records by following rool:
If an group has received at least one Name that is contained in another group, then these two groups are in the same group. In my case Group 1 contains A and B. And group 3 contains B and C. They have common name B, so they are must be in the same group.
As result i want to get something like this:
Group Name ResultGroup
1 A 1
1 B 1
2 R 2
2 F 2
3 B 1
3 C 1
I already finded solution, but in my table is about 200k records, so it take too much time (more than 12 hours). Is there way to optimize it? May be using pandas or something like that?
def printList(l, head=""):
if(head!=""):
print(head)
for i in l:
print(i)
def find_group(groups, vals):
for k in groups.keys():
for v in vals:
if v in groups[k]:
return k
return 0
task = [ [1, "AAA"], [1, "BBB"], [3, "CCC"], [4, "DDD"], [5, "JJJ"], [6, "AAA"], [6, "JJJ"], [6, "CCC"], [9, "OOO"], [10, "OOO"], [10, "DDD"], [11, "LLL"], [12, "KKK"] ]
ptrs = {}
groups = {}
group_id = 1
printList(task, "Initial table")
for i in range(0, len(task)):
itask = task[i]
resp = itask[1]
val = [ x[0] for x in task if x[1] == resp ]
minval = min(val)
for v in val:
if not v in ptrs.keys(): ptrs[v] = minval
myGroup = find_group(groups, val)
if(myGroup == 0):
groups[group_id] = list(set(val))
myGroup = group_id
group_id += 1
else:
groups[myGroup].extend(val)
groups[myGroup] = list(set(groups[myGroup]))
itask.append(myGroup)
task[i] = itask
print()
printList(task, "Result table")
You can groupby 'Name' and keep the first Group:
df = pd.DataFrame({'Group': [1, 1, 2, 2, 3, 3], 'Name': ['A', 'B', 'R', 'F', 'B', 'C']})
df2 = df.groupby('Name').first().reset_index()
Then merge with the original data-frame and drop duplicates of the original group:
df3 = df.merge(df2, on='Name', how='left')
df3 = df3[['Group_x', 'Group_y']].drop_duplicates('Group_x')
df3.columns = ['Group', 'ResultGroup']
One more merge will give you the result:
df.merge(df3, on='Group', how='left')
Group Name ResultGroup
1 A 1
1 B 1
2 R 2
2 F 2
3 B 1
3 C 1

Iterate in a dataframe with strings

I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.
It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

Categories

Resources