I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.
It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12
Related
Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.
one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]
First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]
Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)
Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)
I need to loop a computation over two lists of elements and save the results in a table. So, say that
months = [1,2,3,4,5]
Region = ['Region1', 'Region2']
and that my code is of the type
df=[]
for month in month:
for region in Region:
code
x = result
df.append(x)
What I cannot achieve is rendering the final result in a table in which the rows are regions and coumns are months
1 2 3 4 5
Region1 a b c d e
Region2 f g h i j
Assuming that there is the right numbers of items in result
result = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
months = [1, 2, 3, 4, 5]
Region = ['Region1', 'Region2']
df = pd.DataFrame([[Region[i]] + result[i*len(months): ((i+1)*len(months))] for i in range(len(Region))], columns=["Region"] + months).set_index("Region")
Output
1 2 3 4 5
Region
Region1 a b c d e
Region2 f g h i j
This part
[[Region[i]] + result[i*len(months): ((i+1)*len(months))] for i in range(len(Region))]
is equivalent to something like this
res = []
for i in range(len(Region)):
row = [Region[i]] + result[i*len(months): ((i+1)*len(months))]
res.append(row)
where I use the length of Region to slice result in equals part for each row. And I add the name of the region at the begging of the row.
Another solution - more lines of code:
import pandas as pd
import ast
months = [1,2,3,4,5]
Regions = ['Region1', 'Region2']
df = pd.DataFrame()
for region in Regions:
row = '{\'Region\': \'' + region +'\', '
for month in months:
# put your calculation code
x = month + 1
row = row + '\'' + str(month) + '\':[' + str(x) + '],'
row = row[:len(row)-1] + '}'
row = ast.literal_eval(row)
df = df.append(pd.DataFrame(row))
df
This might work, depending on what and how y want the results.
import pandas as pd
months = [1, 2, 3, 4, 5]
Region = ['Region1', 'Region2']
df = pd.DataFrame(columns=[1, 2, 3]) # just to put in something
value = 59
for r in Region:
value += 5
for m in months:
df.loc[r, m] = chr(m+value)
I have a pandas dataframe like this which I try to sort by column 'dist'. The sorted dataframe should start with E or F as per below. I use sort_values which it is not working for me. The function is computing distances from 'Start' location to a list of locations ['C', 'B', 'D', 'E', 'A', 'F'] and then is supposed to sort the dataframe in ascending order using 'dist' column.
Could someone advice me why sorting is not working?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
loc_list
Out[194]: ['C', 'B', 'D', 'E', 'A', 'F']
def closest_locations(from_loc_point, to_loc_list):
lresults=list()
for list_index in range(len(to_loc_list)):
dist= hypot(locations[from_loc_point[0]][0] -locations[to_loc_list[list_index]][0],locations[from_loc_point[0]][1] -locations[to_loc_list[list_index]][1]) # cumsum distante
lista_dist = [from_loc_point[0],to_loc_list[list_index],dist]
lresults.append(lista_dist[:])
RESULTS = pd.DataFrame(np.array(lresults))
RESULTS.columns = ['from','to','dist']
RESULTS.sort_values(['dist'],ascending=[True],inplace=True)
RESULTS.index = range(len(RESULTS))
return RESULTS
closest_locations(['Start'], loc_list)
Out[189]:
from to dist
0 Start D 10.19803902718557
1 Start A 10.19803902718557
2 Start C 15.132745950421555
3 Start B 15.132745950421555
4 Start E 6.08276253029822
5 Start F 6.08276253029822
closest_two_loc.dtypes
Out[247]:
from object
to object
dist object
dtype: object
Is this what you want?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
df= pd.DataFrame.from_dict(locations, orient='index').rename(columns={0:'x', 1:'y'})
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc['Start', 'x'])**2 + (row['y'] - df.loc['Start', 'y'])**2), axis=1)
df.drop(['Start']).sort_values(by='dist')
x y dist
E 14 4 6.082763
F 14 6 6.082763
A 10 3 10.198039
D 10 7 10.198039
C 5 7 15.132746
B 5 3 15.132746
or if you want to wrap it in a function
def dist_from(df, col):
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc[col,'x'])**2 + (row['y'] - df.loc[col, 'y'])**2), axis=1)
df['form'] = col
df.drop([col]).sort_values(by='dist')
df.index.name = 'to'
return df.reset_index().loc[:, ['from', 'to', 'dist']]
You need to convert values in "dist" column to float:
df = closest_locations(['Start'], loc_list)
df.dist = list(map(lambda x: float(x), df.dist)) # convert each value to float
print(df.sort_values('dist')) # now it will sort properly
Output:
from to dist
4 Start E 6.082763
5 Start F 6.082763
0 Start D 10.198039
1 Start A 10.198039
2 Start C 15.132746
3 Start B 15.132746
Edit: As mentioned by #jezrael in comments, following is a more direct method:
df.dist = df.dist.astype(float)
I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned
I want to count number of times each values is appearing in dataframe.
Here is my dataframe - df:
status
1 N
2 N
3 C
4 N
5 S
6 N
7 N
8 S
9 N
10 N
11 N
12 S
13 N
14 C
15 N
16 N
17 N
18 N
19 S
20 N
I want to dictionary of counts:
ex. counts = {N: 14, C:2, S:4}
I have tried df['status']['N'] but it gives keyError and also df['status'].value_counts but no use.
You can use value_counts and to_dict:
print df['status'].value_counts()
N 14
S 4
C 2
Name: status, dtype: int64
counts = df['status'].value_counts().to_dict()
print counts
{'S': 4, 'C': 2, 'N': 14}
An alternative one liner using underdog Counter:
In [3]: from collections import Counter
In [4]: dict(Counter(df.status))
Out[4]: {'C': 2, 'N': 14, 'S': 4}
You can try this way.
df.stack().value_counts().to_dict()
Can you convert df into a list?
If so:
a = ['a', 'a', 'a', 'b', 'b', 'c']
c = dict()
for i in set(a):
c[i] = a.count(i)
Using a dict comprehension:
c = {i: a.count(i) for i in set(a)}
See my response in this thread for a Pandas DataFrame output,
count the frequency that a value occurs in a dataframe column
For dictionary output, you can modify as follows:
def column_list_dict(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return dict(column_list_df)