Iterate in a dataframe with strings

Iterate in a dataframe with strings - python

I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.

It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12

Related

Given a dataframe, how do I bucket columns according to their names and merge columns in the same bucket into one?

Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.

one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]

First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]

Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)

Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)

Nested loop results in table, Python

I need to loop a computation over two lists of elements and save the results in a table. So, say that
months = [1,2,3,4,5]
Region = ['Region1', 'Region2']
and that my code is of the type
df=[]
for month in month:
for region in Region:
code
x = result
df.append(x)
What I cannot achieve is rendering the final result in a table in which the rows are regions and coumns are months
1 2 3 4 5
Region1 a b c d e
Region2 f g h i j

Assuming that there is the right numbers of items in result
result = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
months = [1, 2, 3, 4, 5]
Region = ['Region1', 'Region2']
df = pd.DataFrame([[Region[i]] + result[i*len(months): ((i+1)*len(months))] for i in range(len(Region))], columns=["Region"] + months).set_index("Region")
Output
1 2 3 4 5
Region
Region1 a b c d e
Region2 f g h i j
This part
[[Region[i]] + result[i*len(months): ((i+1)*len(months))] for i in range(len(Region))]
is equivalent to something like this
res = []
for i in range(len(Region)):
row = [Region[i]] + result[i*len(months): ((i+1)*len(months))]
res.append(row)
where I use the length of Region to slice result in equals part for each row. And I add the name of the region at the begging of the row.

Another solution - more lines of code:
import pandas as pd
import ast
months = [1,2,3,4,5]
Regions = ['Region1', 'Region2']
df = pd.DataFrame()
for region in Regions:
row = '{\'Region\': \'' + region +'\', '
for month in months:
# put your calculation code
x = month + 1
row = row + '\'' + str(month) + '\':[' + str(x) + '],'
row = row[:len(row)-1] + '}'
row = ast.literal_eval(row)
df = df.append(pd.DataFrame(row))
df

This might work, depending on what and how y want the results.
import pandas as pd
months = [1, 2, 3, 4, 5]
Region = ['Region1', 'Region2']
df = pd.DataFrame(columns=[1, 2, 3]) # just to put in something
value = 59
for r in Region:
value += 5
for m in months:
df.loc[r, m] = chr(m+value)

pandas data frame sort

I have a pandas dataframe like this which I try to sort by column 'dist'. The sorted dataframe should start with E or F as per below. I use sort_values which it is not working for me. The function is computing distances from 'Start' location to a list of locations ['C', 'B', 'D', 'E', 'A', 'F'] and then is supposed to sort the dataframe in ascending order using 'dist' column.
Could someone advice me why sorting is not working?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
loc_list
Out[194]: ['C', 'B', 'D', 'E', 'A', 'F']
def closest_locations(from_loc_point, to_loc_list):
lresults=list()
for list_index in range(len(to_loc_list)):
dist= hypot(locations[from_loc_point[0]][0] -locations[to_loc_list[list_index]][0],locations[from_loc_point[0]][1] -locations[to_loc_list[list_index]][1]) # cumsum distante
lista_dist = [from_loc_point[0],to_loc_list[list_index],dist]
lresults.append(lista_dist[:])
RESULTS = pd.DataFrame(np.array(lresults))
RESULTS.columns = ['from','to','dist']
RESULTS.sort_values(['dist'],ascending=[True],inplace=True)
RESULTS.index = range(len(RESULTS))
return RESULTS
closest_locations(['Start'], loc_list)
Out[189]:
from to dist
0 Start D 10.19803902718557
1 Start A 10.19803902718557
2 Start C 15.132745950421555
3 Start B 15.132745950421555
4 Start E 6.08276253029822
5 Start F 6.08276253029822
closest_two_loc.dtypes
Out[247]:
from object
to object
dist object
dtype: object

Is this what you want?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
df= pd.DataFrame.from_dict(locations, orient='index').rename(columns={0:'x', 1:'y'})
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc['Start', 'x'])**2 + (row['y'] - df.loc['Start', 'y'])**2), axis=1)
df.drop(['Start']).sort_values(by='dist')
x y dist
E 14 4 6.082763
F 14 6 6.082763
A 10 3 10.198039
D 10 7 10.198039
C 5 7 15.132746
B 5 3 15.132746
or if you want to wrap it in a function
def dist_from(df, col):
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc[col,'x'])**2 + (row['y'] - df.loc[col, 'y'])**2), axis=1)
df['form'] = col
df.drop([col]).sort_values(by='dist')
df.index.name = 'to'
return df.reset_index().loc[:, ['from', 'to', 'dist']]

You need to convert values in "dist" column to float:
df = closest_locations(['Start'], loc_list)
df.dist = list(map(lambda x: float(x), df.dist)) # convert each value to float
print(df.sort_values('dist')) # now it will sort properly
Output:
from to dist
4 Start E 6.082763
5 Start F 6.082763
0 Start D 10.198039
1 Start A 10.198039
2 Start C 15.132746
3 Start B 15.132746
Edit: As mentioned by #jezrael in comments, following is a more direct method:
df.dist = df.dist.astype(float)

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.

I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

Count frequency of values in pandas DataFrame column

I want to count number of times each values is appearing in dataframe.
Here is my dataframe - df:
status
1 N
2 N
3 C
4 N
5 S
6 N
7 N
8 S
9 N
10 N
11 N
12 S
13 N
14 C
15 N
16 N
17 N
18 N
19 S
20 N
I want to dictionary of counts:
ex. counts = {N: 14, C:2, S:4}
I have tried df['status']['N'] but it gives keyError and also df['status'].value_counts but no use.

You can use value_counts and to_dict:
print df['status'].value_counts()
N 14
S 4
C 2
Name: status, dtype: int64
counts = df['status'].value_counts().to_dict()
print counts
{'S': 4, 'C': 2, 'N': 14}

An alternative one liner using underdog Counter:
In [3]: from collections import Counter
In [4]: dict(Counter(df.status))
Out[4]: {'C': 2, 'N': 14, 'S': 4}

You can try this way.
df.stack().value_counts().to_dict()

Can you convert df into a list?
If so:
a = ['a', 'a', 'a', 'b', 'b', 'c']
c = dict()
for i in set(a):
c[i] = a.count(i)
Using a dict comprehension:
c = {i: a.count(i) for i in set(a)}

See my response in this thread for a Pandas DataFrame output,
count the frequency that a value occurs in a dataframe column
For dictionary output, you can modify as follows:
def column_list_dict(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return dict(column_list_df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate in a dataframe with strings - python

Related

Given a dataframe, how do I bucket columns according to their names and merge columns in the same bucket into one?

Nested loop results in table, Python

pandas data frame sort

Pandas assign label based on index value

Count frequency of values in pandas DataFrame column

Categories

Resources