I need some help in converting the following code to a more efficient one without using iterrows().
for index, row in df.iterrows():
alist=row['index_vec'].strip("[] ").split(",")
blist=[int(i) for i in alist]
for col in blist:
df.loc[index, str(col)] = df.loc[index, str(col)] +1
The above code basically reads a string under 'index_vec' column, parses and converts to integers, and then increments the associated columns by one for each integer. An example of the output is shown below:
Take the 0th row as an example. Its string value is "[370, 370, -1]". So the above code increments column "370" by 2 and column "-1" by 1. The output display is truncated so that only "-10" to "17" columns are shown.
The use of iterrows() is very slow to process a large dataframe. I'd like to get some help in speeding it up. Thank you.
You can also use apply and set axis = 1 to go row wise. Then create a custom function pass into apply:
Example starting df:
index_vec 1201 370 -1
0 [370, -1, -1] 0 0 1
1 [1201, 1201] 0 1 1
import pandas as pd
df = pd.DataFrame({'index_vec': ["[370, -1, -1]", "[1201, 1201]"], '1201': [0, 0], '370': [0, 1], '-1': [1, 1]})
def add_counts(x):
counts = pd.Series(x['index_vec'].strip("[]").split(", ")).value_counts()
x[counts.index] = x[counts.index] + counts
return x
df.apply(add_counts, axis = 1)
print(df)
Outputs:
index_vec 1201 370 -1
0 [370, -1, -1] 0 1 3
1 [1201, 1201] 2 1 1
Let us do
a=df['index_vec'].str.strip("[] ").str.split(",").explode()
s=pd.crosstab(a.index,a).reindex_like(df).fillna(0)
df=df.add(a)
Related
I am new to Python and data frames and trying to solve a machine learning problem, but stuck in a problem. I really need to find a way to solve this.
I have 3 binary valued data frames. Each 15*40
Iterating over each data frame, I need to find minimum number of columns for every row, that can uniquely define that row of THAT data frame from other rows of the OTHER data frames.
If a row of a data frame, can be uniquely identified based on Minimum possible number of columns from other data frames. I will look for similar column values in that data frame and remove those. ( generating a rule )
This way, I believe to find the minimum number of columns and its values that can define that data frame's entries from other data frames.
Is there any possible easy way to do it in Python or pandas ?
I am stuck, but so far no success.
Example:
data frame 1:
1 0 1 0
0 1 1 0
1 0 1 1
data frame 2:
1 1 1 0
1 1 1 0
1 1 1 1
data frame 3:
0 0 1 0
0 0 1 0
1 1 0 1
Expected output is something like this:
2 Rules to uniquely define data frame 1:
rule 1: first 2 columns with value 1, 0 defines the first and third row
rule 2: first to columns with value 0, 1 defines the second row
2 Rules to uniquely define data frame 2:
rule 1: first 3 columns with value 1, 1, 1 defines first and second row
rule 2: first 4 columns with value 1, 1, 1, 1 defines third row
2 Rules to uniquely define data frame 3:
rule 1: first 2 columns with value 0, 0 defines the first and second row
rule 2: last 2 columns with value 0,1 defines the third row
This is how I want to define rules based on column values to uniquely identify a data frame with the minimum number of columns.
Pseudo Code I am trying to follow:
for each row i in a Data frame
Count the occurrence of each column value in other Data frames
Order all the columns in row i according t their value of
occurrences
Find the minimum number of columns in the ordered column list, that
uniquely differentiate row i from other rows in other Data frames
Remove all the rows of that data frame which also satisfies the
found column and its values.
If the length of that data frame is not 0, continue,
Is there any library or a simple method to do it?
I solved this particular problem this way.
It is a working solution. It prints an array of rules in the end
The rules contains one array for each data frame.
That array consists of a dictionary, stating {columnName: columnValue}
import pandas as pd
import itertools
df0 = pd.DataFrame([[1, 0, 1, 0], [0, 1, 1, 0], [1, 0, 1, 1]])
df1 = pd.DataFrame([[1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1]])
df2 = pd.DataFrame([[0, 0, 1, 0], [0, 0, 1, 0], [1, 1, 0, 1]])
print(df0)
print(df1)
print(df2)
list_dfs = [df0, df1, df2]
def find_rules(list_dfs):
rules_sets = []
for idx, df in enumerate(list_dfs):
trgt_df = df
other_df = [x for i, x in enumerate(list_dfs) if i != idx]
other_df = pd.concat(other_df, ignore_index=True)
def count_occur(value, col_name):
return other_df[col_name].value_counts().get(value, 0)
df_dict = []
for idx, row in trgt_df.iterrows():
listz = {}
for col_name in list(trgt_df.columns):
listz[col_name] = [row[col_name],
count_occur(row[col_name], col_name)]
df_dict.append(sorted(listz.items(), key=lambda x: x[1][1]))
rules = []
def check_for_uniquness(list_of_attr):
for row in other_df.itertuples(index=False):
conditions = len(list_of_attr)
for atr in list_of_attr:
if row[atr[0]] == atr[1][0]:
conditions = conditions-1
if conditions == 0:
return False
return True
def find_col_val(row, val):
for r in row:
if r[0] == val:
return r[1][0]
def mark_similar(df_cur, list_of_attr):
new = []
for idx, row in enumerate(df_cur):
combinations = len(list_of_attr)
for atr in list_of_attr:
if find_col_val(row, atr[0]) == atr[1][0]:
combinations = combinations-1
if combinations == 0:
new.append(idx)
return [x for i, x in enumerate(df_cur) if i not in new]
def return_dictionary(list_of_attr):
dic = {}
for idx, el in enumerate(list_of_attr):
dic[el[0]] = el[1][0]
return dic
def possible_combinations(stuff):
lists = []
for L in range(0, len(stuff)+1):
for subset in itertools.combinations(stuff, L):
lists.append(list(subset))
del lists[0]
return lists
def X2R(df_dict):
for elm in df_dict:
combinations = possible_combinations(list(range(0, len(elm))))
for combin in combinations:
column_combinations = []
for i in combin:
column_combinations.append(elm[i])
if check_for_uniquness(column_combinations):
rules.append(return_dictionary(
column_combinations))
return mark_similar(df_dict, column_combinations)
while len(df_dict):
df_dict = X2R(df_dict)
rules_sets.append(rules)
return rules_sets
rules = find_rules(list_dfs)
print(rules)
I have a panda dataframe, it is used for a heatmap. I would like the minimal value of each column to be along the diagonal.
I've sorted the columsn using
data = data.loc[:, data.min().sort_values().index]
This works. Now I just need to sort the values such that the index of the min value in the first column is row 0, then the min value of second column is row 1, and so on.
Example
import seaborn as sns
import pandas as pd
data = [[5,1,9],
[7,8,6],
[5,3,2]]
data = pd.DataFrame(data)
#sns.heatmap(data)
data = data.loc[:, data.min().sort_values().index]
#sns.heatmap(data) # Gives result in step 1
# Step1, Columsn sorted by min value, 1, 2, 5
data = [[1,9,5],
[8,6,7],
[3,2,5]]
data = pd.DataFrame(data)
#sns.heatmap(data)
# How do i perform step two, maintinaing column order.
# Step 2, Rows sorted by min value 1,2,7
data = [[1,9,5],
[3,2,5],
[8,6,7]]
data = pd.DataFrame(data)
sns.heatmap(data)
Is this possible in panda in a clever way?
Setup
data = pd.DataFrame([[5, 1, 9], [7, 8, 6], [5, 3, 2]])
You can accomplish this by using argsort of the diagonal elements of your sorted DataFrame, then indexing the DataFrame using these values.
Step 1
Use your initial sort:
data = data.loc[:, data.min().sort_values().index]
1 2 0
0 1 9 5
1 8 6 7
2 3 2 5
Step 2
Use np.argsort with np.diag:
data.iloc[np.argsort(np.diag(data))]
1 2 0
0 1 9 5
2 3 2 5
1 8 6 7
I'm not quite sure, but you've already done the following to sort column
data = data.loc[:, data.min().sort_values().index]
the same trick could also be applied to sort row
data = data.loc[data.min(axis=1).sort_values().index, :]
To move some values around so that the min value within each column is placed along the diagonal you could try something like this:
for i in range(len(data)):
min_index = data.iloc[:, i].idxmin()
if data.iloc[i,i] != data.iloc[min_index, i]:
data.iloc[i,i], data.iloc[min_index,i] = data.iloc[min_index, i], data.iloc[i,i]
Basically just swap the min with the diagonal.
I have some trouble populating a pandas DataFrame. I am following the instructions found here to produce a MultiIndex DataFrame. The example work fine except that I want to have an array instead of a single value.
activity = 'Open_Truck'
id = 1
index = pd.MultiIndex.from_tuples([(activity, id)], names=['activity', 'id'])
v = pd.Series(np.random.randn(1, 5), index=index)
Exception: Data must be 1-dimensional
If I replace randn(1, 5) with randn(1) it works fine. For randn(1, 1) I should use randn(1, 1).flatten('F') but also works.
When trying:
v = pd.Series(np.random.randn(1, 5).flatten('F'), index=index)
ValueError: Wrong number of items passed 5, placement implies 1
My intention is to add 1 feature vector (they are np.array of course in real case scenario and not np.random.randn) for each activity and id in each row.
So, How do I manage to add an array in a MultiIndex DataFrame?
Edit:
As I am new to pandas I mixed Series with DataFrame. I can achieve the above using DataFrame which is two-dimensional by default:
arrays = [np.array(['Open_Truck']*2),
np.array(['1', '2'])]
df = pd.DataFrame(np.random.randn(2, 4), index=arrays)
df
0 1 2 3
Open 1 -0.210923 0.184874 -0.060210 0.301924
2 0.773249 0.175522 -0.408625 -0.331581
There is problem MultiIndex has only one tuple and data length is different, 5 so lengths not match:
activity = 'Open_Truck'
id = 1
#get 5 times tuples
index = pd.MultiIndex.from_tuples([(activity, id)] * 5, names=['activity', 'id'])
print (index)
MultiIndex(levels=[['Open_Truck'], [1]],
labels=[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]],
names=['activity', 'id'])
print (len(index))
5
v = pd.Series(np.random.randn(1, 5).flatten('F'), index=index)
print (v)
activity id
Open_Truck 1 -1.348832
1 -0.706780
1 0.242352
1 0.224271
1 1.112608
dtype: float64
In first aproach lengths are same, 1, because one tuple in list:
activity = 'Open_Truck'
id = 1
index = pd.MultiIndex.from_tuples([(activity, id)], names=['activity', 'id'])
print (len(index))
1
v = pd.Series(np.random.randn(1), index=index)
print (v)
activity id
Open_Truck 1 -1.275131
dtype: float64
Suppose we have a toy example like below.
np.random.seed(seed=1)
df = pd.DataFrame(np.random.randint(low=0,
high=2,
size=(5, 2)))
df
0 1
0 1 1
1 0 0
2 1 1
3 1 1
4 1 0
We want to return the indices of all rows like a certain row. Suppose I want the indices of all rows like row 0, which has a 1 in both column 0 and column 1.
I would want a data structure that has: (0, 2, 3).
I think you can do it like this
df.index[df.eq(df.iloc[0]).all(1)].tolist()
[0, 2, 3]
One way may be to use lambda:
df.index[df.apply(lambda row: all(row == df.iloc[0]), axis=1)].tolist()
Other way may be to use mask :
df.index[df[df == df.iloc[0].values].notnull().all(axis=1)].tolist()
Result:
[0, 2, 3]
Given the following dataframe:
df = pd.DataFrame({'s1':[1,2,3,4], 's2':[4,3,2,1], 's3':[7,4,3,1], 's4':[9,4,3,1]})
I want to do the following:
Map a predicate >2 over ['s1', 's2'], map a predicate >4 over ['s3', 's4'] if true set field to 1 else 0.
Remove all rows where s1 and s2 and s3 and s4 = 0.
Group by permutations, for example how many rows are [0,1,1,0] etc
Query for different counts for example how many rows have s3=1 or s2=1?
The problem I'm having doing this on a larger dataset is that I have to split the dataset up into series and then iterate over each series and then put them back to a dataframe. I want to do all the transformatios and queries using only one pass over the data.
Update:
I have been trying something like this.
binary = pd.DataFrame({'s1':[1,0,1,0], 's2':[0,0,1,0], 's3':[1,0,1,1]})
binary.loc[(cool!=0).any(axis=1)]
binary.groupby(['s1', 's2','s3']).count() # it works for 2 values but not 3.
Items 1 and 2
To map the predicate, use the gt function. Then use any to select rows that have at least one True value (i.e. exclude rows that are all False).
You can use astype(int) when applying the predicate, but it doesn't seem necessary until after you filter for rows that are all False.
# Apply predicate.
df[['s1', 's2']] = df[['s1', 's2']].gt(2)
df[['s3', 's4']] = df[['s3', 's4']].gt(4)
# Remove rows that are all False and convert to 0/1.
df = df.loc[df.any(axis=1), :].astype(int)
The resulting binary DataFrame df:
s1 s2 s3 s4
0 0 1 1 1
1 0 1 0 0
2 1 0 0 0
3 1 0 0 0
Item 3
To get a count of all row combinations at once, use apply to get a Series containing a tuple of each row, and use value_counts:
# Counts of permutations.
perms = df.apply(tuple, axis=1).value_counts()
The resulting output:
(1, 0, 0, 0) 2
(0, 1, 0, 0) 1
(0, 1, 1, 1) 1
Item 4
Sum over a Boolean array corresponding to your condition:
# Count of rows where s3=1 or s2=1.
row_count = ((df['s3'] == 1) | (df['s2'] == 1)).sum()
This yields 2 as expected.