I have some trouble populating a pandas DataFrame. I am following the instructions found here to produce a MultiIndex DataFrame. The example work fine except that I want to have an array instead of a single value.
activity = 'Open_Truck'
id = 1
index = pd.MultiIndex.from_tuples([(activity, id)], names=['activity', 'id'])
v = pd.Series(np.random.randn(1, 5), index=index)
Exception: Data must be 1-dimensional
If I replace randn(1, 5) with randn(1) it works fine. For randn(1, 1) I should use randn(1, 1).flatten('F') but also works.
When trying:
v = pd.Series(np.random.randn(1, 5).flatten('F'), index=index)
ValueError: Wrong number of items passed 5, placement implies 1
My intention is to add 1 feature vector (they are np.array of course in real case scenario and not np.random.randn) for each activity and id in each row.
So, How do I manage to add an array in a MultiIndex DataFrame?
Edit:
As I am new to pandas I mixed Series with DataFrame. I can achieve the above using DataFrame which is two-dimensional by default:
arrays = [np.array(['Open_Truck']*2),
np.array(['1', '2'])]
df = pd.DataFrame(np.random.randn(2, 4), index=arrays)
df
0 1 2 3
Open 1 -0.210923 0.184874 -0.060210 0.301924
2 0.773249 0.175522 -0.408625 -0.331581
There is problem MultiIndex has only one tuple and data length is different, 5 so lengths not match:
activity = 'Open_Truck'
id = 1
#get 5 times tuples
index = pd.MultiIndex.from_tuples([(activity, id)] * 5, names=['activity', 'id'])
print (index)
MultiIndex(levels=[['Open_Truck'], [1]],
labels=[[0, 0, 0, 0, 0], [0, 0, 0, 0, 0]],
names=['activity', 'id'])
print (len(index))
5
v = pd.Series(np.random.randn(1, 5).flatten('F'), index=index)
print (v)
activity id
Open_Truck 1 -1.348832
1 -0.706780
1 0.242352
1 0.224271
1 1.112608
dtype: float64
In first aproach lengths are same, 1, because one tuple in list:
activity = 'Open_Truck'
id = 1
index = pd.MultiIndex.from_tuples([(activity, id)], names=['activity', 'id'])
print (len(index))
1
v = pd.Series(np.random.randn(1), index=index)
print (v)
activity id
Open_Truck 1 -1.275131
dtype: float64
Related
I have a pandas dataframe with a column such as :
df1 = pd.DataFrame({ 'val': [997.95, 997.97, 989.17, 999.72, 984.66, 1902.15]})
I have 2 types of events that can be detected from this column, I wanna label them 1 and 2 .
I need to get the indexes of each label , and to do so I need to find where the 'val' column has changed a lot (± 7 ) from previous row.
Expected output:
one = [0, 1, 3, 5]
two = [2, 4 ]
Use Series.diff with mask for test less values like 0, last use boolean indexing with indices:
m = df1.val.diff().lt(0)
#if need test less like -7
#m = df1.val.diff().lt(-7)
one = df1.index[~m]
two = df1.index[m]
print (one)
Int64Index([0, 1, 3, 5], dtype='int64')
print (two)
nt64Index([2, 4], dtype='int64')
If need lists:
one = df1.index[~m].tolist()
two = df1.index[m].tolist()
Details:
print (df1.val.diff())
0 NaN
1 0.02
2 -8.80
3 10.55
4 -15.06
5 917.49
Name: val, dtype: float64
I am new to Python and data frames and trying to solve a machine learning problem, but stuck in a problem. I really need to find a way to solve this.
I have 3 binary valued data frames. Each 15*40
Iterating over each data frame, I need to find minimum number of columns for every row, that can uniquely define that row of THAT data frame from other rows of the OTHER data frames.
If a row of a data frame, can be uniquely identified based on Minimum possible number of columns from other data frames. I will look for similar column values in that data frame and remove those. ( generating a rule )
This way, I believe to find the minimum number of columns and its values that can define that data frame's entries from other data frames.
Is there any possible easy way to do it in Python or pandas ?
I am stuck, but so far no success.
Example:
data frame 1:
1 0 1 0
0 1 1 0
1 0 1 1
data frame 2:
1 1 1 0
1 1 1 0
1 1 1 1
data frame 3:
0 0 1 0
0 0 1 0
1 1 0 1
Expected output is something like this:
2 Rules to uniquely define data frame 1:
rule 1: first 2 columns with value 1, 0 defines the first and third row
rule 2: first to columns with value 0, 1 defines the second row
2 Rules to uniquely define data frame 2:
rule 1: first 3 columns with value 1, 1, 1 defines first and second row
rule 2: first 4 columns with value 1, 1, 1, 1 defines third row
2 Rules to uniquely define data frame 3:
rule 1: first 2 columns with value 0, 0 defines the first and second row
rule 2: last 2 columns with value 0,1 defines the third row
This is how I want to define rules based on column values to uniquely identify a data frame with the minimum number of columns.
Pseudo Code I am trying to follow:
for each row i in a Data frame
Count the occurrence of each column value in other Data frames
Order all the columns in row i according t their value of
occurrences
Find the minimum number of columns in the ordered column list, that
uniquely differentiate row i from other rows in other Data frames
Remove all the rows of that data frame which also satisfies the
found column and its values.
If the length of that data frame is not 0, continue,
Is there any library or a simple method to do it?
I solved this particular problem this way.
It is a working solution. It prints an array of rules in the end
The rules contains one array for each data frame.
That array consists of a dictionary, stating {columnName: columnValue}
import pandas as pd
import itertools
df0 = pd.DataFrame([[1, 0, 1, 0], [0, 1, 1, 0], [1, 0, 1, 1]])
df1 = pd.DataFrame([[1, 1, 1, 0], [1, 1, 1, 0], [1, 1, 1, 1]])
df2 = pd.DataFrame([[0, 0, 1, 0], [0, 0, 1, 0], [1, 1, 0, 1]])
print(df0)
print(df1)
print(df2)
list_dfs = [df0, df1, df2]
def find_rules(list_dfs):
rules_sets = []
for idx, df in enumerate(list_dfs):
trgt_df = df
other_df = [x for i, x in enumerate(list_dfs) if i != idx]
other_df = pd.concat(other_df, ignore_index=True)
def count_occur(value, col_name):
return other_df[col_name].value_counts().get(value, 0)
df_dict = []
for idx, row in trgt_df.iterrows():
listz = {}
for col_name in list(trgt_df.columns):
listz[col_name] = [row[col_name],
count_occur(row[col_name], col_name)]
df_dict.append(sorted(listz.items(), key=lambda x: x[1][1]))
rules = []
def check_for_uniquness(list_of_attr):
for row in other_df.itertuples(index=False):
conditions = len(list_of_attr)
for atr in list_of_attr:
if row[atr[0]] == atr[1][0]:
conditions = conditions-1
if conditions == 0:
return False
return True
def find_col_val(row, val):
for r in row:
if r[0] == val:
return r[1][0]
def mark_similar(df_cur, list_of_attr):
new = []
for idx, row in enumerate(df_cur):
combinations = len(list_of_attr)
for atr in list_of_attr:
if find_col_val(row, atr[0]) == atr[1][0]:
combinations = combinations-1
if combinations == 0:
new.append(idx)
return [x for i, x in enumerate(df_cur) if i not in new]
def return_dictionary(list_of_attr):
dic = {}
for idx, el in enumerate(list_of_attr):
dic[el[0]] = el[1][0]
return dic
def possible_combinations(stuff):
lists = []
for L in range(0, len(stuff)+1):
for subset in itertools.combinations(stuff, L):
lists.append(list(subset))
del lists[0]
return lists
def X2R(df_dict):
for elm in df_dict:
combinations = possible_combinations(list(range(0, len(elm))))
for combin in combinations:
column_combinations = []
for i in combin:
column_combinations.append(elm[i])
if check_for_uniquness(column_combinations):
rules.append(return_dictionary(
column_combinations))
return mark_similar(df_dict, column_combinations)
while len(df_dict):
df_dict = X2R(df_dict)
rules_sets.append(rules)
return rules_sets
rules = find_rules(list_dfs)
print(rules)
i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2
So, I have the following sample dataframe (included only one row for clarity/simplicity):
df = pd.DataFrame({'base_number': [2],
'std_dev': [1]})
df['amount_needed'] = 5
df['upper_bound'] = df['base_number'] + df['std_dev']
df['lower_bound'] = df['base_number'] - df['std_dev']
For each given rows, I would like to generate the amount of rows such that the total amount per row is the number given by df['amount_needed'] (so 5, in this example). I would like those 5 new rows to be spread across a spectrum given by df['upper_bound'] and df['lower_bound']. So for the example above, I would like the following result as an output:
df_new = pd.DataFrame({'base_number': [1, 1.5, 2, 2.5, 3]})
Of course, this process will be done for all rows in a much larger dataframe, with many other columns which aren't relevant to this particular issue, which is why I'm trying to find a way to automate this process.
One row of df will create one series (or one data frame). Here's one way to iterate over df and create the series with the values you specified:
for row in df.itertuples():
arr = np.linspace(row.lower_bound,
row.upper_bound,
row.amount_needed)
s = pd.Series(arr).rename('base_number')
print(s)
0 1.0
1 1.5
2 2.0
3 2.5
4 3.0
Name: base_number, dtype: float64
Ended up using jsmart's contribution and working on it to generate a new dataframe, conserving original id's in order to merge the other columns from the old one onto this new one according to id as needed (whole process shown below):
amount_needed = 5
df = pd.DataFrame({'base_number': [2, 4, 8, 0],
'std_dev': [1, 2, 3, 0]})
df['amount_needed'] = amount_needed
df['upper_bound'] = df['base_number'] + df['std_dev']
df['lower_bound'] = df['base_number'] - df['std_dev']
s1 = pd.Series([],dtype = int)
for row in df.itertuples():
arr = np.linspace(row.lower_bound,
row.upper_bound,
row.amount_needed)
s = pd.Series(arr).rename('base_number')
s1 = pd.concat([s1, s])
df_new = pd.DataFrame({'base_number': s1})
ids_og = list(range(1, len(df) + 1))
ids_og = [ids_og] * amount_needed
ids_og = sorted(list(itertools.chain.from_iterable(ids_og)))
df_new['id'] = ids_og
I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned