Converting a list with no tuples into a data frame - python

Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.

IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9

Related

Get index of row where column value changes from previous row

I have a pandas dataframe with a column such as :
df1 = pd.DataFrame({ 'val': [997.95, 997.97, 989.17, 999.72, 984.66, 1902.15]})
I have 2 types of events that can be detected from this column, I wanna label them 1 and 2 .
I need to get the indexes of each label , and to do so I need to find where the 'val' column has changed a lot (± 7 ) from previous row.
Expected output:
one = [0, 1, 3, 5]
two = [2, 4 ]
Use Series.diff with mask for test less values like 0, last use boolean indexing with indices:
m = df1.val.diff().lt(0)
#if need test less like -7
#m = df1.val.diff().lt(-7)
one = df1.index[~m]
two = df1.index[m]
print (one)
Int64Index([0, 1, 3, 5], dtype='int64')
print (two)
nt64Index([2, 4], dtype='int64')
If need lists:
one = df1.index[~m].tolist()
two = df1.index[m].tolist()
Details:
print (df1.val.diff())
0 NaN
1 0.02
2 -8.80
3 10.55
4 -15.06
5 917.49
Name: val, dtype: float64

Python Dataframe find closest matching value with a tolerance

I have a data frame consisting of lists as elements. I want to find the closest matching values within a percentage of a given value.
My code:
df = pd.DataFrame({'A':[[1,2],[4,5,6]]})
df
A
0 [1, 2]
1 [3, 5, 7]
# in each row, lets find a the values and their index that match 5 with 20% tolerance
val = 5
tol = 0.2 # find values matching 5 or 20% within 5 (4 or 6)
df['Matching_index'] = (df['A'].map(np.array)-val).map(abs).map(np.argmin)
Present solution:
df
A Matching_index
0 [1, 2] 1 # 2 matches closely with 5 but this is wrong
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Expected solution:
df
A Matching_index
0 [1, 2] NaN # No matching value, hence NaN
1 [4, 5, 6] 1 # 5 matches with 5, correct.
Idea is get difference with val and then replace to missing values if not match tolerance, last get np.nanargmin which raise error if all missing values, so added next condition with np.any:
def f(x):
a = np.abs(np.array(x)-val)
m = a <= val * tol
return np.nanargmin(np.where(m, a, np.nan)) if m.any() else np.nan
df['Matching_index'] = df['A'].map(f)
print (df)
A Matching_index
0 [1, 2] NaN
1 [4, 5, 6] 1.0
Pandas solution:
df1 = pd.DataFrame(df['A'].tolist(), index=df.index).sub(val).abs()
df['Matching_index'] = df1.where(df1 <= val * tol).dropna(how='all').idxmin(axis=1)
I'm not sure it you want all indexes or just a counter.
Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[[1,2],[4,5,6,7,8]]})
val = 5
tol = 0.3
def closest(arr,val,tol):
idxs = [ idx for idx,el in enumerate(arr) if (np.abs(el - val) < val*tol)]
result = len(idxs) if len(idxs) != 0 else np.nan
return result
df['Matching_index'] = df['A'].apply(closest, args=(val,tol,))
df
If you want all the indexes, just return idxs instead of len(idxs).

Removing duplicates in dataframe via creating a list of their indices pandas

i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2

create a DataFrame from for loop

I have defined a function to create a dataframe, but I get two lists in each column, how could I get each element of the list as a separate row in the dataframe as shown below.
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(result, columns=['number','operation'])
return df
function()
Result:
number operation
0 [1, 2, 3, 4] [8, 16, 24, 32]
What I really want to:
number operation
0 1 8
1 2 16
2 3 24
3 4 34
Can anyone help me please? :)
Your problems are twofold, firstly you are pushing the entire list of values (instead of the "current" value) into the result array on each pass through your for loop, and secondly you are overwriting the dataframe each time as well. It would be simpler to use a list comprehension to generate the values for the dataframe:
import pandas as pd
a = [1, 2, 3, 4]
def function():
result = [{'number' : i, 'operation' : 8*i} for i in a]
df = pd.DataFrame(result)
return df
print(function())
Output:
number operation
0 1 8
1 2 16
2 3 24
3 4 32
import numpy as np
a = [1, 2, 3, 4]
def function():
for i in range(0, len(a)):
number = [i for i in a]
operation = [8*i for i in a]
v=np.rot90(np.array((number,operation)))
result=np.flipud(v)
df = pd.DataFrame(result, columns=['number','operation'])
return df
print (function())
number operation
0 1 8
1 2 16
2 3 24
3 4 32
You are almost there. Just replace number = [i for i in a] with number = a[i] and operation = [8*i for i in a] with operation = 8 * a[i]
(FYI: No need to create pandas dataframe inside loop. You can get same output with pandas dataframe creation outside loop)
Refer to the below code:
a = [1, 2, 3, 4]
def function():
result = []
for i in range(0, len(a)):
number = a[i]
operation = 8*a[i]
result.append({'number': number, 'operation': operation})
df = pd.DataFrame(res, columns=['number','operation'])
return df
function()
number operation
0 1 8
1 2 16
2 3 24
3 4 32

Balanced row sample from dataframe with pandas given categorical target column

Given a dataframe my goal is to sample rows such that values in one column are as balanced as possible.
Say I have a dataframe below, the sample size is 3 and target column is c
a | b | c
1 | 2 | 0
3 | 4 | 0
5 | 6 | 1
7 | 8 | 2
9 | 10| 2
11| 12| 2
One of possible samples would be
a | b | c
1 | 2 | 0
5 | 6 | 1
7 | 8 | 2
In case of sample size is not a multiple of the number of unique classes, it is fine to have difference in 1 item or so.
How would I approach this in pandas?
EDIT: provided solution that worked for me in answers
I first generated sample sizes for each unique value of column c so that it is balanced. The remainders are distributed over the first few elements
unique_values = df['c'].unique()
sample_sizes = [(k//len(df.columns))] * len(unique_values)
i = 0
while i < k%len(df.columns):
sample_sizes[i]+= 1
i= I+1
This bit generates the samples based on the generated sample sizes
df2= pd.concat([df.loc[df['c'] == unique_values[i]].sample() for i in range(len(sample_sizes)) for j in range(sample_sizes[i])])
You can just get a random sample of the dataframe based on the minimum count of the target column.
column = 'c'
df = df.groupby(column).sample(n=df[column].value_counts().min(), random_state='42')
First, we create your example dataframe
columns = ['a', 'b', 'c']
data = [[1, 2, 0], [4, 4, 0], [5, 6, 1], [7, 8, 2], [9, 10, 2], [11, 12, 2]]
df = pd.DataFrame(data = data, columns = columns)
Now, with the following function you can do what you want
def balanced_sample(dataframe, sample_size, target_column):
# extract existing possible classes
target_columns_values = dataframe.loc[:, target_column].unique().tolist()
# count number of classes
target_columns_unique_classes_size = len(target_columns_values)
# checking if sample size is multiple of number of classes
if sample_size%target_columns_unique_classes_size !=0:
print('Sample size is not a multiple of the number of unique classes')
# to have difference in 1 item or so
instances_per_class = round(sample_size/target_columns_unique_classes_size)
# other possibilitie is to use
# sample_size//target_columns_unique_classes_size instead of round(...)
# but then, instances_per_class will be always <= than
# sample_size/target_columns_unique_classes_size
# checking if there is enought examples per class
values_per_class = dataframe.loc[:, target_column].value_counts()
for idx in values_per_class.index:
if instances_per_class>values_per_class[idx]:
print('Class {} has only {} example, so it is impossible to use {}
sample size, i.e., {} per class'.format(idx, values_per_class[idx],
sample_size, instances_per_class))
return pd.DataFrame(columns = dataframe.columns)
# creating the result dataframe
data = []
for classes in target_columns_values:
class_values = dataframe[dataframe.loc[:, target_column] ==
classes].sample(instances_per_class).values.tolist()
data+=class_values
result_dataframe = pd.DataFrame(columns = dataframe.columns, data = data)
return result_dataframe
Now we check the function:
And with other options:
I hope you find it useful, if you have any doubt, comment it here and I will try to answer you.
Question is a bit ambiguous but let say you want to randomly select 1 row for each column c category one could do:
import pandas as pd
data = [
[1, 2, 0], [1, 4, 0], [2, 2, 1],
[4, 5, 1], [3, 7, 2], [3, 3, 2],
[1, 2, 6], [3, 2, 6], [5, 2, 6]
]
df = pd.DataFrame(data, columns=['a', 'b', 'c'])
sample = df.groupby('c').apply(lambda x: x.sample(n=1).squeeze())
a b c
c
0 1 4 0
1 2 2 1
2 3 3 2
6 1 2 6
I am posting the solution that works for me. It is not the most beautiful or efficient code. But that's honest work.
df = pd.read_csv(path)
target_col = 't'
unique_values = df[target_col].unique()
k = 8 #sample size
per_class_sample_size = int(k/unique_values.shape[0])
arr_samples_per_class = [0] * len(unique_values)
leftover = k - (per_class_sample_size * len(unique_values))
for i, v in enumerate(unique_values):
occ = df[df[target_col] == v].shape[0]
if leftover > 0 and occ > per_class_sample_size:
sz = per_class_sample_size + 1
leftover -= 1
else:
sz = per_class_sample_size if occ >= per_class_sample_size else occ
arr_samples_per_class[i] = sz
fdf = None
for v, sz in zip(unique_values, arr_samples_per_class):
ss = df.loc[df[target_col] == v].sample(sz)
fdf = ss if fdf is None else pd.concat([fdf, ss], axis=0)

Categories

Resources