Array in DataFrame Panda Python - python

I have this DataFrame. In Column ArraysDate contains many elements. I want to be able to number and run the for loop in the array of java. I have not found any solution, please tell me some ideas?.
Ex with CustomerNumber = 4 , then ArraysDate have 3 elements ,and understood i1,i2,i3,i4 to use calculations in ArraysDate.
Thanks you
CustomerNumber ArraysDate
1 [ 1 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]

If I understand correctly, you want to get an array of data from 'ArraysDate' based on column 'CustomerNumber'.
Basically, you can use loc
import pandas as pd
data = {'c': [1, 2, 3, 4], 'date': [[1,2],[3],[0],[2,60,30,40]]}
df = pd.DataFrame(data)
df.loc[df['c']==4, 'date']
df.loc[df['c']==4, 'date'] = df.loc[df['c']==4, 'date'].apply(lambda i: sum(i))
Result:
[2, 60, 30, 40]
c date
0 1 [1, 2]
1 2 [3]
2 3 [0]
3 4 132

You can use the lambda to sum all items in the array per row.
Step 1: Create a dataframe
import pandas as pd
import numpy as np
d = {'ID': [[1,2,3],[1,2,43]]}
df = pd.DataFrame(data=d)
Step 2: Sum the items in the array
df['ID2']=df['ID'].apply(lambda x: sum(x))
df

Related

Removing duplicates in dataframe via creating a list of their indices pandas

i have a dataframe (=used_dataframe), that contains duplicates. I am required to create a list that contains the indices of those duplicates
For this I used a function I found here:
Find indices of duplicate rows in pandas DataFrame
def duplicates(x):
#dataframe = pd.read_csv(x)
#df = dataframe.iloc[: , 1:]
df = x
duplicateRowsDF = df[df.duplicated()]
df = df[df.duplicated(keep=False)]
tuppl = df.groupby(list(df)).apply(lambda x: tuple(x.index)).tolist() #this is the function!
n = 1 # N. . .
indicees = [x[n] for x in tuppl]
return indicees
duplicates(used_df)
The next function I need is one, where I remove the duplicates from the dataset which i did like this:
x= tidy(mn)
indices = duplicates(tidy(mn))
used_df = x
used_df['indexcol'] = range(0, len(tidy(mn)))
dropped = used_df[~used_df['indexcol'].isin(indices)]
finito = dropped.drop(columns=['indexcol'])
return finito
handling_duplicate_entries(used_df)
And it works - but when I want to check my solution (to assess, that all duplicates have been removed)
Which I do by duplicates(handling_duplicate_entries(used_df))which should return an empty dataframe to show that there are no duplicates, it returns the error 'DataFrame' object has no attribute 'tolist'.
In the question of the link above, this has also been added as a comment but not solved - and to be quite frank I would love to find a different solution for the duplicates function because I don't quite understand it but so far I haven't.
Ok. I'll try to do my best.
So if you are trying to find the duplicate indices, and want to store those values in a list you can use the following code. Also I have included a small example to create a dataframe containing the duplicated values (original), and the data without any duplicated data.
import pandas as pd
# Toy dataset
data = {
'A': [0, 0, 3, 0, 3, 0],
'B': [0, 1, 3, 2, 3, 0],
'C': [0, 1, 3, 2, 3, 0]
}
df = pd.DataFrame(data)
group = df.groupby(list(df.columns)).size()
group = group[group>1].reset_index(name = 'count')
group = group.drop(columns=['count']).reset_index().rename(columns={'index':'count'})
idxs = df.reset_index().merge(group, how = 'right')['index'].values
duplicates = df.loc[idxs]
no_duplicates = df.loc[~df.index.isin(idxs)]
duplicates
A B C
0 0 0 0
5 0 0 0
2 3 3 3
4 3 3 3
no_duplicates
A B C
1 0 1 1
3 0 2 2

How can I make this kind of aggregation in pandas?

I have a dataframe that has categorical columns and numerical columns, and I want some agrupation on the values on numerical columns (max, min, sum...) depending on the value of the cateogorical ones (so I have to create new columns for each value that each cateogorical column can take).
To make it more understable, it's better to put a toy example.
Say that I have this dataframe:
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'value_type' : ['A', 'B', 'A', 'C', 'C', 'A'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
value_type amount
ref
1 A 100
1 B 50
1 A 20
2 C 300
2 C 150
3 A 70
And I want to group the amounts on the values of value_type, grouping also for each reference. The result in this case (supposing that only the sum was needed) will be this one:
df_result = pd.DataFrame({
'ref' : [1, 2, 3],
'sum_amount_A' : [120, 0, 70],
'sum_amount_B' : [50, 0, 0],
'sum_amount_C' : [0, 450, 0]
}).set_index('ref')
sum_amount_A sum_amount_B sum_amount_C
ref
1 120 50 0
2 0 0 450
3 70 0 0
I have tried something that works but it's extremely inefficient. It takes several minutes to process 30.000 rows aprox.
What I have done is this: (I have a dataframe with an only row for each index ref, called df_final)
df_grouped = df.groupby(['ref'])
for ref in df_grouped.groups:
df_aux = df.loc[[ref]]
column = 'A' # I have more columns, but for illustration one is enough
for value in df_aux[column].unique():
df_aux_column_value = df_aux.loc[df_aux[column] == value]
df_final.at[ref,'sum_' + column + '_' + str(value)] = np.sum(df_aux_columna_valor[column])
I'm sure there should be better ways of doing this aggretation... Thanks in advance!!
EDIT:
The answer given is correct when there is only one column to group by. In the real dataframe I have several columns on which I want to calculate some agg functions, but on the values on each column separately. I mean that I don't want an aggregated value for each combination of the values of the column, but only for the columns by themselves.
Let's make an example.
import pandas as pd
df = pd.DataFrame({
'ref' : [1, 1, 1, 2, 2, 3],
'sexo' : ['Hombre', 'Hombre', 'Hombre', 'Mujer', 'Mujer', 'Hombre'],
'lugar_trabajo' : ['Campo', 'Ciudad', 'Campo', 'Ciudad', 'Ciudad', 'Campo'],
'dificultad' : ['Alta', 'Media', 'Alta', 'Media', 'Baja', 'Alta'],
'amount' : [100, 50, 20, 300, 150, 70]
}).set_index(['ref'])
This dataframe looks like that:
sexo lugar_trabajo dificultad amount
ref
1 Hombre Campo Alta 100
1 Hombre Ciudad Media 50
1 Hombre Campo Alta 20
2 Mujer Ciudad Media 300
2 Mujer Ciudad Baja 150
3 Hombre Campo Alta 70
If I group by several columns, or make a pivot table (which in a way is equivalent, as far as I know), doing this:
df.pivot_table(index='ref',columns=['sexo','lugar_trabajo','dificultad'],values='amount',aggfunc=[np.sum,np.min,np.max,len], dropna=False)
I will get a dataframe with 48 columns (because I have 3 * 2 * 2 different values, and 4 agg functions).
A way of achieve the result that I want is this:
df_agregado = pd.DataFrame(df.index).set_index('ref')
for col in ['sexo','lugar_trabajo','dificultad']:
df_agregado = pd.concat([df_agregado, df.pivot_table(index='ref',columns=[col],values='amount',aggfunc=[np.sum,np.min,np.max,len])],axis=1)
I do each group by alone, and concat all of them. In this way I get 28 columns (2 * 4 + 3 * 4 + 2 * 4). It works and it's fast, but it's not very elegant. Is there another way of getting this result??
The more efficient way is to use Pandas built-in functions instead of for loops. There are two main steps that you should take.
First, you need to groupby not only by index, but also by index and the column:
res = df.groupby(['ref','value_type']).sum()
print(res)
The output is like this at this step:
amount
ref value_type
1 A 120
B 50
2 C 450
3 A 70
Second, you need to unstack the multi index, as follows:
df2 = res.unstack(level='value_type',fill_value=0)
The output will be your desire output:
amount
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0
As an optional step you can use droplevel to flatten it:
df2.columns = df2.columns.droplevel()
value_type A B C
ref
1 120 50 0
2 0 0 450
3 70 0 0

Converting a list with no tuples into a data frame

Normally when you want to create a turn a set of data into a Data Frame, you make a list for each column, create a dictionary from those lists, then create a data frame from the dictionary.
The data frame I want to create has 75 columns, all with the same number of rows. Defining lists one-by-one isn't going work. Instead I decided to make a single list and iteratively put a certain chunk of each row onto a Data Frame.
Here I will make an example where I turn a list into a data frame:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Example list
df =
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9
# Result I want from the example list
Here is my test code:
import pandas as pd
import numpy as np
dict = {'a':[], 'b':[], 'c':[], 'd':[], 'e':[]}
df = pd.DataFrame(dict)
# Here is my test data frame, it contains 5 columns and no rows.
lst = np.arange(10).tolist()
# This is my test list, it looks like this lst = [0, 2, …, 9]
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]))
# This code is supposed to put two entries per column for the whole data frame.
# For the first column, i = 0, so [2 * (0):2 * (0) + 2] = [0:2]
# df.iloc[:, 0] = lst[0:2], so df.iloc[:, 0] = [0, 1]
# Second column i = 1, so [2 * (1):2 * (1) + 2] = [2:4]
# df.iloc[:, 1] = lst[2:4], so df.iloc[:, 1] = [2, 3]
# This is how the code was supposed to allocate lst to df.
# However it outputs an error.
When I run this code I get this error:
ValueError: cannot reindex from a duplicate axis
When I add ignore_index = True such that I have
for i in range(len(lst)):
df.iloc[:, i] = df.iloc[:, i]\
.append(pd.Series(lst[2 * i:2 * i + 2]), ignore_index = True)
I get this error:
IndexError: single positional indexer is out-of-bounds
After running the code, I check the results of df. The output is the same whether I ignore index or not.
In: df
Out:
a b c d e
0 0 NaN NaN NaN NaN
1 1 NaN NaN NaN NaN
It seems that the first loop runs fine, but the error occurs when trying to fill the second column.
Does anybody know how to get this to work? Thank you.
IIUC:
lst = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
alst = np.array(lst)
df = pd.DataFrame(alst.reshape(2,-1, order='F'), columns = [*'abcde'])
print(df)
Output:
a b c d e
0 0 2 4 6 8
1 1 3 5 7 9

Faster way of iterating through lists in Python

I have list vector_list of length 800,000, where the elements are lists of size 768. I'm trying to add 768 columns to a pandas dataframe where each column is 800,000 long and represents an element from each list. Here's my code:
active = pd.DataFrame()
for i in range(len(vector_list[0])):
element_list = []
for j in range(len(vector_list)):
element_list.append(vector_list[j][i])
active['Element {}'.format(i)] = element_list
Just to reiterate,
len(vector_list) = 800,000
len(vector_list[0]) = 768
Is there a more clever, faster way to do this?
Directly pass the list to DataFrame constructor.
import pandas as pd
_list = [[1, 2], [3, 4], [5, 6], [7, 8]]
df = pd.DataFrame(_list)
print(df.head())
Output
0 1
0 1 2
1 3 4
2 5 6
3 7 8

Python Linear Regression input values

I have a Excel sheet with 2 colums and 1000 rows.
I want to give this as inputs to my Linear Regression Fit command using the sklearn.
/
when I want to create a dataframe using panda how can I give the inputs?
like df_x=pd.dataFrame(...)
I used without dataframe sucessfully as:
npMatrix=np.matrix(raw_data)
X,Y=npMatrix[:,1],npMatrix[:,2]
md1=LinearRegression().fit(X,Y)
Can you help with me Pandas how to access the rows?
I think you can convert a pandas dataframe to a numpy array by np.array()
This is discussed here: Quora: How does python-pandas go along with scikit-learn library?
The example, by Muktabh Mayank, is copied below:
>>> from pandas import *
>>> from numpy import *
>>> new_df = DataFrame(array([[1,2,3,4],[5,6,7,8],[9,8,10,11],[16,45,67,88]]))
>>> new_df.index= ["A1","A2","A3","A4"]
>>> new_df.columns= ["X1","X2","X3","X4"]
>>> new_df
X1 X2 X3 X4
A1 1 2 3 4
A2 5 6 7 8
A3 9 8 10 11
A4 16 45 67 88
>>> array(new_df)
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 8, 10, 11],
[16, 45, 67, 88]], dtype=int64)
>>>
And btw, people are actually working on bridging sklearn and pandas: sklearn-pandas
You can read excel
df = pd.read_excel(...)
You can single column using column number
X = df[0]
Y = df[1]
If columns have names ie. "column1", "column2"
X = df["column1"]
Y = df["column2"]
But it gives single column as Series.
If you need single column as DataFrame then use list of columns
X = df[ [0] ]
Y = df[ [1] ]
More: How to get column by number in Pandas?

Categories

Resources