I have a problem with Arrays Column in DataFrame
ex : I have This Data
CustomerNumber ArraysDate
1 [ 1 4 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]
I Want caculator sum the element in ArrayDate
I create a function
def Caculator(n,x,value):
v = 0
for i in n-x:
v = sum(value)
return v
And
s['Sum'] = Caculator(s['n'],1,s['ArraysDate'])
n is count the element of ArraysDate Column
And I want caculator
Sum = t1 + t2 +....+t_n-x
And Expect Result :
CustomerNumber ArraysDate Sum
1 [ 1 4 13 ] 5
2 [ 3 ] 0
3 [ 0 ] 0
4 [ 2 60 30 40] 92
IIUC you can use:
df['Sum']=df.ArraysDate.apply(lambda x: sum(x[:len(x)-1]))
#or df.ArraysDate.str[:-1].apply(sum)
print(df)
CustomerNumber ArraysDate Sum
0 1 [1, 4, 13] 5
1 2 [3] 0
2 3 [0] 0
3 4 [2, 60, 30, 40] 92
DF: df = pd.DataFrame({'CustomerNumber': [1, 2, 3, 4], 'ArraysDate': [[1,4,13],[3],[0],[2,60,30,40]]})
Maybe something like:
def Caculator(x,arrayDates):
vList = []
for i in range(arrayDates.count()):
v = 0
for num in range(0, len(arrayDates[i])-x):
v = v + arrayDates[i][num]
vList.append(v)
return vList
for the DataFrame s:
data = [[1, [1, 4, 13]], [2, [3]], [3, [0]], [4, [2, 60, 30, 40]]]
s = pd.DataFrame(data, columns = ['CustomerNumber', 'ArraysDate'])
and call the function like this:
s['Sum'] = Caculator(1,s['ArraysDate'])
You can make sum in ArraysDate column of Pandas DataFrame like this:
import pandas as pd
import numpy as np
d={'CustomerNumber':pd.Series([1,2,3,4]),
'ArraysDate':pd.Series([[1,4,13],[3],[0],[2,60,30,40]])}
df=pd.DataFrame(d)
df['sum']=[np.sum(i[0:(len(i)-1)]) for i in df['ArraysDate']]
print(df)
Output:
CustomerNumber ArraysDate sum
0 1 [1, 4, 13] 5.0
1 2 [3] 0.0
2 3 [0] 0.0
3 4 [2, 60, 30, 40] 92.0
Related
given the data set
#Create Series
s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
k = pd.Series(['y','n','o'],['buz','bas','bur'])
#Create DataFrame df from two series
df = pd.DataFrame({'first':s,'second':k})
I was able to create new columns based on all possible values of 'first'
def text_to_list(df,col):
val=df[col].explode().unique()
return val
unique=text_to_list(df,'first')
for options in unique :
df[options]=0
now I need to check off (or turn the value to '1') in each row and column where that value exists in the original list of 'first'
I'm pretty sure its a combination of .isin and/or .apply, but i'm struggling
the end result should be for row
buz: cols 1,2,3 are 1
bas: cols 1,10,11 are 1
bur: cols 2,11,12 are 1
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
adding the solution provided by -https://stackoverflow.com/users/3558077/ashutosh-porwal
df1=df.join(pd.get_dummies(df['first'].apply(pd.Series).stack()).sum(level=0))
print(df1)
Note: this solution did not require my hack job of creating the columns beforehand by explode column 'first'
From your update it seems that what you need is simply:
for opt in unique :
df[opt]=df['first'].apply(lambda x: int(opt in x))
Output:
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
Data:
>>> import pandas as pd
>>> s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
>>> k = pd.Series(['y','n','o'],['buz','bas','bur'])
>>> df = pd.DataFrame({'first':s,'second':k})
>>> df
first second
buz [1, 2, 3] y
bas [1, 10, 11] n
bur [2, 11, 12] o
Solution:
>>> df[df['first'].explode().to_list()] = 0
>>> df = df[['first', 'second']].join(df.apply(lambda x:x.loc[x['first']], axis=1).replace({0 : 1, np.nan : 0}).astype(int))
>>> df
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
Use pd.merge and pivot_table:
out = df.reset_index().explode('first') \
.pivot_table(values='index', index='second', columns='first',
aggfunc='any', fill_value=False, sort=False).astype(int)
out = df.merge(out, on='second')
Output:
>>> out
first second 1 2 3 10 11 12
0 [1, 2, 3] y 1 1 1 0 0 0
1 [1, 10, 11] n 1 0 0 1 1 0
2 [2, 11, 12] o 0 1 0 0 1 1
I have two dataframes of errors in 3 axis (x, y, z):
df1 = pd.DataFrame([[0, 1, 2], [-1, 0, 1], [-2, 0, 3]], columns = ['x', 'y', 'z'])
df2 = pd.DataFrame([[1, 1, 3], [1, 0, 2], [1, 0, 3]], columns = ['x', 'y', 'z'])
I'm looking for a fast way to find the Cartesian sum of the square of each row of the two dataframes.
EDIT My current solution:
cartesian_sum = list(np.sum(list(tup), axis = 0).tolist()
for tup in itertools.product( (df1**2).to_numpy().tolist(),
(df2**2).to_numpy().tolist() ) )
cartesian_sum
>>>
[[1, 2, 13],
[1, 1, 8],
[1, 1, 13],
[2, 1, 10],
[2, 0, 5],
[2, 0, 10],
[5, 1, 18],
[5, 0, 13],
[5, 0, 18]]
is too slow (~ 2.4 ms; compared to the solutions based purely in Pandas running ~ 8-10 ms).
This is similar to the related question (link here) but using itertools is so slow. Is there a faster way of doing this in Python?
I think you need cross join first, remove column a, squared, convert columns to MultiIndex and sum per first level:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a').drop('a', axis=1) ** 2
df.columns = df.columns.str.split('_', expand=True)
df = df.sum(level=0, axis=1)
print (df)
x y z
0 1 2 13
1 1 1 8
2 1 1 13
3 2 1 10
4 2 0 5
5 2 0 10
6 5 1 18
7 5 0 13
8 5 0 18
Details:
print (df1.assign(a=1).merge(df2.assign(a=1), on='a'))
x_x y_x z_x a x_y y_y z_y
0 0 1 2 1 1 1 3
1 0 1 2 1 1 0 2
2 0 1 2 1 1 0 3
3 -1 0 1 1 1 1 3
4 -1 0 1 1 1 0 2
5 -1 0 1 1 1 0 3
6 -2 0 3 1 1 1 3
7 -2 0 3 1 1 0 2
8 -2 0 3 1 1 0 3
One idea for improve performance:
#https://stackoverflow.com/a/53699013/2901002
def cartesian_product_simplified_changed(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
a = np.column_stack([left.values[ia2.ravel()] ** 2, right.values[ib2.ravel()] ** 2])
a = a[:, :la] + a[:, la:]
return a
a = cartesian_product_simplified_changed(df1, df2)
print (a)
[[ 1 2 13]
[ 1 1 8]
[ 1 1 13]
[ 2 1 10]
[ 2 0 5]
[ 2 0 10]
[ 5 1 18]
[ 5 0 13]
[ 5 0 18]]
I have turned dataframe that has a tuple of length 2 as index
1 2 -1
(0, 1) 0 1 0
(0, 2) 1 0 0
(0, -1) 0 0 0
(1, 1) 1 0 0
(1, 2) 0 1 0
(1, -1) 1 1 1
into numpy 2D array and managed to split it to 3D array(in regards to the first value) by split funcion:
arr = np.array(np.array_split(arr,2))
with result
[[[0 1 0]
[1 0 0]
[0 0 0]]
[[1 0 0]
[0 1 0]
[1 1 1]]]
I want to make a function to do the split even further, for example, to create 5D tensor from (0,0,0,0) (length 4) indices.
Any idea on how to do this recursively?
Use the following code to generate sample data:
import pandas as pd
import numpy as np
import itertools
def create_fake_data_frame(nlevels = 2, ncols = 3):
result = pd.DataFrame(
index=itertools.product(*(nlevels * [[0, 1]])),
data=np.arange(ncols*2**nlevels).reshape(2**nlevels, ncols)
)
result = convert_index_of_tuples_to_multiindex(result)
return result
def convert_index_of_tuples_to_multiindex(df):
return df.set_index(pd.MultiIndex.from_tuples(df.index))
# Increase nlevels to get dataframes with more levels in their MultiIndex
df = create_fake_data_frame(nlevels=3)
print(df)
This is the result:
0 1 2
0 0 0 0 1 2
1 3 4 5
1 0 6 7 8
1 9 10 11
1 0 0 12 13 14
1 15 16 17
1 0 18 19 20
1 21 22 23
Then, modify the dataframe in such a way that each row contains a single
column, whose value is a list of the values in the corresponding row of
the original dataframe:
def data_frame_with_single_column_of_lists(df):
if len(df.columns) <= 1:
return df
result = df.apply(collapse_columns_into_lists, axis=1)
return result
def collapse_columns_into_lists(s):
result = s.copy()
result['lists'] = result.values.tolist()
result = result[['lists']]
return result
df = data_frame_with_single_column_of_lists(df)
print(df)
The output will be like this:
lists
0 0 0 [0, 1, 2]
1 [3, 4, 5]
1 0 [6, 7, 8]
1 [9, 10, 11]
1 0 0 [12, 13, 14]
1 [15, 16, 17]
1 0 [18, 19, 20]
1 [21, 22, 23]
Finally, use the following code to get a tensor
def increase_list_nesting_by_removing_an_index_level(df):
def list_of_lists(series):
result = series.to_frame().set_index(series.index.droplevel(-1))
result = result.apply(lambda x: x['lists'], axis=1).to_frame()
result = [x[0] for x in result.values.tolist()]
return result
grouped = df.groupby(df.index.droplevel(-1))
result = grouped.agg(list_of_lists)
if type(result.index[0]) == tuple:
result = convert_index_of_tuples_to_multiindex(result)
return result
def tensor_from_data_frame(df):
if df.index.nlevels <= 1:
return np.array([i[0] for i in df.values])
result = increase_list_nesting_by_removing_an_index_level(df)
result = tensor_from_data_frame(result)
return result
tensor = tensor_from_data_frame(df)
print(tensor)
The result will be like this:
[[[[ 0 1 2]
[ 3 4 5]]
[[ 6 7 8]
[ 9 10 11]]]
[[[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]]]]
I would like to add a column in a data frame when another column is increasing/decreasing or stays the same with:
1 -> increasing, 0 -> same, -1 -> decreasing
So if df['battery'] = [1,2,3,4,7,9,3,3,3,]
I would like state to be df['state'] = [1,1,1,1,1,-1,0,0]
This should do the trick!
a = [1,2,3,4,7,9,3,3,3]
b = []
for x in range(len(a)-1):
b.append((a[x+1] > a[x]) - (a[x+1] < a[x]))
print(b)
You could use pd.Series.diff method to get the difference between consecutive values, and then assign the necessary state values by using boolean indexing:
import pandas as pd
df = pd.DataFrame()
df['battery'] = [1,2,3,4,7,9,3,3,3]
diff = df['battery'].diff()
df.loc[diff > 0, 'state'] = 1
df.loc[diff == 0, 'state'] = 0
df.loc[diff < 0, 'state'] = -1
print(df)
# battery state
# 0 1 NaN
# 1 2 1.0
# 2 3 1.0
# 3 4 1.0
# 4 7 1.0
# 5 9 1.0
# 6 3 -1.0
# 7 3 0.0
# 8 3 0.0
Or, alternatively, one could use np.select:
import numpy as np
diff = df['battery'].diff()
df['state'] = np.select([diff < 0, diff > 0], [-1, 1], 0)
# Be careful, default 0 will replace the first NaN as well.
print(df)
# battery state
# 0 1 0
# 1 2 1
# 2 3 1
# 3 4 1
# 4 7 1
# 5 9 1
# 6 3 -1
# 7 3 0
# 8 3 0
So here's your dataframe:
>>> import pandas as pd
>>> data = [[[1,2,3,4,7,9,3,3,3]]]
>>> df = pd.DataFrame(data, columns = ['battery'])
>>> df
battery
0 [1, 2, 3, 4, 7, 9, 3, 3, 3]
And finally use apply and a lambda function in order to generate the required result:
>>> df['state'] = df.apply(lambda row: [1 if t - s > 0 else -1 if t-s < 0 else 0 for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 1, 1, -1, 0, 0]
Alternatively, if you want the exact difference between each element in the list, you can use the following:
>>> df['state'] = df.apply(lambda row: [t - s for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 3, 2, -6, 0, 0]
Try pd.np.sign
pd.np.sign(df.battery.diff().fillna(1))
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 -1.0
7 0.0
8 0.0
Name: battery, dtype: float64
I'd like to create a new dataframe using the same values from another dataframe, unless there is a 0 value. If there is a 0 value, I'd like to find the average of the entry before and after.
For Example:
df = A B C
5 2 1
3 4 5
2 1 0
6 8 7
I'd like the result to look like the df below:
df_new = A B C
5 2 1
3 4 5
2 1 6
6 8 7
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 3, 2, 6], 'B':[2, 4, 1, 8], 'C':[1, 5, 0, 7]})
Nrows = len(df)
def run(col):
originalValues = list(df[col])
values = list(np.where(np.array(list(df[col])) == 0)[0])
indices2replace = filter(lambda x: x > 0 and x < Nrows, values)
for index in indices2replace:
originalValues[index] = 0.5 * (originalValues[index+1] + originalValues[index-1])
return originalValues
newDF = pd.DataFrame(map(lambda x: run(x) , df.columns)).transpose()