How to write multiple rows for single userid into dataframe - python

How should I write multiple rows to single user id
example
id = ['userid1','userid2'....'useridn']
ndarry1 = [1,2,3,4,5...]
ndarry2 = [1,2,3,4,5...]
.
.
ndarryn = [1,2,3,4,5...]
Expected Output: Dataframe
id value
userid1 1
userid1 2
userid1 3
. .
. .
userid2 1
Can anybody suggest how should I do it.?

id = ['userid1', 'userid2', 'userid3', 'userid4']
ndarray1 = [1, 2, 3, 4]
ndarray2 = [1, 2, 3, 4]
ndarray3 = [1, 2, 3, 4]
ndarray4 = [1, 2, 3, 4]
n = 4
ID = []
value = []
for i in id:
a = str(id.index(i)+1)
for j in range(0,n):
ID.append(i)
value.append(eval('ndarray'+ a)[j])
df = pd.DataFrame({'ID':ID,'Value':value})
Output
ID Value
0 userid1 1
1 userid1 2
2 userid1 3
3 userid1 4
4 userid2 1
5 userid2 2
6 userid2 3
7 userid2 4
8 userid3 1
9 userid3 2
10 userid3 3
11 userid3 4
12 userid4 1
13 userid4 2
14 userid4 3
15 userid4 4

Different approach
id = ['userid1', 'userid2', 'userid3']
ndarray1 = [1, 2, 3, 4]
ndarray2 = [1, 2, 3, 4]
ndarray3 = [1, 2, 3, 4]
concat = [ndarray1, ndarray2, ndarray3]
n = []
user = []
for i in range(len(concat)):
for j in range(len(concat[i])):
user.append(id[i])
n.append(concat[i][j])
df = pd.DataFrame(data=[n, user]).T
df.columns = ['id', 'value']

Related

Compare two pandas DataFrames in the most efficient way

Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?
IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7
Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7

pandas translate from a column that is a list to create new columns with all options as a binary yes/no if the value exists in the original list

given the data set
#Create Series
s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
k = pd.Series(['y','n','o'],['buz','bas','bur'])
#Create DataFrame df from two series
df = pd.DataFrame({'first':s,'second':k})
I was able to create new columns based on all possible values of 'first'
def text_to_list(df,col):
val=df[col].explode().unique()
return val
unique=text_to_list(df,'first')
for options in unique :
df[options]=0
now I need to check off (or turn the value to '1') in each row and column where that value exists in the original list of 'first'
I'm pretty sure its a combination of .isin and/or .apply, but i'm struggling
the end result should be for row
buz: cols 1,2,3 are 1
bas: cols 1,10,11 are 1
bur: cols 2,11,12 are 1
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
adding the solution provided by -https://stackoverflow.com/users/3558077/ashutosh-porwal
df1=df.join(pd.get_dummies(df['first'].apply(pd.Series).stack()).sum(level=0))
print(df1)
Note: this solution did not require my hack job of creating the columns beforehand by explode column 'first'
From your update it seems that what you need is simply:
for opt in unique :
df[opt]=df['first'].apply(lambda x: int(opt in x))
Output:
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
Data:
>>> import pandas as pd
>>> s = pd.Series([[1,2,3,],[1,10,11],[2,11,12]],['buz','bas','bur'])
>>> k = pd.Series(['y','n','o'],['buz','bas','bur'])
>>> df = pd.DataFrame({'first':s,'second':k})
>>> df
first second
buz [1, 2, 3] y
bas [1, 10, 11] n
bur [2, 11, 12] o
Solution:
>>> df[df['first'].explode().to_list()] = 0
>>> df = df[['first', 'second']].join(df.apply(lambda x:x.loc[x['first']], axis=1).replace({0 : 1, np.nan : 0}).astype(int))
>>> df
first second 1 2 3 10 11 12
buz [1, 2, 3] y 1 1 1 0 0 0
bas [1, 10, 11] n 1 0 0 1 1 0
bur [2, 11, 12] o 0 1 0 0 1 1
Use pd.merge and pivot_table:
out = df.reset_index().explode('first') \
.pivot_table(values='index', index='second', columns='first',
aggfunc='any', fill_value=False, sort=False).astype(int)
out = df.merge(out, on='second')
Output:
>>> out
first second 1 2 3 10 11 12
0 [1, 2, 3] y 1 1 1 0 0 0
1 [1, 10, 11] n 1 0 0 1 1 0
2 [2, 11, 12] o 0 1 0 0 1 1

Fast Cartesian sum of rows of dataframe

I have two dataframes of errors in 3 axis (x, y, z):
df1 = pd.DataFrame([[0, 1, 2], [-1, 0, 1], [-2, 0, 3]], columns = ['x', 'y', 'z'])
df2 = pd.DataFrame([[1, 1, 3], [1, 0, 2], [1, 0, 3]], columns = ['x', 'y', 'z'])
I'm looking for a fast way to find the Cartesian sum of the square of each row of the two dataframes.
EDIT My current solution:
cartesian_sum = list(np.sum(list(tup), axis = 0).tolist()
for tup in itertools.product( (df1**2).to_numpy().tolist(),
(df2**2).to_numpy().tolist() ) )
cartesian_sum
>>>
[[1, 2, 13],
[1, 1, 8],
[1, 1, 13],
[2, 1, 10],
[2, 0, 5],
[2, 0, 10],
[5, 1, 18],
[5, 0, 13],
[5, 0, 18]]
is too slow (~ 2.4 ms; compared to the solutions based purely in Pandas running ~ 8-10 ms).
This is similar to the related question (link here) but using itertools is so slow. Is there a faster way of doing this in Python?
I think you need cross join first, remove column a, squared, convert columns to MultiIndex and sum per first level:
df = df1.assign(a=1).merge(df2.assign(a=1), on='a').drop('a', axis=1) ** 2
df.columns = df.columns.str.split('_', expand=True)
df = df.sum(level=0, axis=1)
print (df)
x y z
0 1 2 13
1 1 1 8
2 1 1 13
3 2 1 10
4 2 0 5
5 2 0 10
6 5 1 18
7 5 0 13
8 5 0 18
Details:
print (df1.assign(a=1).merge(df2.assign(a=1), on='a'))
x_x y_x z_x a x_y y_y z_y
0 0 1 2 1 1 1 3
1 0 1 2 1 1 0 2
2 0 1 2 1 1 0 3
3 -1 0 1 1 1 1 3
4 -1 0 1 1 1 0 2
5 -1 0 1 1 1 0 3
6 -2 0 3 1 1 1 3
7 -2 0 3 1 1 0 2
8 -2 0 3 1 1 0 3
One idea for improve performance:
#https://stackoverflow.com/a/53699013/2901002
def cartesian_product_simplified_changed(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
a = np.column_stack([left.values[ia2.ravel()] ** 2, right.values[ib2.ravel()] ** 2])
a = a[:, :la] + a[:, la:]
return a
a = cartesian_product_simplified_changed(df1, df2)
print (a)
[[ 1 2 13]
[ 1 1 8]
[ 1 1 13]
[ 2 1 10]
[ 2 0 5]
[ 2 0 10]
[ 5 1 18]
[ 5 0 13]
[ 5 0 18]]

AttributeError: 'DataFrame' object has no attribute 'set_value'

I'm using flask and getting error at set_values. I'm reading the input from html and passing it to the code
#app.route('/home', methods=['POST'])
def first():
source = request.files['first']
destination = request.files['second']
df = pd.read_csv(source)
df1 = pd.read_csv(destination)
val1 = int(request.form['val1'])
val2 = int(request.form['val2'])
val3 = int(request.form['val3'])
target = request.form['str']
df2 = df[df.columns[val2]]
count = 0
for j in df[df.columns[val1]]:
x = df1.loc[df1[df1.columns[val3]] == j].index.values
for i in x:
df1.set_value(i, target, df2[count])
count = count + 1
df1.to_csv('result.csv', index=False)
Check your pandas version.
df.set_value() is deprecated since pandas version 0.21.0
Instead use df.at
import pandas as pd
df = pd.DataFrame({"A":[1, 5, 3, 4, 2],
"B":[3, 2, 4, 3, 4],
"C":[2, 2, 7, 3, 4],
"D":[4, 3, 6, 12, 7]})
df.at[2,'B']=100
A B C D
0 1 3 2 4
1 5 2 2 3
2 3 100 7 6
3 4 3 3 12
4 2 4 4 7

Caculator Arrays in Column DataFrame Python

I have a problem with Arrays Column in DataFrame
ex : I have This Data
CustomerNumber ArraysDate
1 [ 1 4 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]
I Want caculator sum the element in ArrayDate
I create a function
def Caculator(n,x,value):
v = 0
for i in n-x:
v = sum(value)
return v
And
s['Sum'] = Caculator(s['n'],1,s['ArraysDate'])
n is count the element of ArraysDate Column
And I want caculator
Sum = t1 + t2 +....+t_n-x
And Expect Result :
CustomerNumber ArraysDate Sum
1 [ 1 4 13 ] 5
2 [ 3 ] 0
3 [ 0 ] 0
4 [ 2 60 30 40] 92
IIUC you can use:
df['Sum']=df.ArraysDate.apply(lambda x: sum(x[:len(x)-1]))
#or df.ArraysDate.str[:-1].apply(sum)
print(df)
CustomerNumber ArraysDate Sum
0 1 [1, 4, 13] 5
1 2 [3] 0
2 3 [0] 0
3 4 [2, 60, 30, 40] 92
DF: df = pd.DataFrame({'CustomerNumber': [1, 2, 3, 4], 'ArraysDate': [[1,4,13],[3],[0],[2,60,30,40]]})
Maybe something like:
def Caculator(x,arrayDates):
vList = []
for i in range(arrayDates.count()):
v = 0
for num in range(0, len(arrayDates[i])-x):
v = v + arrayDates[i][num]
vList.append(v)
return vList
for the DataFrame s:
data = [[1, [1, 4, 13]], [2, [3]], [3, [0]], [4, [2, 60, 30, 40]]]
s = pd.DataFrame(data, columns = ['CustomerNumber', 'ArraysDate'])
and call the function like this:
s['Sum'] = Caculator(1,s['ArraysDate'])
You can make sum in ArraysDate column of Pandas DataFrame like this:
import pandas as pd
import numpy as np
d={'CustomerNumber':pd.Series([1,2,3,4]),
'ArraysDate':pd.Series([[1,4,13],[3],[0],[2,60,30,40]])}
df=pd.DataFrame(d)
df['sum']=[np.sum(i[0:(len(i)-1)]) for i in df['ArraysDate']]
print(df)
Output:
CustomerNumber ArraysDate sum
0 1 [1, 4, 13] 5.0
1 2 [3] 0.0
2 3 [0] 0.0
3 4 [2, 60, 30, 40] 92.0

Categories

Resources