I have a dataframe df (see program below) whose column names and number are not fixed.
However, there is a list ls which will have the list of columns of df that needs to be appended together.
I tried
df['combined'] = df[ls].apply(lambda x: '{}{}{}'.format(x[0], x[1], x[2]), axis=1)
but here I am assuming that the list ls has 3 elements which is hard coding and incorrect.What if the list has 10 elements.. I want to dynamically read the list and append the columns of the dataframe.
import pandas as pd
def main():
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7],
'col_3': [14, 15, 16, 19],
'col_4': [22, 23, 24, 25],
'col_5': [30, 31, 32, 33],
})
ls = ['col_1','col_4', 'col_3']
df['combined'] = df[ls].apply(lambda x: '{}{}'.format(x[0], x[1]), axis=1)
print(df)
if __name__ == '__main__':
main()
You can use ''.join after converting the columns' data type to str:
df[ls].astype(str).apply(''.join, axis=1)
#0 02214
#1 12315
#2 22416
#3 32519
#dtype: object
You can use cumulative sum over strings for this for more speed i.e
df[ls].astype(str).cumsum(1).iloc[:,-1].values
Output :
0 02214
1 12315
2 22416
3 32519
Name: combined, dtype: object
If you need to add space then first add ' ' then find sum i.e
n = (df[ls].astype(str)+ ' ').sum(1)
0 0 22 14
1 1 23 15
2 2 24 16
3 3 25 19
dtype: object
Timings :
ndf = pd.concat([df]*10000)
%%timeit
ndf[ls].astype(str).cumsum(1).iloc[:,-1].values
1 loop, best of 3: 538 ms per loop
%%timeit
ndf[ls].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 1.93 s per loop
Related
I have a pandas dataframe like below
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
I am trying to attain the value of this expression.
I havnt got an idea how to mutiply first value in a column with 2nd value in another column like in the expression.
Try pd.DataFrame.shift() but I think you need to enter -1 into shift judging by the summation notation you posted. i + 1 implies using the next x or y, so shift needs to use a negative integer to shift 1 number ahead. Positive integers in shift go backwards.
Can you confirm 320 is the right answer?
0.5 * ((df.x * df.y.shift(-1)) - (df.x.shift(-1) + df.y)).sum()
>>>320
I think the below code has the correct value in expresion_end
import pandas as pd
data = [[5, 10], [4, 20], [15, 30], [20, 15], [12, 14], [5, 5]]
df = pd.DataFrame(data, columns=['x', 'y'])
df["x+1"]=df["x"].shift(periods=-1)
df["y+1"]=df["y"].shift(periods=-1)
df["exp"]=df["x"]*df["y+1"]-df["x+1"]*df["y"]
expresion_end=0.5*df["exp"].sum()
You can use pandas.DataFrame.shift(). You can one times compute shift(-1) and use it for 'x' and 'y'.
>>> df_tmp = df.shift(-1)
>>> (df['x']*df_tmp['y'] - df_tmp['x']*df['y']).sum() * 0.5
-202.5
# Explanation
>>> df[['x+1', 'y+1']] = df.shift(-1)
>>> df
x y x+1 y+1
0 5 10 4.0 20.0 # x*(y+1) - y*(x+1) = 5*20 - 10*4
1 4 20 15.0 30.0
2 15 30 20.0 15.0
3 20 15 12.0 14.0
4 12 14 5.0 5.0
5 5 5 NaN NaN
I have a basic dataframe which is a result of a gruopby from unclean data:
df:
Name1 Value1 Value2
A 10 30
B 40 50
I have created a list as follows:
Segment_list = df['Name1'].unique()
Segment_list
array(['A', 'B'], dtype=object)
Now i want to traverse the list and find the amount in Value1 for each iteration so i am usinig:
for Segment_list in enumerate(Segment_list):
print(df['Value1'])
But I getting both values instead of one by one. I just need one value for one iteration. Is this possible?
Expected output:
10
40
I recommend using pandas.DataFrame.groupby to get the values for each group.
For the most part, using a for-loop with pandas is an indication that it's probably not being done correctly or efficiently.
Additional resources:
Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects
Stack Overflow Pandas Tag Info Page
Option 1:
import pandas as pd
import numpy as np
import random
np.random.seed(365)
random.seed(365)
rows = 25
data = {'n': [random.choice(['A', 'B', 'C']) for _ in range(rows)],
'v1': np.random.randint(40, size=(rows)),
'v2': np.random.randint(40, size=(rows))}
df = pd.DataFrame(data)
# groupby n
for g, d in df.groupby('n'):
# print(g) # use or not, as needed
print(d.v1.values[0]) # selects the first value of each group and prints it
[out]: # first value of each group
5
33
18
Option 2:
dfg = df.groupby(['n'], as_index=False).agg({'v1': list})
# display(dfg)
n v1
0 A [5, 26, 39, 39, 10, 12, 13, 11, 28]
1 B [33, 34, 28, 31, 27, 24, 36, 6]
2 C [18, 27, 9, 36, 35, 30, 3, 0]
Option 3:
As stated in the comments, your data is already the result of groupby, and it will only ever have one value in the column for each group.
dfg = df.groupby('n', as_index=False).sum()
# display(dfg)
n v1 v2
0 A 183 163
1 B 219 188
2 C 158 189
# print the value for each group in v1
for v in dfg.v1.to_list():
print(v)
[out]:
183
219
158
Option 4:
Print all rows for each column
dfg = df.groupby('n', as_index=False).sum()
for col in dfg.columns[1:]: # selects all columns after n
for v in dfg[col].to_list():
print(v)
[out]:
183
219
158
163
188
189
I agree with #Trenton's comment that the whole point of using data frames is to avoid looping through them like this. Re-think this using a function. However the closest way to make what you've written work is something like this:
Segment_list = df['Name1'].unique()
for Index in Segment_list:
print(df['Value1'][df['Name1']==Index]).iloc[0]
Depending on what you want to happen if there are two entries for Name (presumably this can happen because you use .unique(), This will print the sum of the Values:
df.groupby('Name1').sum()['Value1']
Given these two dataframes:
df1 =
Name Start End
0 A 10 20
1 B 20 30
2 C 30 40
df2 =
0 1
0 5 10
1 15 20
2 25 30
df2 has no column names, but you can assume column 0 is an offset of df1.Start and column 1 is an offset of df1.End. I would like to transpose df2 onto df1 to get the Start and End differences. The final df1 dataframe should look like this:
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 Start_Diff_2 End_Diff_2
0 A 10 20 5 10 -5 0 -15 -10
1 B 20 30 15 20 5 10 -5 0
2 C 30 40 25 30 15 20 5 10
I have a solution that works, but I'm not satisfied with it because it takes too long to run when processing a dataframe that has millions of rows. Below is a sample test case to simulate processing 30,000 rows. As you can imagine, running the original solution (method_1) on a 1GB dataframe is going to be a problem. Is there a faster way to do this using Pandas, Numpy, or maybe another package?
UPDATE: I've added the provided solutions to the benchmarks.
# Import required modules
import numpy as np
import pandas as pd
import timeit
# Original
def method_1():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Store data for new columns in a dictionary
new_columns = {}
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
key_start = 'Start_Diff_' + str(index2)
key_end = 'End_Diff_' + str(index2)
if (key_start in new_columns):
new_columns[key_start].append(row1[1]-row2[0])
else:
new_columns[key_start] = [row1[1]-row2[0]]
if (key_end in new_columns):
new_columns[key_end].append(row1[2]-row2[1])
else:
new_columns[key_end] = [row1[2]-row2[1]]
# Add dictionary data as new columns
for key, value in new_columns.items():
df1[key] = value
# jezrael - https://stackoverflow.com/a/60843750/452587
def method_2():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Convert selected columns to 2d numpy array
a = df1[['Start', 'End']].to_numpy()
b = df2[[0, 1]].to_numpy()
# Output is 3d array; convert it to 2d array
c = (a - b[:, None]).swapaxes(0, 1).reshape(a.shape[0], -1)
# Generate columns names and with DataFrame.join; add to original
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df1 = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
# sammywemmy - https://stackoverflow.com/a/60844078/452587
def method_3():
df1 = pd.DataFrame([['A', 10, 20], ['B', 20, 30], ['C', 30, 40]] * 10000, columns=['Name', 'Start', 'End'])
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
# Create numpy arrays of df1 and df2
df1_start = df1.loc[:, 'Start'].to_numpy()
df1_end = df1.loc[:, 'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
# Use np tile to create shapes that allow elementwise subtraction
tiled_start = np.tile(df1_start, (len(df2), 1)).T
tiled_end = np.tile(df1_end, (len(df2), 1)).T
# Subtract df2 from df1
start = np.subtract(tiled_start, df2_start)
end = np.subtract(tiled_end, df2_end)
# Create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
# Create dataframes of start and end
start_df = pd.DataFrame(start, columns=start_columns)
end_df = pd.DataFrame(end, columns=end_columns)
# Lump start and end into one dataframe
lump = pd.concat([start_df, end_df], axis=1)
# Sort the columns by the digits at the end
filtered = lump.columns[lump.columns.str.contains('\d')]
cols = sorted(filtered, key=lambda x: x[-1])
lump = lump.reindex(cols, axis='columns')
# Hook lump back to df1
df1 = pd.concat([df1,lump],axis=1)
print('Method 1:', timeit.timeit(method_1, number=3))
print('Method 2:', timeit.timeit(method_2, number=3))
print('Method 3:', timeit.timeit(method_3, number=3))
Output:
Method 1: 50.506279182
Method 2: 0.08886280600000163
Method 3: 0.10297686199999845
I suggest use here numpy - convert selected columns to 2d numpy array in first step::
a = df1[['Start','End']].to_numpy()
b = df2[[0,1]].to_numpy()
Output is 3d array, convert it to 2d array:
c = (a - b[:, None]).swapaxes(0,1).reshape(a.shape[0],-1)
print (c)
[[ 5 10 -5 0 -15 -10]
[ 15 20 5 10 -5 0]
[ 25 30 15 20 5 10]]
Last generate columns names and with DataFrame.join add to original:
cols = [item for x in range(b.shape[0]) for item in (f'Start_Diff_{x}', f'End_Diff_{x}')]
df = df1.join(pd.DataFrame(c, columns=cols, index=df1.index))
print (df)
Name Start End Start_Diff_0 End_Diff_0 Start_Diff_1 End_Diff_1 \
0 A 10 20 5 10 -5 0
1 B 20 30 15 20 5 10
2 C 30 40 25 30 15 20
Start_Diff_2 End_Diff_2
0 -15 -10
1 -5 0
2 5 10
Don't use iterrows(). If you're simply subtracting values, use vectorization with Numpy (Pandas also offers vectorization, but Numpy is faster).
For instance:
df2 = pd.DataFrame([[5, 10], [15, 20], [25, 30]], columns=None)
col_names = "Start_Diff_1 End_Diff_1".split()
df3 = pd.DataFrame(df2.to_numpy() - 10, columns=colnames)
Here df3 equals:
Start_Diff_1 End_Diff_1
0 -5 0
1 5 10
2 15 20
You can also change column names by doing:
df2.columns = "Start_Diff_0 End_Diff_0".split()
You can use f-strings to change column names in a loop, i.e., f"Start_Diff_{i}", where i is a number in a loop
You can also combine multiple dataframes with:
df = pd.concat([df1, df2],axis=1)
This is one way to go about it:
#create numpy arrays of df1 and 2
df1_start = df1.loc[:,'Start'].to_numpy()
df1_end = df1.loc[:,'End'].to_numpy()
df2_start = df2[0].to_numpy()
df2_end = df2[1].to_numpy()
#use np tile to create shapes
#that allow element wise subtraction
tiled_start = np.tile(df1_start,(len(df2),1)).T
tiled_end = np.tile(df1_end,(len(df2),1)).T
#subtract df2 from df1
start = np.subtract(tiled_start,df2_start)
end = np.subtract(tiled_end, df2_end)
#create columns for start and end
start_columns = [f'Start_Diff_{num}' for num in range(len(df2))]
end_columns = [f'End_Diff_{num}' for num in range(len(df2))]
#create dataframes of start and end
start_df = pd.DataFrame(start,columns=start_columns)
end_df = pd.DataFrame(end, columns = end_columns)
#lump start and end into one dataframe
lump = pd.concat([start_df,end_df],axis=1)
#sort the columns by the digits at the end
filtered = final.columns[final.columns.str.contains('\d')]
cols = sorted(filtered, key = lambda x: x[-1])
lump = lump.reindex(cols,axis='columns')
#hook lump back to df1
final = pd.concat([df1,lump],axis=1)
Say I have a dataframe
A B C D
2019-01-01 1 10 100 12
2019-01-02 2 20 200 23
2019-01-03 3 30 300 34
And an array to group the columns by
array([0, 1, 0, 2])
I wish to group the dataframe by the array (on the column axis), apply a function, then return a Series with length of the number of columns, containing the result of the applied function on each column.
So, for the above (with the applied function taking the group's sum), would want to output:
A 606
B 60
C 606
D 69
dtype: int64
My best attempt:
func = lambda a: np.full(a.shape[1], np.sum(a.values))
df.groupby(groups, axis=1).apply(func)
0 [606, 606]
1 [60]
2 [69]
dtype: object
(in this example the applied function returns equal values inside a group, but this can't be guaranteed for the real case)
I can not see how to do this with pandas grouping syntax, unless I am missing something. Could anyone lend a hand, thanks!
Try this:
import numpy as np
import pandas as pd
groups = [0, 1, 0, 2]
df = pd.DataFrame({'A': [1, 2, 3],
'B': [10, 20, 30],
'C': [100, 200, 300],
'D': [12, 23, 34]})
temp = df.apply(sum).to_frame()
temp.index = pd.MultiIndex.from_arrays(
np.stack([temp.index, groups]),
names=("df columns", "groups")
)
temp_filter = temp.groupby(level=1).agg(sum)
result = temp.join(temp_filter, rsuffix='0'). \
set_index(temp.index.get_level_values(0))["00"]
# df columns
# A 606
# B 60
# C 606
# D 69
# Name: 00, dtype: int64
I have this DataFrame. In Column ArraysDate contains many elements. I want to be able to number and run the for loop in the array of java. I have not found any solution, please tell me some ideas?.
Ex with CustomerNumber = 4 , then ArraysDate have 3 elements ,and understood i1,i2,i3,i4 to use calculations in ArraysDate.
Thanks you
CustomerNumber ArraysDate
1 [ 1 13 ]
2 [ 3 ]
3 [ 0 ]
4 [ 2 60 30 40]
If I understand correctly, you want to get an array of data from 'ArraysDate' based on column 'CustomerNumber'.
Basically, you can use loc
import pandas as pd
data = {'c': [1, 2, 3, 4], 'date': [[1,2],[3],[0],[2,60,30,40]]}
df = pd.DataFrame(data)
df.loc[df['c']==4, 'date']
df.loc[df['c']==4, 'date'] = df.loc[df['c']==4, 'date'].apply(lambda i: sum(i))
Result:
[2, 60, 30, 40]
c date
0 1 [1, 2]
1 2 [3]
2 3 [0]
3 4 132
You can use the lambda to sum all items in the array per row.
Step 1: Create a dataframe
import pandas as pd
import numpy as np
d = {'ID': [[1,2,3],[1,2,43]]}
df = pd.DataFrame(data=d)
Step 2: Sum the items in the array
df['ID2']=df['ID'].apply(lambda x: sum(x))
df