I have the following code which I would like to speed up.
EDIT: we would like the columns in 'colsi' to be shifted by the group columns in 'colsj'. Pandas allows us to shift multiple columns at once through vectorization of 'colsi'. I loop through each group column and perform the vectorized shifts. Then I fill the NAs by the medians of the columns in 'colsi'. The reindex is just to create new blank columns before they are assigned. The issue is that I have many groups and looping through each is becoming time consuming.
EDIT2: My goal is to engineer new columns by the lag of each group. I have many group columns and many columns to be shifted. 'colsi' contains the columns to be shifted. 'colsj' contains the group columns. I am able to vectorize 'colsi', but looping through each group column in 'colsj' is still time consuming.
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
for j in colsj:
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
Parallelization seems to be a good way to do it. Leaning on this code, I attempted the following but it didn't work:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=3)
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
def funct(j):
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
for j in colsj:
pool.apply_async(funct, (j))
I do not have any knowledge on how to go about parallel processing, so I am not sure whats missing here. Please advise.
Related
I tried a lot but could not find a way to do the following and even I am not sure if it is possible in pandas.
Assume I have a dataframe like in (1).
When I use dataframe.groupby() on "col-a" i get (2) and i can process the groupbydataframe as usual, for example by applying a function. My question is :
Is it possible to group the dataframe like in (3) before processing (the row having "1" at Col-x to be included in group2 with a condition or something... or is it possible to apply a function to include that row belonging to group1 in group2 while processing.
Thank you all for your attention.
Last one request and may be the most imortant one :), altough i started learning pandas a while ago, as a retired software developer i still have a difficulty of understanding its inner mechanism. May a pandas pro please advice me a document,book,method or another resource to learn Panda's basic principles well since, I really love it.
groupby can use a defined function to select groups. The function can combine column values in any way you want. To use your example this could be done along these lines:
df = pd.DataFrame({ 'col_a': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b','b','b','b','b'],
'col_x': [0,0,0,0,0,1,0,0,0,0,0,0],
'col_calc': [1,1,1,1,1,99,1,1,1,1,1,1]
})
def func(mdf, idx, col1, col2):
x = mdf[col1].loc[idx]
y = mdf[col2].loc[idx]
if x == 'a' and y == 0:
return 'g1'
if x == 'b' or y == 1:
return 'g2'
df2 = df.groupby(lambda x: func(df, x, 'col_a', 'col_x'))['col_calc'].sum()
print(df2)
which gives:
g1 5
g2 105
I have a Dataframe that is composed by 3760 rows. I want to split it in 10 parts of equal lenght and then use each new array as a column for a new DataFrame.
A way that I found to do this is:
alfa = np.array_split(dff, 10)
caa = pd.concat([alfa[0].reset_index(drop=True), alfa[1].reset_index(drop=True), alfa[2].reset_index(drop=True), alfa[3].reset_index(drop=True),
alfa[4].reset_index(drop=True), alfa[5].reset_index(drop=True), alfa[6].reset_index(drop=True), alfa[7].reset_index(drop=True),
alfa[8].reset_index(drop=True), alfa[9].reset_index(drop=True)], axis=1)
Not very cool, not very efficient.
Then I tried
teta = pd.concat(np.array_split(dff, 10), axis=1, ignore_index=True)
But it doesn't work as I wanted since it gives me this:
I assume that is because the ignore_index works on the axis 1
Is there a better way to do it?
You can use list comprehension to concat your columns. This code expects your columns name is init_col:
chunks = 10
cols = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J']
out = pd.concat(
[np.array_split(dff, chunks)[i]
.reset_index(drop=True)
.rename(columns={"init_col": cols[i]})
for i in range(chunks)],
axis=1
)
It seems the original DataFrame seems to be just an array? In that case, perhaps you could use numpy.reshape:
new_df = pd.DataFrame(dff.to_numpy().reshape(10,-1).T, columns=dff.columns.tolist()*10)
I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?
Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h
Is there a way to use the .groupby() function to consolidate repeating rows in a data frame, separate out non-similar elements by a ',', and have the resulting .groupby() data frame retain the original datatype of the non-similar elements / convert non-similar items to an object?
As I understand, a column in pandas can hold multiple datatypes, so I feel like this should be possible.
I can use the .agg() function to separate out non-similar elements by a ',', but it doesn't work with non-string elements. I'd like to separate out the datatypes for error checking later when looking for rows with bad entries after the .groupby().
#Libraries
import pandas as pd
import numpy as np
#Example dataframe
col = ['Acol', 'Bcol', 'Ccol', 'Dcol']
df = pd.DataFrame(columns = col)
df['Acol'] = [1,1,2,3]
df['Bcol'] = ['a', 'b', 'c', 'd']
df['Ccol'] = [1,2,3,4]
df['Dcol'] = [1,'a',2,['a', 'b']]
#Code
outdf = df.groupby(by='Acol').agg(lambda x: ','.join(x)).reset_index()
outdf
I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')