Avoiding writing a dataframe with a large number of columns - python

I have a dataframe that looks like this:
student school class answer question
a scl first True x
a scl first False y
a scl first True y
b scl first False x
c scl sec False y
c scl sec True z
d scl sec True x
d scl sec True z
e scl third True z
e scl third False z
Note that it is possible to answer a question multiple times. Note also that not everyone may answer the same set of questions. I want to see which class performed better per question. So for each question, a ranking of the classes, one time when I consider only the first answer of a student, and one time overall.
What I did so far is just a ranking of the classes independent of what question was answered:
#only the first answer is considered
df1 = df.drop_duplicates(subset=["student", "scl", "class", "question"], keep="first")
(df1.groupby(['school', 'class'])
['answer'].mean()
.rename('ClassRanking')
.sort_values(ascending=False)
.reset_index()
)
#all the answers are considered
(df.groupby(['school', 'class'])
['answer'].mean()
.rename('ClassRanking')
.sort_values(ascending=False)
.reset_index()
)
So I do indeed have a ranking of the classes. But I don't know how to compare these classes judging by each question, because I wouldn't create a dataframe with 50 columns when I have 50 classes.
Edit:
I would imagine a dataframe like this, but this is a bit ugly when I have 50 classes:
df_all=
question class_first_res class_sec_res class_third_res
x 0.5 1 None
y 0.5 0 None
z None 1 0.5
df_first_attempt=
question class_first_res class_sec_res class_third_res
x 0.5 1 None
y 0 0 None
z None 1 1

If I understood you correctly:
df_first = df.drop_duplicates(subset=['student', 'class', 'question'], keep='first').groupby(['class', 'question'])['answer'].apply(lambda x: x.sum()/len(x)).reset_index()
df_first = df_first.sort_values(by=['question']).rename(columns={'answer': 'ClassRanking'})
df_first = df_first.pivot_table(index='question', columns='class', values='ClassRanking').reset_index().rename_axis(None, axis=1)
df_overall = df.groupby(['class', 'question'])['answer'].apply(lambda x: x.sum()/len(x)).reset_index()
df_overall = df_overall.sort_values(by=['question']).rename(columns={'answer': 'ClassRanking'})
df_overall = df_overall.pivot_table(index='question', columns='class', values='ClassRanking').reset_index().rename_axis(None, axis=1)
df_first:
question first sec third
0 x 0.5 1.0 NaN
1 y 0.0 0.0 NaN
2 z NaN 1.0 1.0
df_overall:
question first sec third
0 x 0.5 1.0 NaN
1 y 0.5 0.0 NaN
2 z NaN 1.0 0.5

You could try this.
pd.pivot_table(df, index="class", columns="question", values="answer")
It is similar to your examples, but rows instead of columns, but the content is the same.
On the other hand, if you would want a ranking of all the classes based on the average success on the questions, you could do this here:
pd.pivot_table(df, index="question", columns="class", values="answer").mean()

Related

Pandas: how to multiply each element of a Series to each element of a column in a Dataframe

I am trying to find a solution to do the following operation using either numpy or pandas:
For instance, the result matrix has [0, 0, 0] as its first column which is a result of [a x a] elementwise, more specifically it is equal to: [0 x 0.5, 0 x 0.4, 0 x 0.1].
If there is no solution method for such a problem, I might just expand the series to a dataframe by duplicating its values to just multiply two dataframes..
input data:
series = pd.Series([0,10,0,100,1], index=list('abcde'))
df = pd.DataFrame([[0.5,0.4,0.2,0.7,0.8],
[0.4,0.5,0.1,0.1,0.5],
[0.1,0.9,0.8,0.3,0.8]
], columns=list('abcde'))
This is actually very simple. Because the Series' index aligns with the DataFrame's columns, you only need to do:
series*df
output:
a b c d e
0 0.0 4.0 0.0 70.0 0.8
1 0.0 5.0 0.0 10.0 0.5
2 0.0 9.0 0.0 30.0 0.8
input:
series = pd.Series([0,10,0,100,1], index=list('abcde'))
df = pd.DataFrame([[0.5,0.4,0.2,0.7,0.8],
[0.4,0.5,0.1,0.1,0.5],
[0.1,0.9,0.8,0.3,0.8]
], columns=list('abcde'))

Pandas - New column based on the value of another column N rows back, when N is stored in a column

I have a pandas dataframe with example data:
idx price lookback
0 5
1 7 1
2 4 2
3 3 1
4 7 3
5 6 1
Lookback can be positive or negative but I want to take the absolute value of it for how many rows back to take the value from.
I am trying to create a new column that contains the value of price from lookback + 1 rows ago, for example:
idx price lookback lb_price
0 5 NaN NaN
1 7 1 NaN
2 4 2 NaN
3 3 1 7
4 7 3 5
5 6 1 3
I started with what felt like the most obvious way, this did not work:
df['sbc'] = df['price'].shift(dataframe['lb'].abs() + 1)
I then tried using a lambda, this did not work but I probably did it wrong:
sbc = lambda c, x: pd.Series(zip(*[c.shift(x+1)]))
df['sbc'] = sbc(df['price'], df['lb'].abs())
I also tried a loop (which was extremely slow, but worked) but I am sure there is a better way:
lookback = np.nan
for i in range(len(df)):
if df.loc[i, 'lookback']:
if not np.isnan(df.loc[i, 'lookback']):
lookback = abs(int(df.loc[i, 'lookback']))
if not np.isnan(lookback) and (lookback + 1) < i:
df.loc[i, 'lb_price'] = df.loc[i - (lookback + 1), 'price']
I have seen examples using lambda, df.apply, and perhaps Series.map but they are not clear to me as I am quite a novice with Python and Pandas.
I am looking for the fastest way I can do this, if there is a way without using a loop.
Also, for what its worth, I plan to use this computed column to create yet another column, which I can do as follows:
df['streak-roc'] = 100 * (df['price'] - df['lb_price']) / df['lb_price']
But if I can combine all of it into one really efficient way of doing it, that would be ideal.
Solution!
Several provided solutions worked great (thank you!) but all needed some small tweaks to deal with my potential for negative numbers and that it was a lookback + 1 not - 1 and so I felt it was prudent to post my modifications here.
All of them were significantly faster than my original loop which took 5m 26s to process my dataset.
I marked the one I observed to be the fastest as accepted as I improving the speed of my loop was the main objective.
Edited Solutions
From Manas Sambare - 41 seconds
df['lb_price'] = df.apply(
lambda x: df['price'][x.name - (abs(int(x['lookback'])) + 1)]
if not np.isnan(x['lookback']) and x.name >= (abs(int(x['lookback'])) + 1)
else np.nan,
axis=1)
From mannh - 43 seconds
def get_lb_price(row, df):
if not np.isnan(row['lookback']):
lb_idx = row.name - (abs(int(row['lookback'])) + 1)
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = dataframe.apply(get_lb_price, axis=1 ,args=(df,))
From Bill - 18 seconds
lookup_idxs = df.index.values - (abs(df['lookback'].values) + 1)
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df['price'].to_numpy()[lookup_idxs[valid_lookups].astype(int)]
By getting the row's index inside of the df.apply() call using row.name, you can generate the 'lb_price' data relative to which row you are currently on.
%time
df.apply(
lambda x: df['price'][x.name - int(x['lookback'] + 1)]
if not np.isnan(x['lookback']) and x.name >= x['lookback'] + 1
else np.nan,
axis=1)
# > CPU times: user 2 µs, sys: 0 ns, total: 2 µs
# > Wall time: 4.05 µs
FYI: There is an error in your example as idx[5]'s lb_price should be 3 and not 7.
Here is an example which uses a regular function
def get_lb_price(row, df):
lb_idx = row.name - abs(row['lookback']) - 1
if lb_idx >= 0:
return df.loc[lb_idx, 'price']
else:
return np.nan
df['lb_price'] = df.apply(get_lb_price, axis=1 ,args=(df,))
Here's a vectorized version (i.e. no for loops) using numpy array indexing.
lookup_idxs = df.index.values - df['lookback'].values - 1
valid_lookups = lookup_idxs >= 0
df['lb_price'] = np.nan
df.loc[valid_lookups, 'lb_price'] = df.price.to_numpy()[lookup_idxs[valid_lookups].astype(int)]
print(df)
Output:
price lookback lb_price
idx
0 5 NaN NaN
1 7 1.0 NaN
2 4 2.0 NaN
3 3 1.0 7.0
4 7 3.0 5.0
5 6 1.0 3.0
This solution loops of the values ot the column lockback and calculates the index of the wanted value in the column price which I store as a list.
The rule it, that the lockback value has to be a number and that the wanted index is not smaller than 0.
new = np.zeros(df.shape[0])
price = df.price.values
for i, lookback in enumerate(df.lookback.values):
# lookback has to be a number and the index is not allowed to be less than 0
# 0<i-lookback is equivalent to 0<=i-(lookback+1)
if lookback!=np.nan and 0<i-lookback:
new[i] = price[int(i-(lookback+1))]
else:
new[i] = np.nan
df['lb_price'] = new

What is the difference between the args 'index' and 'values' for the pandas interpolate function?

What is the difference between the pandas DataFrame interpolate function called with args 'index' and 'values' respectively? It's ambiguous from the documentation:
pandas.DataFrame.interpolate
method : str, default ‘linear’
Interpolation technique to use. One of:
‘linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.
‘time’: Works on daily and higher resolution data to interpolate given length of interval.
‘index’, ‘values’: use the actual numerical values of the index."
Both appear to use the numerical values of the index, is this the case?
UPDATE:
Following ansev's answer, they do indeed do the same thing
I think it's pretty clear, imagine you're going to interpolate points. The values ​​of your DataFrame represent the Y values, it is about filling in the missing values ​​in Y with some logic, for them an interpolation function is used, in this case for the variable X there are two options, to assume a fixed step, independent of the index or take into account the values ​​of the index.
Example with linear interpolation:
Here for each row the index increases by 1 upward and therefore there is no difference between the methods.
df=pd.DataFrame({'Y':[1,np.nan,3]})
print(df)
Y
0 1.0
1 NaN
2 3.0
print(df.interpolate(method = 'index'))
Y
0 1.0
1 2.0
2 3.0
print(df.interpolate())
Y
0 1.0
1 2.0
2 3.0
but if we change the index values...
df.index = [0,1,10000]
print(df.interpolate(method = 'index'))
Y
0 1.0000
1 1.0002 #(3-1)*((1-0)/(10000-0))
10000 3.0000
print(df.interpolate())
Y
0 1.0
1 2.0
10000 3.0
df.index = [0,0.1,1]
print(df.interpolate(method = 'index'))
Y
0.0 1.0
0.1 1.2 #(3-1)*((0.1-0)/(1-0))
1.0 3.0

Use .apply to recode nan rows into a different value

I am trying to create a new groupid based on the original groupid which has the value of 0, 1. I used the following code but it failed to code the nan rows into 2.
final['groupid2'] = final['groupid'].apply(lambda x: 2 if x == np.nan else x)
I tried the following code also, but it gave an attribute error
final['groupid2'] = final['groupid'].apply(lambda x: 2 if x.isnull() else x)
Could someone please explain why this is the case? Thanks
Use pd.isnull for check scalars if need use apply:
final = pd.DataFrame({'groupid': [1, 0, np.nan],\
'B': [400, 500, 600]})
final['groupid2'] = final['groupid'].apply(lambda x: 2 if pd.isnull(x) else x)
print (final)
groupid B groupid2
0 1.0 400 1.0
1 0.0 500 0.0
2 NaN 600 2.0
Details:
Value x in lambda function is scalar, because Series.apply loop each value of column. So function pd.Series.isnull() failed.
For better testing is possible rewrite lambda funcion to:
def f(x):
print (x)
print (pd.isnull(x))
return 2 if pd.isnull(x) else x
1.0
False
0.0
False
nan
True
final['groupid2'] = final['groupid'].apply(f)
But better is Series.fillna:
final['groupid2'] = final['groupid'].fillna(2)

pandas merge by coordinates

I am trying to merge two pandas tables where I find all rows in df2 which have coordinates close to each row in df1. Example follows.
df1:
x y val
0 0 1 A
1 1 3 B
2 2 9 C
df2:
x y val
0 1.2 2.8 a
1 0.9 3.1 b
2 2.0 9.5 c
desired result:
x y val_x val_y
0 0 1 A NaN
1 1 3 B a
2 1 3 B b
3 2 0 C c
Each row in df1 can have 0, 1, or many corresponding entries in df2, and finding the match should be done with a cartesian distance:
(x1 - x2)^2 + (y1 - y2)^2 < 1
The input dataframes have different sizes, even though they don't in this example. I can get close by iterating over the rows in df1 and finding the close values in df2, but am not sure what to do from there:
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
# ?? What now?
Any help would be very much appreciated. I made this example with an ipython notebook, so which you can view/access here: http://nbviewer.ipython.org/gist/anonymous/49a3d821420c04169f02
I found an answer, though I am not real happy with having to loop over the rows in df1. In this case there are only a few hundred so I can deal with it, but it won't scale as well as something else. Solution:
df2_list = []
df1['merge_row'] = df1.index.values # Make a row to merge on with the index values
for i, row in df1.iterrows():
df2_subset = df2.loc[(df2.x - row.x)**2 + (df2.y - row.y)**2 < 1.0]
df2_subset['merge_row'] = i # Add a merge row
df2_list.append(df2_subset)
df2_found = pd.concat(df2_list)
result = pd.merge(df1, df2_found, on='merge_row', how='left')

Categories

Resources