Pandas DataFrame with Function: Columns Varying

Pandas DataFrame with Function: Columns Varying - python

Given the following DataFrame:
import pandas as pd
import numpy as np
d=pd.DataFrame({' Label':['a','a','b','b'],'Count1':[10,20,30,40],'Count2':[20,45,10,35],
'Count3':[40,30,np.nan,22],'Nobs1':[30,30,70,70],'Nobs2':[65,65,45,45],
'Nobs3':[70,70,22,32]})
d
Label Count1 Count2 Count3 Nobs1 Nobs2 Nobs3
0 a 10 20 40.0 30 65 70
1 a 20 45 30.0 30 65 70
2 b 30 10 NaN 70 45 22
3 b 40 35 22.0 70 45 32
I would like to apply the z test for proportions on each combination of column groups (1 and 2, 1 and 3, 2 and 3) per row. By column group, I mean, for example, "Count1" and "Nobs1".
For example, one such test would be:
count = np.array([10, 20]) #from first row of Count1 and Count2, respectively
nobs = np.array([30, 65]) #from first row of Nobs1 and Nobs2, respectively
pv = proportions_ztest(count=count,nobs=nobs,value=0,alternative='two-sided')[1] #this returns just the p-value, which is of interest
pv
0.80265091465415639
I would want the result (pv) to go into a new column (first row) called "p_1_2" or something logical that corresponds to its respective columns.
In summary, here are the challenges I'm facing:
How to apply this per row.
...for each paired combination, mentioned above.
...where the column names and number of pairs of "Count" and "Nobs" columns may vary (assuming that there will always be a "Nobs" column for each "Count" column).
Related to 3: For example, I might have a column called "18-24" and another called "18-24_Nobs".
Thanks in advance!

To 1) and 2) for one test, additional tests can be coded similar or within an additonal loop
for i,row in d.iterrows():
d.loc[i,'test'] = proportions_ztest(count=row['Count1':'Count2'].values,
nobs=row['Nobs1':'Nobs2'].values,
value=0,alternative='two-sided')[1]
for 3) it should be possible the handle these case with pure python inside the loop

Related

searching rows for a value that is 90% of the max value in that row and then outputting the index of the column for CSV files, using pandas

I am a python newbie so struggling with this problem.
I am using pandas to read in csv files with multiple rows (changes depending on csv file, up to 200,000) and columns (495).
I want to search along the rows separately to find the max value, then I want to take the value that is 90% of the max and index this to find what column number it is. I want to do this for all rows separately.
For example:
row 1 has a max value of 12,098 and is in column 300
90% of 12,098 gives a value of 10,888. it is unlikely there will be an exact match, so i want
to find the nearest match in that row and then provide me with the column number (index), which
could be column 300 for example.
I then want to repeat this for every row.
This is what I have done so far:
1.search my rows of data to find the max value,
maxValues = df.max(axis = 1)
calculate 90% of this:
newmax = maxValues / 10 * 9
then find the value closest to that newmax in the row, and then tell me what the column number where that value is - this is the part I can't do. I have tried:
arr = pulses.to_numpy()
x = newmax.values`
difference_array = np.absolute(arr-x).axis=1
index = difference_array.argmin().axis=1
provides the following error: operands could not be broadcast together with shapes (114,495)
(114,)
I can do up to number 2 above, but can't figure out 3. I have tried converting them to arrays as you can see but this only produces errors.

Let's say we have a following dataframe:
import pandas as pd
d= {'a':[0,1], 'b':[10, 20], 'c':[30, 40], 'd':[15, 30]}
df = pd.DataFrame(data=d)
To go row by row you can use apply function
Since you operate with just one row, you can find its maximum with max
To find a closest value to 0.9 of maximum you need to find the smallest abs difference between numbers
To insert values by index of row in initial dataframe use at
So a code would be like this:
percent = 0.9
def foo(row):
max_val = row.max()
max_col = row[row==max_val].index[0]
second_max_val = percent * max_val
idx = row.name
df.at[idx, 'max'] = max_col
df.at[idx, '0.9max'] = (abs(row.loc[row.index!=max_col] - second_max_val)).idxmin()
return row
df.apply(lambda row: foo(row), axis=1)
print(df)

Your error occurs because you are comparing a two dimensional array with an one dimensional one (arr - x).
Consider this sample data frame:
import pandas as pd
import numpy as np
N=5
df = pd.DataFrame({
"col1": np.random.randint(100, size=(N,)),
"col2": np.random.randint(100, size=(N,)),
"col3": np.random.randint(100, size=(N,)),
"col4": np.random.randint(100, size=(N,)),
"col5": np.random.randint(100, size=(N,))
})
col1 col2 col3 col4 col5
0 48 21 74 76 95
1 66 1 13 56 83
2 91 67 96 93 28
3 49 76 39 95 84
4 65 31 61 68 24
IIUC, you could use the following code (no iteration needed, relies only on numpy and pandas) to find the index positions of those columns that are closest to the maximum value in each row multiplied by 0.9. If two values are equally close, the first index will be returned. The code only needs about five seconds for 2.mio rows.
Code:
np.argmin(df.sub(df.max(axis=1) * 0.9, axis=0).apply(np.abs).values, axis=1)
Output:
array([3, 4, 0, 4, 2])

Statistical significance test after matching

I have a dataframe consisting of four columns and around 20000 rows, like this.
import pandas as pd
import numpy as np
d = {'x': [1,1,0,1,0,0,1],'BPM':[70,55,45,np.nan,35,25,np.nan],'AGE': [50, 47,21, 50,24,47,16], 'WEIGHT': [50,100,50,np.nan,np.nan,100,27]}
df = pd.DataFrame(data=d)
x BPM AGE WEIGHT
1 70 50 50
1 55 47 100
0 45 21 50
1 nan 24 nan
0 35 50 nan
0 25 47 100
1 nan 16 27
Is there any significant difference in "BPM" between class '1' and class '0' after matching AGE and WEIGHT?
There are two classes: 0 and 1. The number of samples is not equal in both classes. I understand that first I have to match values, then I can apply t-test. I am new to this field, so I do not understand how to proceed.

You could calculate the t-scores by hand.
mean_bpm_df = df.groupby(['AGE','WEIGHT','x']).mean().unstack(level=-1)
mean_bpm_df.columns = ['mean_bpm_0','mean_bpm_1']
std_count_df = df.drop(columns='x').groupby(['AGE','WEIGHT']).agg(['std','count'])
std_count_df.columns = ['std_bpm','count_bpm']
t_df = (mean_bpm_df.mean_bpm_0 - mean_bpm_df.mean_bpm_1) / (std_count_df.std_bpm / np.sqrt(std_count_df.count_bpm))
Now, if you also want the p-values, those can be calculated by hand too. Assume a 2-sided t-test (you can modify this if needed).
from scipy.stats import t
p_df = pd.DataFrame(index=t_df.index, data=2*(1 - t.cdf(abs(t_df), std_count_df.count_bpm-1)))

Correlation analysis with multiple data in a single cell

I have a dataset with some rows containing singular answers and others having multiple answers. Like so:
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
With the singular answers I managed to create a heatmap using df.corr(), but I can't figure out what is the best approach for multiple answers rows.
I could split them and add additional columns for each answer like:
year length Animation
0 1971 121 1
1 1971 121 2
2 1971 121 3
3 1939 71 1
4 1939 71 3 ...
and then do the exact same dr.corr(), or add additional Animation_01, Animation_02 ... columns, but there must be a smarter way to work around this issue?
EDIT: Actual data snippet

You should compute a frequency table between two categorical variables using pd.crosstab() and perform subsequent analyses based on this table. df.corr(x, y) is NOT mathematically meaningful when one of x and y is categorical, no matter it is encoded into number or not.
N.B.1 If x is categorical but y is numerical, there are two options to describe the linkage between them:
Group y into quantiles (bins) and treat it as categorical
Perform a linear regression of y against one-hot encoded dummy variables of x
Option 2 is more precise in general but the statistics is beyond the scope of this question. This post will focus on the case of two categorical variables.
N.B.2 For sparse matrix output please see this post.
Sample Solution
Data & Preprocessing
import pandas as pd
import io
import matplotlib.pyplot as plt
from seaborn import heatmap
df = pd.read_csv(io.StringIO("""
year length Animation
0 1971 121 1,2,3
1 1939 71 1,3
2 1941 7 0,2
3 1996 70 1,2,0
4 1975 71 3,2,0
"""), sep=r"\s{2,}", engine="python")
# convert string to list
df["Animation"] = df["Animation"].str.split(',')
# expand list column into new rows
df = df.explode("Animation")
# (optional)
df["Animation"] = df["Animation"].astype(int)
Frequency Table
Note: grouping of length is ignored for simplicity
ct = pd.crosstab(df["Animation"], df["length"])
print(ct)
# Out[65]:
# length 7 70 71 121
# Animation
# 0 1 1 1 0
# 1 0 1 1 1
# 2 1 1 1 1
# 3 0 0 2 1
Visualization
ax = heatmap(ct, cmap="viridis",
yticklabels=df["Animation"].drop_duplicates().sort_values(),
xticklabels=df["length"].drop_duplicates().sort_values(),
)
ax.set_title("Title", fontsize=20)
plt.show()
Example Analysis
Based on the frequency table, you can ask questions about the distribution of y given a certain (subset of) x value(s), or vice versa. This should better describe the linkage between two categorical variables, as the categorical variables have no order.
For example,
Q: What length does Animation=3 produces?
A: 66.7% chance to give 71
33.3% chance to give 121
otherwise unobserved

You want to break Animation (or Preferred_positions in your data snippet) up into a series of one-hot columns, one one-hot column for every unique string in the original column. Every column with have values of either zero or one, one corresponding to rows where that string appeared in the original column.
First, you need to get all the unique substrings in Preferred_positions (see this answer for how to deal with a column of lists).
positions = df.Preferred_positions.str.split(',').sum().unique()
Then you can create the positions columns in a loop based on whether the given position is in Preferred_positions for each row.
for position in positions:
df[position] = df.Preferred_positions.apply(
lambda x: 1 if position in x else 0
)

creating new dataframe with manhattan distance in python

I need to create a dataframe containing the manhattan distance between two dataframes with the same columns, and I need the indexes of each dataframe to be the index and column name, so for example lets say I have these two dataframes:
x_train :
index a b c
11 2 5 7
23 4 2 0
312 2 2 2
x_test :
index a b c
22 1 1 1
30 2 0 0
so the columns match but the size and indexes do not, the expected dataframe would look like this:
dist_dataframe:
index 11 23 312
22 11 5 3
30 12 4 4
and what I have right now is this:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def calc_distance(X_test,X_train):
dist_dataframe = pd.DataFrame(index=X_test.index,columns = X_train.index)
for i in X_train.index:
for j in X_test.index:
dist_dataframe.loc[i,j]=manhattan_distance(X_train.loc[[i]],X_test.loc[[j]])
return dist_dataframe
what I get from the code I have is this dataframe:
dist_dataframe:
index
index 11 23 312
22 NaN NaN NaN
30 NaN NaN NaN
I get the right dataframe size except that it has 2 rows called indexes that I get from the creation of the new dataframe, and also I get an error no matter what I do in the manhattan calculation line, can anyone help me out here please?

Problem in your code
There is a very small problem in your code, i.e. accessing values in dist_dataframe. So,instead of dist_dataframe.loc[i,j], you should reverse the order of i and j and make it like dist_dataframe.loc[j,i]
More efficient solution
It will work fine but since you are a new contributor, I would also like to point out the efficiency of your code. Always try to replace loops with pandas in-built functions. Since they are written in C, it makes them much faster. So here is a more efficient solution:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def xtrain_distance(row):
distances = {}
for i,each in x_train.iterrows():
distances[i] = manhattan_distance(each,row)
return distances
result = x_test.apply(xtrain_distance, axis=1)
# converting into dataframe
pd.DataFrame(dict(result)).transpose()
It also produces same output and on your example and you can't see any time difference. But when run on a larger size (same data scaled over 20 times), i.e. 60 x_train samples and 40 x_test samples, here is the time difference:
Your solution took: 929 ms
This solution took: 207 ms
It got 4x faster just by eliminating one for loop. Note that, it can be made more efficient but for the sake of demonstration, I have used this solution.

Pandas: row operations on a column, given one reference value on a different column

I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.

I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame with Function: Columns Varying - python

Related

searching rows for a value that is 90% of the max value in that row and then outputting the index of the column for CSV files, using pandas

Statistical significance test after matching

Correlation analysis with multiple data in a single cell

creating new dataframe with manhattan distance in python

Pandas: row operations on a column, given one reference value on a different column

Categories

Resources