Pandas DataFrame - insert copy of row with some changes - python

Say I have
import pandas as pd
x = pd.DataFrame.from_dict({'A':[1,2,3,4,5,6], 'B':[10, 20, 30, 44, 48, 81]})
And I want to insert a copy of the row x[5], but in it add +2 to 'A' value, +7 to 'B' value. How can I do this?
Obviously in the real example the dataframe has many more columns, that's why it makes sense for me to copy a row rather than manually populate the value for each column in it.

First build the dataframe for then one you need creat the copy from original dataframe, the we adjust the value in it , then concat it back
x1=x.loc[[5],:]
x1.A+=2
x1.B+=7
x_new = pd.concat([x,x1]).sort_index()
x_new
Out[291]:
A B
0 1 10
1 2 20
2 3 30
3 4 44
4 5 48
5 6 81
5 8 88

Related

How to merge an itertools generated dataframe and a normal dataframe in pandas?

I have generated a dataframe containing all the possible two combinations of electrocardiogram (ECG) leads using itertools using the code below
source = [ 'I-s', 'II-s', 'III-s', 'aVR-s', 'aVL-s', 'aVF-s', 'V1-s', 'V2-s', 'V3-s', 'V4-s', 'V5-s', 'V6-s', 'V1Long-s', 'IILong-s', 'V5Long-s', 'Information-s' ]
target = [ 'I-t', 'II-t', 'III-t', 'aVR-t', 'aVL-t', 'aVF-t', 'V1-t', 'V2-t', 'V3-t', 'V4-t', 'V5-t', 'V6-t', 'V1Long-t', 'IILong-t', 'V5Long-t', 'Information-t' ]
from itertools import product
test = pd.DataFrame(list(product(source, target)), columns=['source', 'target'])
The test dataframe contains 256 rows/lines containing all the two possible combinations.
The value for each combination is zero as follows
test['value'] = 0
The test df looks like this:
I have another dataframe called diagramDF that contains the combinations where the value column is non-zero. The diagramDF is significanntly smaller than the test dataframe.
source target value
0 I-s II-t 137
1 II-s I-t 3
2 II-s III-t 81
3 II-s IILong-t 13
4 II-s V1-t 21
5 III-s II-t 3
6 III-s aVF-t 19
7 IILong-s II-t 13
8 IILong-s V1Long-t 353
9 V1-s aVL-t 11
10 V1Long-s IILong-t 175
11 V1Long-s V3-t 4
12 V1Long-s aVF-t 4
13 V2-s V3-t 8
14 V3-s V2-t 6
15 V3-s V6-t 2
16 V5-s aVR-t 5
17 V6-s III-t 4
18 aVF-s III-t 79
19 aVF-s V1Long-t 235
20 aVL-s I-t 1
21 aVL-s aVF-t 16
22 aVR-s aVL-t 1
Note that the first two columns source and target have the same notations
I have tried to replace the zero values of the test dataframe with the nonzero values of the diagramDF using merge like below:
df = pd.merge(test, diagramDF, how='left', on=['source', 'target'])
However, I get an error informing me that:
ValueError: The column label 'source' is not unique. For a
multi-index, the label must be a tuple with elements corresponding to
each level
Is there something that I am getting wrong? Is there a more efficient and fast way to do this?
Might help,
pd.merge(test, diagramDF, how='left', on=['source', 'target'],right_index=True,left_index=True)
Check this:
test = test.reset_index()
diagramDF = diagramDF.reset_index()
new = pd.merge(test, diagramDF, how='left', on=['source', 'target'])

Pandas: row operations on a column, given one reference value on a different column

I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.

Finding the indexes of the N maximum values across an axis in Pandas

I know that there is a method .argmax() that returns the indexes of the maximum values across an axis.
But what if we want to get the indexes of the 10 highest values across an axis?
How could this be accomplished?
E.g.:
data = pd.DataFrame(np.random.random_sample((50, 40)))
You can use argsort:
s = pd.Series(np.random.permutation(30))
sorted_indices = s.argsort()
top_10 = sorted_indices[sorted_indices < 10]
print(top_10)
Output:
3 9
4 1
6 0
8 7
13 4
14 2
15 3
19 8
20 5
24 6
dtype: int64
IIUC, say, if you want to get the index of the top 10 largest numbers of column col:
data[col].nlargest(10).index
Give this a try. This will take the 10 largest values across a row and put them into a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_sample((50, 40)))
df2 = pd.DataFrame(np.sort(df.values)[:,-10:])

Trying to call a cell value, why is my list of column values being interpreted as index values?

I need to do some maths using the following dataframe. In a for loop iterating through VALUE column cells, I need to grab the corresponding FracDist.
VALUE FracDist
0 11 0.022133
1 21 0.021187
2 22 0.001336
3 23 0.000303
4 24 0.000015
5 31 0.000611
6 41 0.040523
7 42 0.285630
8 43 0.161956
9 52 0.296993
10 71 0.160705
11 82 0.008424
12 90 0.000130
13 95 0.000053
First I made a list of VALUE values which I can use in a for loop, which worked as expected:
IN: LCvals = df['VALUE'].tolist()
print LCvals
OUT: [11, 21, 22, 23, 24, 31, 41, 42, 43, 52, 71, 82, 90, 95]
When I try to grab a cell from the dataframe's FracDist column based on which VALUE row the for loop is on, that is where a problem comes up. Instead of looking up rows using VALUE from the VALUE column, the code is trying to lookup rows using VALUE as the index. So what I get:
IN: for val in LCvals:
print val
print LCdf.loc[val]['FracDist']
OUT: 11
0.00842444155517
21
KeyError: 'the label [21] is not in the [index]'
Note that the FracDist row that is grabbed for VALUE=11 is from index 11, not VALUE 11.
What needs to change in that for loop code to query rows based on VALUE in the VALUE column rather than VALUE as a spot in the index?
Here pd.DataFrame.loc will index first by row label and then, if a second argument is supplied, by column label. This is by design. See also Indexing and Selecting Data.
Don't, under any circumstances use chained indexing. For example, Boolean indexing followed by column label selection via LCdf.loc[LCdf['VALUE']==val]['FracDist'] is not recommended.
If you wish to iterate a single series, you can use pd.Series.items. But here you are using 'VALUE' as if it were an index, so you can use set_index first:
for val, dist in df.set_index('VALUE')['FracDist'].items():
print(val, dist)
11 0.022133
21 0.021187
...
90 0.00013
95 5.3e-05
If you pass in an integer into .loc, it will return (in this case) a value located at that index. You could use this LCdf.loc[LCdf['VALUE']==val]['FracDist'].
Edit: Here is a better (more efficient) answer:
for index, row in LCdf.iterrows():
print(row['VALUE'])
print(row['FracDist'])

Pandas DataFrame with Function: Columns Varying

Given the following DataFrame:
import pandas as pd
import numpy as np
d=pd.DataFrame({' Label':['a','a','b','b'],'Count1':[10,20,30,40],'Count2':[20,45,10,35],
'Count3':[40,30,np.nan,22],'Nobs1':[30,30,70,70],'Nobs2':[65,65,45,45],
'Nobs3':[70,70,22,32]})
d
Label Count1 Count2 Count3 Nobs1 Nobs2 Nobs3
0 a 10 20 40.0 30 65 70
1 a 20 45 30.0 30 65 70
2 b 30 10 NaN 70 45 22
3 b 40 35 22.0 70 45 32
I would like to apply the z test for proportions on each combination of column groups (1 and 2, 1 and 3, 2 and 3) per row. By column group, I mean, for example, "Count1" and "Nobs1".
For example, one such test would be:
count = np.array([10, 20]) #from first row of Count1 and Count2, respectively
nobs = np.array([30, 65]) #from first row of Nobs1 and Nobs2, respectively
pv = proportions_ztest(count=count,nobs=nobs,value=0,alternative='two-sided')[1] #this returns just the p-value, which is of interest
pv
0.80265091465415639
I would want the result (pv) to go into a new column (first row) called "p_1_2" or something logical that corresponds to its respective columns.
In summary, here are the challenges I'm facing:
How to apply this per row.
...for each paired combination, mentioned above.
...where the column names and number of pairs of "Count" and "Nobs" columns may vary (assuming that there will always be a "Nobs" column for each "Count" column).
Related to 3: For example, I might have a column called "18-24" and another called "18-24_Nobs".
Thanks in advance!
To 1) and 2) for one test, additional tests can be coded similar or within an additonal loop
for i,row in d.iterrows():
d.loc[i,'test'] = proportions_ztest(count=row['Count1':'Count2'].values,
nobs=row['Nobs1':'Nobs2'].values,
value=0,alternative='two-sided')[1]
for 3) it should be possible the handle these case with pure python inside the loop

Categories

Resources