pandas modifying sections then recombining - python

I have been on modifying an excel document with Pandas. I only need to work with small sections at a time, and breaking each into a separate DataFrame and then recombining back into the whole after modifying seems like the best solution. Is this feasible? I've tried a couple options with merge() and concat() but they don't seem to give me the results I am looking for.
As previously stated, I've tried using the merge() function to recombine them back together with the larger table I just got a memory error, and when I tested it with smaller dataframes, rows weren't maintained.
here's an small scale example of what I am looking to do:
import pandas as pd
df1 = pd.DataFrame({'A':[1,2,3,5,6],'B':[3,10,11,13,324],'C':[64,'','' ,'','' ],'D':[32,45,67,80,100]})#example df
print(df1)
df2= df1[['B','C']]#section taken
df2.at[2,'B'] = 1#modify area
print(df2)
df1 = df1.merge(df2)#merge dataframes
print(df1)
output:
A B C D
0 1 3 64 32
1 2 10 45
2 3 11 67
3 5 13 80
4 6 324 100
B C
0 3 64
1 10
2 1
3 13
4 324
A B C D
0 1 3 64 32
1 2 10 45
2 5 13 80
3 6 324 100
what I would like to see
A B C D
0 1 3 64 32
1 2 10 45
2 3 11 67
3 5 13 80
4 6 324 100
B C
0 3 64
1 10
2 1
3 13
4 324
A B C D
0 1 3 64 32
1 2 10 45
2 3 1 67
3 5 13 80
4 6 324 100
as I said before,in my actual code I just get a memoryerror if I try this due to the size of the dataframe

No need for merging here, you can just re-assign back the values from df2 into df1:
...
df1.loc[df2.index, df2.columns] = df2 #recover changes into original dataframe
print(df1)
giving as expected:
A B C D
0 1 3 64 32
1 2 10 45
2 3 1 67
3 5 13 80
4 6 324 100
df1.update(df2) gives same result (thanks to Quang Hoang for the precision)

Related

How to pivot dataframe and transpose 1 row

I want to pivot this dataframe and convert the columns to a second level multiindex or column.
Original dataframe:
Type VC C B Security
0 Standard 2 2 2 A
1 Standard 16 13 0 B
2 Standard 52 35 2 C
3 RI 10 10 0 A
4 RI 10 15 31 B
5 RI 10 15 31 C
Desired dataframe:
Type A B C
0 Standard VC 2 16 52
1 Standard C 2 13 35
2 Standard B 2 0 2
3 RI VC 10 10 10
11 RI C 10 15 15
12 RI B 0 31 31
You could try as follows:
Use df.pivot and then transpose using df.T.
Next, chain df.sort_index to rearrange the entries, and apply df.swaplevel to change the order of the MultiIndex.
Lastly, consider getting rid of the Security as columns.name, and adding an index.name for the unnamed variable, e.g. Subtype here.
If you want the MultiIndex as columns, you can of course simply use df.reset_index at this stage.
res = (df.pivot(index='Security', columns='Type').T
.sort_index(level=[1,0], ascending=[False, False])
.swaplevel(0))
res.columns.name = None
res.index.names = ['Type','Subtype']
print(res)
A B C
Type Subtype
Standard VC 2 16 52
C 2 13 35
B 2 0 2
RI VC 10 10 10
C 10 15 15
B 0 31 31

Pandas groupby filter only last two rows

I am working on pandas manipulation and want to select only the last two rows for each column "B".
How to do without reset_index and filter (do inside groupby)
import pandas as pd
df = pd.DataFrame({
'A': list('aaabbbbcccc'),
'B': [0,1,2,5,7,2,1,4,1,0,2],
'V': range(10,120,10)
})
df
My attempt
df.groupby(['A','B'])['V'].sum()
Required output
A B
a
1 20
2 30
b
5 40
7 50
c
2 110
4 80
IIUC, you want to get the rows the highest two B per A.
You can compute a descending rank per group and keep those ≤ 2.
df[df.groupby('A')['B'].rank('first', ascending=False).le(2)]
Output:
A B V
1 a 1 20
2 a 2 30
3 b 5 40
4 b 7 50
7 c 4 80
10 c 2 110
Try:
df.sort_values(['A', 'B']).groupby(['A']).tail(2)
Output:
A B V
1 a 1 20
2 a 2 30
3 b 5 40
4 b 7 50
10 c 2 110
7 c 4 80
def function1(dd:pd.DataFrame):
return dd.sort_values('B').iloc[-2:,1:]
df.groupby(['A']).apply(function1).droplevel(1)
out
B V
A
a 1 20
a 2 30
b 5 40
b 7 50
c 2 110
c 4 80

Map two pandas dataframe and add a column to the first dataframe

I have posted two sample dataframes. I would like to map one column of a dataframe with respect to the index of a column in another dataframe and place the values back to the first dataframe shown as below
A = np.array([0,1,1,3,5,2,5,4,2,0])
B = np.array([55,75,86,98,100,111])
df1 = pd.Series(A, name='data').to_frame()
df2 = pd.Series(B, name='values_for_replacement').to_frame()
The below is the first dataframe df1
data
0 0
1 1
2 1
3 3
4 5
5 2
6 5
7 4
8 2
9 0
And the below is the second dataframe df2
values_for_replacement
0 55
1 75
2 86
3 98
4 100
5 111
The below is the output needed (Mapped with respect to the index of the df2)
data new_data
0 0 55
1 1 75
2 1 75
3 3 98
4 5 111
5 2 86
6 5 111
7 4 100
8 2 86
9 0 55
I would kindly like to know how one can achieve this using some pandas functions like map.
Looking forward for some answers. Many thanks in advance

How can I normalize data in a pandas dataframe to the starting value of a time series?

I would like to analyze a dataset from a clinical study using pandas.
Patients come at different visits to the clinic and some parameters are measured. I would like to normalize the bloodparameters to the values of the first visit (baseline values), i.e: Normalized = Parameter[Visit X] / Parameter[Visit 1].
The dataset looks roughly like the following example:
import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'Patient': ['A','A','A','B','B','B','C','C','C'],
'Visit': [1,2,3,1,2,3,1,2,3],
'Parameter': rng.randint(0, 100, 9)},
columns = ['Patient', 'Visit', 'Parameter'])
df
Patient Visit Parameter
0 A 1 44
1 A 2 47
2 A 3 64
3 B 1 67
4 B 2 67
5 B 3 9
6 C 1 83
7 C 2 21
8 C 3 36
Now I would like to add a column that includes each parameter normalized to the baseline value, i.e. the value at Visit 1. The simplest thing would be to add a column, which contains only the Visit 1 value for each patient and then simply divide the parameter column by this added column. However I fail to create such a column, which would add the baseline value for each respective patient. But maybe there are also one-line solutions without adding another column.
The result should look like this:
Patient Visit Parameter Normalized
0 A 1 44 1.0
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.0
4 B 2 67 1.0
5 B 3 9 0.13
6 C 1 83 1.0
7 C 2 21 0.25
8 C 3 36 0.43
IIUC, GroupBy.transform
df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.000000
1 A 2 47 1.068182
2 A 3 64 1.454545
3 B 1 67 1.000000
4 B 2 67 1.000000
5 B 3 9 0.134328
6 C 1 83 1.000000
7 C 2 21 0.253012
8 C 3 36 0.433735
df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first')).round(2)
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.00
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.00
4 B 2 67 1.00
5 B 3 9 0.13
6 C 1 83 1.00
7 C 2 21 0.25
8 C 3 36 0.43
If you need create a new DataFrame:
df2 = df.assign(Normalized = df['Parameter'].div(df.groupby('Patient')['Parameter'].transform('first')))
We could also use lambda as I suggested.
Or:
df2 = df.copy()
df2['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
What #ansev said: GroupBy.transform
If you wish to preserve the Parameter column, just run the last line he wrote but with Normalized instead of Parameter as the new column name:
df = df.assign(Normalized = lambda x: x['Parameter'].div(x.groupby('Patient')['Parameter'].transform('first')))

How to remove ugly row in pandas.dataframe

so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)

Categories

Resources