Python Dataframe using Columns as Index - python

I am relatively new to Python and I've run in to a problem that I cannot seem to search my way out of. I have written a function to query a third-party API. The function runs as expected and retrieves the correct results. However, my dataframe display returns the results with my columns and rows transposed. I have used this same function before without issue. I know I can transpose them to the correct position, but since I want to use this function as part of a larger function it is important that the query return the values with the columns and rows as intended.
I've included my snippet below as well as the results and desired outcome.
import pandas as pd
def get_tiering():
df = vendorAPI.getportfoliocustomcolumns('prod').as_dataframe()
records = df.to_dict('records')
return {rec['Patient']: rec for rec in records}
tiersdf = pd.DataFrame(get_tiering())
print(tiersdf)
[6 rows x 198 columns]
[198 rows x 6 columns]
I am wondering if there is some DataFrame setting that I inadvertently changed? I am using Spyder version 2.2 with Python 3.9. Any guidance you can provide would be appreciated.
Thank you for your time.

Did you try tiersdf.T?
This is my sample df
p1 p2 p3
height 65 66 5
weight 62 22 6
age 32 55 8
bp 12 44 6
hr 2 8 3
and i got this after doing transform
height weight age bp hr
p1 65 62 32 12 2
p2 66 22 55 44 8
p3 5 6 8 6 3

I would suggest you try to change
records = df.to_dict('records')
return {rec['Patient']: rec for rec in records}
to
return df.transpose().to_dict('series')
Please, let me know if it works. Otherwise, please let us know the exact output of vendorAPI.getportfoliocustomcolumns('prod')

Related

Group by a category

I have done KMeans clusters and now I need to analyse each individual cluster. For example look at cluster 1 and see what clients are on it and make conclusions.
dfRFM['idcluster'] = num_cluster
dfRFM.head()
idcliente Recencia Frecuencia Monetario idcluster
1 3 251 44 -90.11 0
2 8 1011 44 87786.44 2
6 88 537 36 8589.57 0
7 98 505 2 -179.00 0
9 156 11 15 35259.50 0
How do I group so I only see results from lets say idcluster 0 and sort by lets say "Monetario". Thanks!
To filter a dataframe, the most common way is to use df[df[colname] == val] Then you can use df.sort_values()
In your case, that would look like this:
dfRFM_id0 = dfRFM[dfRFM['idcluster']==0].sort_values('Monetario')
The way this filtering works is that dfRFM['idcluster']==0 returns a series of True/False based on if it is, well, true or false. So then we have a sort of dfRFM[(True,False,True,True...)], and so the dataframe returns only the rows where we have a True. That is, filtering/selecting the data where the condition is true.
edit: add 'the way this works...'
I think you actually just need to filter your DF!
df_new = dfRFM[dfRFM.idcluster == 0]
and then sort by Montario
df_new = df_new.sort_values(by = 'Monetario')
Group by is really best for when you're wanting to look at the cluster as a whole - for example, if you wanted to see the average values for Recencia, Frecuencia, and Monetario for all of Group 0.

creating new dataframe with manhattan distance in python

I need to create a dataframe containing the manhattan distance between two dataframes with the same columns, and I need the indexes of each dataframe to be the index and column name, so for example lets say I have these two dataframes:
x_train :
index a b c
11 2 5 7
23 4 2 0
312 2 2 2
x_test :
index a b c
22 1 1 1
30 2 0 0
so the columns match but the size and indexes do not, the expected dataframe would look like this:
dist_dataframe:
index 11 23 312
22 11 5 3
30 12 4 4
and what I have right now is this:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def calc_distance(X_test,X_train):
dist_dataframe = pd.DataFrame(index=X_test.index,columns = X_train.index)
for i in X_train.index:
for j in X_test.index:
dist_dataframe.loc[i,j]=manhattan_distance(X_train.loc[[i]],X_test.loc[[j]])
return dist_dataframe
what I get from the code I have is this dataframe:
dist_dataframe:
index
index 11 23 312
22 NaN NaN NaN
30 NaN NaN NaN
I get the right dataframe size except that it has 2 rows called indexes that I get from the creation of the new dataframe, and also I get an error no matter what I do in the manhattan calculation line, can anyone help me out here please?
Problem in your code
There is a very small problem in your code, i.e. accessing values in dist_dataframe. So,instead of dist_dataframe.loc[i,j], you should reverse the order of i and j and make it like dist_dataframe.loc[j,i]
More efficient solution
It will work fine but since you are a new contributor, I would also like to point out the efficiency of your code. Always try to replace loops with pandas in-built functions. Since they are written in C, it makes them much faster. So here is a more efficient solution:
def manhattan_distance(a, b):
return sum(abs(e1-e2) for e1, e2 in zip(a,b))
def xtrain_distance(row):
distances = {}
for i,each in x_train.iterrows():
distances[i] = manhattan_distance(each,row)
return distances
result = x_test.apply(xtrain_distance, axis=1)
# converting into dataframe
pd.DataFrame(dict(result)).transpose()
It also produces same output and on your example and you can't see any time difference. But when run on a larger size (same data scaled over 20 times), i.e. 60 x_train samples and 40 x_test samples, here is the time difference:
Your solution took: 929 ms
This solution took: 207 ms
It got 4x faster just by eliminating one for loop. Note that, it can be made more efficient but for the sake of demonstration, I have used this solution.

Comparing values in two dataframes and generate report if difference is greater set point

I have 2 data frames ( master and slave) looks like below.
# Master
C D E F G
0 5 44 4.0 33 22
1 1 0 4.5 565 11
# Slave
C D E F G
0 5 44 4.0 33.0 22
1 1 4 6.5 562.5 10
Expected results( highlight those cells where difference is > 1)
C D E F G
0 5 44 4.0 33.0 22
1 1 4 6.5 562.5 10
Where 4, 6.5, 562.5 are highlighted
Picture attached for better understanding.
I would like to compare two data frames and would like to highlight the cells where the difference exceed the SET VALUE( >1) in a newly created data frame. SET value=1 is constant for entire data frame.
Please note difference should be based on Absolute value. i.e ABS( master- slave)
I would like to use the numpy np.isclose function to achieve my goal.
This should happen for bigger data frame with 200 rows and 300 columns.
Data frame displayed here is small for better understanding.
Cell : D2 : highlight is required since (D2_MASTER) -(D2_Slave)= 0- 4 = -4
Cell : E2 : highlight is required since (E2_MASTER) -(E2_Slave)= 4.5- 6.5 = -2
Cell : F2 : highlight is required since (F2_MASTER) -(F2_Slave)= 565- 562.5.5 = 2.5
Cell : G2 : NO highlight since (G2_MASTER) -(G2_Slave)=11- 10 = 1 (should not be highlighted since difference is within the limit)
I just started coding in python and using pandas on my own and I admit I am a bit lost.
Thanks for reading all this and thanks in advance for any suggestions and feedback. !
Code
for ind,row in dfmaster.iterrows():
print(row)
(dfmaster.iloc())=np.isclose ( (dfmaster.iloc()) , (dfmaster.iloc()) , atol=1)#.any()
Let's try style.format:
def highlight_error(df):
return pd.DataFrame(np.where(df.sub(slave).abs() > 1, 'background-color:red', ''),
df.index, df.columns)
master.style.apply(highlight_error, axis=None)
On Jupyter notebook you would get:

Pandas groupby expanding optimization of syntax

I am using the data from the example shown here:
http://pandas.pydata.org/pandas-docs/stable/groupby.html. Go to the subheading: New syntax to window and resample operations
At the command prompt, the new syntax works as shown in the pandas documentation. But I want to add a new column with the expanded data to the existing dataframe, as would be done in a saved program.
Before a syntax upgrade to the groupby expanding code, I was able to use the following single line code:
df = pd.DataFrame({'A': [1] * 10 + [5] * 10, 'B': np.arange(20)})
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
This gives the expected results, but also gives an 'expanding_sum is deprecated' message. Expected results are:
A B Sum of B
0 1 0 0
1 1 1 1
2 1 2 3
3 1 3 6
4 1 4 10
5 1 5 15
6 1 6 21
7 1 7 28
8 1 8 36
9 1 9 45
10 5 10 10
11 5 11 21
12 5 12 33
13 5 13 46
14 5 14 60
15 5 15 75
16 5 16 91
17 5 17 108
18 5 18 126
19 5 19 145
I want to use the new syntax to replace the deprecated syntax. If I try the new syntax, I get the error message:
df['Sum of B'] = df.groupby('A').expanding().B.sum()
TypeError: incompatible index of inserted column with frame index
I did some searching on here, and saw something that might have helped, but it gave me a different message:
df['Sum of B'] = df.groupby('A').expanding().B.sum().reset_index(level = 0)
ValueError: Wrong number of items passed 2, placement implies 1
The only way I can get it to work is to assign the result to a temporary df, then merge the temporary df into the original df:
temp_df = df.groupby('A').expanding().B.sum().reset_index(level = 0).rename(columns = {'B' : 'Sum of B'})
new_df = pd.merge(df, temp_df, on = 'A', left_index = True, right_index = True)
print (new_df)
This code gives the expected results as shown above.
I've tried different variations using transform as well, but have not been able to come up with coding this in one line as I did before the deprecation. Is there a single line syntax that will work? Thanks.
It seems you need a cumsum:
df.groupby('A')['B'].cumsum()
TL;DR
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: x.expanding().sum())
Explanation
We start from the offending line:
df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
Let's read carefully the warning you mentioned:
FutureWarning: pd.expanding_sum is deprecated for Series and will be
removed in a future version, replace with
Series.expanding(min_periods=1).sum()
After reading Pandas 0.17.0: pandas.expanding_sum it becomes clear that the Series the warning is talking about is the first parameter of the pd.expanding_sum. I.e. in our case it is x.
Now we apply the code transformation suggested in the warning. So pd.expanding_sum(x) becomes x.expanding(min_periods=1).sum().
According to Pandas 0.22.0: pandas.Series.expanding min_periods has a default value of 1 so in your case it can be omitted altogether, hence the final result.

Filling in missing data in Python

I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
​
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4

Categories

Resources