How would I pivot this basic table using pandas? - python

What I want is this:
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
48944305 A02AG A03AX13 N02BE01 R05X NaN NaN NaN
I don't know how many atc_1...atc_7...?atc_100 columns there will need to be in advance. I just need to gather all associated atc_codes into one row with each visit_id.
This seems like a group_by and then a pivot but I have tried many times and failed. I also tried to self-join a la SQL using pandas' merge() but that doesn't work either.
The end result is that I will paste together atc_1, atc_7, ... atc_100 to form one long atc_code. This composite atc_code will be my "Y" or "labels" column of my dataset that I am trying to predict.
Thank you!

Use cumcount first for count values per groups which create columns by function pivot. Then add missing columns with reindex_axis and change column names by add_prefix. Last reset_index:
g = df.groupby('visit_id').cumcount() + 1
print (g)
0 1
1 2
2 3
3 4
4 5
5 1
6 2
7 3
8 4
dtype: int64
df = pd.pivot(index=df['visit_id'], columns=g, values=df['atc_code'])
.reindex_axis(range(1, 8), 1)
.add_prefix('atc_')
.reset_index()
print (df)
visit_id atc_1 atc_2 atc_3 atc_4 atc_5 atc_6 atc_7
0 48944282 A02AG J01CA04 J095AX02 N02BE01 R05X NaN NaN
1 48944305 A02AG A03AX13 N02BE01 R05X None NaN NaN

Related

Merge the all elements of multiple columns into one column in series while keeping NaNs

Context: I have 5 years of weight data. The first column is the date (month and day), the succeeding columns are the years with corresponding weight for each day of the month. I want to have a full plot of all of my data among other things and so I want to combine all into just two columns. First column is the dates from 2018 to 2022, then the second column is the corresponding weight to each date. I have managed the date part, but can't combine the weight data. In essence, I want to turn ...
0 1
0 1 4.0
1 2 NaN
2 3 6.0
Into ...
0
0 1
1 2
2 3
3 4
4 NaN
5 6.0
pd.concat only puts the year columns next to each other. .join, .merge, melt, stack. agg don't work either. How do I do this?
sample code:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'2018': [1, 2, 3]})
df2 = pd.DataFrame({'2019': [4, np.NaN, 6]})
merged_df = pd.concat([df1,df2],axis=1, ignore_index=True, levels = 0)
print(merged_df)
P.S. I particularly don't want to input any index names (like id_vars="2018") because I want this process to be automated as the years go by with more data.
concat, merge, melt, join, stack, agg. i want to combine all column data into just one series
I think np.ravel(merged_df,order='F') will do the job for you.
If you want it in the form of a dataframe then pd.DataFrame(np.ravel(merged_df,order='F')).
It's not fully clear what's your I/O but based on your first example, you can use concat like this :
pd.concat([df["0"], df["1"].rename("0")], ignore_index=True)
Output :
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
5 6.0
Name: 0, dtype: float64

How can I create a new column that could tell me if specific columns have NaN values

Saying that I have a DataFrame with 6 Columns and I want to add a new colum that could give me a 1 when column 3 to 6 is NaN and a 0 when not all of them are NaN
like this
How Can I do that?
Or How could I remove those rows where all those columns are NaN except from the fisrt 2 columns?
IIUC, here's one way:
df['<New Col Name>'] = df.set_index(['Date', 'Name']).isna().all(1).astype(int)

How Can I drop a column if the last row is nan

I have found examples of how to remove a column based on all or a threshold but I have not been able to find a solution to my particular problem which is dropping the column if the last row is nan. The reason for this is im using time series data in which the collection of data doesnt all start at the same time which is fine but if I used one of the previous solutions it would remove 95% of the dataset. I do however not want data whose most recent column is nan as it means its defunct.
A B C
nan t x
1 2 3
x y z
4 nan 6
Returns
A C
nan x
1 3
x z
4 6
You can also do something like this
df.loc[:, ~df.iloc[-1].isna()]
A C
0 NaN x
1 1 3
2 x z
3 4 6
Try with dropna
df = df.dropna(axis=1, subset=[df.index[-1]], how='any')
Out[8]:
A C
0 NaN x
1 1 3
2 x z
3 4 6
You can use .iloc, .loc and .notna() to sort out your problem.
df = pd.DataFrame({"A":[np.nan, 1,"x",4],
"B":["t",2,"y",np.nan],
"C":["x",3,"z",6]})
df = df.loc[:,df.iloc[-1,:].notna()]
You can use a boolean Series to select the column to drop
df.drop(df.loc[:,df.iloc[-1].isna()], axis=1)
Out:
A C
0 NaN x
1 1 3
2 x z
3 4 6
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i] == 'nan':
temp_df = temp_df.drop(i,1)
This will work for you.
Basically what I'm doing here is looping over all columns and checking if last entry is 'nan', then dropping that column.
temp_df.shape[1]
this is the numbers of columns.
pandas.df.drop(i,1)
i represents the column index and 1 represents that you want to drop the column.
EDIT:
I read the other answers on this same post and it seems to me that notna would be best (I would use it), but the advantage of this method is that someone can compare anything they wish to.
Another method I found is isnull() which is a function in the pandas library which will work like this:
for i in range(temp_df.shape[1]):
if temp_df.iloc[-1,i].isnull():
temp_df = temp_df.drop(i,1)

comparing each value in two columns

How can I compare two columns in a dataframe and create a new column based on the difference of those two columns efficiently?
I have a feature in my table that has a lot of missing values and I need to backfill those information by using other tables in the database that contain that same feature. I have used np.select to compare the feature in my original table with the same feature in other table, but I feel like there should be an easy method.
Eg: pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
I expect the new column to contain values [1,2,"different",4,np.nan]. Any help will be appreciated!
pandas.Series.combine_first or pandas.DataFrame.combine_first could be useful here. These operate like a SQL COALESCE and combine the two columns by choosing the first non-null value if one exists.
df = pd.DataFrame({'A': [1,2,3,4,np.nan], 'B':[1,np.nan,30,4,np.nan]})
C = df.A.combine_first(df.B)
C looks like:
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
Then, to capture your requirement that two different non-null values should give "different" when combined, just find those indices and update the values.
mask = ~df.A.isna() & ~df.B.isna() & (df.A != df.B)
C[mask] = 'different'
C now looks like:
0 1
1 2
2 different
3 4
4 NaN
Another way is to use pd.DataFrame.iterrows with nunique:
import pandas as pd
df['C'] = [s['A'] if s.nunique()<=1 else 'different' for _, s in df.iterrows()]
Output:
A B C
0 1.0 1.0 1
1 2.0 NaN 2
2 3.0 30.0 different
3 4.0 4.0 4
4 NaN NaN NaN

Python adding two dataframes based on index (edited)

(no idea how to introduce a matrix here for readability)
I have two dataframes obtained with Panda and Python.
df1 = pd.DataFrame({'Index': ['0','1','2'], 'number':[3,'dd',1], 'people':[3,'s',3]})
df1 = df1.set_index('Index')
df2 = pd.DataFrame({'Index': ['0','1','2'], 'quantity':[3,2,'hi'], 'persons':[1,5,np.nan]})
I would like to sum the quantities of columns based on Index. Columns do not have the same name and may contain strings. (I have in fact 50 columns on each df). I want to consider nan as 0. The result should look:
df3
Index column 1 column 2
0 6 4
1 nan nan
2 nan nan
I was wondering how could this be done.
Note:
For sure a double while or for would do the trick, just not very elegant...
indices=0
columna=0
while indices<len(df.index)-1:
while columna<numbercolumns-1:
df3.iloc[indices,columna]=df1.iloc[indices,columna] +df2.iloc[indices,columna]
indices += 1
columna += 1
Thank you.
You can try of concatenating both dataframes, then add based on the index group
df1.columns = df.columns
df1.people = pd.to_numeric(df1.people,errors='coerce')
pd.concat([df,df1]).groupby('Index').sum()
Out:
number people
Index
A 8 5.0
B 2 2.0
C 2 5.0
F 3 3.0

Categories

Resources