Dividing two columns of an unstacked dataframe - python

I have two columns in a pandas dataframe.
Column 1 is ed and contains strings (e.g. 'a','a','b,'c','c','a')
ed column = ['a','a','b','c','c','a']
Column 2 is job and also contains strings (e.g. 'aa','bb','aa','aa','bb','cc')
job column = ['aa','bb','aa','aa','bb','cc'] #these are example values from column 2 of my pandas data frame
I then generate a two column frequency table like this:
my_counts= pdata.groupby(['ed','job']).size().unstack().fillna(0)
Now how do I then divide the frequencies in one column by the frequencies in another column of that frequency table? I want to take that ratio and use it to argsort() so that I can sort by the calculated ratio but I don't know how to reference each column of the resulting table.

I initialized the data as follows:
ed_col = ['a','a','b','c','c','a']
job_col = ['aa','bb','aa','aa','bb','cc']
pdata = pd.DataFrame({'ed':ed_col, 'job':job_col})
my_counts= pdata.groupby(['ed','job']).size().unstack().fillna(0)
Now my_counts looks like this:
job aa bb cc
ed
a 1 1 1
b 1 0 0
c 1 1 0
To access a column, you could use my_counts.aa or my_counts['aa'].
To access a row, you could use my_counts.loc['a'].
So the frequencies of aa divided by bb are my_counts['aa'] / my_counts['bb']
and now, if you want to get it sorted, you can do:
my_counts.iloc[(my_counts['aa'] / my_counts['bb']).argsort()]

Related

Pandas dataframe groupby and aggreagate with conditions

Is there a way where I can group my dataframe based on specific columns and include empty value as well but only when all of the values of the specific column is empty.
Example:
I have a dataframe that look like this:
I am trying to group the dataframe based on Name and Subject.
and my expected output looks like this:
So, if a person takes more than one subject but one of them is empty, then drop the row so when aggregating the other rows it wont be included. If a person takes only one subject and it is empty then dont drop the row
[Updated]
Original dataframe
Outcome will still be the same. It will takes the first row value if all subjects of a person is empty
[Updated] Another new dataframe
Outcome will have the same number of subjects but there will be 3 year
Here is a proposition with GroupBy.agg :
df = df.drop_duplicates(subset=["ID", "Name", "Subject"])
m = (df.groupby(["ID", "Name"])["Subject"].transform("size").gt(1)
& df["Subject"].isnull())
out = df.loc[~m].groupby(["ID", "Name"], as_index=False).agg(list)
Output :
​
print(out)
ID Name Subject Year
0 1 CC [Math, English] [1, 3]
1 2 DD [Physics] [2]
2 3 EE [Chemistry] [1]
3 4 FF [nan] [0]
4 5 GG [nan] [0]

Column by column pairplotting of 2 dataframes

I want to be able to plot two dataframes against each other pairing each column successively (but not all columns against all columns). The dataframes are identical in size and column headers but differ in the values. So the dataframes are of the form:
df_X =
A B C
0 1 1 1
1 2 2 2
...
df_Y =
A B C
0 3 3 3
1 4 4 4
...
At the moment I can do this manually on subplots using by starting with a merged dataframe with two header columns:
df_merge =
col A B C
X Y X Y X Y
0 1 3 1 3 1 3
1 2 4 2 4 2 4
...
_, ax = plt.subplots(3, 1)
for i in range(3):
ax[i].scatter(df_merge[col[i]][X], df_merge[col[i]][Y])
This works, but I am wondering if there is a better way of acheving this. Particularly when trying to then calculate the numerical correlation value between the pairs, which would again involve another loop and several more lines of code.
You can get correlation with something like:
df_merge[[col[i]][X],col[i]][Y]]).corr()
You can generally assume that most statistical functions can be applied in a single line to dataframe content either with built-in Pandas functions (https://pandas.pydata.org/docs/user_guide/computation.html), or scipy/numpy functions which you can apply.
To title each plot with the correlation, for example, you can do
thisAX.set_title("Corr: {}".format(df_merge[[col[i]][X],col[i]][Y]]).corr())
(I flattened your column names to make display a bit simpler, and I reversed one of the number pairs to show negative correlation)
Note: when feeding two Pandas columns (Series) into .corr(), you'll get a dataframe returned - to get the X:Y correlation, you can pick out a single value with .corr()["{}_X".format(col[i])]["{}_Y".format(col[i])])) (those are just the column and index names of the correlation)
Here's a lightly styled version of the same plot (again, using the flattened version of your column names)

Calculate quantile for each observation in a dataframe

I am new to Python and I have the following dataframe structure:
data = {'name': ["a","b","c","d","e","f","g","h"], 'value1': [1,2,3,4,5,6,7,8],'value2': [1,2,3,4,5,6,7,8]}
data = pd.DataFrame.from_dict(data)
data = data.transpose()
What I want to calculate is a new dataframe, where for each row, each column has a value corresponding to the quantile in the data.
In other words, I am trying to understand how to apply the function pd.quantile to return a dataframe with each entry being equal to the quantile value of the column in the row.
I tried the following, but I don't think it works:
x.quantile(q = 0.9,axis =0)
or:
x.apply(quantile,axis=0)
Many thanks in advance.
This is because you transpose your data and as per pandas documentation here https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
When the DataFrame has mixed dtypes, we get a transposed DataFrame
with the object dtype
Your dataframe after loading looks like below, which means it has 'mixed dtypes' (one column is object / category and the other two are integers).
name value1 value2
0 a 1 1
1 b 2 2
2 c 3 3
3 d 4 4
4 e 5 5
5 f 6 6
6 g 7 7
7 h 8 8
In this case you transpose your data and it is being converted to object dtype, which means that quantile function does not understand it as numbers.
Try removing transposing step and use axis argument to decide for which direction you want to calculate quantiles.
By the way, you can do transposition with:
df = df.T

Copying (assembling) the column from smaller data frames into the bigger data frame with pandas

I have a data frame with measurements for several groups of participants, and I am doing some calculations for each group. I want to add a column in a big data frame (all participants), from secondary data frames (partial list of participants).
When I do merge a couple of times (merging a new data frame to the existing one), it creates a duplicate of the column instead of one column.
As the size of the dataframes is different I can not compare them directly.
I tried
#df1 - main bigger dataframe, df2 - smaller dataset contains group of df1
for i in range(len(df1)):
# checking indeces to place the data to correct participant:
if df1.index[i] not in df2['index']:
pass
else :
df1['rate'][i] = list(df2[rate][df2['index']==i])
It does not work properly though. Can you please help with the correct way of assembling the column?
update: where the index of the initial dataframe and the "index" column of the calculation is the same, copy the rate value from the calculation into main df
main dataframe 1df
index
rate
1
0
2
0
3
0
4
0
5
0
6
0
dataframe with calculated values
index
rate
1
value
4
value
6
value
output df
index
rate
1
value
2
0
3
0
4
value
5
0
6
value
Try this – using .join() to merge dataframes on their indices and combining two columns using .combine_first():
df = df1.join(df2, lsuffix="_df1", rsuffix="_df2")
df["rate"] = df["rate_df2"].combine_first(df["rate_df1"])
EDIT:
This assumes both dataframes use a matching index. If that is not the case for df2, run this first:
df2 = df2.set_index('index')

How to append column values of one dataframe to column of another dataframe

I'm working with 2 dataframes, A & B. Dataframe A is populated with values, while dataframe B is empty except for a header structure
I want to take the value of column in dataframe A, and append them to the corresponding column in dataframe B.
I've placed the values of the dataframe A column I want to append in a list. I 've tried setting the destination column values to equal the list of start column values, but that gives me the following error:
dataframeB[x] = list(dataframeA[A])
This yields the following error:
ValueError: Length of values does not match length of index
The result I expect is
Dataframe A's column A transfers over to Dataframe B's column x
A B C D
1 2 3 4
1 2 3 4
Dataframe B
x y
- -
Create the dataframe with the data already in it...
dataframeB = pd.DataFrame(dataframeA['A'], columns = ['x'])
Then you can add columns in from the other dataframe:
dataframeB['y'] = dataframeA['B']
Result:
x y
1 2
1 2

Categories

Resources