Pandas dataframe groupby and aggreagate with conditions - python

Is there a way where I can group my dataframe based on specific columns and include empty value as well but only when all of the values of the specific column is empty.
Example:
I have a dataframe that look like this:
I am trying to group the dataframe based on Name and Subject.
and my expected output looks like this:
So, if a person takes more than one subject but one of them is empty, then drop the row so when aggregating the other rows it wont be included. If a person takes only one subject and it is empty then dont drop the row
[Updated]
Original dataframe
Outcome will still be the same. It will takes the first row value if all subjects of a person is empty
[Updated] Another new dataframe
Outcome will have the same number of subjects but there will be 3 year

Here is a proposition with GroupBy.agg :
df = df.drop_duplicates(subset=["ID", "Name", "Subject"])
m = (df.groupby(["ID", "Name"])["Subject"].transform("size").gt(1)
& df["Subject"].isnull())
out = df.loc[~m].groupby(["ID", "Name"], as_index=False).agg(list)
Output :
​
print(out)
ID Name Subject Year
0 1 CC [Math, English] [1, 3]
1 2 DD [Physics] [2]
2 3 EE [Chemistry] [1]
3 4 FF [nan] [0]
4 5 GG [nan] [0]

Related

append or join value from one dataframe to every row in another dataframe in Pandas

I'm normally OK on the joining and appending front, but this one has got me stumped.
I've got one dataframe with only one row in it. I have another with multiple rows. I want to append the value from one of the columns of my first dataframe to every row of my second.
df1:
id
Value
1
word
df2:
id
data
1
a
2
b
3
c
Output I'm seeking:
df2
id
data
Value
1
a
word
2
b
word
3
c
word
I figured that this was along the right lines, but it listed out NaN for all rows:
df2 = df2.append(df1[df1['Value'] == 1])
I guess I could just join on the id value and then copy the value to all rows, but I assumed there was a cleaner way to do this.
Thanks in advance for any help you can provide!
Just get the first element in the value column of df1 and assign it to value column of df2
df2['value'] = df1.loc[0, 'value']

How to sum up all values in a row where the column contains a specific string?

So I have a dataframe of categories of venues in each neighbourhood. It looks like:
The values in each row represent the no. of each venue in the specific neighbourhood.
I want to find out the total number of restaurants in each neighbourhood. To do so, I know I have to sum up the values in a row where the column contains the string "Restaurant".
I've tried using str.contains function but that sums up True cases - how many times a column containing the string restaurant has a value >0 in that row. But instead, what I'd like is, to sum up, the total no. of restaurants in the neighbourhood instead.
You can use pd.Index.str.contains with df.loc here.
df['sum_rest'] = df.loc[:,df.columns.str.contains('Restaurant')].sum(axis=1)
Here's a way to do that:
df = pd.DataFrame({"restaurant_a": [1,2,3], "shop": [2,3,4], "restaurant_b": [4,5,6]})
df["sum_rest"] = df[[x for x in df.columns if "restaurant" in x]].sum(axis = "columns")
df
The result is:
restaurant_a shop restaurant_b sum_rest
0 1 2 4 5
1 2 3 5 7
2 3 4 6 9
Define a list of columns containing "Restaurant" :
lr = ["Afgan Restaurant", "American Restaurant", "Argentinian Restaurant"]
Then parse the result and put it in a column :
df["sum_restaurant"] = df.loc[:, columns=lr].apply(lambda row : np.sum(row.to_numpy()))

Transpose all rows in one column of dataframe to multiple columns based on certain conditions

I would like to convert one column of data to multiple columns in dataframe based on certain values/conditions.
Please find the code to generate the input dataframe
df1 = pd.DataFrame({'VARIABLE':['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']})
The data looks like as shown below
Please note that I may not know the column names in advance. But it usually follows this format. What I have shown above is a sample data and real data might have around 600-700 columns and data arranged in this fashion
What I would like to do is convert values which start with non-digits(characters) as new columns in dataframe. It can be a new dataframe.
I attempted to write a for loop but failed to due to the below error. Can you please help me achieve this outcome.
for i in range(3,len(df1)):
#str(df1['VARIABLE'][i].contains('^\d'))
if (df1['VARIABLE'][i].astype(str).contains('^\d') == True):
Through the above loop, I was trying to check whether first char is a digit, if yes, then retain it as a value (ex: 1,2,3 etc) and if it's a character (ex:gender, ethnicity etc), then create a new column. But guess this is an incorrect and lengthy approach
For example, in the above example, the columns would be studyid,age_interview,Gender,Ethnicity.
The final output would look like this
Can you please let me know if there is an elegant approach to do this?
You can use groupby to do something like:
m=~df1['VARIABLE'].str[0].str.isdigit().fillna(True)
new_df=(pd.DataFrame(df1.groupby(m.cumsum()).VARIABLE.apply(list).
values.tolist()).set_index(0).T)
print(new_df.rename_axis(None,axis=1))
studyid age_interview Gender Ethnicity
1 1 65 1.Male 1.Chinese
2 None None 2.Female 2.Indian
3 None None None 3.Malay
Explanation: m is a helper series which helps seperating the groups:
print(m.cumsum())
0 1
1 1
2 2
3 2
4 3
5 3
6 3
7 4
8 4
9 4
10 4
Then we group this helper series and apply list:
df1.groupby(m.cumsum()).VARIABLE.apply(list)
VARIABLE
1 [studyid, 1]
2 [age_interview, 65]
3 [Gender, 1.Male, 2.Female]
4 [Ethnicity, 1.Chinese, 2.Indian, 3.Malay]
Name: VARIABLE, dtype: object
At this point we have each group as a list with the column name as the first entry.
So we create a dataframe with this and set the first column as index and transpose to get our desired output.
Use itertools.groupby and then construct pd.DataFrame:
import pandas as pd
import itertools
l = ['studyid',1,'age_interview', 65,'Gender','1.Male',
'2.Female',
'Ethnicity','1.Chinese','2.Indian','3.Malay']
l = list(map(str, l))
grouped = [list(g) for k, g in itertools.groupby(l, key=lambda x:x[0].isnumeric())]
d = {k[0]: v for k,v in zip(grouped[::2],grouped[1::2])}
pd.DataFrame.from_dict(d, orient='index').T
Output:
Gender studyid age_interview Ethnicity
0 1.Male 1 65 1.Chinese
1 2.Female None None 2.Indian
2 None None None 3.Malay

pandas dataframe with identical column names - is it valid process?

I was able to produce a pandas dataframe with identical column names.
Is it this normal fro a pandas dataframe?
How can I choose one of the two columns only?
Using the identical name, it has, as a result, to produce as output both columns of the dataframe?
Example given below:
# Producing a new empty pd dataset
dataset=pd.DataFrame()
# fill in a list with values to be added to the dataset later
cases=[1]*10
# Adding the list of values in the dataset, and naming the variable / column
dataset["id"]=cases
# making a list of columns as it is displayed below:
data_columns = ["id", "id"]
# Then, we call the pd dataframe using the defined column names:
dataset_new=dataset[data_columns]
# dataset_new
# It has as a result two columns with identical names.
# How can I process only one of the two dataset columns?
id id
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
You can use the .iloc to access either column.
dataset_new.iloc[:,0]
or
dataset_new.iloc[:,1]
and of course you can rename your columns just like you did when you set them both to 'id' using:
dataset_new.column = ['id_1', 'id_2']
df = pd.DataFrame()
lst = ['1', '2', '3']
df[0] = lst
df[1] = lst
df.rename(columns={0:'id'}, inplace=True)
df.rename(columns={1:'id'}, inplace=True)
print(df[[1]])

Dividing two columns of an unstacked dataframe

I have two columns in a pandas dataframe.
Column 1 is ed and contains strings (e.g. 'a','a','b,'c','c','a')
ed column = ['a','a','b','c','c','a']
Column 2 is job and also contains strings (e.g. 'aa','bb','aa','aa','bb','cc')
job column = ['aa','bb','aa','aa','bb','cc'] #these are example values from column 2 of my pandas data frame
I then generate a two column frequency table like this:
my_counts= pdata.groupby(['ed','job']).size().unstack().fillna(0)
Now how do I then divide the frequencies in one column by the frequencies in another column of that frequency table? I want to take that ratio and use it to argsort() so that I can sort by the calculated ratio but I don't know how to reference each column of the resulting table.
I initialized the data as follows:
ed_col = ['a','a','b','c','c','a']
job_col = ['aa','bb','aa','aa','bb','cc']
pdata = pd.DataFrame({'ed':ed_col, 'job':job_col})
my_counts= pdata.groupby(['ed','job']).size().unstack().fillna(0)
Now my_counts looks like this:
job aa bb cc
ed
a 1 1 1
b 1 0 0
c 1 1 0
To access a column, you could use my_counts.aa or my_counts['aa'].
To access a row, you could use my_counts.loc['a'].
So the frequencies of aa divided by bb are my_counts['aa'] / my_counts['bb']
and now, if you want to get it sorted, you can do:
my_counts.iloc[(my_counts['aa'] / my_counts['bb']).argsort()]

Categories

Resources