Explode List containing many dictionaries in Pandas dataframe - python

I am having a dataset which look like follows(in dataframe):
**_id** **paper_title** **references** **full_text**
1 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
2 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
3 XYZ [{'abc':'something'},{'def':'something'},...many others] something
Expected:
**_id** **paper_title** **abc** **def** **full_text**
1 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
2 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
I have tried df['column_name'].apply(pd.Series).apply(pd.Series) to split the list and dictionaries into columns of dataframe but doesn't help as it didn't split dictionaries.
First row of my dataframe:
df.head(1)

Assuming your original DataFrame is a list of dictionaries with one key:value pair and a key named 'reference':
print(df)
id paper_title references full_text
0 1 xyz [{'reference': 'description1'}, {'reference': ... some text
1 2 xyz [{'reference': 'descriptiona'}, {'reference': ... more text
2 3 xyz [{'reference': 'descriptioni'}, {'reference': ... even more text
Then you can use concat to separate out your references with their index:
df1 = pd.concat([pd.DataFrame(i) for i in df['references']], keys = df.index).reset_index(level=1,drop=True)
print(df1)
reference
0 description1
0 description2
0 description3
1 descriptiona
1 descriptionb
1 descriptionc
2 descriptioni
2 descriptionii
2 descriptioniii
Then use DataFrame.join to join the columns back together on their index:
df = df.drop('references', axis=1).join(df1).reset_index(drop=True)
print(df)
id paper_title full_text reference
0 1 xyz some text description1
1 1 xyz some text description2
2 1 xyz some text description3
3 2 xyz more text descriptiona
4 2 xyz more text descriptionb
5 2 xyz more text descriptionc
6 3 xyz even more text descriptioni
7 3 xyz even more text descriptionii
8 3 xyz even more text descriptioniii

After a lot of Documentation reading of pandas, I found the explode method applying with apply(pd.Series) is the easiest of what I was looking for in the question.
Here is the Code:
df = df.explode('reference')
# It explodes the lists to rows of the subset columns
df = df['reference'].apply(pd.Series).merge(df, left_index=True, right_index=True, how ='outer')
# split a list inside a Dataframe cell into rows and merge with original dataframe like (AUB) in set theory
Sidenote: while merging look for unique values in column as there will many columns with duplicated values
I hope this helps someone with dataframe/Series with columns having list containing multiple dictionaries and want to split list of multiple dictionaries key to new column with values as their rows.

Related

Pandas dataframe groupby and aggreagate with conditions

Is there a way where I can group my dataframe based on specific columns and include empty value as well but only when all of the values of the specific column is empty.
Example:
I have a dataframe that look like this:
I am trying to group the dataframe based on Name and Subject.
and my expected output looks like this:
So, if a person takes more than one subject but one of them is empty, then drop the row so when aggregating the other rows it wont be included. If a person takes only one subject and it is empty then dont drop the row
[Updated]
Original dataframe
Outcome will still be the same. It will takes the first row value if all subjects of a person is empty
[Updated] Another new dataframe
Outcome will have the same number of subjects but there will be 3 year
Here is a proposition with GroupBy.agg :
df = df.drop_duplicates(subset=["ID", "Name", "Subject"])
m = (df.groupby(["ID", "Name"])["Subject"].transform("size").gt(1)
& df["Subject"].isnull())
out = df.loc[~m].groupby(["ID", "Name"], as_index=False).agg(list)
Output :
​
print(out)
ID Name Subject Year
0 1 CC [Math, English] [1, 3]
1 2 DD [Physics] [2]
2 3 EE [Chemistry] [1]
3 4 FF [nan] [0]
4 5 GG [nan] [0]

Create Table with two lists python

I have two lists. I have a list of table titles(title_df). My other list is from my contents (prediction_df) to be sorted by titles. I want to populate my contents by titles and create a table in the result.
title_df=['a','b']
prediction_df=['1','2','3','800800','802100','800905']
My table has three rows and two columns
Use numpy.reshape, 2 is for 2 columns and -1 is for count number of rows by data, last pass to DataFrame constructor:
df = pd.DataFrame(np.reshape(prediction_df, (-1,2), order='F'), columns=title_df)
print (df)
a b
0 1 800800
1 2 802100
2 3 800905

How to compare two DataFrames and return a matrix containing values where columns matched

I have two data frames as follows:
df1
id start end label item
0 1 0 3 food burger
1 1 4 6 drink cola
2 2 0 3 food fries
df2
id start end label item
0 1 0 3 food burger
1 1 4 6 food cola
2 2 0 3 drink fries
I would like to compare the two data frames (by checking where they match in the id, start, end columns) and create a matrix of size 2 x (number of items) for each id. The cells should contain the label corresponding to an item. In this example:
M_id1: [[food, drink], M_id2: [[food],
[food, food]] [drink]]
I tried looking at the pandas documentation but didn't really find anything that could help me.
You can merge the dataframe df1 and df2 on columns id, start, end then group the merged dataframe on id and for each group per id create key-value pairs inside dict comprehension where key is the id and value is the corresponding matrix of labels:
m = df1.merge(df2, on=['id', 'start', 'end'])
dct = {f'M_id{k}': g.filter(like='label').to_numpy().T for k, g in m.groupby('id')}
To access the matrix of labels use dictionary lookup:
>>> dct['M_id1']
array([['food', 'drink'], ['food', 'food']], dtype=object)
>>> dct['M_id2']
array([['food'], ['drink']], dtype=object)

How to merge multiple dataframe rows into one by key?

I have a pandas dataframe like this:
key columnA
1 1199
1 8674
2 8674
2 0183
2 3957
3 0183
3 3647
Expected result:
key columnA
1 11998674
2 867401833957
3 01833647
Is there sth. that merges the rows by key while putting the different values in columnA together?
df['columnA'] = df['columnA'].astype(str)
method 1:
df.groupby('key').agg({'columnA': sum})
method 2:
df.groupby('key').agg({'columnA': "".join})
optionally, convert the column back to int.
if you want to add separators:
# assuming separator is ":"
df.groupby('key').agg({'columnA': ":".join})

How to add a MultiIndex after loading csv data into a pandas dataframe?

I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.

Categories

Resources