How to merge multiple dataframe rows into one by key? - python

I have a pandas dataframe like this:
key columnA
1 1199
1 8674
2 8674
2 0183
2 3957
3 0183
3 3647
Expected result:
key columnA
1 11998674
2 867401833957
3 01833647
Is there sth. that merges the rows by key while putting the different values in columnA together?

df['columnA'] = df['columnA'].astype(str)
method 1:
df.groupby('key').agg({'columnA': sum})
method 2:
df.groupby('key').agg({'columnA': "".join})
optionally, convert the column back to int.
if you want to add separators:
# assuming separator is ":"
df.groupby('key').agg({'columnA': ":".join})

Related

Pandas. What is the best way to insert additional rows in dataframe based on cell values?

I have dataframe like this:
id
name
emails
1
a
a#e.com,b#e.com,c#e.com,d#e.com
2
f
f#gmail.com
And I need iterate over emails if there are more than one, create additional rows in dataframe with additional emails, not corresponding to name, should be like this:
id
name
emails
1
a
a#e.com
2
f
f#gmail.com
3
NaN
b#e.com
4
NaN
c#e.com
5
NaN
d#e.com
What is the best way to do it apart of iterrows with append or concat? is it ok to modify iterated dataframe during iteration?
Thanks.
Use DataFrame.explode with splitted values by Series.str.split first, then compare values before # and if no match set missing value and last sorting like missing values are in end of DataFrame with assign range to id column:
df = df.assign(emails = df['emails'].str.split(',')).explode('emails')
mask = df['name'].eq(df['emails'].str.split('#').str[0])
df['name'] = np.where(mask, df['name'], np.nan)
df = df.sort_values('name', key=lambda x: x.isna(), ignore_index=True)
df['id'] = range(1, len(df) + 1)
print (df)
id name emails
0 1 a a#e.com
1 2 f f#gmail.com
2 3 NaN b#e.com
3 4 NaN c#e.com
4 5 NaN d#e.com

Python : Remove all data from a column of a dataframe except the last value that we store in the first row

Let's say that I have a simple Dataframe.
data1 = [12,34,465,678,896]
df1 = pd.DataFrame(data1, columns= ['Data'])
print(df1)
Data
0 12
1 34
2 465
3 678
4 896
I want to delete all the data except the last value of the column that I want to save in the first row. It can be an column with thousands of rows. So I would like the result :
Data
0 896
1
2
3
4
What are the simplest functions to do that efficiently ?
Thank you
You an use iloc where 0 is the first row of the data column, -1 is the last row and 1: is every row except the first row:
df1['Data'].iloc[0] = df1['Data'].iloc[-1]
df1['Data'].iloc[1:] = ''
df1
Out[1]:
Data
0 896
1
2
3
4
Use the loc accessor. Utilise the python x,y=a,b to assign the values
df1.loc[0,'Data'],df1.loc[1::,'Data']=df1['Data'].values[-1],''
Data
0 896
1
2
3
4
You can use .reverse() method of python lists, something like this:
my_data = df1['Data'].to_list() # Get list from Serie
my_data.reverse() # Reverse order.
my_data[1:] = [""]*len(my_data[1:]) # Fill with spaces from the second item.
df1['Data'] = my_data
Output:
Data
0 896
1
2
3
4

Extract row data from dictionary if dataframes based on filter on a column value

The dictionary dict_set has dataframes as the value for their keys.
I'm trying to extract data from a dictionary of dataframes based on a filter on 'A' column in the dataframe based on the value in column.
dict_set={}
dict_set['a']=pd.DataFrame({'A':[1,2,3],'B':[1,2,3]})
dict_set['b']=pd.DataFrame({'A':[1,4,5],'B':[1,5,6]})
df=pd.concat([dict_set[x][dict_set[x]['A']==1] for x in dict_set.keys()],axis=0)
output being the below.
A B
0 1 1
0 1 1
But I would want the output to be
A B x
0 1 1 a
0 1 1 b
Basically, I want the value of x to be present in the new dataframe formed as a column, say column x in the dataframe formed such that df[x] would give me the x values. Is there a simple way to do this?
Try this:
pd.concat([df.query("A == 1") for df in dict_set.values()], keys=dict_set.keys())\
.reset_index(level=0)\
.rename(columns={'level_0':'x'})
Output:
x A B
0 a 1 1
0 b 1 1
Details:
Let's get the dataframes from the dictionary using list comprehension and filter the datafames. Here, I choose to use query, but you could use boolean index with df[df['A'] == 1] also, then pd.concat with the keys parameter set to the dictionary keys. Lastly, reset_index level=0 and rename.

Explode List containing many dictionaries in Pandas dataframe

I am having a dataset which look like follows(in dataframe):
**_id** **paper_title** **references** **full_text**
1 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
2 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
3 XYZ [{'abc':'something'},{'def':'something'},...many others] something
Expected:
**_id** **paper_title** **abc** **def** **full_text**
1 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
2 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
I have tried df['column_name'].apply(pd.Series).apply(pd.Series) to split the list and dictionaries into columns of dataframe but doesn't help as it didn't split dictionaries.
First row of my dataframe:
df.head(1)
Assuming your original DataFrame is a list of dictionaries with one key:value pair and a key named 'reference':
print(df)
id paper_title references full_text
0 1 xyz [{'reference': 'description1'}, {'reference': ... some text
1 2 xyz [{'reference': 'descriptiona'}, {'reference': ... more text
2 3 xyz [{'reference': 'descriptioni'}, {'reference': ... even more text
Then you can use concat to separate out your references with their index:
df1 = pd.concat([pd.DataFrame(i) for i in df['references']], keys = df.index).reset_index(level=1,drop=True)
print(df1)
reference
0 description1
0 description2
0 description3
1 descriptiona
1 descriptionb
1 descriptionc
2 descriptioni
2 descriptionii
2 descriptioniii
Then use DataFrame.join to join the columns back together on their index:
df = df.drop('references', axis=1).join(df1).reset_index(drop=True)
print(df)
id paper_title full_text reference
0 1 xyz some text description1
1 1 xyz some text description2
2 1 xyz some text description3
3 2 xyz more text descriptiona
4 2 xyz more text descriptionb
5 2 xyz more text descriptionc
6 3 xyz even more text descriptioni
7 3 xyz even more text descriptionii
8 3 xyz even more text descriptioniii
After a lot of Documentation reading of pandas, I found the explode method applying with apply(pd.Series) is the easiest of what I was looking for in the question.
Here is the Code:
df = df.explode('reference')
# It explodes the lists to rows of the subset columns
df = df['reference'].apply(pd.Series).merge(df, left_index=True, right_index=True, how ='outer')
# split a list inside a Dataframe cell into rows and merge with original dataframe like (AUB) in set theory
Sidenote: while merging look for unique values in column as there will many columns with duplicated values
I hope this helps someone with dataframe/Series with columns having list containing multiple dictionaries and want to split list of multiple dictionaries key to new column with values as their rows.

Adding several columns at the same time with multiindex

I have a dataframe with a variable number of columns and with are handled inside MultiIndex for the columns. I'm trying to add several columns into the same MultiIndex structure
I've tried to add the new columns like if I would if there was only one column but it doesn't work
I have tried this:
df = pd.DataFrame(np.random.rand(4,2), columns=pd.MultiIndex.from_tuples([('plus_zero', 'A'), ('plus_zero', 'B')]))
df['plus_one'] = df['plus_zero'] + 1
But I get ValueError: Wrong number of items passed 2, placement implies 1.
The original df should look like this:
plus_zero
A B
0 0.602891 0.701130
1 0.395749 0.960206
2 0.268238 0.140606
3 0.165802 0.971707
And the result I want:
plus_zero plus_one
A B A B
0 0.602891 0.701130 1.602891 1.701130
1 0.395749 0.960206 1.395749 1.960206
2 0.268238 0.140606 1.268238 1.140606
3 0.165802 0.971707 1.165802 1.971707
Using pd.concat:
You must specify the names of the new columns and the axis=1 or axis='columns'
pd.concat([df.loc[:,'plus_zero'],df.loc[:,'plus_zero']+1],
keys=['plus_zero','plus_one'],
axis=1)
plus_zero plus_one
A B A B
0 0.049735 0.013907 1.049735 1.013907
1 0.782054 0.449790 1.782054 1.449790
2 0.148571 0.172844 1.148571 1.172844
3 0.875560 0.393258 1.875560 1.393258

Categories

Resources