Merge duplicate cells of a column - python

My Current excel looks like:
----------------
| Type | Val |
|--------------|
| A | 1 |
|--------------|
| A | 2 |
|--------------|
| B | 3 |
|--------------|
| B | 4 |
|--------------|
| B | 5 |
|--------------|
| C | 6 |
----------------
This is the required excel:
----------------------
| Type | Val | Sum |
|--------------------|
| A | 1 | 3 |
| |------| |
| | 2 | |
|--------------------|
| B | 3 | 12 |
| |------| |
| | 4 | |
| |------| |
| | 5 | |
|--------------------|
| C | 6 | 6 |
----------------------
Is it possible in Python using Pandas or any other module?

IIUC use:
df['Sum']=df.groupby('Type').transform('sum')
df.loc[df[['Type','Sum']].duplicated(),['Type','Sum']]=''
print(df)
Type Val Sum
0 A 1 3
1 2
2 B 3 12
3 4
4 5
5 C 6 6
P.s: you can also add this as index:
df=df.set_index(['Type','Sum']) #export to excel without index=False

For merged first 2 levels is possible set all 3 columns to MultiIndex - only order of columns is different:
#specify column name after groupby
df['Sum'] = df.groupby('Type')['Val'].transform('sum')
df = df.set_index(['Type','Sum', 'Val'])
df.to_excel('file.xlsx')
But in my opinion the best is working with duplicated values:
df['Sum'] = df.groupby('Type')['Val'].transform('sum')
print (df)
Type Val Sum
0 A 1 3
1 A 2 3
2 B 3 12
3 B 4 12
4 B 5 12
5 C 6 6
df.to_excel('file.xlsx', index=False)

You can use
import pandas as pd
df = pd.DataFrame({'Type': ['A', 'A','B','B','B','C'], 'Val': [1,2 ,3,4,5,6]})
df_result = df.merge(df.groupby(by='Type', as_index=False).agg({'Val':'sum'}).rename(columns={'Val':'Sum'}), on = 'Type')
which gives the output as
print(df_result)
Type Val Sum
0 A 1 3
1 A 2 3
2 B 3 12
3 B 4 12
4 B 5 12
5 C 6 6
Is this what you are looking for?

Related

How to unmerge the features of a dataframe from one column into several single columns separated by "\" via pandas?

More visually, I would like to move from this dataframe :
| A\B\C\D | Unnamed:1 | Unnamed:2 | Unnamed:3 | Unnamed:4 |
| --------| ----------|
0 | 1\2\3\4 | NaN | NaN | NaN | NaN |
1 | 1\2\3\4 | NaN | NaN | NaN | NaN |
2 | a\2\7\C | NaN | NaN | NaN | NaN |
3 | d\2\u\4 | NaN | NaN | NaN | NaN |
to this one:
| A | B | C | D |
| --------| ----------|
0 | 1 | 2 | 3 | 4 |
1 | 1 | 2 | 3 | 4 |
2 | a | 2 | 7 | C |
3 | d | 2 | u | 4 |
Thanks !
Try splitting the values first and then split the column name:
df2 = df.iloc[:,0].str.split('\\', expand = True)
df2.columns = df.columns[0].split('\\')
df2
result:
A B C D
0 1 2 3 4
1 1 2 3 4
2 a 2 7 C
3 d 2 u 4
You can use DataFrame constructor:
out = pd.DataFrame(df.iloc[:, 0].str.split('\\').tolist(),
columns=df.columns[0].split('\\'))
print(out)
# Output
A B C D
0 1 2 3 4
1 1 2 3 4
2 a 2 7 C
3 d 2 u 4
The question is: why do you have a such input? Do you read your data from csv file and you don't use the right separator?

Create a column from a choice of other columns using IF style statement

Given the following table:
+---------+---------+-------------+
| field_a | field_b | which_field |
+---------+---------+-------------+
| 1 | 2 | a |
| 1 | 2 | b |
| 3 | 4 | a |
| 3 | 4 | b |
+---------+---------+-------------+
I'd like to create a column called output where the value for each row is taken from either field_a or field_b based upon the value in which_field. So the resulting table would look like this:
+---------+---------+-------------+--------+
| field_a | field_b | which_field | output |
+---------+---------+-------------+--------+
| 1 | 2 | a | 1 |
| 1 | 2 | b | 2 |
| 3 | 4 | a | 3 |
| 3 | 4 | b | 4 |
+---------+---------+-------------+--------+
I've reviewed a number of examples using loc and np.where but these only seem to be able to handle assigning a fixed value rather than the value from a choice of columns.
This is an MRE - in reality there could be multiple which_field fields so it would be great to get an answer that can cope with multiple conditions.
Thanks in advance!
Use DataFrame.melt with DataFrame.loc:
df1 = df.melt('which_field', ignore_index=False)
df['output'] = df1.loc[('field_' + df1['which_field']).eq(df1['variable']), 'value']
print (df)
field_a field_b which_field output
0 1 2 a 1
1 1 2 b 2
2 3 4 a 3
3 3 4 b 4

Count the number of duplicate grouped by ID pandas

I'm not sure if this is a duplicate question, but here it goes.
Assuming I have the following table:
import pandas
lst = [1,1,1,2,2,3,3,4,5]
lst2 = ['A','A','B','D','E','A','A','A','E']
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['ID', 'val'])
will output the following table
+----+-----+
| ID | Val |
+----+-----+
| 1 | A |
+----+-----+
| 1 | A |
+----+-----+
| 1 | B |
+----+-----+
| 2 | D |
+----+-----+
| 2 | E |
+----+-----+
| 3 | A |
+----+-----+
| 3 | A |
+----+-----+
| 4 | A |
+----+-----+
| 5 | E |
+----+-----+
The goal is count the duplicates on VAL grouped by ID:
+----+-----+--------------+
| ID | Val | is_duplicate |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | A | 1 |
+----+-----+--------------+
| 1 | B | 0 |
+----+-----+--------------+
| 2 | D | 0 |
+----+-----+--------------+
| 2 | E | 0 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 3 | A | 1 |
+----+-----+--------------+
| 4 | A | 0 |
+----+-----+--------------+
| 5 | E | 0 |
+----+-----+--------------+
I tried the following code but its counting the overall duplicates
df_grouped = df.groupby(['notes']).size().reset_index(name='count')
while the following code does only the duplicate count
df.duplicated(subset=['notes'])
what would be the best approach for this?
Let us try duplicated
df['is_dup']=df.duplicated(subset=['ID','val'],keep=False).astype(int)
df
Out[21]:
ID val is_dup
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0
8 5 E 0
You can use .groupby on the relevant columns and get the count. Then if you add >1 to the end, then that will mean the value for the specified group contains duplicates. The > 1 will create a boolean True/False data type. Finally, to change to 1 or 0, simply use .astype(int) to transform the data type from a boolean data type to an int, which changes True to 1 and False to 0:
df['is_duplicate'] = (df.groupby(['ID','val'])['val'].transform('count') > 1).astype(int)
Out[7]:
ID val is_duplicate
0 1 A 1
1 1 A 1
2 1 B 0
3 2 D 0
4 2 E 0
5 3 A 1
6 3 A 1
7 4 A 0

Pandas: add column with index of matching row from other dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 3 years ago.
Cleaning up sharepoint list for upload to mssql with proper table relationships.
Basically, two dataframes (data, config), both share some common columns (country, business).
What I want to do is to insert a new column in datadf where for each row it contains index of matching row in configdf based on values in columns country and business.
dataframe data:
-----|---------|----------|-----
... | Country | Business | ...
-----|---------|----------|-----
| A | 1 |
-----|---------|----------|-----
| A | 1 |
-----|---------|----------|-----
| A | 2 |
-----|---------|----------|-----
| A | 2 |
-----|---------|----------|-----
| B | 1 |
-----|---------|----------|-----
| B | 1 |
-----|---------|----------|-----
| B | 2 |
-----|---------|----------|-----
| C | 1 |
-----|---------|----------|-----
| C | 2 |
-----|---------|----------|-----
dataframe config (ID = index):
----|---------|----------|-----
ID | Country | Business | ...
----|---------|----------|-----
1 | A | 1 |
----|---------|----------|-----
2 | A | 2 |
----|---------|----------|-----
3 | B | 1 |
----|---------|----------|-----
4 | B | 2 |
----|---------|----------|-----
5 | C | 1 |
----|---------|----------|-----
6 | C | 2 |
----|---------|----------|-----
what I want to add to dataframe data:
-----|---------|----------|-----------|-----
... | Country | Business | config_ID | ...
-----|---------|----------|-----------|-----
| A | 1 | 1 |
-----|---------|----------|-----------|-----
| A | 1 | 1 |
-----|---------|----------|-----------|-----
| A | 2 | 2 |
-----|---------|----------|-----------|-----
| A | 2 | 2 |
-----|---------|----------|-----------|-----
| B | 1 | 3 |
-----|---------|----------|-----------|-----
| B | 1 | 3 |
-----|---------|----------|-----------|-----
| B | 2 | 4 |
-----|---------|----------|-----------|-----
| C | 1 | 5 |
-----|---------|----------|-----------|-----
| C | 2 | 6 |
-----|---------|----------|-----------|-----
----Found something that works----
datadf['config_ID'] = datadf.apply(lambda x: configdf[(configdf.country == x.country) & (configdf.business_unit == x.business_unit)].index[0], axis=1)
It gets the job done, although I am open for other suggestions, especially if it could work with df.insert()
You can use numpy.where function to match the data frames
For example:
datadf = pd.DataFrame([['USA','Business1'],['AUS','Business2'],['UK','Business3'],['IND','Business4']],
columns=['country','business'])
configdf = pd.DataFrame([['AUS','Business2'],['IND','Business4'],['USA','Business1'],['UK','Business3']],
columns=['country','business'])
datadf['new_col'] = datadf.apply(lambda x: (np.where(x == configdf)[0][0]),axis=1)
print(datadf)
Output:
country business new_col
0 USA Business1 2
1 AUS Business2 0
2 UK Business3 3
3 IND Business4 1
EDIT1:
Well, in that case, you can use
datadf['new_col'] = datadf.apply(lambda x: (np.where((x['country'] == configdf['country']) & (x['business'] == configdf['business']))[0][0]),axis=1)
Output based on your sample data frames datadf and configdf:
country business new_col
0 A 1 0
1 A 1 0
2 A 2 1
3 A 2 1
4 B 1 2
5 B 1 2
6 B 2 3
7 C 1 4
8 C 2 5
Here is a solution using pandas merge.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html#pandas.DataFrame.merge
import pandas as pd
# make the two dataframes
data = pd.DataFrame({'Country':['A','A','A','A','B','B','B','C','C'],
'Business':[1,1,2,2,1,1,2,1,2]})
configdf = pd.DataFrame({'Country':['A','A','B','B','C','C'],
'Business':[1,2,1,2,1,2]})
# make a column with the index values
configdf.reset_index(inplace=True)
# merge the two dataframes based on the selected columns.
newdf = data.merge(configdf, on=['Country', 'Business'])

Get next value from a row that satisfies a condition in pandas

I have a DataFrame that looks something like this:
| event_type | object_id
------ | ------ | ------
0 | A | 1
1 | D | 1
2 | A | 1
3 | D | 1
4 | A | 2
5 | A | 2
6 | D | 2
7 | A | 3
8 | D | 3
9 | A | 3
What I want to do is get the index of the next row where the event_type is A and the object_id is still the same, so as an additional column this would look like this:
| event_type | object_id | next_A
------ | ------ | ------ | ------
0 | A | 1 | 2
1 | D | 1 | 2
2 | A | 1 | NaN
3 | D | 1 | NaN
4 | A | 2 | 5
5 | A | 2 | NaN
6 | D | 2 | NaN
7 | A | 3 | 9
8 | D | 3 | 9
9 | A | 3 | NaN
and so on.
I want to avoid using .apply() because my DataFrame is quite large, is there a vectorized way to do this?
EDIT: for multiple A/D pairs for the same object_id, I'd like it to always use the next index of A, like this:
| event_type | object_id | next_A
------ | ------ | ------ | ------
0 | A | 1 | 2
1 | D | 1 | 2
2 | A | 1 | 4
3 | D | 1 | 4
4 | A | 1 | NaN
You can do it with groupby like:
def populate_next_a(object_df):
object_df['a_index'] = pd.Series(object_df.index, index=object_df.index)[object_df.event_type == 'A']
object_df['a_index'].fillna(method='bfill', inplace=True)
object_df['next_A'] = object_df['a_index'].where(object_df.event_type != 'A', object_df['a_index'].shift(-1))
object_df.drop('a_index', axis=1)
return object_df
result = df.groupby(['object_id']).apply(populate_next_a)
print(result)
event_type object_id next_A
0 A 1 2.0
1 D 1 2.0
2 A 1 NaN
3 D 1 NaN
4 A 2 5.0
5 A 2 NaN
6 D 2 NaN
7 A 3 9.0
8 D 3 9.0
9 A 3 NaN
GroupBy.apply will not have as much overhead as a simple apply.
Note you cannot (yet) store integer with NaN: http://pandas.pydata.org/pandas-docs/stable/gotchas.html#support-for-integer-na so they end up as float values

Categories

Resources