Splitting a DataFrame based on duplicate values in columns

Splitting a DataFrame based on duplicate values in columns - python

Here's my starting dataframe:
StartDF = pd.DataFrame({'A': {0: 1, 1: 1, 2: 2, 3: 4, 4: 5, 5: 5, 6: 5, 7: 5}, 'B': {0: 2, 1: 2, 2: 4, 3: 2, 4: 2, 5: 4, 6: 4, 7: 5}, 'C': {0: 10, 1: 1000, 2: 250, 3: 100, 4: 550, 5: 100, 6: 3000, 7: 250}})
I need to create a list of individual dataframes based on duplicate values in columns A and B, so it should look like this:
df1 = pd.DataFrame({'A': {0: 1, 1: 1}, 'B': {0: 2, 1: 2}, 'C': {0: 10, 1: 1000}})
df2 = pd.DataFrame({'A': {0: 2}, 'B': {0: 4}, 'C': {0: 250}})
df3 = pd.DataFrame({'A': {0: 4}, 'B': {0: 2}, 'C': {0: 100}})
df4 = pd.DataFrame({'A': {0: 5}, 'B': {0: 2}, 'C': {0: 550}})
df5 = pd.DataFrame({'A': {0: 5, 1: 5}, 'B': {0: 4, 1: 4}, 'C': {0: 100, 1: 3000}})
df6 = pd.DataFrame({'A': {0: 5}, 'B': {0: 5}, 'C': {0: 250}})
I've seen a lot of answers that explain how to DROP duplicates, but I need to keep the duplicate values because the information in column C will usually be different between rows regardless of duplicates in columns A and B. All of the row data needs to be preserved in the new dataframes.
Additional note, the starting dataframe (StartDF) will change in length, so each time this is run, the number of individual dataframes created will be variable. Ultimately, I need to print the newly created dataframes to their own csv files (I know how to do this part). Just need to know how to break out the data from the original dataframe in an elegant way.

You can use a groupby, iterate over each group and build a list using a list comprehension.
df_list = [g for _, g in df.groupby(['A', 'B'])]
print(*df_list, sep='\n\n')
A B C
0 1 2 10
1 1 2 1000
A B C
2 2 4 250
A B C
3 4 2 100
A B C
4 5 2 550
A B C
5 5 4 100
6 5 4 3000
A B C
7 5 5 250

Related

Write a function to perform calculations on multiple columns in a Pandas dataframe

I have the following dataframe (the real one has a lot more columns and rows, so just using this as an example):
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
I'd like to write a function to perform calculations on the dataframe, for specific columns. The calculation is in the code below.
As I'd only want to apply the code to specific columns, I've set up a list of columns, and as there is a pre-defined 'factor' we need to take into account in the calculation, I set this up too:
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row):
return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
Then, I apply the function to the dataframe, and I want to overwrite the original column values with the new ones, so I do this:
for cols in df.columns:
df[cols] = df[cols].apply(multiply_columns)
But I get the following error:
~\AppData\Local\Temp/ipykernel_8544/3939806184.py in multiply_columns(row)
3
4 def multiply_columns(row):
----> 5 return ((row[cols] / row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
6
7
TypeError: string indices must be integers
But the values I'm using in the calculation aren't strings:
sample object
sample id int64
replicate int64
taste float64
smell float64
shape float64
volume int64
weight float64
dtype: object
The desired output would be:
{'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 0.0074, 1: 0.028366667, 2: 0.2183, 3: 3.08333e-05},
'smell': {0: 0.123333333, 1: 0.141833333, 2: 0.01295, 3: 0.032683333},
'shape': {0: 2.46667e-05, 1: 0.001233333, 2: 0.00074, 3: 0.067833333},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}}
Can anyone kindly show me the errors of my ways

This has a few issues.
If you wanted to index elements in row, the index you're using is a string (the column name) rather than an integer (like an index). To get an index for the column names you're interested in, you could use this:
cols = ['taste', 'smell', 'shape']
cols_idx = [df.columns.get_loc(col) for col in cols]
However, if I understand your question, you could perform this operation on columns directly with the understanding that the operation will be performed on each row. See a test case that worked for me:
import pandas as pd
df = pd.DataFrame({'sample': {0: 'orange', 1: 'orange', 2: 'banana', 3: 'banana'},
'sample id': {0: 1, 1: 1, 2: 5, 3: 5},
'replicate': {0: 1, 1: 2, 2: 1, 3: 2},
'taste': {0: 1.2, 1: 4.6, 2: 35.4, 3: 0.005},
'smell': {0: 20.0, 1: 23.0, 2: 2.1, 3: 5.3},
'shape': {0: 0.004, 1: 0.2, 2: 0.12, 3: 11.0},
'volume': {0: 23, 1: 23, 2: 23, 3: 23},
'weight': {0: 12.0, 1: 1.3, 2: 2.4, 3: 3.2}})
cols = ['taste', 'smell', 'shape']
factor = 72
for col in cols:
df[col] = ((df[col] / df['volume']) * (factor * df['volume'] / df['weight']) / 1000)
Note that your line
for cols in df.columns:
indicated you should run this operation on every column (cols became the index and was no longer your list).

You have to pass the column as well to the function.
cols = ['taste', 'smell', 'shape']
factor = 72
def multiply_columns(row,col):
return ((row[col]/ row['volume']) * (factor * row['volume'] / row['weight']) / 1000)
for col in cols:
df[col] = df.apply(lambda x:multiply_columns(x,col),axis=1)
Also the output I'm getting is bit different from your desired output even though I used the same formula.
sample sample id replicate taste smell shape volume weight
0 orange 1 1 0.00720000000 0.12000000000 0.00002400000 23 12.00000000000
1 orange 1 2 0.25476923077 1.27384615385 0.01107692308 23 1.30000000000
2 banana 5 1 1.06200000000 0.06300000000 0.00360000000 23 2.40000000000
3 banana 5 2 0.00011250000 0.11925000000 0.24750000000 23 3.20000000000

In Pandas sum columns and change values to proportion of sum

If I have the following DataFrame, how can I convert the value in each row to the proportion of the total of the columns?
Input:
pd.DataFrame(
{'A': {0: 1, 1: 1},
'B': {0: 1, 1: 2},
'C': {0: 1, 1: 9},})
Output:
pd.DataFrame(
{'A': {0: 0.5, 1: 0.5},
'B': {0: 0.333, 1: 0.666},
'C': {0: 1, 0.1: 0.9},})

How about apply?
import pandas as pd
df = pd.DataFrame(
{'A': {0: 1, 1: 1},
'B': {0: 1, 1: 2},
'C': {0: 1, 1: 9},})
df = df.apply(lambda col: col / sum(col))
print(df)
# A B C
# 0 0.5 0.333333 0.1
# 1 0.5 0.666667 0.9

Try
out = df.div(df.sum())
Out[549]:
A B C
0 0.5 0.333333 0.1
1 0.5 0.666667 0.9

Fixing column names and renaming them after grouping the dataframe by two columns

I have a dataframe:
{'ARTICLE_ID': {0: 111, 1: 111, 2: 222, 3: 222, 4: 222}, 'CITEDIN_ARTICLE_ID': {0: 11, 1: 11, 2: 11, 3: 22, 4: 22}, 'enrollment': {0: 10, 1: 10, 2: 10, 3: 10, 4: 10}, 'Trial_year': {0: 2017, 1: 2017, 2: 2017, 3: 2017, 4: 2017}, 'AUTHOR_ID': {0: 'aaa', 1: 'aaa', 2: 'aaa', 3: 'aaa', 4: 'aaa'}, 'AUTHOR_RANK': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}}
I am grouping it by two columns
df_grouped = df.groupby(['AUTHOR_ID', 'Trial_year']).agg({'ARTICLE_ID': "count",
'enrollment': ["count", 'sum']}).reset_index()
As a result, I receive this dataframe, where column names have two levels
{('AUTHOR_ID', ''): {0: 'aaa'}, ('Trial_year', ''): {0: 2017}, ('ARTICLE_ID', 'count'): {0: 5}, ('enrollment', 'count'): {0: 5}, ('enrollment', 'sum'): {0: 50}}
My ideal output - the dataframe with one level of column names and renamed column names
`AUTHOR_ID`, `Trial_year`, `ARTICLE_ID_count`, `enrollment_count`, `enrollment_sum`

You can modify the columns:
df_grouped.columns = [f"{i}_{j}" if j!='' else i for i,j in df_grouped.columns]
or use NamedAgg from the beginning:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'])
.agg(ARTICLE_ID_count=('ARTICLE_ID', "count"),
enrollment_count=('enrollment','count'),
enrollment_sum=('enrollment','sum')).reset_index())
You can also pass a dictionary to groupby.agg for a little concise code:
df_grouped = (df.groupby(['AUTHOR_ID', 'Trial_year'], as_index=False)
.agg(**{'_'.join(pair): pair for pair in [('ARTICLE_ID', 'count'),
('enrollment','count'),
('enrollment','sum')]}))
Output:
AUTHOR_ID Trial_year ARTICLE_ID_count enrollment_count enrollment_sum
0 aaa 2017 5 5 50

Appending the counts of lists to a dictionary

I have a large list, which I separated into small sized lists which have elements of occurrences of 1s and 0s, randomly.
Also, the first two lists are made with different parameters from the last two.
Example:
list_of_lists[0] =[1,0,1,1,1,0,1,1,1,0]
list_of_lists[1] =[0,0,0,0,0,0,0,0,0,0]
list_of_lists[2] =[1,1,1,1,1,1,1,1,1,1]
list_of_lists[3] =[0,0,1,1,1,1,1,1,1,0]
I would like to count the occurrences of 1s and 0s in each list, and append them into a dictionary to plot the occurrences.
My trial is as follows:
counts_each = dict()
for i in range(4): #all 4 lists
for k in list_of_lists[i]: #elements of the lists
counts_each[k] = counts_each.get(k, 0) + 1
print(counts_each)
which calculates the general occurrences of the 1s and 0s for the al lists:
{0: 16, 1: 24}
If I do:
list_counts = []
for i in range(4):
counts_each = dict()
for k in list_of_lists[i]:
counts_each[k] = counts_each.get(k, 0) + 1
list_counts.append(counts_each)
print(list_counts)
It does not accumulate all of the counts:
[{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 10},
{0: 10},
{0: 10},
{0: 10},
{0: 10},
{0: 10},
{0: 10},
{0: 10},
{0: 10},
{0: 10},
{1: 10},
{1: 10},
{1: 10},
{1: 10},
{1: 10},
{1: 10},
{1: 10},
{1: 10},
{1: 10},
{1: 10},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7},
{0: 3, 1: 7}]
I would be glad to have some insights of what I am doing wrong.
Thank you.

You can let the collections module do all the counting work for you.
from collections import Counter
list_of_lists = [[] for _ in range(4)]
list_of_lists[0] =[1,0,1,1,1,0,1,1,1,0]
list_of_lists[1] =[0,0,0,0,0,0,0,0,0,0]
list_of_lists[2] =[1,1,1,1,1,1,1,1,1,1]
list_of_lists[3] =[0,0,1,1,1,1,1,1,1,0]
counters = [Counter(l) for l in list_of_lists]
print(*counters, sep="\n")
OUTPUT
Counter({1: 7, 0: 3})
Counter({0: 10})
Counter({1: 10})
Counter({1: 7, 0: 3})

You could use a Dict Comprehension, given your nested list:
list_of_lists = [[1,0,1,1,1,0,1,1,1,0], [0,0,0,0,0,0,0,0,0,0], [1,1,1,1,1,1,1,1,1,1], [0,0,1,1,1,1,1,1,1,0]]
use it in this way:
{ idx: {0: lst.count(0), 1: lst.count(1)} for idx, lst in enumerate(list_of_lists) }
#=> {0: {0: 3, 1: 7}, 1: {0: 10, 1: 0}, 2: {0: 0, 1: 10}, 3: {0: 3, 1: 7}}
In the above case I used the index as a key, but you could just use a list comprehension to get a list of dictionaries:
[ {0: lst.count(0), 1: lst.count(1)} for lst in list_of_lists ]
#=> [{0: 3, 1: 7}, {0: 10, 1: 0}, {0: 0, 1: 10}, {0: 3, 1: 7}]

Chris Doyle's answer is excellent, but perhaps your goal is to understand the problem with your solution, specifically.
You have not included your expected output. If I am correct that your issue with your current solution is the repetition of the counts, and you want an output like this:
[{1: 7, 0: 3}, {0: 10}, {1: 10}, {0: 3, 1: 7}]
Then the issue appears to be with the indenting of the line list_counts.append(counts_each). You are doing this each time through the k loop (looping through the items in the list) when I think you want to do it only after finishing the count for a given list:
list_counts = []
for i in range(4):
counts_each = dict()
for k in list_of_lists[i]:
counts_each[k] = counts_each.get(k, 0) + 1
list_counts.append(counts_each)
print(list_counts)

Python deduplicate records - dedupe

I want to use https://github.com/datamade/dedupe to deduplicate some records in python. Looking at their examples
data_d = {}
for row in data:
clean_row = [(k, preProcess(v)) for (k, v) in row.items()]
row_id = int(row['id'])
data_d[row_id] = dict(clean_row)
the dictionary consumes quite a lot of memory compared to e.g. a dictionary created by pandas out of a pd.Datafrmae, or even a normal pd.Dataframe.
If this format is required, how can I convert a pd.Dataframe efficiently to such a dictionary?
edit
Example what pandas generates
{'column1': {0: 1389225600000000000,
1: 1388707200000000000,
2: 1388707200000000000,
3: 1389657600000000000,....
Example what dedupe expects
{'1': {column1: 1389225600000000000, column2: "ddd"},
'2': {column1: 1111, column2: "ddd} ...}

It appears that df.to_dict(orient='index') will produce the representation you are looking for:
import pandas
data = [[1, 2, 3], [4, 5, 6]]
columns = ['a', 'b', 'c']
df = pandas.DataFrame(data, columns=columns)
df.to_dict(orient='index')
results in
{0: {'a': 1, 'b': 2, 'c': 3}, 1: {'a': 4, 'b': 5, 'c': 6}}

You can try something like this:
df = pd.DataFrame({'A': [1,2,3,4,5], 'B': [6,7,8,9,10]})
A B
0 1 6
1 2 7
2 3 8
3 4 9
4 5 10
print(df.T.to_dict())
{0: {'A': 1, 'B': 6}, 1: {'A': 2, 'B': 7}, 2: {'A': 3, 'B': 8}, 3: {'A': 4, 'B': 9}, 4: {'A': 5, 'B': 10}}
This is the same output as in #chthonicdaemon answer so his answer is probably better. I am using pandas.DataFrame.T to transpose index and columns.

A python dictionary is not required, you just need an object that allows indexing by column name. i.e. row['col_name']
So, assuming data is a pandas dataframe should just be able to do something like:
data_d = {}
for row_id, row in data.iterrows():
data_d[row_id] = row
That said, the memory overhead of python dicts is not going to be where you have memory bottlenecks in dedupe.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Splitting a DataFrame based on duplicate values in columns - python

You can use a groupby, iterate over each group and build a list using a list comprehension. df_list = [g for _, g in df.groupby(['A', 'B'])] print(*df_list, sep='\n\n') A B C 0 1 2 10 1 1 2 1000 A B C 2 2 4 250 A B C 3 4 2 100 A B C 4 5 2 550 A B C 5 5 4 100 6 5 4 3000 A B C 7 5 5 250

Related

Write a function to perform calculations on multiple columns in a Pandas dataframe

In Pandas sum columns and change values to proportion of sum

Fixing column names and renaming them after grouping the dataframe by two columns

Appending the counts of lists to a dictionary

Python deduplicate records - dedupe

Categories

Resources