Unique, filtering pandas dataframe along with changing the data inside the dataframe in one single loop [duplicate] - python

This question already has answers here:
Split pandas dataframe based on groupby
(4 answers)
Closed yesterday.
I have a dataframe:
Type
Sub_Type
Count
A
AA
1
A
AA
2
B
BA
3
B
BA
4
B
AA
5
A
BA
6
So I m trying to make a loop for finding the unique combination of Type and Sub_Type and if there are unique combinations, filter the groups into different dataframe.
Once new dataframe is created, even the values in count column should change. The value in count is given below in the desired output.
The output should be:
df1:
Type
Sub_Type
Count
A
AA
1
A
AA
2
df2:
Type
Sub_Type
Count
B
BA
1
B
BA
2
df3:
Type
Sub_Type
Count
B
AA
1
df4:
Type
Sub_Type
Count
A
BA
1
Once the dataframes are made the loop should end.
Please suggest a loop for this problem.

import pandas as pd
df = pd.DataFrame({'Type': ["A", "A", "B", "B", "B", "A"],
'Sub_Type': ["AA", "AA", "BA", "BA", "AA", "BA"],
'Count': [3, 9, 3, 4, 3, 1],
})
datdict = {}
i = 0
for frame, data in df.groupby(['Type', 'Sub_Type']):
datdict[i] = data
i+=1
Dataframe 0 for instance :
print(pd.DataFrame(datdict[0]))
Type Sub_Type Count
0 A AA 3
1 A AA 9
Datadict contains :
>>> datdict
{0: Type Sub_Type Count
0 A AA 3
1 A AA 9, 1: Type Sub_Type Count
5 A BA 1, 2: Type Sub_Type Count
4 B AA 3, 3: Type Sub_Type Count
2 B BA 3
3 B BA 4}

Related

Create different dataframes from a main dataframe and order by frequency

I have a dataframe which looks like this
a b name .....
1 1 abc
2 2 xyz
3 3 abc
4 4 dfg
Now I need to create multiple dataframes based on the frequency of the names like df_abc should have all the data for name "abc" and so on. Tried using for loop but I'm new to python and not able to solve it. Thanks!
df_abc
a b name
1 1 abc
3 3 abc
You can use .groupby and list which yields a list of tuples. Using dict-comprehension you can access these dataframes with my_dict["abc"].
df = pd.DataFrame(
{"a": [1, 2, 3, 4], "b": [1, 2, 3, 4], "name": ["abc", "xyz", "abc", "dfg"]}
)
my_dict={name:df for name, df in list(df.groupby("name")) }
for val, df_val in my_dict.items():
print(f"df:{df_val}\n")
You could create a dictionary of dataframes which holds the different sets of data filtered with the unique values in 'name' column. Then you can reference each dataframe as you would reference a dictionary:
See below an example:
import pandas as pd
from io import StringIO
d = """
a b name
1 1 abc
2 2 xyz
3 3 abc
4 4 dfg
"""
df=pd.read_csv(StringIO(d),sep=" ")[['a','b','name']]
dfs = {}
for item in df['name']:
dfs[item] = df.loc[df['name'] == item]
>>> dfs.keys()
dict_keys(['abc', 'xyz', 'dfg'])
>>> dfs['abc']
a b name
0 1 1 abc
2 3 3 abc
>>> dfs['xyz']
a b name
1 2 2 xyz
>>> dfs['dfg']
a b name
3 4 4 dfg

updating and including two Pandas' DataFrames

I would like to update the Pandas' DataFrame by summation, and if the ID does not exist in the merged DataFrame, then I would like to include the ID's corresponding row. For example, let's say there are two DataFrames like this:
import pandas as pd
d1 = pd.DataFrame({'ID': ["A", "B", "C", "D"], "value": [2, 3, 4, 5]})
d2 = pd.DataFrame({'ID': ["B", "D", "E"], "value": [1, 3, 2]})
Then, the final output that I would like to produce is as follows:
ID value
0 A 2
1 B 4
2 C 4
3 D 8
4 E 2
Do you have any ideas on this? I have tried to do it with update or concat functions, but this is not the way for producing the results that I want to produce. Thanks in advance.
Use concat and aggregate sum:
df = pd.concat([d1, d2]).groupby('ID', as_index=False).sum()
print (df)
ID value
0 A 2
1 B 4
2 C 4
3 D 8
4 E 2
Another idea if unique ID in both DataFrames with convert ID to index and use DataFrame.add:
df = d1.set_index('ID').add(d2.set_index('ID'), fill_value=0).reset_index()
print (df)
ID value
0 A 2.0
1 B 4.0
2 C 4.0
3 D 8.0
4 E 2.0

Map a Pandas Series with duplicate keys to a DataFrame

Env: Python 3.9.6, Pandas 1.3.5
I have a DataFrame and a Series like below
df = pd.DataFrame({"C1" : ["A", "B", "C", "D"]})
sr = pd.Series(data = [1, 2, 3, 4, 5],
index = ["A", "A", "B", "C", "D"])
"""
[DataFrame]
C1
0 A
1 B
2 C
3 D
[Series]
A 1
A 2
B 3
C 4
D 5
"""
What I tried,
df["C2"] = df["C1"].map(sr)
But InvalidIndexError occurred because the series has duplicate keys ("A").
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Is there any method to make DF like below?
C1 C2
0 A 1
1 A 2
2 B 3
3 C 4
4 D 5
or
C1 C2
0 A 1
1 B 3
2 C 4
3 D 5
4 A 2
Row indices do not matter.
The question was heavily edited and now has a very different meaning.
You want a simple merge:
df.merge(sr.rename('C2'),
left_on='C1', right_index=True)
Output:
C1 C2
0 A 1
0 A 2
1 B 3
2 C 4
3 D 5
old answer
First, I don't reproduce your issue (tested with 3M rows on pandas 1.3.5).
Then why do you use slicing and not map? This would have the advantage of systematically outputting the correct number of rows (NaN if the key is absent):
Example:
sr = pd.Series({10:"A", 13:"B", 16:"C", 18:"D"})
df = pd.DataFrame({"C1":np.random.randint(10, 20, size=3000000)})
df['C2'] = df['C1'].map(sr)
print(df.head())
output:
C1 C2
0 10 A
1 18 D
2 10 A
3 13 B
4 15 NaN

Counting values with condition in one DataFrame and adding the result to another DataFrame

I have two DataFrames:
df1 = pd.DataFrame({"id": [1, 2, 3, 4]})
df2 = pd.DataFrame({"id": [1, 1, 2, 4, 4, 4],
"text": ["a", "a", "b", "a", "b", "b"]})
Output df1:
id
0 1
1 2
2 3
3 4
Output df2:
id value
0 1 a
1 1 a
2 2 b
3 4 a
4 4 b
5 4 b
My goal is to add three columns in df1.
In count_all I would like to count the corresponding ids in df2. E.g. id 4 exists 3 times in df2.
In count_a I would like to count the corresponding ids in df2 where the text value == a.
In count_b I would like to count the corresponding ids in df2 where the text value == b.
id count_all count_a count_b
0 1 2 2 0
1 2 1 0 1
2 3 0 0 0
3 4 3 1 2
How can this be done with pandas?
Use crosstab with margins parameter, add missing index values or change columns ordering by DataFrame.reindex, change columns names by DataFrame.add_prefix and last join to df1 by DataFrame.join:
df = (df1.join(pd.crosstab(df2['id'], df2['text'], margins=True)
.reindex(index=df1['id'].unique(),
columns=['All'] + df2['text'].unique().tolist(),
fill_value=0)
.add_prefix('count_'), on='id'))
print (df)
id count_All count_a count_b
0 1 2 2 0
1 2 1 0 1
2 3 0 0 0
3 4 3 1 2
Here is another way:
df1.join(df2.groupby('id').agg(
count_all = ('id','count'),
count_a=('text',lambda x: sum(x.eq('a'))),
count_b = ('text',lambda x: sum(x.eq('b')))),on='id').fillna(0)

What is the pandas equivalent of sort foo | uniq -c (and how to label the count column as 'Count')?

Having spent many hours trying to solve this, I have managed to get close to an answer, but not exactly there. I haven't found an example that does exactly what I want, yet it seems to be a very simple thing to do.
df = pd.DataFrame({'Name': ["A", "B", "C", "A"],
'ID': [1, 2, 3, 1]})
print("\ndf")
print(df)
emits
Name ID
0 A 1
1 B 2
2 C 3
3 A 1
What can I do to get this output?
Name ID Count
A 1 2
B 2 1
C 3 1
The below answer should help you:
import pandas as pd
df = pd.DataFrame({'Name': ["A", "B", "C", "A"],
'ID': [1, 2, 3, 1]})
df = df.groupby(["Name", "ID"])["Name"].count().reset_index(name="Count")
print(df)
Output:
Name ID Count
0 A 1 2
1 B 2 1
2 C 3 1
df.groupby(['Name', 'ID']).size().reset_index().rename(columns={0:'COUNT'})

Categories

Resources