updating and including two Pandas' DataFrames - python

I would like to update the Pandas' DataFrame by summation, and if the ID does not exist in the merged DataFrame, then I would like to include the ID's corresponding row. For example, let's say there are two DataFrames like this:
import pandas as pd
d1 = pd.DataFrame({'ID': ["A", "B", "C", "D"], "value": [2, 3, 4, 5]})
d2 = pd.DataFrame({'ID': ["B", "D", "E"], "value": [1, 3, 2]})
Then, the final output that I would like to produce is as follows:
ID value
0 A 2
1 B 4
2 C 4
3 D 8
4 E 2
Do you have any ideas on this? I have tried to do it with update or concat functions, but this is not the way for producing the results that I want to produce. Thanks in advance.

Use concat and aggregate sum:
df = pd.concat([d1, d2]).groupby('ID', as_index=False).sum()
print (df)
ID value
0 A 2
1 B 4
2 C 4
3 D 8
4 E 2
Another idea if unique ID in both DataFrames with convert ID to index and use DataFrame.add:
df = d1.set_index('ID').add(d2.set_index('ID'), fill_value=0).reset_index()
print (df)
ID value
0 A 2.0
1 B 4.0
2 C 4.0
3 D 8.0
4 E 2.0

Related

Converting category type column values to columns with corresponding value count [duplicate]

This question already has an answer here:
simple pivot table of pandas dataframe
(1 answer)
Closed last year.
I have a Pandas dataframe built like:
Col1
Col2
1
A
1
B
1
B
2
A
2
A
3
A
3
Nan
For every value of Col1, I want to count every value of Col2 ignoring the Nan values and put the sum in the associated column, obtaining something like:
Col1
A
B
1
1
2
2
2
0
3
1
0
How can I do that in Pandas? I have a lot of values in Col1 and lots of columns like Col2.
Thank you very much!
You can try crosstab
out = pd.crosstab(df.Col1, df.Col2).reset_index()
Out[66]:
Col2 Col1 A B
0 1 1 2
1 2 2 0
2 3 1 0
simply do this !!works!!
df.groupby(list(df.columns)).size().unstack()
output:
Col2 A B
Col1
1 1.0 2.0
2 2.0 NaN
3 1.0 NaN
You can just add the columns and set the values to a new column.
sum_column = df["col1"] + df["col2"]
df["col3"] = sum_column
If you have several columns and want to count the values over all columns, here's a maybe not too elegant, but working solution:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 1, 1, 2, 2, 2, 3, 3, 3],
["A", "B", "A", "C", "b", "A", "A", "A", np.nan],
["a", "a", "C", "b", "A", "b", "a", "b", np.nan]
]).T
df.columns = ["Col1", "Col2", "Col3"]
result = {}
for i, group in df.groupby("Col1"):
result[i] = group.drop(columns=["Col1"]).stack().value_counts()
result = pd.DataFrame(result).T.fillna(0)
There are a number of ways to achieve this(#BENY has already provided the crosstab option):
pd.get_dummies:
(pd
.get_dummies(df, columns=["Col2"], prefix="", prefix_sep="")
.groupby("Col1")
.sum()
)
A B
Col1
1 1 2
2 2 0
3 1 0
pivot_table:
(df.pivot_table(index="Col1", columns="Col2", aggfunc="size", fill_value=0))
Col2 A B
Col1
1 1 2
2 2 0
3 1 0
value_counts:
(df
.value_counts()
.unstack(fill_value=0)
.rename_axis(columns=None)
)
A B
Col1
1 1 2
2 2 0
3 1 0
Use pandas.pivot_tabel(). Link is available https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html.

Count the number of values in Pandas [duplicate]

This question already has an answer here:
simple pivot table of pandas dataframe
(1 answer)
Closed last year.
I have a Pandas dataframe built like:
Col1
Col2
1
A
1
B
1
B
2
A
2
A
3
A
3
Nan
For every value of Col1, I want to count every value of Col2 ignoring the Nan values and put the sum in the associated column, obtaining something like:
Col1
A
B
1
1
2
2
2
0
3
1
0
How can I do that in Pandas? I have a lot of values in Col1 and lots of columns like Col2.
Thank you very much!
You can try crosstab
out = pd.crosstab(df.Col1, df.Col2).reset_index()
Out[66]:
Col2 Col1 A B
0 1 1 2
1 2 2 0
2 3 1 0
simply do this !!works!!
df.groupby(list(df.columns)).size().unstack()
output:
Col2 A B
Col1
1 1.0 2.0
2 2.0 NaN
3 1.0 NaN
You can just add the columns and set the values to a new column.
sum_column = df["col1"] + df["col2"]
df["col3"] = sum_column
If you have several columns and want to count the values over all columns, here's a maybe not too elegant, but working solution:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 1, 1, 2, 2, 2, 3, 3, 3],
["A", "B", "A", "C", "b", "A", "A", "A", np.nan],
["a", "a", "C", "b", "A", "b", "a", "b", np.nan]
]).T
df.columns = ["Col1", "Col2", "Col3"]
result = {}
for i, group in df.groupby("Col1"):
result[i] = group.drop(columns=["Col1"]).stack().value_counts()
result = pd.DataFrame(result).T.fillna(0)
There are a number of ways to achieve this(#BENY has already provided the crosstab option):
pd.get_dummies:
(pd
.get_dummies(df, columns=["Col2"], prefix="", prefix_sep="")
.groupby("Col1")
.sum()
)
A B
Col1
1 1 2
2 2 0
3 1 0
pivot_table:
(df.pivot_table(index="Col1", columns="Col2", aggfunc="size", fill_value=0))
Col2 A B
Col1
1 1 2
2 2 0
3 1 0
value_counts:
(df
.value_counts()
.unstack(fill_value=0)
.rename_axis(columns=None)
)
A B
Col1
1 1 2
2 2 0
3 1 0
Use pandas.pivot_tabel(). Link is available https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html.

Map a Pandas Series with duplicate keys to a DataFrame

Env: Python 3.9.6, Pandas 1.3.5
I have a DataFrame and a Series like below
df = pd.DataFrame({"C1" : ["A", "B", "C", "D"]})
sr = pd.Series(data = [1, 2, 3, 4, 5],
index = ["A", "A", "B", "C", "D"])
"""
[DataFrame]
C1
0 A
1 B
2 C
3 D
[Series]
A 1
A 2
B 3
C 4
D 5
"""
What I tried,
df["C2"] = df["C1"].map(sr)
But InvalidIndexError occurred because the series has duplicate keys ("A").
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Is there any method to make DF like below?
C1 C2
0 A 1
1 A 2
2 B 3
3 C 4
4 D 5
or
C1 C2
0 A 1
1 B 3
2 C 4
3 D 5
4 A 2
Row indices do not matter.
The question was heavily edited and now has a very different meaning.
You want a simple merge:
df.merge(sr.rename('C2'),
left_on='C1', right_index=True)
Output:
C1 C2
0 A 1
0 A 2
1 B 3
2 C 4
3 D 5
old answer
First, I don't reproduce your issue (tested with 3M rows on pandas 1.3.5).
Then why do you use slicing and not map? This would have the advantage of systematically outputting the correct number of rows (NaN if the key is absent):
Example:
sr = pd.Series({10:"A", 13:"B", 16:"C", 18:"D"})
df = pd.DataFrame({"C1":np.random.randint(10, 20, size=3000000)})
df['C2'] = df['C1'].map(sr)
print(df.head())
output:
C1 C2
0 10 A
1 18 D
2 10 A
3 13 B
4 15 NaN

How to concat the row output of iterrows to another pandas DataFrame with the same columns?

Assume I have the following two pandas DataFrames:
df1 = pd.DataFrame({"A": [1, 2, 3],
"B": ["a", "b", "c"],
"C": [7, 43, 15]})
df2 = pd.DataFrame({"A": [4, 5],
"B": ["c", "d"],
"C": [12, 19]})
Now, I want to iterate over the rows in df1, and if a certain condition is met for that row, add the row to df2.
For example:
for i, row in df1.iterrows():
if row["C"] == 43:
df2 = pd.concat([row, df2])
df2.head()
Should give me output:
A B C
4 c 12
5 d 19
2 b 43
But instead I get an output where the column names of the DataFrames appear in the rows:
0 A B C
A 2 NaN NaN NaN
B b NaN NaN NaN
C 43 NaN NaN NaN
0 NaN 4.0 c 12.0
1 NaN 5.0 d 19.0
How to solve this?
I think you just need concat with boolean indexing on df1.
pd.concat([df2, df1[df1['C'] == 43]], ignore_index=True)
The df1[df1['C']==43]] part takes a slice of df1 based on the condition the column C being equal to 43 and concats it to df2.
Output:
A B C
0 4 c 12
1 5 d 19
2 2 b 43
Change your code to this
for i, row in df1.iterrows():
if row["C"] == 43:
df2.loc[len(df2.index)] = row
df2.head()
Use pd.merge()
pd.merge() take list of two dfs and merge them horizontally if no axis is defined.
In your case pass df2 along with df1[df1["C"] == 43] which will return only those rows who have 43 in its column C.
reset_index() so that output don't have duplicate index values.
df2 = pd.concat([df2,df1[df1["C"] == 43]]).reset_index(drop=True)
print(df2)
A B C
0 4 c 12
1 5 d 19
2 2 b 43

What is the pandas equivalent of sort foo | uniq -c (and how to label the count column as 'Count')?

Having spent many hours trying to solve this, I have managed to get close to an answer, but not exactly there. I haven't found an example that does exactly what I want, yet it seems to be a very simple thing to do.
df = pd.DataFrame({'Name': ["A", "B", "C", "A"],
'ID': [1, 2, 3, 1]})
print("\ndf")
print(df)
emits
Name ID
0 A 1
1 B 2
2 C 3
3 A 1
What can I do to get this output?
Name ID Count
A 1 2
B 2 1
C 3 1
The below answer should help you:
import pandas as pd
df = pd.DataFrame({'Name': ["A", "B", "C", "A"],
'ID': [1, 2, 3, 1]})
df = df.groupby(["Name", "ID"])["Name"].count().reset_index(name="Count")
print(df)
Output:
Name ID Count
0 A 1 2
1 B 2 1
2 C 3 1
df.groupby(['Name', 'ID']).size().reset_index().rename(columns={0:'COUNT'})

Categories

Resources