What is the pandas equivalent of sort foo | uniq -c (and how to label the count column as 'Count')? - python

Having spent many hours trying to solve this, I have managed to get close to an answer, but not exactly there. I haven't found an example that does exactly what I want, yet it seems to be a very simple thing to do.
df = pd.DataFrame({'Name': ["A", "B", "C", "A"],
'ID': [1, 2, 3, 1]})
print("\ndf")
print(df)
emits
Name ID
0 A 1
1 B 2
2 C 3
3 A 1
What can I do to get this output?
Name ID Count
A 1 2
B 2 1
C 3 1

The below answer should help you:
import pandas as pd
df = pd.DataFrame({'Name': ["A", "B", "C", "A"],
'ID': [1, 2, 3, 1]})
df = df.groupby(["Name", "ID"])["Name"].count().reset_index(name="Count")
print(df)
Output:
Name ID Count
0 A 1 2
1 B 2 1
2 C 3 1

df.groupby(['Name', 'ID']).size().reset_index().rename(columns={0:'COUNT'})

Related

updating and including two Pandas' DataFrames

I would like to update the Pandas' DataFrame by summation, and if the ID does not exist in the merged DataFrame, then I would like to include the ID's corresponding row. For example, let's say there are two DataFrames like this:
import pandas as pd
d1 = pd.DataFrame({'ID': ["A", "B", "C", "D"], "value": [2, 3, 4, 5]})
d2 = pd.DataFrame({'ID': ["B", "D", "E"], "value": [1, 3, 2]})
Then, the final output that I would like to produce is as follows:
ID value
0 A 2
1 B 4
2 C 4
3 D 8
4 E 2
Do you have any ideas on this? I have tried to do it with update or concat functions, but this is not the way for producing the results that I want to produce. Thanks in advance.
Use concat and aggregate sum:
df = pd.concat([d1, d2]).groupby('ID', as_index=False).sum()
print (df)
ID value
0 A 2
1 B 4
2 C 4
3 D 8
4 E 2
Another idea if unique ID in both DataFrames with convert ID to index and use DataFrame.add:
df = d1.set_index('ID').add(d2.set_index('ID'), fill_value=0).reset_index()
print (df)
ID value
0 A 2.0
1 B 4.0
2 C 4.0
3 D 8.0
4 E 2.0

How to remove some rows from a Pandas dataframe to balance it

I have a csv file and after reading it with pandas it has the following structure:
file_path, label
- -
The labels are only zeros and ones, and the frequency count is as follows:
data["labels"].value_counts()
0 197664
1 78444
I would like to remove an amount of rows which has the value 0, lets say 20k for example so that the frequency counts will have these values.
data["labels"].value_counts()
0 195664
1 78444
You can drop the last 20K rows on some condition using pandas drop.
df.drop(df[df.labels == 0].index[-20000:], inplace=True)
mydict = {
"file_path" : ["a", "b", "c", "d", "e" , "f", "g"],
"label" : [0, 1, 0, 1, 1, 1, 0]
}
df = pd.DataFrame(mydict)
file_path
label
0
a
0
1
b
1
2
c
0
3
d
1
4
e
1
5
f
1
6
g
0
if your labels are 1 or 0 and you want get only "1" label, you can group your dataset by "label" column and then use get_group() :
get_1 = df.groupby("label").get_group(1)
get_1
file_path
label
1
b
1
3
d
1
4
e
1
5
f
1
Usually I do split then concat
df1 = df.iloc[:20000]
df2 = df.drop(df1.index)
new = pd.concat([df1[df1['labels'] != 0], df2])

Converting category type column values to columns with corresponding value count [duplicate]

This question already has an answer here:
simple pivot table of pandas dataframe
(1 answer)
Closed last year.
I have a Pandas dataframe built like:
Col1
Col2
1
A
1
B
1
B
2
A
2
A
3
A
3
Nan
For every value of Col1, I want to count every value of Col2 ignoring the Nan values and put the sum in the associated column, obtaining something like:
Col1
A
B
1
1
2
2
2
0
3
1
0
How can I do that in Pandas? I have a lot of values in Col1 and lots of columns like Col2.
Thank you very much!
You can try crosstab
out = pd.crosstab(df.Col1, df.Col2).reset_index()
Out[66]:
Col2 Col1 A B
0 1 1 2
1 2 2 0
2 3 1 0
simply do this !!works!!
df.groupby(list(df.columns)).size().unstack()
output:
Col2 A B
Col1
1 1.0 2.0
2 2.0 NaN
3 1.0 NaN
You can just add the columns and set the values to a new column.
sum_column = df["col1"] + df["col2"]
df["col3"] = sum_column
If you have several columns and want to count the values over all columns, here's a maybe not too elegant, but working solution:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 1, 1, 2, 2, 2, 3, 3, 3],
["A", "B", "A", "C", "b", "A", "A", "A", np.nan],
["a", "a", "C", "b", "A", "b", "a", "b", np.nan]
]).T
df.columns = ["Col1", "Col2", "Col3"]
result = {}
for i, group in df.groupby("Col1"):
result[i] = group.drop(columns=["Col1"]).stack().value_counts()
result = pd.DataFrame(result).T.fillna(0)
There are a number of ways to achieve this(#BENY has already provided the crosstab option):
pd.get_dummies:
(pd
.get_dummies(df, columns=["Col2"], prefix="", prefix_sep="")
.groupby("Col1")
.sum()
)
A B
Col1
1 1 2
2 2 0
3 1 0
pivot_table:
(df.pivot_table(index="Col1", columns="Col2", aggfunc="size", fill_value=0))
Col2 A B
Col1
1 1 2
2 2 0
3 1 0
value_counts:
(df
.value_counts()
.unstack(fill_value=0)
.rename_axis(columns=None)
)
A B
Col1
1 1 2
2 2 0
3 1 0
Use pandas.pivot_tabel(). Link is available https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html.

Count the number of values in Pandas [duplicate]

This question already has an answer here:
simple pivot table of pandas dataframe
(1 answer)
Closed last year.
I have a Pandas dataframe built like:
Col1
Col2
1
A
1
B
1
B
2
A
2
A
3
A
3
Nan
For every value of Col1, I want to count every value of Col2 ignoring the Nan values and put the sum in the associated column, obtaining something like:
Col1
A
B
1
1
2
2
2
0
3
1
0
How can I do that in Pandas? I have a lot of values in Col1 and lots of columns like Col2.
Thank you very much!
You can try crosstab
out = pd.crosstab(df.Col1, df.Col2).reset_index()
Out[66]:
Col2 Col1 A B
0 1 1 2
1 2 2 0
2 3 1 0
simply do this !!works!!
df.groupby(list(df.columns)).size().unstack()
output:
Col2 A B
Col1
1 1.0 2.0
2 2.0 NaN
3 1.0 NaN
You can just add the columns and set the values to a new column.
sum_column = df["col1"] + df["col2"]
df["col3"] = sum_column
If you have several columns and want to count the values over all columns, here's a maybe not too elegant, but working solution:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 1, 1, 2, 2, 2, 3, 3, 3],
["A", "B", "A", "C", "b", "A", "A", "A", np.nan],
["a", "a", "C", "b", "A", "b", "a", "b", np.nan]
]).T
df.columns = ["Col1", "Col2", "Col3"]
result = {}
for i, group in df.groupby("Col1"):
result[i] = group.drop(columns=["Col1"]).stack().value_counts()
result = pd.DataFrame(result).T.fillna(0)
There are a number of ways to achieve this(#BENY has already provided the crosstab option):
pd.get_dummies:
(pd
.get_dummies(df, columns=["Col2"], prefix="", prefix_sep="")
.groupby("Col1")
.sum()
)
A B
Col1
1 1 2
2 2 0
3 1 0
pivot_table:
(df.pivot_table(index="Col1", columns="Col2", aggfunc="size", fill_value=0))
Col2 A B
Col1
1 1 2
2 2 0
3 1 0
value_counts:
(df
.value_counts()
.unstack(fill_value=0)
.rename_axis(columns=None)
)
A B
Col1
1 1 2
2 2 0
3 1 0
Use pandas.pivot_tabel(). Link is available https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html.

Label encoding multiple columns with the same category

Consider the following dataframe:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data=[["France", "Italy", "Belgium"], ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
df = df.apply(LabelEncoder().fit_transform)
print(df)
It currently outputs:
a b c
0 0 1 0
1 1 0 0
My goal is to make it output something like this by passing in the columns I want to share categorial values:
a b c
0 0 1 2
1 1 0 2
Pass axis=1 to call LabelEncoder().fit_transform once for each row.
(By default, df.apply(func) calls func once for each column).
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
encoder = LabelEncoder()
df = df.apply(encoder.fit_transform, axis=1)
print(df)
yields
a b c
0 1 2 0
1 2 1 0
Alternatively, you could use make the data of category dtype and use the category codes as labels:
import pandas as pd
df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
stacked = df.stack().astype('category')
result = stacked.cat.codes.unstack()
print(result)
also yields
a b c
0 1 2 0
1 2 1 0
This should be significantly faster since it does not require calling encoder.fit_transform once for each row (which might give terrible performance if you have lots of rows).
You can do this with pd.factorize.
df = df.stack()
df[:] = pd.factorize(df)[0]
df.unstack()
a b c
0 0 1 2
1 1 0 2
In case you want to encode only some columns in the dataframe then:
temp = df[['a', 'b']].stack()
temp[:] = temp.factorize()[0]
df[['a', 'b']] = temp.unstack()
a b c
0 0 1 Belgium
1 1 0 Belgium
If the encoding order doesn't matter, you can do:
df_new = (
pd.DataFrame(columns=df.columns,
data=LabelEncoder()
.fit_transform(df.values.flatten()).reshape(df.shape))
)
df_new
Out[27]:
a b c
0 1 2 0
1 2 1 0
Here's an alternative solution using categorical data. Similar to #unutbu's but preserves ordering of factorization. In other words, the first value found will have code 0.
df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
["Italy", "France", "Belgium"]],
columns=["a", "b", "c"])
# get unique values in order
vals = df.T.stack().unique()
# convert to categories and then extract codes
for col in df:
df[col] = pd.Categorical(df[col], categories=vals)
df[col] = df[col].cat.codes
print(df)
a b c
0 0 1 2
1 1 0 2

Categories

Resources