transform data by conditional addition - python

I need help transforming the data as follows:
From a dataset in this version (df1)
ID apples oranges pears apples_pears oranges_pears
0 1 1 0 0 1 0
1 2 0 1 0 1 0
2 3 0 1 1 0 1
to a data set like the following (df2):
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2
What I'm trying to accomplish is get the total values of apples, from all the columns in which the word "apple" appears in the column name. E.g. in the df1 there are 2 column names in which the word "apple" appears. If you sum up all the apples from the first row, there would be a total of 2. I want a single column for apples in the new dataset (df2). Note that a 1 for appleas_pears is a 1 for EACH apples and pears.

Idea is split DataFrame to new 2 - first change columns names by all values before _ and for second filter columns with _ by DataFrame.filter and change columns by value after _, last join together by concat and sum per columns:
df1 = df.set_index('ID')
df2 = df1.filter(like='_')
df1.columns = df1.columns.str.split('_').str[0]
df2.columns = df2.columns.str.split('_').str[1]
df = pd.concat([df1, df2], axis=1).sum(level=0, axis=1).reset_index()
print (df)
ID apples oranges pears
0 1 2 0 1
1 2 1 1 1
2 3 0 2 2

Related

Updated all values in a pandas dataframe based on all instances of a specific value in another dataframe

My apologies beforehand! I have done this before a few times, but I am having some brain fog. I have two dataframes df1, and df2. I would like to update all values in df2 if it matches a specific value in df1, while not changing the other values in df2. I can do this pretty easily with np.where on columns of a dataframe, I am having brain fog on how I did this previously with 2 dataframes!
Goal: Set values in Df2 to 0 if they are 0 in DF1 - otherwise keep the DF2 value
Example
df1
A
B
C
4
0
1
0
2
0
1
4
0
df2
A
B
C
1
8
1
9
2
7
1
4
6
Expected df2 after our element swap
A
B
C
1
0
1
0
2
0
1
4
0
brain fog is bad! thank you for the assistance!
Using fillna
>>> df2[df1 != 0].fillna(0)
You can try
df2[df1.eq(0)] = 0
print(df2)
A B C
0 1 0 1
1 0 2 0
2 1 4 0

How to make values in one column dependent on another?

I have one column called "A" with only values 0 or 1. I have another column called "B". If column A value=0, I want the column B value to equal "dog". If column A value=1, I want the column B value to equal "cat".
Sample DataFrame column:
print(df)
A
0 0
1 1
Is there anyway to fill the B column as such without a for loop?
Desired:
print(df)
A B
0 0 Cat
1 1 Dog
Thanks
Can Simply can try Below using map...
Sample Data
print(df)
A
0 0
1 1
2 0
3 1
4 1
5 1
6 0
7 0
8 1
Result:
df['B'] = df['A'].map({0:'Cat', 1:'Dog'})
print(df)
A B
0 0 Cat
1 1 Dog
2 0 Cat
3 1 Dog
4 1 Dog
5 1 Dog
6 0 Cat
7 0 Cat
8 1 Dog
Next time, please post your research and minimal reproducible code. See comments
import pandas as pd
d = {'A': [0, 1, 0]}
df = pd.DataFrame(data=d)
m = {0: ("Dog"), 1: ("Cat")}
df['B'] = df['A'].map(lambda x: m[x])

Groupby a column in Dataframe and create another dataframe with grouped data

I have a dataframe like the following:
data:
items status
0 jet fail
1 car fail
2 car pass
3 bike fail
4 car fail
5 jet fail
6 bike pass
7 jet fail
8 jet fail
9 bike pass
I want to group the data by items and create a new dataframe with the counts of each value.
Expected output:
df:
unique count pass fail
0 jet 4 0 4
1 car 3 1 2
2 bike 3 2 1
Now one method would be to get a list of unique items and loop on it to find the count, pass and fail and then combine these lists to a dataframe
But how can I do that efficiently ?
Use crosstab with DataFrame.rename_axis for new index name, then add new column for 0 position by DataFrame.insert and last convert index to column by DataFrame.reset_index:
df = pd.crosstab(df['items'], df['status']).rename_axis(columns=None, index='unique')
df.insert(0, 'count', df.sum(axis=1))
df = df.reset_index()
print (df)
unique count fail pass
0 bike 3 1 2
1 car 3 2 1
2 jet 4 4 0
If count should be last column is possible use margin parameter and remove last row:
df = (pd.crosstab(df['items'], df['status'],
margins=True,
margins_name='count')
.rename_axis(columns=None, index='unique')
.iloc[:-1]
.reset_index())
print (df)
unique fail pass count
0 bike 1 2 3
1 car 2 1 3
2 jet 4 0 4
You could get the values separately and combine with pd.concat :
A = df.groupby("items").size().rename("count")
A
items
bike 3
car 3
jet 4
Name: count, dtype: int64
B = (
df.groupby(["items", "status"])
.size()
.unstack(fill_value=0)
.rename_axis(columns=None)
)
B
fail pass
items
bike 1 2
car 2 1
jet 4 0
pd.concat((A, B), axis=1).reset_index()
items count fail pass
0 bike 3 1 2
1 car 3 2 1
2 jet 4 4 0

How to count the most popular value from multiple value pandas column

i have such a problem:
I have pandas dataframe with shop ID and shop cathegories, looking smth like that:
id cats
0 10002718 182,45001,83079
1 10004056 9798
2 10009726 17,45528
3 10009752 64324,17
4 1001107 44607,83520,76557
... ... ...
24922 9992184 45716
24923 9997866 77063
24924 9998461 45001,44605,3238,72627,83785
24925 9998954 69908,78574,77890
24926 9999728 45653,44605,83648,85023,84481,68822
So the problem is that each shop can have multiple cathegories, and the task is to count frequency of each cathegoty. What's the easiest way to do it?
In conclusion i need to have dataframe with columns
cats count
0 1 133
1 2 1
2 3 15
3 4 12
Use Series.str.split with Series.explode and Series.value_counts:
df1 = (df['cats'].str.split(',')
.explode()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
Or add expand=True to split to DataFrame and DataFrame.stack:
df1 = (df['cats'].str.split(',', expand=True)
.stack()
.value_counts()
.rename_axis('cats')
.reset_index(name='count'))
print (df1.head(10))
cats count
0 17 2
1 44605 2
2 45001 2
3 83520 1
4 64324 1
5 44607 1
6 45653 1
7 69908 1
8 83785 1
9 83079 1

How to return a dataframe of number of duplicates based on filtering the two columns in python

ID group categories
1 0 red
2 1 blue
3 1 green
4 1 green
1 0 blue
1 0 blue
2 1 red
3 0 red
4 0 red
4 1 red
Hi, I am new to python I am trying to get the count of duplicates of ID columns based on multiple conditions on the other 2 columns. So I am filtering out red and 0 and then I wanted ID's that repeated more than once.
df1 = df[(df['categories']=='red')& (df['group'] == 0)]
df1['ID'].value_counts()[df1['ID'].value_counts()>1]
There are almost 10 categories in the categories column so I was thinking if there is any way to write a function or for loop instead of repeating the same steps. The final goal is to see how many duplicate ID's in each group given category is 'red'/'blue'/'green'. Thanks in advance
P.S : the group values doesn't change it is a binomial variable
output
ID count
1 3
2 2
3 2
4 3
I think you can use groupby with SeriesGroupBy.value_counts:
s = df.groupby(['ID','group'])['categories'].value_counts()
print (s)
ID group categories
1 0 blue 2
red 1
2 1 blue 1
red 1
3 0 red 1
1 green 1
4 0 red 1
1 green 1
red 1
Name: categories, dtype: int64
out = s[s > 1].reset_index(name='count')
print (out)
ID group categories count
0 1 0 blue 2
Another solution is get duplicates first by filtering with duplicated and then count:
df = df[df.duplicated(['ID','group','categories'], keep=False)]
print (df)
ID group categories
4 1 0 blue
5 1 0 blue
df1 = df.groupby(['ID','group'])['categories'].value_counts().reset_index(name='count')
print (df1)
ID group categories count
0 1 0 blue 2
EDIT: For count categories (all rows) per ID use GroupBy.size:
df1 = df.groupby('ID').size().reset_index(name='count')
print (df1)
ID count
0 1 3
1 2 2
2 3 2
3 4 3

Categories

Resources