Python Pandas how to find matching values by label - python

I have a csv file that looks something like this:
mark
time
value1
value2
1
14:22:02
5
2
1
14:22:05
8
4
2
14:25:02
1
1
2
14:26:05
4
7
3
15:12:08
5
2
3
15:12:11
5
4
3
15:12:15
5
2
3
15:12:17
8
4
I would like to output all the matches by label 1 and 3
Expected result:
Number of matches is the number of intersections with the same symbols of the label 1 and 3
That is, if there are 5 in mark 1 and Value 1 column, then it counts the entire number of intersections with mark3 in Value 1
By two columns of value
mark
value1
value2
Number of matches
1-3
5
2
2
1-3
8
4
1
For value 1
mark
value1
Number of matches
1-3
5
3
1-3
8
1
For value 2
mark
value2
Number of matches
1-3
2
2
1-3
4
2

You can use a groupby on the filtered DataFrame, then filter again to have a count > 1:
target = ['value1', 'value2']
(df.loc[df['mark'].isin([1,3])]
.astype({'mark': 'str'})
.groupby(target, as_index=False)
.agg(**{'mark': ('mark', lambda g: '-'.join(dict.fromkeys(g))),
'Num matches': ('mark', 'count')
})
.loc[lambda d: d['Num matches'].gt(1)]
)
Output:
value1 value2 mark Num matches
0 5 2 1-3 3
2 8 4 1-3 2

Related

How to get unique count of two columns based on unique combination of other two columns in pandas

This is a follow up question for this
Say I have this dataset:
dff = pd.DataFrame(np.array([["2020-11-13",3,4, 0,0], ["2020-10-11", 3,4,0,1], ["2020-11-13", 1,4,1,1],
["2020-11-14", 3,4,0,0], ["2020-11-13", 5,4,0,1],
["2020-11-14", 2,4,1,1],["2020-11-12", 1,4,0,1],["2020-11-14", 2,4,0,1],["2020-11-15", 5,4,1,1],
["2020-11-11", 1,2,0,0],["2020-11-15", 1,2,0,1],
["2020-11-18", 2,4,0,1],["2020-11-17", 1,2,0,0],["2020-11-20", 3,4,0,0]]), columns=['Timestamp', 'Name', "slot", "A", "B"])
I want to have a count for each Name and slot combination but disregard multiple timeseries value for same combination of A and B. For example, if I simply group by Name and slot I get:
dff.groupby(['Name', "slot"]).Timestamp.count().reset_index(name="count")
Name slot count
1 2 3
1 4 2
2 4 3
3 4 4
5 4 2
However, for A == 0 && B == 0, there are two combinations for name == 1 and slot == 2, so instead of 3 I want the count to be 2.
This is the table I would ideally want.
Name slot count
1 2 2
1 4 2
2 4 2
3 4 2
5 4 2
I tried:
filter_one = dff.groupby(['A','B']).Timestamp.transform(min)
dff1 = dff.loc[dff.Timestamp == filter_one]
dff1.groupby(['Name', "slot"]).Timestamp.count().reset_index(name="count")
But this gives me:
Name slot count
1 2 1
1 4 1
3 4 1
It also does not work if I drop duplicates for A and B.
If I understand correctly, you may just need to drop the duplicates based on the combination of the grouper columns with A and B before grouping:
u = dff.drop_duplicates(['Name','slot','A','B'])
u.groupby(['Name', "slot"]).Timestamp.count().reset_index(name="count")
Name slot count
0 1 2 2
1 1 4 2
2 2 4 2
3 3 4 2
4 5 4 2

Filter rows with more than 1 value in a set and count their occurrence pandas python

Let's assume, I have the following data frame.
Id Combinations
1 (A,B)
2 (C,)
3 (A,D)
4 (D,E,F)
5 (F)
I would like to filter out Combination column values with more than value in a set. Something like below. AND I would like count the number of occurrence as whole in Combination column. For example, ID number 2 and 5 should be removed since their value in a set is only 1.
The result I am looking for is:
ID Combination Frequency
1 A 2
1 B 1
3 A 2
3 D 2
4 D 2
4 E 1
4 F 2
Can anyone help to get the above result in Python pandas?
First if necessary convert values to lists:
df['Combinations'] = df['Combinations'].str.strip('(,)').str.split(',')
If need count after filtering only one values by Series.str.len in boolean indexing, then use DataFrame.explode and count values by Series.map with Series.value_counts:
df1 = df[df['Combinations'].str.len().gt(1)].explode('Combinations')
df1['Frequency'] = df1['Combinations'].map(df1['Combinations'].value_counts())
print (df1)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 1
Or if need count before removing them filter them by Series.duplicated in last step:
df2 = df.explode('Combinations')
df2['Frequency'] = df2['Combinations'].map(df2['Combinations'].value_counts())
df2 = df2[df2['Id'].duplicated(keep=False)]
Alternative:
df2 = df2[df2.groupby('Id').Id.transform('size') > 1]
Or:
df2 = df2[df2['Id'].map(df2['Id'].value_counts() > 1]
print (df2)
Id Combinations Frequency
0 1 A 2
0 1 B 1
2 3 A 2
2 3 D 2
3 4 D 2
3 4 E 1
3 4 F 2

How to get the difference of the max and min of the row and input as series for a dataframe?

I have following dataframe. Values are the rating by customer.
Ind Department Value1 Value2 Value3 Value4
1 Electronics 5 4 3 2
2 Clothing 4 3 2 1
3 Grocery 3 3 5 1
Here I would like to make column range that is the difference of the max and min value from the row. Expected is as below:
Ind Department Value1 Value2 Value3 Value4 range
1 Electronics 5 4 3 2 3
2 Clothing 4 3 2 1 3
3 Grocery 3 3 5 1 3
df['range'] = df.max(axis=1) - df.min(axis=1)
If you want to specify column numbers to calculate the range:
df['range'] = df.iloc[:,col1index:col2index].max(axis=1) - df.iloc[:,col1index:col2index].min(axis=1)
You can try numpy ptp
np.ptp(df.loc[:,'Value1':].values,axis=1)
array([3, 3, 4], dtype=int64)
df['range']=np.ptp(df.loc[:,'Value1':].values,axis=1)
Filter for only the Values column and compute the difference of the max and min per row :
boxes = df.filter(like="Value")
df["range"] = boxes.max(1) - boxes.min(1)
df
Ind Department Value1 Value2 Value3 Value4 range
0 1 Electronics 5 4 3 2 3
1 2 Clothing 4 3 2 1 3
2 3 Grocery 3 3 5 1 4
Same end result, but longer route, in my opinion - set the first two columns as index, get the difference of the max and min for each row, and reset the index :
(df
.set_index(["Ind", "Department"])
.assign(max_min=lambda x: x.max(1) - x.min(1))
.reset_index()
)

Sum DataFrame rows a column contains a substring

I have this DataFrame:
df1:
Date Value Info
1 1 XXX.othertext2
1 4 somerandomtext
1 2 XXX.othertext2
1 3 XXX.othertext3
1 2 XXX.othertext3
1 1 XXX.othertext2
1 1 XXX.othertext3
2 6 somerandomtext
2 9 XXX.othertext2
I want to sum rows by same Date that start with XXX.othertext2 until a new XXX.othertext2 or sometext (so it is the sum of fisrt XXX.othertext2 + all XXX.othertext3). The resulting row value of Info will be XXX.othertext2:
newdf:
Date Value Info
1 1 XXX.othertext2
1 4 somerandomtext
1 7 XXX.othertext2
1 2 XXX.othertext2
2 6 sometext
2 9 XXX.othertext2
Here's one option, with a custom grouper:
grouper = ((b.Info.str.contains('some')) | (b.Info == 'XXX.othertext2')).cumsum()
b.groupby(['Date', grouper]).sum().reset_index()
You can refine it more with a regex if necessary.

Python Pandas: Counting keys and summing up their values in a data frame

Data frame df1 contains key, value pairs:
key val
0 1 7
1 2 5
2 2 5
3 3 4
4 3 4
5 3 4
How to get data frame df2 that for each key has a record with two fields: cnt equal to the number of times given key is found in df1 and sum equal to the sum of this key values ? Like this:
cnt key sum
0 1 1 7
1 2 2 10
2 3 3 12
You can use agg with a list of summary functions:
df.groupby('key').val.agg(["count", "sum"]).reset_index()

Categories

Resources