Pandas how to count the string appear times in columns? - python

It would be more easy to explain start from a simple example df:
df1:
A B C D
0 a 6 1 b/5/4
1 a 6 1 a/1/6
2 c 9 3 9/c/3
There were four columns in the df1(ABCD).The task is to find out columns D's strings appeared how many times in columnsABC(3coulumns)?Here is expect output and more explanation:
df2(expect output):
A B C D E (New column)
0 a 6 1 b/5/4 0 <--Found 0 ColumnD's Strings from ColumnABC
1 a 6 1 a/1/6 3 <--Found a、1 & 6 so it should return 3
2 c 9 3 9/c/3 3 <--Found all strings (3 totally)
Anyone has good idea for this? Thanks!

You can use a list comprehension with set operations:
df['E'] = [len(set(l).intersection(s.split('/'))) for l, s in
zip(df.drop(columns='D').astype(str).to_numpy().tolist(),
df['D'])]
Output:
A B C D E
0 a 6 1 b/5/4 0
1 a 6 1 a/1/6 3
2 c 9 3 9/c/3 3

import pandas as pd
from pandas import DataFrame as df
dt = {'A':['a','a','c'], 'B': [6,6,9], 'C': [1,1,3], 'D':['b/5/4', 'a/1/6', 'c/9/3']}
E = []
nu_data =pd.DataFrame(data=dt)
for itxid, itx in enumerate(nu_data['D']):
match = 0
str_list = itx.split('/')
for keyid, keys in enumerate(dt):
if keyid < len(dt)-1:
for seg_str in str_list:
if str(dt[keys][itxid]) == seg_str:
match += 1
E.append(match)
nu_data['E'] = E
print(nu_data)

Related

Sequence length in dataframe in python

I have a dataframe in python that has a column like below:
Type
A
A
B
B
B
I want to add another column to my data frame according to the sequence of Type:
Type Seq
A 1
A 2
B 1
B 2
B 3
I was doing it in R with the following command:
setDT(df)[ , Seq := seq_len(.N), by = rleid(Type) ]
I am not sure how to do it python.
Use Series.rank,
df['seq'] = df['Type'].rank(method = 'dense').astype(int)
Type seq
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
Edit for updated question
df['seq'] = df.groupby('Type').cumcount() + 1
df
Output:
Type seq
0 A 1
1 A 2
2 B 1
3 B 2
4 B 3
Use pd.factorize:
import pandas as pd
df['seq'] = pd.factorize(df['Type'])[0] + 1
df
Output:
Type seq
0 A 1
1 A 1
2 B 2
3 B 2
4 B 2
In pandas
(df.Type!=df.Type.shift()).ne(0).cumsum()
Out[58]:
0 1
1 1
2 2
3 2
4 2
Name: Type, dtype: int32
More info
v=c('A','A','B','B','B','A')
data.table::rleid(v)
[1] 1 1 2 2 2 3
df
Type
0 A
1 A
2 B
3 B
4 B
5 A# assign a new number in R data.table rleid
(df.Type!=df.Type.shift()).ne(0).cumsum()
Out[60]:
0 1
1 1
2 2
3 2
4 2
5 3# check
Might not be the best way but try this:
df.loc[df['Type'] == A, 'Seq'] = 1
Similarly, for B:
df.loc[df['Type'] == B, 'Seq'] = 2
A strange (and not recommended) way of doing it is to use the built-in ord() function to get the Unicode code-point of the character.
That is:
df['Seq'] = df['Type'].apply(lamba x: ord(x.lower())-96)
A much better way of doing it is to change the type of the strings to categories:
df['Seq'] = df['Type'].astype('category').cat.codes
You may have to increment the codes if you want different numbers.

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Start counting at zero by group

Consider the following dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame({'group': list('aaabbabc')})
>>> df
group
0 a
1 a
2 a
3 b
4 b
5 a
6 b
7 c
I want to count the cumulative number of times each group has occurred. My desired output looks like this:
>>> df
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
My initial approach was to do something like this:
df['n'] = df.groupby('group').apply(lambda x: list(range(x.shape[0])))
Basically assigning a length n array, zero-indexed, to each group. But that has proven difficult to transpose and join.
You can use groupby + cumcount, and horizontally concat the new column:
>>> pd.concat([df, df.group.groupby(df.group).cumcount()], axis=1).rename(columns={0: 'n'})
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Simply use groupby on column name, in this case group and then apply cumcount and finally add a column in dataframe with the result.
df['n']=df.groupby('group').cumcount()
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
You can use apply method by passing a lambda expression as parameter.
The idea is that you need to find out the count for a group as number of appearances for that group from the previous rows.
df['n'] = df.apply(lambda x: list(df['group'])[:int(x.name)].count(x['group']), axis=1)
Output
group n
0 a 0
1 a 1
2 a 2
3 b 0
4 b 1
5 a 3
6 b 2
7 c 0
Note: cumcount method is build with the help of the apply function.
You can read this in pandas documentation.

Sum values in third column while putting together corespondinng values in first and second columns

I have 3 columns of data. I have data stored in three columns (k, v, t) in csv. For instance,
Data:
k v t
a 1 2
b 2 3
c 3 4
a 2 3
b 3 2
b 3 4
c 3 5
b 2 3
I want to get as the following data. Basically, sum all the values of t that has the same k and v.
a 1 5
b 2 6
b 3 6
c 3 9
this is the code I have so far:
aList = []
aList2 = []
aList3 = []
for i in range(len(data)):
if data['k'][i] == 'a':
if data['v'][i] == 1:
aList.append(data['t'][i])
elif data['v'][i] == 2:
aList2.append(data['t'][i])
else:
aList3.append(data['t'][i])
and it keeps going until the end.
I use "for loop" and "if" but it is too long. Can I use numpy in a short and clean way? or any other better way?
Here is one solution using pandas.
First create a dataframe, then perform a groupby operation. The below code assumes your data is stored in a csv file.
df = pd.read_csv('file.csv')
g = df.groupby(['k', 'v'], as_index=False)['t'].sum()
Result
k v t
0 a 1 2
1 a 2 3
2 b 2 6
3 b 3 6
4 c 3 9

Selecting specific columns with specific values pandas

So I have a data frame of 30 columns and I want to filter it for values found in 10 of those columns and return all the rows that match. In the example below, I want to search for values equal to 1 in all df columns that end with "good..."
df[df[[i for i in df.columns if i.endswith('good')]].isin([1])]
df[df[[i for i in df.columns if i.endswith('good')]] == 1]
Both of these work to find those columns but everything that does not match appears as NaN. My question is how can I query specific columns for specific values and have all the rows that don't match not appear as NaN?
You can filter columns first with str.endswith, select columns by [] and compare by eq. Last add any for at least one 1 per row
cols = df.columns[df.columns.str.endswith('good')]
df1 = df[df[cols].eq(1).any(axis=1)]
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[1,1,4,5,5,1],
'C good':[7,8,9,4,2,3],
'D good':[1,3,5,7,1,0],
'E good':[5,3,6,9,2,1],
'F':list('aaabbb')})
print (df)
A B C good D good E good F
0 a 1 7 1 5 a
1 b 1 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 1 3 0 1 b
cols = df.columns[df.columns.str.endswith('good')]
print (df[cols].eq(1))
C good D good E good
0 False True False
1 False False False
2 False False False
3 False False False
4 False True False
5 False False True
df1 = df[df[cols].eq(1).any(1)]
print (df1)
A B C good D good E good F
0 a 1 7 1 5 a
4 e 5 2 1 2 b
5 f 1 3 0 1 b
You solution was really close, only add any:
df1 = df[df[[i for i in df.columns if i.endswith('good')]].isin([1]).any(axis=1)]
print (df1)
A B C good D good E good F
0 a 1 7 1 5 a
4 e 5 2 1 2 b
5 f 1 3 0 1 b
EDIT:
If need only 1 and all another rows and columns remove:
df1 = df.loc[:, df.columns.str.endswith('good')]
df2 = df1.loc[df1.eq(1).any(1), df1.eq(1).any(0)]
print (df2)
D good E good
0 1 5
4 1 2
5 0 1

Categories

Resources