Delete elements after it appeared a certain times - python

I have a data frame similar like this:
pd.DataFrame([['a','b'],
['c','a'],
['c','d'],
['a','e'],
['p','g'],
['d','a'],
['c', 'g']
], columns=['col1','col2'])
I need to delete rows after an element appeared a certain number of times. For example, say I want to keep each value appear maximum of 2 times in this dataframe (in both columns), the final dataframe can be like this:
[['a','b'],
['a','c'],
['c','d'],
['p','g']
]
The order of rows to delete doesn't matter here. I want to maintain the maximum times of a value appear in my dataframe.
Many Thanks!

IIUC, try:
n=2
s=df.stack()
s[(s.groupby(s).cumcount()+1).le(n)].unstack().dropna()
col1 col2
0 a b
1 a c
2 c d
4 p g

Here is one way using stack then cumcount with all
s=df.stack()
s=s.groupby(s).cumcount().unstack()
df[(s<=1).all(1)]
Out[206]:
col1 col2
0 a b
1 a c
2 c d
4 p g

You can stack the data, cumcount, and unstack back:
s = df.stack()
df[s.groupby(s).cumcount().unstack().lt(2).all(1)]
Output:
col1 col2
0 a b
1 a c
2 c d
4 p g

Related

Creating a New Column in a Pandas Dataframe in a more pythonic way

I am trying to find a better, more pythonic way of accomplishing the following:
I want to add a new column to business_df called 'dot_prod', which is the dot product of a fixed vector (fixed_vector) and a vector from another data frame (rating_df). The rows of both business_df and rating_df have the same index values (business_id).
I have this loop which appears to work, however I know it's super clumsy (and takes forever). Essentially it loops through once for every row, calculates the dot product, then dumps it into the business_df dataframe.
n=0
for i in range(business_df.shape[0]):
dot_prod = np.dot(fixed_vector, rating_df.iloc[n])
business_df['dot_prod'][n] = dot_prod
n+=1
IIUC, you are looking for apply across axis=1 like:
business_df['dot_prod'] = rating_df.apply(lambda x: np.dot(fixed_vector, x), axis=1)
>>> fixed_vector = [1, 2, 3]
>>> df = pd.DataFrame({'col1' : [1,2], 'col2' : [3,4], 'col3' : [5,6]})
>>> df
col1 col2 col3
0 1 3 5
1 2 4 6
>>> df['col4'] = np.dot(fixed_vector, [df['col1'], df['col2'], df['col3']])
>>> df
col1 col2 col3 col4
0 1 3 5 22
1 2 4 6 28

Reshaping DataFrame with pandas

So I'm working with pandas on python. I collect data indexed by timestamps with multiple ways.
This means I can have one index with 2 features available (and the others with NaN values, it's normal) or all features, it depends.
So my problem is when I add some data with multiple values for the same indices, see the example below :
Imagine this is the set we're adding new data :
Index col1 col2
1 a A
2 b B
3 c C
This the data we will add:
Index new col
1 z
1 y
Then the result is this :
Index col1 col2 new col
1 a A NaN
1 NaN NaN z
1 NaN NaN y
2 b B NaN
3 c C NaN
So instead of that, I would like the result to be :
Index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I want that instead of having multiples indexes in 1 feature, there will be 1 index for multiple features.
I don't know if this is understandable. Another way is to say that I want this : number of values per timestamp=number of features instead of =numbers of indices.
This solution assumes the data that you need to add is a series.
Original df:
df = pd.DataFrame(np.random.randint(0,3,size=(3,3)),columns = list('ABC'),index = [1,2,3])
Data to add (series):
s = pd.Series(['x','y'],index = [1,1])
Solution:
df.join(s.to_frame()
.assign(cc = lambda x: x.groupby(level=0)
.cumcount().add(1))
.set_index('cc',append=True)[0]
.unstack()
.rename('New Col{}'.format,axis=1))
Output:
A B C New Col1 New Col2
1 1 2 2 x y
2 0 1 2 NaN NaN
3 2 2 0 NaN NaN
Alternative answer (maybe more simplistic, probably less pythonic). I think you need to look at converting wide data to long data and back again in general (pivot and transpose might be good things to look up for this), but I also think there are some possible problems in your question. You don't mention new col 1 and new col 2 in the declaration of the subsequent arrays.
Here's my declarations of your data frames:
d = {'index': [1, 2, 3],'col1': ['a', 'b', 'c'], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data=d)
e1 = {'index': [1], 'new col1': ['z']}
dfe1 = pd.DataFrame(data=e1)
e2 = {'index': [1], 'new col2': ['y']}
dfe2 = pd.DataFrame(data=e2)
They look like this:
index new col1
1 z
and this:
index new col2
1 y
Notice that I declare your new columns as part of the data frames. Once they're declared like that, it's just a matter of merging:
dfr = pd.merge(df, dfe, on='index', how="outer")
dfr1 = pd.merge(df, dfe1, on='index', how="outer")
dfr2 = pd.merge(dfr1, dfe2, on='index', how="outer")
And the output looks like this:
index col1 col2 new col1 new col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
I think one problem may arise in the way you first create your second data frame.
Actually, expanding the number of feature depending on its content is what makes this reformatting a bit annoying here (as you could see for yourself, when writing two new column names out of the bare assumption that this reflect the number of feature observed at every timestamps).
Here is yet another solution, this tries to be a bit more explicit in the step taken than rhug123's answer.
# Initial dataFrames
a = pd.DataFrame({'col1':['a', 'b', 'c'], 'col2':['A', 'B', 'C']}, index=range(1, 4))
b = pd.DataFrame({'new col':['z', 'y']}, index=[1, 1])
Now the only important step is basically transposing your second DataFrame, while here you also need to intorduce two new column names.
We will do this grouping of the second dataframe according to its content (y, z, ...):
c = b.groupby(b.index)['new col'].apply(list) # this has also one index per timestamp, but all features are grouped in a list
# New column names:
cols = ['New col%d'%(k+1) for in range(b.value_counts().sum())]
# Expanding dataframe "c" for each new column
d = pd.DataFrame(c.to_list(), index=b.index.unique(), columns=cols)
# Merge
a.join(d, how='outer')
Output:
col1 col2 New col1 New col2
1 a A z y
2 b B NaN NaN
3 c C NaN NaN
Finally, one problem encountered with both my answer and the one from rhug123, is that as for now it won't deal with another feature at a different timestamp correctly. Not sure what the OP expects here.
For example if b is:
new col
1 z
1 y
2 x
The merged output will be:
col1 col2 New col1 New col2
1 a A z y
2 b B x None
3 c C NaN NaN

Extract regex match from pandas column and use as index to access list element

I have a pandas dataframe that looks like this:
col1 col2 col3
0 A,B,C 0|0 1|1
1 D,E,F 2|2 3|3
2 G,H,I 4|4 0|0
My goal is to apply a function on col2 through the last column of the dataframe that splits the corresponding string in col1, using the comma as the delimiter, and uses the first number as the index to get the corresponding list element. For numbers that are greater than the length of the list, I'd like to replace with the 0th element of the list.
Expected output:
col1 col2 col3
0 A,B,C A B
1 D,E,F F D
2 G,H,I G G
In reality, my dataframe has thousands of columns with millions of entries that need this replacement, so I need a method that doesn't refer to 'col2' and 'col3' explicity (and preference for a computationally efficient method).
You can use this code to create the original dataframe:
df = pd.DataFrame(
{
'col1': ['A,B,C', 'D,E,F', 'G,H,I'],
'col2': ['0|0', '2|2', '4|4'],
'col3': ['1|1', '3|3', '0|0']
}
)
Taking into account that you could have a lot of columns and the length of the arrays in col1 could vary, you can use the following generalization, which only loops through the columns:
for col in df.columns[1:]:
df[col] = (df['col1']+','+df[col].str.split('|').str[0]).str.split(',') \
.apply(lambda x: x[int(x[-1])] if int(x[-1]) < len(x[:-1]) else x[0])
which outputs for your example:
>>> print(df)
col1 col2 col3
0 A,B,C A B
1 D,E,F F D
2 G,H,I G G
Explanation:
first you get the index as string from colX and append to the string in col1 so that you get something like 'A,B,C,0' and split it to get a list with the last element been the index that you need ([A,B,C,0]):
(df['col1']+','+df[col].str.split('|').str[0]).str.split(',')
Then you apply a function that returns the ith element been i the last element of the list and if i is bigger then the len of the list - 1 then return just the first element of the list.
(df['col1']+','+df[col].str.split('|').str[0]).str.split(',') \
.apply(lambda x: x[int(x[-1])] if int(x[-1]) < len(x[:-1]) else x[0])
Last but not least, you just put it in a loop over your desired columns.
I would first reduce your strange x|x for all x format:
df['col2'] = df['col2'].str.split('|', expand=True).iloc[:, 0]
df['col3'] = df['col3'].str.split('|', expand=True).iloc[:, 0]
Then split the letter mappings while keeping them aligned by row.
ndf = pd.concat([df, df['col1'].str.split(',', expand=True)], axis=1)
After that, map them back by row while making sure to prevent overflows:
def bad_mapping(row, c):
value = int(row[c])
if value <= 2: # adjust if needed
return row[value]
else:
return row[0]
for c in ['col2', 'col3']:
ndf['mapped_' + c] = ndf.apply(lambda r: bad_mapping(r, c), axis=1)
Output looks like:
col1 col2 col3 0 1 2 mapped_col2 mapped_col3
0 A,B,C 0 1 A B C A B
1 D,E,F 2 3 D E F F D
2 G,H,I 4 0 G H I G G
Drop columns with df.drop(columns=['your', 'columns', 'here'], inplace=True) as needed.

Pandas - Find duplicated entries in one column within rows with equal values in another column

Assume a dataframe df like the following:
col1 col2
0 a A
1 b A
2 c A
3 c B
4 a B
5 b B
6 a C
7 a C
8 c C
I would like to find those values of col2 where there are duplicate entries a in col1. In this example the result should be ['C]', since for df['col2'] == 'C', col1 has two a as entries.
I tried this approach
df[(df['col1'] == 'a') & (df['col2'].duplicated())]['col2'].to_list()
but this only works, if the a within a block of rows defined by col2 is at the beginning or end of the block, depending on how you define the keep keyword of duplicated(). In this example, it returns ['B', 'C'], which is not what I want.
Use Series.duplicated only for filtered rows:
df1 = df[df['col1'] == 'a']
out = df1.loc[df1['col2'].duplicated(keep=False), 'col2'].unique().tolist()
print (out)
['C']
Another idea is use DataFrame.duplicated by both columns and chain wit hrows match only a:
out = df.loc[df.duplicated(subset=['col1', 'col2'], keep=False) &
(df['col1'] == 'a'), 'col2'].unique().tolist()
print (out)
['C']
You can group your col1 by col2 and count occurrences of 'a'
>>> s = df.col1.groupby(df.col2).sum().str.count('a').gt(1)
>>> s[s].index.values
array(['C'], dtype=object)
A more generalised solution using Groupby.count and index.get_level_values:
In [2632]: x = df.groupby(['col1', 'col2']).col2.count().to_frame()
In [2642]: res = x[x.col2 > 1].index.get_level_values(1).tolist()
In [2643]: res
Out[2643]: ['C']

How to Access Element of Pandas Series that is a List

I have a Dataframe series that contains is a list of strings for each row. I'd like to create another series that is the last string in the list for that row.
So one row may have a list e.g
['a', 'b', 'c', 'd']
I'd like to create another pandas series made up of the last element of the row, normally access as a -1 reference, in this 'd'. The lists for each observation (i.e. row) are of varying length. How can this be done?
I believe need indexing with str, it working with all iterables:
df = pd.DataFrame({'col':[['a', 'b', 'c', 'd'],['a', 'b'],['a'], []]})
df['last'] = df['col'].str[-1]
print (df)
col last
0 [a, b, c, d] d
1 [a, b] b
2 [a] a
3 [] NaN
strings are iterables too:
df = pd.DataFrame({'col':['abcd','ab','a', '']})
df['last'] = df['col'].str[-1]
print (df)
col last
0 abcd d
1 ab b
2 a a
3 NaN
Why not making the list column to a info dataframe, and you can using the index for join
Infodf=pd.DataFrame(df.col.values.tolist(),index=df.index)
Infodf
Out[494]:
0 1 2 3
0 a b c d
1 a b None None
2 a None None None
3 None None None None
I think I over looked the question , and both PiR and Jez provided their valuable suggestion to help me achieve the final result .
Infodf.ffill(1).iloc[:,-1]

Categories

Resources