python remove duplicate substring parsed by comma - python

I have an input Pandas Series like this:
I would like to remove duplicates in each row. For example, change M,S,S to M,S.
I tried
fifa22['player_positions'] = fifa22['player_positions'].str.split(',').apply(pd.unique)
But the results are a Series of ndarray
I would like to convert the results to simple string, without the square bracket. Wondering what to do, thanks!

If it only on this one column, you should use map.
import pandas as pd
df = pd.DataFrame({
'player_positions' : "M,S,S S S,M M,M M,M M M,S S,M,M,S".split(' ')
})
print(df)
player_positions
0 M,S,S
1 S
2 S,M
3 M,M
4 M,M
5 M
6 M,S
7 S,M,M,S
out = df['player_positions'].map(lambda x: ','.join(set(x.split(','))))
print(out)
0 M,S
1 S
2 M,S
3 M
4 M
5 M
6 M,S
7 M,S
If you want to concatenate in any other way just change the , in ','.join(...) to anything else.

Related

Find where word is present in string with where statement [duplicate]

I having replace issue while I try to replace a string with value from another column.
I want to replace 'Length' with df['Length'].
df["Length"]= df["Length"].replace('Length', df['Length'], regex = True)
Below is my data
Input:
**Formula** **Length**
Length 5
Length+1.5 6
Length-2.5 5
Length 4
5 5
Expected Output:
**Formula** **Length**
5 5
6+1.5 6
5-2.5 5
4 4
5 5
However, with the code I used above, it will replace my entire cell instead of Length only.
I getting below output:
I found it was due to df['column'] is used, if I used any other string the behind offset (-1.5) will not get replaced.
**Formula** **Length**
5 5
6 6
5 5
4 4
5 5
May I know is there any replace method for values from other columns?
Thank you.
If want replace by another column is necessary use DataFrame.apply:
df["Formula"]= df.apply(lambda x: x['Formula'].replace('Length', str(x['Length'])), axis=1)
print (df)
Formula Length
0 5 5
1 6+1.5 6
2 5-2.5 5
3 4 4
4 5 5
Or list comprehension:
df["Formula"]= [x.replace('Length', str(y)) for x, y in df[['Formula','Length']].to_numpy()]
Just wanted to add, that list comprehension is much faster of course:
df = pd.DataFrame({'a': ['aba'] * 1000000, 'c': ['c'] * 1000000})
%timeit df.apply(lambda x: x['a'].replace('b', x['c']), axis=1)
# 1 loop, best of 5: 11.8 s per loop
%timeit [x.replace('b', str(y)) for x, y in df[['a', 'c']].to_numpy()]
# 1 loop, best of 5: 1.3 s per loop

How to find the number of unique values in comma separated strings stored in an pandas data frame column?

x
Unique_in_x
5,5,6,7,8,6,8
4
5,9,8,0
4
5,9,8,0
4
3,2
2
5,5,6,7,8,6,8
4
Unique_in_x is my expected column.Sometime x column might be string also.
You can use a list comprehension with a set
df['Unique_in_x'] = [len(set(x.split(','))) for x in df['x']]
Or using a split and nunique:
df['Unique_in_x'] = df['x'].str.split(',', expand=True).nunique(1)
Output:
x Unique_in_x
0 5,5,6,7,8,6,8 4
1 5,9,8,0 4
2 5,9,8,0 4
3 3,2 2
4 5,5,6,7,8,6,8 4
You can find the unique value of the list with np.unique() and then just use the length
import pandas as pd
import numpy as np
df['Unique_in_x'] = df['X'].apply(lambda x : len(np.unique(x.split(','))))

How to simplify pandas columns sum?

I try to sum columns, like the following:
The data frame:
ID name grade_math grade_chemistry grade_physic CS_math CS_chemistry CS_physic
1 A 4 2.75 3 3 2 3
2 B 3 4 4 3 2 3
3 C 2 2 2 3 2 3
the formula is:
df['total'] = (df['grade_math']*df['CS_math']) + (df['grade_chemistry']*df['CS_chemistry']) + (df['grade_physic']*df['CS_physic']
but I've tried to simplify like this:
df['total'] = sum(df[f'grade{i}'] * df[f'CS{i}'] for i in range(1, 3))
but I realized, this logic is totally wrong. Any suggestions?
You were close in your logic. What you're after is this:
sum(df[f'grade_{subject}'] * df[f'CS_{subject}'] for subject in ["math", "chemistry", "physic"])
The issue was that when you were for i in range(1, 3), you were iterating over numbers. Placing them into f-strings will therefore result in strings like CS1, CS2, etc. These strings don't exist in the columns of your dataframe.
Therefore, in the provided solution you can notice that we iterate over the common suffixes ("math", "chemistry", and "physic") so that the f-strings results are found in the columns of the dataframe.
Use:
sum(df[f'grade_{i}'] * df[f'CS_{i}'] for i in ['math', 'chemistry', 'physic'])
Output:
0 26.5
1 29.0
2 16.0
dtype: float64

Converting a 1D list into a 2D DataFrame

I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step

Splitting and copying a row in pandas

I have a task that is completely driving me mad. Lets suppose we have this df:
import pandas as pd
k = {'random_col':{0:'a',1:'b',2:'c'},'isin':{0:'ES0140074008', 1:'ES0140074008ES0140074010', 2:'ES0140074008ES0140074016ES0140074024'},'n_isins':{0:1,1:2,2:3}}
k = pd.DataFrame(k)
What I want to do is to double or triple a row a number of times goberned by col n_isins which is a number obtained by dividing the lentgh of col isin didived by 12, as isins are always strings of 12 characters.
So, I need 1 time row 0, 2 times row 1 and 3 times row 2. My real numbers are up-limited by 6 so it is a hard task. I began by using booleans and slicing the col isin but that does not take me to nothing. Hopefully my explanation is good enough. Also I need the col isin sliced like this [0:11] + ' ' + [12:23]... splitting by the 'E' but I think I know how to do that, I just post it cause is the criteria that rules the number of times I have to copy each row. Thanks in advance!
I think you need numpy.repeat with loc, last remove duplicates in index by reset_index. Last for new column use custom splitting function with numpy.concatenate:
n = np.repeat(k.index, k['n_isins'])
k = k.loc[n].reset_index(drop=True)
print (k)
isin n_isins random_col
0 ES0140074008 1 a
1 ES0140074008ES0140074010 2 b
2 ES0140074008ES0140074010 2 b
3 ES0140074008ES0140074016ES0140074024 3 c
4 ES0140074008ES0140074016ES0140074024 3 c
5 ES0140074008ES0140074016ES0140074024 3 c
#https://stackoverflow.com/a/7111143/2901002
def chunks(s, n):
"""Produce `n`-character chunks from `s`."""
for start in range(0, len(s), n):
yield s[start:start+n]
s = np.concatenate(k['isin'].apply(lambda x: list(chunks(x, 12))))
df['new'] = pd.Series(s, index = df.index)
print (df)
isin n_isins random_col new
0 ES0140074008 1 a ES0140074008
1 ES0140074008ES0140074010 2 b ES0140074008
2 ES0140074008ES0140074010 2 b ES0140074010
3 ES0140074008ES0140074016ES0140074024 3 c ES0140074008
4 ES0140074008ES0140074016ES0140074024 3 c ES0140074016
5 ES0140074008ES0140074016ES0140074024 3 c ES0140074024

Categories

Resources