manipulate a column of dataframe with conditions - python

enter code hereIn order to change strings' suffix to be prefix in a column of dataframe, which is made with the following code for example.
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
a b
1 100000.ss 10
2 200000.zz 18
I tried one line code below, but the result shows the if else statement doesn't work. Why?
df['a'] = df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") else 'zz.'+x[:6])
a b
1 ss.100000 10
2 ss.200000 18

Each x of your lambda function is a string. x.find returns -1 if not found. -1 is considered as boolean True. Therefore, your lambda always returns ss + .... Try to change your lambda to this
df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") != -1 else 'zz.'+x[:6])
Out[4]:
1 ss.100000
2 zz.200000
Name: a, dtype: object
Anyway, you don't need apply for this issue. Just use pandas str accessor
df['a'].str[-2:] + '.' + df['a'].str[:-3]
Out[10]:
1 ss.100000
2 zz.200000
Name: a, dtype: object

Why do the hardwork when there is a library that does it for you....
import pandas as pd
from pathlib import Path
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df.assign(
a=lambda x: x["a"].apply(lambda s: f"{Path(s).suffix[1:]}.{Path(s).stem}")
)
output
a b
ss.100000 10
zz.200000 18

There might be options to this in a lower number of lines. I have a solution
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df[['First','Last']] = df.a.str.split(".",expand=True)
df['a']=df['Last']+'.'+df['First']
df.drop(['First','Last'],axis=1)

Related

Extract substring numbers from string pandas

I've a list like this
lis=["proc_movieclip1_0.450-16.450.wav", "proc_movieclip1_17.700-23.850.wav", "proc_movieclip1_25.800-29.750.wav"]
I've converted into df by
import numpy as np
import pandas as pd
dfs = pd.DataFrame(mylist2)
dfs.columns=['path']
dfs
so dfs look like this
path
0 proc_movieclip1_0.450-16.450.wav
1 proc_movieclip1_17.700-23.850.wav
2 proc_movieclip1_25.800-29.750.wav
I just wanto extract this num range in string as a new column as follows
range
0.450-16.450
17.700-23.850
25.800-29.750
what I've tried.
dfs.path.str.extract('(\d+)')
output
0
0 1
1 1
2 1
Also tried
dfn = dfs.assign(path = lambda x: x['path'].str.extract('(\d+)'))
I got same output as above...Am i missing anything?
You need to use a more complex regex here:
dfs['path'].str.extract(r'(\d+(?:\.\d+)?-\d+(?:\.\d+)?)')
output:
0
0 0.450-16.450
1 17.700-23.850
2 25.800-29.750
regex demo
If you're unfamiliar with regex, you would want to use str.split() method:
def Extractor(string):
num1, num2 = string.split('_')[-1][:-4].split('-')
return (float(num1), float(num2))
Result:
>>> Extractor('proc_movieclip1_0.450-16.450.wav')
(0.45, 16.45)
Lambda one-liner:
lambda x: tuple([float(y) for y in x.split('_')[-1][:-4].split('-')])

Check if a pandas Dataframe string column contains all the elements given in an array

I have a dataframe as shown below:
>>> import pandas as pd
>>> df = pd.DataFrame(data = [['app;',1,2,3],['app; web;',4,5,6],['web;',7,8,9],['',1,4,5]],columns = ['a','b','c','d'])
>>> df
a b c d
0 app; 1 2 3
1 app; web; 4 5 6
2 web; 7 8 9
3 1 4 5
I have an input array that looks like this: ["app","web"]
For each of these values I want to check against a specific column of a dataframe and return a decision as shown below:
>>> df.a.str.contains("app")
0 True
1 True
2 False
3 False
Since str.contains only allows me to look for an individual value, I was wondering if there's some other direct way to determine the same something like:
df.a.str.contains(["app","web"]) # Returns TypeError: unhashable type: 'list'
My end goal is not to do an absolute match (df.a.isin(["app", "web"]) but rather a 'contains' logic that says return true even if it has those characters present in that cell of data frame.
Note: I can of course use apply method to create my own function for the same logic such as:
elementsToLookFor = ["app","web"]
df[header] = df.apply(lambda element: all([a in element for a in elementsToLookFor]))
But I am more interested in the optimal algorithm for this and so prefer to use a native pandas function within pandas, or else the next most optimized custom solution.
This should work too:
l = ["app","web"]
df['a'].str.findall('|'.join(l)).map(lambda x: len(set(x)) == len(l))
also this should work as well:
pd.concat([df['a'].str.contains(i) for i in l],axis=1).all(axis = 1)
so many solutions, which one is the most efficient
The str.contains-based answers are generally fastest, though str.findall is also very fast on smaller dfs:
values = ['app', 'web']
pattern = ''.join(f'(?=.*{value})' for value in values)
def replace_dummies_all(df):
return df.a.str.replace(' ', '').str.get_dummies(';')[values].all(1)
def findall_map(df):
return df.a.str.findall('|'.join(values)).map(lambda x: len(set(x)) == len(values))
def lower_contains(df):
return df.a.astype(str).str.lower().str.contains(pattern)
def contains_concat_all(df):
return pd.concat([df.a.str.contains(l) for l in values], axis=1).all(1)
def contains(df):
return df.a.str.contains(pattern)
Try with str.get_dummies
df.a.str.replace(' ','').str.get_dummies(';')[['web','app']].all(1)
0 False
1 True
2 False
3 False
dtype: bool
Update
df['a'].str.contains(r'^(?=.*web)(?=.*app)')
Update 2: (To ensure case insenstivity doesn't matter and the column dtype is str without which the logic may fail):
elementList = ['app','web']
for eachValue in elementList:
valueString += f'(?=.*{eachValue})'
df[header] = df[header].astype(str).str.lower() #To ensure case insenstivity and the dtype of the column is string
result = df[header].str.contains(valueString)

Iterate over two dataframes' columns and str.encode in utf8

I'm currently running on Python 2.7 and have two dataframes x and y. I would like to use some sort of list comprehension to iterate over both columns and use str.encode('UTF8) on each column to get rid of unicode.
This works perfectly fine and is easily readable but wanted to try to use something faster and more efficient.
for col in y:
if y[col].dtype=='O':
y[col] = y[col].str.encode("utf-8")
for col in x:
if x[col].dtype=='O':
x[col] = x[col].str.encode("utf-8")
Other methods I have tried:
1.)[y[col].str.encode("utf-8") for col in y if y[col].dtype=='O' ]
2.)y.columns= [( y[col].str.encode("utf-8") if y[col].dtype=='O' else y[col]) for col in y ]
3.)y.apply(lambda x : (y[col].str.encode("utf-8") for col in y if y[col].dtype=='O'))
I am getting valueerrors and length mismatch errors for 2.) and 3.)
You can use select_dtypes to get object columns, then call apply over each column to encode it:
u = df.select_dtypes(include=[object])
df[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
Write a small function to do this and call it for each dataframe.
def encode_df(df):
u = df.select_dtypes(include=[object])
df[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
return df
x, y = encode_df(x), encode_df(y)
Use this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3,4], 'b':[11,12,13,14]})
def f(x):
return x**2
pd.DataFrame([[f(i) for i in tuple(v)] for k,v in df.iterrows()], columns=df.columns)
Out[54]:
a b
0 1 121
1 4 144
2 9 169
3 16 196

pandas standalone series and from dataframe different behavior

Here is my code and warning message. If I change s to be a standalone Series by using s = pd.Series(np.random.randn(5)), there will no such errors. Using Python 2.7 on Windows.
It seems Series created from standalone and Series created from a column of a data frame are different behavior? Thanks.
My purpose is to change the Series value itself, other than change on a copy.
Source code,
import pandas as pd
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
s = sample['c_d']
#s = pd.Series(np.random.randn(5))
for i in range(len(s)):
if s.iloc[i] > 0:
s.iloc[i] = s.iloc[i] + 1
else:
s.iloc[i] = s.iloc[i] - 1
Warning message,
C:\Python27\lib\site-packages\pandas\core\indexing.py:132: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Content of 123.csv,
c_a,c_b,c_c,c_d
hello,python,numpy,0.0
hi,python,pandas,1.0
ho,c++,vector,0.0
ho,c++,std,1.0
go,c++,std,0.0
Edit 1, seems lambda solution does not work, tried to print s before and after, the same value,
import pandas as pd
sample = pd.read_csv('123.csv', header=None, skiprows=1,
dtype={0:str, 1:str, 2:str, 3:float})
sample.columns = pd.Index(data=['c_a', 'c_b', 'c_c', 'c_d'])
sample['c_d'] = sample['c_d'].astype('int64')
s = sample['c_d']
print s
s.apply(lambda x:x+1 if x>0 else x-1)
print s
0 0
1 1
2 0
3 1
4 0
Name: c_d, dtype: int64
Backend TkAgg is interactive backend. Turning interactive mode on.
0 0
1 1
2 0
3 1
4 0
regards,
Lin
By doing s = sample['c_d'], if you make a change to the value of s then your original Dataframe sample also changes. That's why you got the warning.
You can do s = sample[c_d].copy() instead, so that changing the value of s doesn't change the value of c_d column of the Dataframe sample.
I suggest you use apply function instead:
s.apply(lambda x:x+1 if x>0 else x-1)

Hash each value in a pandas data frame

In python, I am trying to find the quickest to hash each value in a pandas data frame.
I know any string can be hashed using:
hash('a string')
But how do I apply this function on each element of a pandas data frame?
This may be a very simple thing to do, but I have just started using python.
Pass the hash function to apply on the str column:
In [37]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df
Out[37]:
a
0 asds
1 asdds
2 asdsadsdas
In [39]:
df['hash'] = df['a'].apply(hash)
df
Out[39]:
a hash
0 asds 4065519673257264805
1 asdds -2144933431774646974
2 asdsadsdas -3091042543719078458
If you want to do this to every element then call applymap:
In [42]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas'],'b':['asewer','werwer','tyutyuty']})
df
Out[42]:
a b
0 asds asewer
1 asdds werwer
2 asdsadsdas tyutyuty
In [43]:
df.applymap(hash)
​
Out[43]:
a b
0 4065519673257264805 7631381377676870653
1 -2144933431774646974 -6124472830212927118
2 -3091042543719078458 -1784823178011532358
Pandas also has a function to apply a hash function on an array or column:
import pandas as pd
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df["hash"] = pd.util.hash_array(df["a"].to_numpy())
In addition to #EdChum a heads-up: hash() does not return the same values for a string for each run on every machine. Depending on your use-case, you better use
import hashlib
def md5hash(s: str):
return hashlib.md5(s.encode('utf-8')).hexdigest() # or SHA, ...
df['a'].apply(md5hash)
# or
df.applymap(md5hash)

Categories

Resources