Extract substring numbers from string pandas

Extract substring numbers from string pandas - python

I've a list like this
lis=["proc_movieclip1_0.450-16.450.wav", "proc_movieclip1_17.700-23.850.wav", "proc_movieclip1_25.800-29.750.wav"]
I've converted into df by
import numpy as np
import pandas as pd
dfs = pd.DataFrame(mylist2)
dfs.columns=['path']
dfs
so dfs look like this
path
0 proc_movieclip1_0.450-16.450.wav
1 proc_movieclip1_17.700-23.850.wav
2 proc_movieclip1_25.800-29.750.wav
I just wanto extract this num range in string as a new column as follows
range
0.450-16.450
17.700-23.850
25.800-29.750
what I've tried.
dfs.path.str.extract('(\d+)')
output
0
0 1
1 1
2 1
Also tried
dfn = dfs.assign(path = lambda x: x['path'].str.extract('(\d+)'))
I got same output as above...Am i missing anything?

You need to use a more complex regex here:
dfs['path'].str.extract(r'(\d+(?:\.\d+)?-\d+(?:\.\d+)?)')
output:
0
0 0.450-16.450
1 17.700-23.850
2 25.800-29.750
regex demo

If you're unfamiliar with regex, you would want to use str.split() method:
def Extractor(string):
num1, num2 = string.split('_')[-1][:-4].split('-')
return (float(num1), float(num2))
Result:
>>> Extractor('proc_movieclip1_0.450-16.450.wav')
(0.45, 16.45)
Lambda one-liner:
lambda x: tuple([float(y) for y in x.split('_')[-1][:-4].split('-')])

Related

Compute number of floats in a int range - Python

I've the following dataframe containing floats as input and would like to compute how many values are in range 0;90 and 90;180. The output dataframe was obtained using frequency() function from excel.
[Input dataframe]
[Desired output]
I'd like to do the same thing with python but didn't find a solution. Do you have any suggestion ?
I can also provide source files if needed.

Here's one way, by dividing the columns by 90, then using groupy and count:
import numpy as np
import pandas as pd
data = [
[87.084,5.293],
[55.695,0.985],
[157.504,2.995],
[97.701,179.593],
[97.67,170.386],
[118.713,177.53],
[99.972,176.665],
[124.849,1.633],
[72.787,179.459]
]
df = pd.DataFrame(data,columns=['Var1','Var2'])
df = (df / 90).astype(int)
df1 = pd.DataFrame([["0-90"], ["90-180"]])
df1['Var1'] = df.groupby('Var1').count()
df1['Var2'] = df.groupby('Var2').count()
print(df1)
Output:
0 Var1 Var2
0 0-90 3 4
1 90-180 6 5

Sorting a pandas DataFrame by one level of a MultiIndex with a "key"

My question is basically the same as the one here:
Sorting a pandas DataFrame by one level of a MultiIndex
id est, I want to sort a MultiIndex dataframe along one level, BUT I am facing the problem that the following index :
["foo2","foo1","foo10"] is sorted in ["foo1","foo10","foo2"] instead of ["foo1","foo2","foo10"]
and I cannot pass a "key" argument like for the list.sort() function (see example below).
How should I manage that ?
Should I reset_index, sort the column, and then set the index again ?
import pandas as pd
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
return [atoi(c) for c in re.split('(\d+)',text)]
# example on a list
L1=["foo2","foo1","foo10"]
print(sorted(L1))
print(sorted(L1,key=natural_keys))
print()
df = pd.DataFrame([{'I1':'foo2','I2':'b','val':2},{'I1':'foo1','I2':'a','val':1},{'I1':'foo10','I2':'c','val':3}])
df = df.set_index(['I1','I2'])
sorted_df = df.sort_index(level=0)
print(sorted_df)
print()
expected_df = pd.DataFrame([{'I1':'foo1','I2':'a','val':1},{'I1':'foo2','I2':'b','val':2},{'I1':'foo10','I2':'c','val':3}])
expected_df = expected_df.set_index(['I1','I2'])
print(expected_df)
val
I1 I2
foo1 a 1
foo10 c 3
foo2 b 2
EXPECTED DF:
val
I1 I2
foo1 a 1
foo2 b 2
foo10 c 3
Thanks

As explained by Jon Clements, if you are on a version of pandas >= 1.0.0 you can use the key argument of sort index.
but if you also want to discriminate between several numbers in your index :
foo_1_bar_2
foo_2_bar_1
in this order then you need to combine several function :
import pandas as pd
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
return [atoi(c) for c in re.split('(\d+)',text)]
def sort_index(index):
return [sorted(index,key=natural_keys,reverse=False).index(val) for val in index]
df = pd.DataFrame([{'I1':'foo2','I2':'b','val':2},{'I1':'foo1','I2':'a','val':1},{'I1':'foo10','I2':'c','val':3}])
df = df.set_index(['I1','I2'])
sorted_df=df.sort_index(level=0,key=sort_index)
I have not found any simple solution on previous version of pandas

manipulate a column of dataframe with conditions

enter code hereIn order to change strings' suffix to be prefix in a column of dataframe, which is made with the following code for example.
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
a b
1 100000.ss 10
2 200000.zz 18
I tried one line code below, but the result shows the if else statement doesn't work. Why?
df['a'] = df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") else 'zz.'+x[:6])
a b
1 ss.100000 10
2 ss.200000 18

Each x of your lambda function is a string. x.find returns -1 if not found. -1 is considered as boolean True. Therefore, your lambda always returns ss + .... Try to change your lambda to this
df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") != -1 else 'zz.'+x[:6])
Out[4]:
1 ss.100000
2 zz.200000
Name: a, dtype: object
Anyway, you don't need apply for this issue. Just use pandas str accessor
df['a'].str[-2:] + '.' + df['a'].str[:-3]
Out[10]:
1 ss.100000
2 zz.200000
Name: a, dtype: object

Why do the hardwork when there is a library that does it for you....
import pandas as pd
from pathlib import Path
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df.assign(
a=lambda x: x["a"].apply(lambda s: f"{Path(s).suffix[1:]}.{Path(s).stem}")
)
output
a b
ss.100000 10
zz.200000 18

There might be options to this in a lower number of lines. I have a solution
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df[['First','Last']] = df.a.str.split(".",expand=True)
df['a']=df['Last']+'.'+df['First']
df.drop(['First','Last'],axis=1)

Convert string (comma separated) to int list in pandas Dataframe

I have a Dataframe with a column which contains integers and sometimes a string which contains multiple numbers which are comma separated (like "1234567, 89012345, 65425774").
I want to convert that string to an integer list so it's easier to search for specific numbers.
In [1]: import pandas as pd
In [2]: raw_input = "1111111111 666 10069759 9695011 9536391,2261003 9312405 15542804 15956127 8409044 9663061 7104622 3273441 3336156 15542815 15434808 3486259 8469323 7124395 15956159 3319393 15956184
: 15956217 13035908 3299927"
In [3]: df = pd.DataFrame({'x':raw_input.split()})
In [4]: df.head()
Out[4]:
x
0 1111111111
1 666
2 10069759
3 9695011
4 9536391,2261003

Since your column contains strings and integers, you want probably something like this:
def to_integers(column_value):
if not isinstance(column_value, int):
return [int(v) for v in column_value.split(',')]
else:
return column_value
df.loc[:, 'column_name'] = df.loc[:, 'column_name'].apply(to_integers)

Your best solution to cases like this, where a column has 1 or more values, is splitting the data into multiple columns.
Try something like
ids = df.ID.str.split(',', n=1, expand=True)
for i in range(3):
df['ID' + str(i + 1)] = ids.iloc[, i]

Is there a way to use str.count() function with a LIST of values instead of a single string?

I am trying to count the number of times that any string from a list_of_strings appears in a csv file cell.
For example, the following would work fine.
import pandas as pd
data_path = "SurveryResponses.csv"
df = pd.read_csv(data_path)
totalCount = 0
for row in df['rowName']:
if type(row) == str:
print(row.count('word_of_interest'))
However, I would like to be able to enter a list of strings (['str1', str2', str3']) rather than just one 'word_of_interest', such that if any of those strings appear the count value will increase by one.
Is there a way to do this?

Perhaps something along the lines of
totalCount = 0
words_of_interst = ['cat','dog','foo','bar']
for row in df['rowName']:
if type(row) == str:
if sum([word in row for word in words_of_interst]) > 0:
totalCount += 1

Use the str accessor:
df['rowName'].str.count('word_of_interest')
If you need to convert the column to string first, use astype:
df['rowName'].astype(str).str.count('word_of_interest')

Assuming list_of_strings = ['str1', str2', str3'] you can try the following:
if any(map(lambda x: x in row, list_of_strings)):
totalCount += 1

You can use this method to count from an external list
strings = ['string1','string2','string3']
sum([1 if sr in strings else 0 for sr in df.rowName])

Here is an example:
import io
filedata = """animal,amount
"['cat','dog']",2
"['cat','horse']",2"""
df = pd.read_csv(io.StringIO(filedata))
Returns this dataframe:
animal amount
0 ['cat','dog'] 2
1 ['cat','horse'] 2
Search for word cat (looping through all columns as series):
search = "cat"
# sums True for each serie and then wrap a sum around all sums
# sum([2,0]) in this case
sum([sum(df[cols].astype(str).str.contains(search)) for cols in df.columns])
Returns 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract substring numbers from string pandas - python

You need to use a more complex regex here: dfs['path'].str.extract(r'(\d+(?:\.\d+)?-\d+(?:\.\d+)?)') output: 0 0 0.450-16.450 1 17.700-23.850 2 25.800-29.750 regex demo

Related

Compute number of floats in a int range - Python

Sorting a pandas DataFrame by one level of a MultiIndex with a "key"

manipulate a column of dataframe with conditions

Convert string (comma separated) to int list in pandas Dataframe

Is there a way to use str.count() function with a LIST of values instead of a single string?

Categories

Resources