How to split digits and text - python

I have a dataset like this
data = pd.DataFrame({ 'a' : [5, 5, '2 bad']})
I want to convert this to
{ 'a.digits' : [5, 5, 2], 'a.text' : [nan, nan, 'bad']}
I can get 'a.digits' as bellow
data['a.digits'] = data['a'].replace('[^0-9]', '', regex = True)
5 2
2 1
Name: a, dtype: int64
When i do
data['a'] = data['a'].replace('[^\D]', '', regex = True)
or
data['a'] = data['a'].replace('[^a-zA-Z]', '', regex = True)
i get
5 2
bad 1
Name: a, dtype: int64
What's wrong? How to remove digits?

Something like this would suffice?
In [8]: import numpy as np
In [9]: import re
In [10]: data['a.digits'] = data['a'].apply(lambda x: int(re.sub(r'[\D]', '', str(x))))
In [12]: data['a.text'] = data['a'].apply(lambda x: re.sub(r'[\d]', '', str(x)))
In [13]: data.replace('', np.nan, regex=True)
Out[13]:
a a.digits a.text
0 5 5 NaN
1 5 5 NaN
2 2 bad 2 bad

Assuming there is a space between 2 and the word bad, you can do this:
data['Text'] = data['a'].str.split(' ').str[1]

Related

Remove opening and closing parenthesis with word in pandas

Given a data frame:
df =
multi
0 MULTIPOLYGON(((3 11, 2 33)))
1 MULTIPOLYGON(((4 22, 5 66)))
I was trying to remove the word 'MULTIPOLYGON', and parenthesis '(((', ')))'
My try:
df['multi'] = df['multi'].str.replace(r"\(.*\)","")
df['multi'] = df['multi'].map(lambda x: x.lstrip('MULTIPOLYGON()').rstrip('aAbBcC'))
df.values =
array([[''],
[''],
...
[''],
[''],
[''],
['7.5857754821 44.9628409423']
Desired output:
df =
multi
3 11, 2 33
4 22, 5 6
Try this:
import pandas as pd
import re
def f(x):
x = ' '.join(re.findall(r'[0-9, ]+',x))
return x
def f2(x):
x = re.findall(r'[0-9, ]+',x)
return pd.Series(x[0].split(','))
df =pd.DataFrame({'a':['MULTIPOLYGON(((3 11, 2 33)))' ,'MULTIPOLYGON(((4 22, 5 6)))']})
df['a'] = df['a'].apply(f)
print(df)
#or for different columns you can do
df =pd.DataFrame({'a':['MULTIPOLYGON(((3 11, 2 33)))' ,'MULTIPOLYGON(((4 22, 5 6)))']})
#df['multi'] = df.a.str.replace('[^0-9. ]', '', regex=True)
#print(df)
list_of_cols = ['c1','c2']
df[list_of_cols] = df['a'].apply(f2)
del df['a']
print(df)
output:
a
0 3 11, 2 33
1 4 22, 5 6
c1 c2
0 3 11 2 33
1 4 22 5 6
[Finished in 2.5s]
You can also use str.replace with a regex:
# removes anything that's not a digit or a space or a dot
df['multi'] = df.multi.str.replace('[^0-9\. ]', '', regex=True)#changing regex
You can use df.column.str in the following way.
df['a'] = df['a'].str.findall(r'[0-9.]+')
df = pd.DataFrame(df['a'].tolist())
print(df)
output:
0 1
0 3.49 11.10
1 4.49 22.12
This will work for any number of columns. But in the end you have to name those columns.
df.columns = ['a'+str(i) for i in range(df.shape[1])]
This method will work even when some rows have different number of numerical values. like
df =pd.DataFrame({'a':['MULTIPOLYGON(((3.49)))' ,'MULTIPOLYGON(((4.49 22.12)))']})
a
0 MULTIPOLYGON(((3.49)))
1 MULTIPOLYGON(((4.49 22.12)))
So the expected output is
0 1
0 3.49 None
1 4.49 22.12
After naming the columns using,
df.columns = ['a'+str(i) for i in range(df.shape[1])]
You get,
a0 a1
0 3.49 None
1 4.49 22.12
Apply is a rather slow method in pandas since it's basically a loop that iterates over each row and apply's your function. Pandas has vectorized methods, we can use str.extract here to extract your pattern:
df['multi'] = df['multi'].str.extract('(\d\.\d+\s\d+\.\d+)')
multi
0 3.49 11.10
1 4.49 22.12

Calculate average of column x if column y meets criteria, for each y

How do I retrieve the value of column Z and its average
if any value are >1
data=[9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
l=[]
for x,y in df.iterrows():
for i,s in y.iteritems():
if s >1:
l.append(x)
print(df['Z'])
The expected output will most likely be a dictionary with the column name as key and the average of Z as its values.
Using a dictionary comprehension:
res = {col: df.loc[df[col] > 1, 'Z'].mean() for col in df.columns[:-1]}
# {'A': 9.0, 'B': 5.0, 'C': 8.0, 'D': 7.5, 'E': 6.666666666666667}
Setup used for above:
np.random.seed(0)
data = [9,2,3,4,5,6,7,8]
df = pd.DataFrame(np.random.randn(8, 5),columns=['A', 'B', 'C', 'D','E'])
fd = pd.DataFrame(data, columns=['Z'])
df = pd.concat([df, fd], axis=1)
Do you mean this?
df[df['Z']>1].loc[:,'Z'].mean(axis=0)
or
df[df['Z']>1]['Z'].mean()
I don't know if I understood your question correctly but do you mean this:
import pandas as pd
import numpy as np
data=[9,2,3,4,5,6,7,8]
columns = ['A', 'B', 'C', 'D','E']
df = pd.DataFrame(np.random.randn(8, 5),columns=columns)
fd=pd.DataFrame(data,columns=['Z'])
df=pd.concat([df,fd], axis=1)
print('df = \n', str(df))
anyGreaterThanOne = (df[columns] > 1).any(axis=1)
print('anyGreaterThanOne = \n', str(anyGreaterThanOne))
filtered = df[anyGreaterThanOne]
print('filtered = \n', str(filtered))
Zmean = filtered['Z'].mean()
print('Zmean = ', str(Zmean))
Result:
df =
A B C D E Z
0 -2.170640 -2.626985 -0.817407 -0.389833 0.862373 9
1 -0.372144 -0.375271 -1.309273 -1.019846 -0.548244 2
2 0.267983 -0.680144 0.304727 0.302952 -0.597647 3
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
5 -1.135545 -1.738466 -1.148341 0.764914 -1.140543 6
6 -2.078396 0.057462 -0.737875 -0.817707 0.570017 7
7 0.187877 0.363962 0.637949 -0.875372 -1.105744 8
anyGreaterThanOne =
0 False
1 False
2 False
3 True
4 True
5 False
6 False
7 False
dtype: bool
filtered =
A B C D E Z
3 0.243549 1.046297 0.647842 1.188530 0.640133 4
4 -0.116007 1.090770 0.510190 -1.310732 0.546881 5
Zmean = 4.5

Pandas - how to slice value_counts?

I would like to slice a pandas value_counts() :
>sur_perimetre[col].value_counts()
44341006.0 610
14231009.0 441
12131001.0 382
12222009.0 364
12142001.0 354
But I get an error :
> sur_perimetre[col].value_counts()[:5]
KeyError: 5.0
The same with ix :
> sur_perimetre[col].value_counts().ix[:5]
KeyError: 5.0
How would you deal with that ?
EDIT
Maybe :
pd.DataFrame(sur_perimetre[col].value_counts()).reset_index()[:5]
Method 1:
You need to observe that value_counts() returns a Series object. You can process it like any other series and get the values. You can even construct a new dataframe out of it.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: vc = df.C1.value_counts()
In [4]: type(vc)
Out[4]: pandas.core.series.Series
In [5]: vc.values
Out[5]: array([2, 1, 1, 1, 1])
In [6]: vc.values[:2]
Out[6]: array([2, 1])
In [7]: vc.index.values
Out[7]: array([3, 5, 4, 2, 1])
In [8]: df2 = pd.DataFrame({'value':vc.index, 'count':vc.values})
In [8]: df2
Out[8]:
count value
0 2 3
1 1 5
2 1 4
3 1 2
4 1 1
Method2:
Then, I was trying to regenerate the error you mentioned. But, using a single column in DF, I didnt get any error in the same notation as you mentioned.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: df['C1'].value_counts()[:3]
Out[3]:
3 2
5 1
4 1
Name: C1, dtype: int64
In [4]: df.C1.value_counts()[:5]
Out[4]:
3 2
5 1
4 1
2 1
1 1
Name: C1, dtype: int64
In [5]: pd.__version__
Out[5]: u'0.17.1'
Hope it helps!

Extract non- empty values from the regex array output in python

I have a column of type numpy.ndarray which looks like:
col
['','','5','']
['','8']
['6','','']
['7']
[]
['5']
I want the ouput like this :
col
5
8
6
7
0
5
How can I do this in python.Any help is highly appreciated.
To convert the data to numeric values you could use:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5',''], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = pd.to_numeric(df['col'].str.join('')).fillna(0).astype(int)
print(df)
yields
col
0 5
1 8
2 6
3 7
4 0
5 5
To convert the data to strings use:
df['col'] = df['col'].str.join('').replace('', '0')
The result looks the same, but the dtype of the column is object since the values are strings.
If there is more than one number in some rows and you wish to pick the largest,
then you'll have to loop through each item in each row, convert each string to
a numeric value and take the max:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5','6'], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = [max([int(xi) if xi else 0 for xi in x] or [0]) for x in df['col']]
print(df)
yields
col
0 6 # <-- note ['','','5','6'] was converted to 6
1 8
2 6
3 7
4 0
5 5
For versions of pandas prior to 0.17, you could use df.convert_objects instead:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5',''], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = df['col'].str.join('').replace('', '0')
df = df.convert_objects(convert_numeric=True)
xn = array([['', '', '5', ''], ['', '8'], ['6', '', ''], ['7'], [], ['5']],
dtype=object)
In [20]: for a in x:
....: if len(a)==0:
....: print 0
....: else:
....: for b in a:
....: if b:
....: print b
....:
5
8
6
7
0
5
I'll leave you with this :
>>> l=['', '5', '', '']
>>> l = [x for x in l if not len(x) == 0]
>>> l
>>> ['5']
You can do the same thing using lambda and filter
>>> l
['', '1', '']
>>> l = filter(lambda x: not len(x)==0, l)
>>> l
['1']
The next step would be iterating through the rows of the array and implementing one of these two ideas.
Someone shows how this is done here: Iterating over Numpy matrix rows to apply a function each?
edit: maybe this is down-voted, but I made it on purpose to not give the final code.

Strip all characters from column header before a :

I have column's named like this:
1:Arnston 2:Berg 3:Carlson 53:Brown
and I want to strip all the characters before and including :. I know I can rename the columns, but that would be pretty tedious since my numbers go up to 100.
My desired out put is:
Arnston Berg Carlson Brown
Assuming that you have a frame looking something like this:
>>> df
1:Arnston 2:Berg 3:Carlson 53:Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
You can use the vectorized string operators to split each entry at the first colon and then take the second part:
>>> df.columns = df.columns.str.split(":", 1).str[1]
>>> df
Arnston Berg Carlson Brown
0 5 0 2 1
1 9 3 2 9
2 9 2 9 7
import re
s = '1:Arnston 2:Berg 3:Carlson 53:Brown'
s_minus_numbers = re.sub(r'\d+:', '', s)
Gets you
'Arnston Berg Carlson Brown'
The best solution IMO is to use pandas' str attribute on the columns. This allows for the use of regular expressions without having to import re:
df.columns.str.extract(r'\d+:(.*)')
Where the regex means: select everything ((.*)) after one or more digits (\d+) and a colon (:).
You can do it with a list comprehension:
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
print('Before: {!r}'.format(columns))
columns = [col.split(':')[1] for col in columns]
print('After: {!r}'.format(columns))
Output
Before: ['1:Arnston', '2:Berg', '3:Carlson', '53:Brown']
After: ['Arnston', 'Berg', 'Carlson', 'Brown']
Another way is with a regular expression using re.sub():
import re
columns = '1:Arnston 2:Berg 3:Carlson 53:Brown'.split()
pattern = re.compile(r'^.+:')
columns = [pattern.sub('', col) for col in columns]
print(columns)
Output
['Arnston', 'Berg', 'Carlson', 'Brown']
df = pd.DataFrame({'1:Arnston':[5,9,9],
'2:Berg':[0,3,2],
'3:Carlson':[2,2,9] ,
'53:Brown':[1,9,7]})
[x.split(':')[1] for x in df.columns.factorize()[1]]
output:
['Arnston', 'Berg', 'Carlson', 'Brown']
You could use str.replace and pass regex expression:
In [52]: df
Out[52]:
1:Arnston 2:Berg 3:Carlson 53:Brown
0 1.340711 1.261500 -0.512704 -0.064384
1 0.462526 -0.358382 0.168122 -0.660446
2 -0.089622 0.656828 -0.838688 -0.046186
3 1.041807 0.775830 -0.436045 0.162221
4 -0.422146 0.775747 0.106112 -0.044917
In [51]: df.columns.str.replace('\d+[:]','')
Out[51]: Index(['Arnston', 'Berg', 'Carlson', 'Brown'], dtype='object')

Categories

Resources