Extract non- empty values from the regex array output in python - python

I have a column of type numpy.ndarray which looks like:
col
['','','5','']
['','8']
['6','','']
['7']
[]
['5']
I want the ouput like this :
col
5
8
6
7
0
5
How can I do this in python.Any help is highly appreciated.

To convert the data to numeric values you could use:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5',''], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = pd.to_numeric(df['col'].str.join('')).fillna(0).astype(int)
print(df)
yields
col
0 5
1 8
2 6
3 7
4 0
5 5
To convert the data to strings use:
df['col'] = df['col'].str.join('').replace('', '0')
The result looks the same, but the dtype of the column is object since the values are strings.
If there is more than one number in some rows and you wish to pick the largest,
then you'll have to loop through each item in each row, convert each string to
a numeric value and take the max:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5','6'], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = [max([int(xi) if xi else 0 for xi in x] or [0]) for x in df['col']]
print(df)
yields
col
0 6 # <-- note ['','','5','6'] was converted to 6
1 8
2 6
3 7
4 0
5 5
For versions of pandas prior to 0.17, you could use df.convert_objects instead:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5',''], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = df['col'].str.join('').replace('', '0')
df = df.convert_objects(convert_numeric=True)

xn = array([['', '', '5', ''], ['', '8'], ['6', '', ''], ['7'], [], ['5']],
dtype=object)
In [20]: for a in x:
....: if len(a)==0:
....: print 0
....: else:
....: for b in a:
....: if b:
....: print b
....:
5
8
6
7
0
5

I'll leave you with this :
>>> l=['', '5', '', '']
>>> l = [x for x in l if not len(x) == 0]
>>> l
>>> ['5']
You can do the same thing using lambda and filter
>>> l
['', '1', '']
>>> l = filter(lambda x: not len(x)==0, l)
>>> l
['1']
The next step would be iterating through the rows of the array and implementing one of these two ideas.
Someone shows how this is done here: Iterating over Numpy matrix rows to apply a function each?
edit: maybe this is down-voted, but I made it on purpose to not give the final code.

Related

replace values in dataframe using values from two lists

I am cleaning some dataframes and want to replace a set of values with different values as shown below.
import pandas as pd
dftmp=pd.DataFrame({
'a':['yes','true','false','no','na', 'NA', 'TRUE'],
'b':['yes','true','false','no','FALSE','ofcourse','yes we can'],
'c':['any','other','random','column','in', 'the', 'db']
})
a b c
0 yes yes any
1 true true other
2 false false random
3 no no column
4 na FALSE in
5 NA ofcourse the
6 TRUE yes we can db
#replace with Y, N, NA. (actual combination of old and replacement values, and columns in which to replace is imported from another dataframe and will change over dataframes and time).
#next 3 variables populated from another database and can change.
cols = ['a','b']
lstold = [['Yes, True'], ['No, False'], ['NA']]
lstnew = ['Y', 'N', 'NA']
for col in cols:
dlsts = dict(zip(lstnew, lstold))
for key, val in dlsts.items():
try:
valsold = val.split(', ')
except:
print('single item list. continue')
for valold in valsold:
dftmp[col] = dftmp[col].replace(f'(?i){valold}', key, regex=True)
I've almost got the desired result - the issue is in 6b, where it should should remain 'Yes we can' instead of 'Y we can':
a b c
0 Y Y any
1 Y Y other
2 N N random
3 N N column
4 NA N in
5 NA ofcourse the
6 Y Y we can db
How do I stop the 'Yes' in 'Yes we can' from being replaced.
Can this be done without using 3 for loops? I fear it will take time a lot more time with my bigger datasets.
Thanks
You could try this:
lstold = [['Yes, True'], ['No, False'], ['NA']]
lstold = [ a.lower() for x in lstold for a in x]
lstnew = ['Y','Y','N', 'N', 'NA']
for i in df.columns:
df[i] = df[i].str.lower().replace(lstold,lstnew)
This is the output:
a b c
0 Y Y any
1 Y Y other
2 N N random
3 N N column
4 NA N in
5 NA ofcourse the
6 Y yes we can db
The sublists in your lstold aren't actually lists, they're a single string. I changed that in the sample I'm showing to make each string an element of the list. Assuming you can do that, perhaps this is what you are looking for.
import pandas as pd
dftmp=pd.DataFrame({
'a':['yes','true','false','no','na', 'NA', 'TRUE'],
'b':['yes','true','false','no','FALSE','ofcourse','yes we can'],
'c':['any','other','random','column','in', 'the', 'db']
})
cols = ['a','b']
lstold = [['Yes', 'True'], ['No', 'False'], ['NA']]
lstnew = ['Y', 'N', 'NA']
m = {}
for c,l in enumerate(lstold):
for s in l:
m[s.lower()] = lstnew[c]
for col in cols:
dftmp[col].update(dftmp[col].str.lower().map(m))
Output
a b c
0 Y Y any
1 Y Y other
2 N N random
3 N N column
4 NA N in
5 NA ofcourse the
6 Y yes we can db
Couldn't reduce to less than three for loops. but thanks to response to my question here was able to stop replacement of substrings:
cols = ['a','b']
lstold = [['Yes, True'], ['No, False'], ['NA']]
lstnew = ['Y', 'N', 'NA']
for col in cols:
dlsts = dict(zip(lstnew, lstold))
for key, val in dlsts.items():
try:
valsold = val.split(', ')
except:
print('single item list. continue')
for valold in valsold:
df[col] = df[col].replace(rf'(?i)^{valold}$', key, regex=True)
if one doesn't need to ignore case or worry about replacing substrings, then one of the for loops can be dropped as follows:
cols = ['a','b']
lstold = [['Yes, True'], ['No, False'], ['NA']]
lstnew = ['Y', 'N', 'NA']
for col in cols:
dlsts = dict(zip(lstnew, lstold))
for key, val in dlsts.items():
df[col] = df[col].str.replace(val, key, case=False, regex=False)

Match to re-code letters and numbers in python (pandas)

I have a variable that is mixed with letters and numbers. The letters range from A:Z and the numbers range from 2:8. I want to re-code this variable so that it is all numeric with the letters A:Z now becoming numbers 1:26 and the numbers 2:8 becoming numbers 27:33.
For example, I would like this variable:
Var1 = c('A',2,3,8,'C','W',6,'T')
To become this:
Var1 = c(1,27,28,33,3,23,31,20)
In R I can do this using 'match' like this:
Var1 = as.numeric(match(Var1, c(LETTERS, 2:8)))
How can I do this using python? Pandas?
Thank you
Make a dictionary and map the values:
import string
import numpy as np
dct = dict(zip(list(string.ascii_uppercase) + list(np.arange(2, 9)), np.arange(1, 34)))
# If they are strings of numbers, not integers use:
#dct = dict(zip(list(string.ascii_uppercase) + ['2', '3', '4', '5', '6', '7', '8'], np.arange(1, 34)))
df.col_name = df.col_name.map(dct)
An example:
import pandas as pd
df = pd.DataFrame({'col': [2, 4, 6, 3, 5, 'A', 'B', 'D', 'F', 'Z', 'X']})
df.col.map(dct)
Outputs:
0 27
1 29
2 31
3 28
4 30
5 1
6 2
7 4
8 6
9 26
10 24
Name: col, dtype: int64
i think that could help you
Replacing letters with numbers with its position in alphabet
then you just need to apply on you df column
dt.Var1.apply(alphabet_position)
you can also try this
for i in range(len(var1)):
if type(var1[i]) == int:
var1[i] = var1[i] + 25
else:
var1[i] = ord(var1[i].lower()) - 96

Overwrite value in Pandas DataFrame by iteration?

I have a dataframe like this:
lis = [['a','b','c'],
['17','10','6'],
['5','30','x'],
['78','50','2'],
['4','58','x']]
df = pd.DataFrame(lis[1:],columns=lis[0])
How can I write a function that says, if 'x' is in column [c], then overwrite that value with the corresponding one in column [b]. The result would be this:
[['a','b','c'],
['17','10','6'],
['5','30','30'],
['78','50','2'],
['4','58','58']]
By using .loc and np.where
import numpy as np
df.c=np.where(df.c=='x',df.b,df.c)
df
Out[569]:
a b c
0 17 10 6
1 5 30 30
2 78 50 2
3 4 58 58
This should do the trick
import numpy as np
df.c = np.where(df.c == 'x',df.b, df.c)
I am not into pandas but if you want to change the lis you could do it like so:
>>> [x if x[2] != "x" else [x[0], x[1], x[1]] for x in lis]
[['a','b','c'],
['17','10','6'],
['5','30','30'],
['78','50','2'],
['4','58','58']]

Concatenate all columns in a pandas dataframe

I have multiple pandas dataframe which may have different number of columns and the number of these columns typically vary from 50 to 100. I need to create a final column that is simply all the columns concatenated. Basically the string in the first row of the column should be the sum(concatenation) of the strings on the first row of all the columns. I wrote the loop below but I feel there might be a better more efficient way to do this. Any ideas on how to do this
num_columns = df.columns.shape[0]
col_names = df.columns.values.tolist()
df.loc[:, 'merged'] = ""
for each_col_ind in range(num_columns):
print('Concatenating', col_names[each_col_ind])
df.loc[:, 'merged'] = df.loc[:, 'merged'] + df[col_names[each_col_ind]]
Solution with sum, but output is float, so convert to int and str is necessary:
df['new'] = df.sum(axis=1).astype(int).astype(str)
Another solution with apply function join, but it the slowiest:
df['new'] = df.apply(''.join, axis=1)
Last very fast numpy solution - convert to numpy array and then 'sum':
df['new'] = df.values.sum(axis=1)
Timings:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
#print (df)
cols = list('ABC')
#not_a_robot solution
In [259]: %timeit df['concat'] = pd.Series(df[cols].fillna('').values.tolist()).str.join('')
100 loops, best of 3: 17.4 ms per loop
In [260]: %timeit df['new'] = df[cols].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 386 ms per loop
In [261]: %timeit df['new1'] = df[cols].values.sum(axis=1)
100 loops, best of 3: 6.5 ms per loop
In [262]: %timeit df['new2'] = df[cols].astype(str).sum(axis=1).astype(int).astype(str)
10 loops, best of 3: 68.6 ms per loop
EDIT If dtypes of some columns are not object (obviously strings) cast by DataFrame.astype:
df['new'] = df.astype(str).values.sum(axis=1)
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Gives us:
df
Out[6]:
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
To select a given set of columns:
df['concat'] = pd.Series(df[['A', 'B']].fillna('').values.tolist()).str.join('')
df
Out[8]:
A B C concat
0 1 4 7 14
1 2 5 8 25
2 3 6 9 36
However, I've noticed that approach can sometimes result in NaNs being populated where they shouldn't, so here's another way:
>>> from functools import reduce
>>> df['concat'] = df[cols].apply(lambda x: reduce(lambda a, b: a + b, x), axis=1)
>>> df
A B C concat
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
Although it should be noted that this approach is a lot slower:
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[df[["a", "b"]].apply(lambda x: reduce(lambda a, b: a + b, x)) for _ in range(10)]'
10 loops, best of 3: 451 msec per loop
Versus
$ python3 -m timeit 'import pandas as pd;from functools import reduce; df=pd.DataFrame({"a": ["this", "is", "a", "string"] * 5000, "b": ["this", "is", "a", "string"] * 5000});[pd.Series(df[["a", "b"]].fillna("").values.tolist()).str.join(" ") for _ in range(10)]'
10 loops, best of 3: 98.5 msec per loop
I don't have enough reputation to comment, so I'm building my answer off of blacksite's response.
For clarity, LunchBox commented that it failed for Python 3.7.0. It also failed for me on Python 3.6.3. Here is the original answer by blacksite:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
Here is my modification for Python 3.6.3:
df['concat'] = pd.Series(df.fillna('').values.tolist()).map(lambda x: ''.join(map(str,x)))
The solutions given above that use numpy arrays have worked great for me.
However, one thing to be careful about is the indexing when you get the numpy.ndarray from df.values, since the axis labels are removed from df.values.
So to take one of the solutions offered above (the one that I use most often) as an example:
df['concat'] = pd.Series(df.fillna('').values.tolist()).str.join('')
This portion:
df.fillna('').values
does not preserve the indices of the original DataFrame. Not a problem when the DataFrame has the common 0, 1, 2, ... row indexing scheme, but this solution will not work when the DataFrame is indexed in any other way. You can fix this by adding an index= argument to pd.Series():
df['concat'] = pd.Series(df.fillna('').values.tolist(),
index=df.index).str.join('')
I always add the index= argument just to be safe, even when I'm sure the DataFrame is row-indexed as 0, 1, 2, ...
This lambda approach offers some flexibility with columns chosen and separator type:
Setup:
df = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4', '5', '6'], 'C': ['7', '8', '9']})
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Concatenate All Columns - no separator:
cols = ['A', 'B', 'C']
df['combined'] = df[cols].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
A B C combined
0 1 4 7 147
1 2 5 8 258
2 3 6 9 369
Concatenate Two Columns A and C with '_' separator:
cols = ['A', 'C']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
A B C combined
0 1 4 7 1_7
1 2 5 8 2_8
2 3 6 9 3_9
as a solution to #Gary Dorman's question in the comment,
i would want to have a delimiter in place so when you're looking at your overall column, you can see how it's broken out.
you maybe use
df_tmp=df.astype(str) + ','
df_tmp.sum(axis=1).str.rstrip(',')
before:
1.2.3.480tcp
6.6.6.680udp
7.7.7.78080tcp
8.8.8.88080tcp
9.9.9.98080tcp
after:
1.2.3.4,80,tcp
6.6.6.6,80,udp
7.7.7.7,8080,tcp
8.8.8.8,8080,tcp
9.9.9.9,8080,tcp
which looks better (like CSV :)
This additional sep step is about 30% slower in my machine.

How to split digits and text

I have a dataset like this
data = pd.DataFrame({ 'a' : [5, 5, '2 bad']})
I want to convert this to
{ 'a.digits' : [5, 5, 2], 'a.text' : [nan, nan, 'bad']}
I can get 'a.digits' as bellow
data['a.digits'] = data['a'].replace('[^0-9]', '', regex = True)
5 2
2 1
Name: a, dtype: int64
When i do
data['a'] = data['a'].replace('[^\D]', '', regex = True)
or
data['a'] = data['a'].replace('[^a-zA-Z]', '', regex = True)
i get
5 2
bad 1
Name: a, dtype: int64
What's wrong? How to remove digits?
Something like this would suffice?
In [8]: import numpy as np
In [9]: import re
In [10]: data['a.digits'] = data['a'].apply(lambda x: int(re.sub(r'[\D]', '', str(x))))
In [12]: data['a.text'] = data['a'].apply(lambda x: re.sub(r'[\d]', '', str(x)))
In [13]: data.replace('', np.nan, regex=True)
Out[13]:
a a.digits a.text
0 5 5 NaN
1 5 5 NaN
2 2 bad 2 bad
Assuming there is a space between 2 and the word bad, you can do this:
data['Text'] = data['a'].str.split(' ').str[1]

Categories

Resources