I have the following data set:
column1
HL111
PG3939HL11
HL339PG
RC--HL--PG
I am attempting to write a function that does the following:
Loop through each row of column1
Pull only the alphabet and put into an array
If the array has "HL" in it, remove it from the array UNLESS HL is the only word in the array.
Take the first word in the array and output results.
So for the above example, my array (step2) would look like this:
[HL]
[PG,HL]
[HL,PG]
[RC,HL,PG]
and my desired final output (step4) would look like this:
desired_column
HL
PG
PG
RC
I have the code for step 2, and it seems to work fine
df['array_column'] = (df.column1.str.extractall('([A-Z]+)')
.unstack()
.values.tolist())
But I don't know how to get from here to my final output (step4).
You may achieve what you need by replacing all non-letters first, then extracting pairs of letters and then applying some custom logic to extract the necessary value from the array:
>>> df['array_column'].str.replace('[^A-Z]+', '').str.findall('([A-Z]{2})').apply(lambda d: [''] if len(d) == 0 else d).apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0])
0 HL
1 PG
2 PG
3 RC
Name: array_column, dtype: object
>>>
Details
.replace('[^A-Z]+', '') - remove all chars other the uppercase letters
.str.findall('([A-Z]{2})') - extract pairs of letters
.apply(lambda d: [''] if len(d) == 0 else d) will add an empty item if there is no regex match in the previous step
.apply(lambda x: 'HL' if len(x) == 1 and x[0] == 'HL' else [m for m in x if m != 'HL'][0]) - custom logic: if the list length is 1 and it is equal to HL, keep it, else remove all HL and get the first element
This is one approach using apply
Demo:
import re
import pandas as pd
def checkValue(value):
value = re.findall(r"[A-Z]{2}", value)
if (len(value) > 1) and ("HL" in value):
return [i for i in value if i != "HL"][0]
else:
return value[0]
df = pd.DataFrame({"column1": ["HL111", "PG3939HL11", "HL339PG", "RC--HL--PG"]})
print(df.column1.apply(checkValue))
Output:
0 HL
1 PG
2 PG
3 RC
Name: column1, dtype: object
You can do something like this (or probably something more elegant), what you had already gets you to a fairly nice structure where you can use groupby to complete your solution
def extract_relevant_str(grp):
ret_val = None
if "HL" in grp[0].tolist() and len(grp) == 1:
ret_val = "HL"
elif len(grp) >= 1:
ret_val = grp.loc[grp[0] != "HL", 0].iloc[0]
return ret_val
items = df.column1.str.extractall('([A-Z]+)')
items.reset_index().groupby("level_0").apply(extract_relevant_str)
Output:
level_0
0 HL
1 PG
2 PG
3 RC
dtype: object
Related
So I have a data frame like this:
FileName
01011RT0TU7
11041NT4TU8
51391RST0U2
01011645RT0TU9
11311455TX0TU8
51041545ST3TU9
What I want is another column in the DataFrame like this:
FileName |RdwyId
01011RT0TU7 |01011000
11041NT4TU8 |11041000
51391RST0U2 |51391000
01011645RT0TU9|01011645
11311455TX0TU8|11311455
51041545ST3TU9|51041545
Essentially, if the first 5 characters are digits then concat with "000", if the first 8 characters are digits then simply move them to the RdwyId column
I am noob so I have been playing with this:
Test 1:
rdwyre1=re.compile(r'\d\d\d\d\d')
rdwyre2=re.compile(r'\d\d\d\d\d\d\d\d')
rdwy1=rdwyre1.findall(str(thous["FileName"]))
rdwy2=rdwyre2.findall(str(thous["FileName"]))
thous["RdwyId"]=re.sub(r'\d\d\d\d\d', str(thous["FileName"].loc[:4])+"000",thous["FileName"])
Test 2:
thous["RdwyId"]=np.select(
[
re.search(r'\d\d\d\d\d',thous["FileName"])!="None",
rdwyre2.findall(str(thous["FileName"]))!="None"
],
[
rdwyre1.findall(str(thous["FileName"]))+"000",
rdwyre2.findall(str(thous["FileName"])),
],
default="Unknown"
)
Test 3:
thous=thous.assign(RdwyID=lambda x: str(rdwyre1.search(x).group())+"000" if bool(rdwyre1.search(x))==True else str(rdwyre2.search(x).group()))
None of the above have worked. Could anyone help me figure out where I am going wrong? and how to fix it?
You can use numpy select, which replicates CASE WHEN for multiple conditions, and Pandas' str.isnumeric method:
cond1 = df.FileName.str[:8].str.isnumeric() # first condition
choice1 = df.FileName.str[:8] # result if first condition is met
cond2 = df.FileName.str[:5].str.isnumeric() # second condition
choice2 = df.FileName.str[:5] + "000" # result if second condition is met
condlist = [cond1, cond2]
choicelist = [choice1, choice2]
df.loc[:, "RdwyId"] = np.select(condlist, choicelist)
df
FileName RdwyId
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545
def filt(list1):
for i in list1:
if i[:8].isdigit():
print(i[:8])
else:
print(i[:5]+"000")
# output
01011000
11041000
51391000
01011645
11311455
51041545
I mean, if your case is very specific, you can tweak it and apply it to your dataframe.
To a dataframe.
def filt(i):
if i[:8].isdigit():
return i[:8]
else:
return i[:5]+"000"
d = pd.DataFrame({"names": list_1})
d["filtered"] = d.names.apply(lambda x: filt(x)) #.apply(filt) also works im used to lambdas
#output
names filtered
0 01011RT0TU7 01011000
1 11041NT4TU8 11041000
2 51391RST0U2 51391000
3 01011645RT0TU9 01011645
4 11311455TX0TU8 11311455
5 51041545ST3TU9 51041545
Using regex:
c1 = re.compile(r'\d{5}')
c2 = re.compile(r'\d{8}')
rdwyId = []
for f in thous['FileName']:
m = re.match(c2, f)
if m:
rdwyId.append(m[0])
continue
m = re.match(c1, f)
if m:
rdwyId.append(m[0] + "000")
thous['RdwyId'] = rdwyId
Edit: replaced re.search with re.match as it's more efficient, since we are only looking for matches at the beginning of the string.
Let us try findall with ljust
df['new'] = df.FileName.str.findall(r"(\d+)[A-z]").str[0].str.ljust(8,'0')
Out[226]:
0 01011000
1 11041000
2 51391000
3 01011645
4 11311455
5 51041545
Name: FileName, dtype: object
I'm the process of cleaning a data frame, and one particular column contains values that are comprised of lists. I'm trying to find the average of those lists and update the existing column with an int while preserving the indices. I can successfully and efficiently convert those values to a list, but I lose the index values in the process. The code I've written below is too memory-tasking to execute. Is there a simpler code that would work?
data: https://docs.google.com/spreadsheets/d/1Od7AhXn9OwLO-SryT--erqOQl_NNAGNuY4QPSJBbI18/edit?usp=sharing
def Average(lst):
sum1 = 0
average = 0
if len(x) == 1:
for obj in x:
sum1 = int(obj)
if len(x)>1:
for year in x:
sum1 += int(year)
average = sum1/len(x)
return mean(average)
hello = hello[hello.apply([lambda x: mean(x) for x in hello])]
Here's the loop I used to convert the values into a list:
df_list1 = []
for x in hello:
sum1 = 0
average = 0
if len(x) == 1:
for obj in x:
df_list1.append(int(obj))
if len(x)>1:
for year in x:
sum1 += int(year)
average = sum1/len(x)
df_list1.append(int(average))
Use apply and np.mean.
import numpy as np
df = pd.DataFrame(data={'listcol': [np.random.randint(1, 10, 5) for _ in range(3)]}, index=['a', 'b', 'c'])
# np.mean will return NaN on empty list
df['listcol'] = df['listcol'].fillna([])
# can use this if all elements in lists are numeric
df['listcol'] = df['listcol'].apply(lambda x: np.mean(x))
# use this instead if list has numbers stored as strings
df['listcol'] = df['listcol'].apply(lambda x: np.mean([int(i) for i in x]))
Output
>>>df
listcol
a 5.0
b 5.2
c 4.4
I'm trying to write a function that will take a string, and given an integer, will remove all the adjacent duplicates larger than the integer and output the remaining string. I have this function right now that removes all the duplicates in a string, and I'm not sure how to put the integer constraint into it:
def remove_duplicates(string):
s = set()
list = []
for i in string:
if i not in s:
s.add(i)
list.append(i)
return ''.join(list)
string = "abbbccaaadddd"
print(remove_duplicates(string))
This outputs
abc
What I would want is a function like
def remove_duplicates(string, int):
.....
Where if for the same string I input int=2, I want to remove my n characters without removing all the characters. Output should be
abbccaadd
I'm also concerned about run time and complexity for very large strings, so if my initial approach is bad, please suggest a different approach. Any help is appreciated!
Not sure I understand your question correctly. I think that, given m repetitions of a character, you want to remove up to k*n duplicates such that k*n < m.
You could try this, using groupby:
>>> from itertools import groupby
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for k, g in groupby(string) for c in k * (len(list(g)) % n or n))
'abccadd'
Here, k * (len(list(g)) % n or n) means len(g) % n repetitions, or n if that number is 0.
Oh, you changed it... now my original answer with my "interpretation" of your output actually works. You can use groupby together with islice to get at most n characters from each group of duplicates.
>>> from itertools import groupby, islice
>>> string = "abbbccaaadddd"
>>> n = 2
>>> ''.join(c for _, g in groupby(string) for c in islice(g, n))
'abbccaadd'
Create group of letters, but compute the length of the groups, maxed out by your parameter.
Then rebuild the groups and join:
import itertools
def remove_duplicates(string,maxnb):
groups = ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))
return "".join(itertools.chain.from_iterable(v*k for k,v in groups))
string = "abbbccaaadddd"
print(remove_duplicates(string,2))
this prints:
abbccaadd
can be a one-liner as well (cover your eyes!)
return "".join(itertools.chain.from_iterable(v*k for k,v in ((k,min(len(list(v)),maxnb)) for k,v in itertools.groupby(string))))
not sure about the min(len(list(v)),maxnb) repeat value which can be adapted to suit your needs with a modulo (like len(list(v)) % maxnb), etc...
You should avoid using int as a variable name as it is a python keyword.
Here is a vanilla function that does the job:
def deduplicate(string: str, treshold: int) -> str:
res = ""
last = ""
count = 0
for c in string:
if c != last:
count = 0
res += c
last = c
else:
if count < treshold:
res += c
count += 1
return res
I have trouble with shortening my code with lambda if possible. bp is my data name.
My data looks like this:
user label
1 b
2 b
3 c
I expect to have
user label Y
1 b 1
2 b 1
3 c 0
Here is my code:
counts = bp['Label'].value_counts()
def score_to_numeric(x):
if counts['b'] > counts['s']:
if x == 'b':
return 1
else:
return 0
else:
if x =='b':
return 0
else:
return 1
bp['Y'] = bp['Label'].apply(score_to_numeric) # apply above function to convert data
It is a function converting a categorical data 'b' or 's' in column named 'Label' into numeric data: 0 or 1. The line counts = bp['Label'].value_counts() counts the number of 'b' or 's' in column 'Label'. Then, in score_to_numeric, if the count of 'b' is more than 's', then give value 1 to b in a new column called 'Y', and vice versa.
I would like to shorten my code into 3-4 lines at most. I think perhaps using a lambda statement will do this, but I'm not familiar enough with lambdas.
Since True and False evaluate to 1 and 0, respectively, you can simply return the Boolean expression, converted to integer.
def score_to_numeric(x):
return int((counts['b'] > counts['s']) == \
(x == 'b'))
It returns 1 iff both expressions have the same Boolean value.
I don't think you need to use the apply method. Something simple like this should work:
value_counts = bp.Label.value_counts()
bp.Label[bp.Label == 'b'] = 1 if value_counts['b'] > value_counts['s'] else 0
bp.Label[bp.Label == 's'] = 1 if value_counts['s'] > value_counts['b'] else 0
You could do the following
counts = bp['Label'].value_counts()
t = 1 if counts['b'] > counts['s'] else 0
bp['Y'] = bp['Label'].apply(lambda x: t if x == 'b' else 1 - t)
I have two list of lists which basically need to be mapped to each other based on their matching items (list). The output is a list of pairs that were mapped. When the list to be mapped is of length one, we can look for direct matches in the other list. The problem arises, when the list to be mapped is of length > 1 where I need to find, if the list in A is a subset of B.
Input:
A = [['point'], ['point', 'floating']]
B = [['floating', 'undefined', 'point'], ['point']]
My failed Code:
C = []
for a in A:
for b in B:
if a == b:
C.append([a, b])
else:
if set(a).intersection(b):
C.append([a, b])
print C
Expected Output:
C = [
[['point'], ['point']],
[['point', 'floating'], ['floating', 'undefined', 'point']]
]
Just add a length condition to the elif statement:
import pprint
A = [['point'], ['point', 'floating']]
B = [['floating', 'undefined', 'point'], ['point']]
C = []
for a in A:
for b in B:
if a==b:
C.append([a,b])
elif all (len(x)>=2 for x in [a,b]) and not set(a).isdisjoint(b):
C.append([a,b])
pprint.pprint(C)
output:
[[['point'], ['point']],
[['point', 'floating'], ['floating', 'undefined', 'point']]]
Just for interests sake, here's a "one line" implementation using itertools.ifilter.
from itertools import ifilter
C = list(ifilter(
lambda x: x[0] == x[1] if len(x[0]) == 1 else set(x[0]).issubset(x[1]),
([a,b] for a in A for b in B)
))
EDIT:
Having reading the most recent comments on the question, I think I may have misinterpreted what exactly is considered to be a match. In which case, something like this may be more appropriate.
C = list(ifilter(
lambda x: x[0] == x[1] if len(x[0])<2 or len(x[1])<2 else set(x[0]).intersection(x[1]),
([a,b] for a in A for b in B)
))
Either way, the basic concept is the same. Just change the condition in the lamba to match exactly what you want to match.