I am trying to find a way to search for substrings in strings for a problem like this
findin = pd.Series({1:'abcab', 2: 'abab',3: 'abcdaa', 4:'cabca'})
what = pd.Series({1:'b',2: 'a',3: 'bc',4: 'abc'})
where "what" is what I am seeking and "findin" is the values I want to search
I would like the output to be something like
1 4
0 3
1
1
Every method I have tried is upset at the different number of values that come out. I keep getting "Data must be 1-dimensional" for example using methods like
list(map(lambda x, y: x.find(y), findin, what))
I feel like expand needs to be here, but where does it go?
You can use regex in a function and apply if on findin Series:
c = iter(range(1, 5))
def func(x):
ind = next(c)
return [i.start() for i in re.finditer(what[ind], x)]
findin.apply(func)
Out:
1 [1, 4]
2 [0, 2]
3 [1]
4 [1]
dtype: object
Related
I have a df like:
df = pd.DataFrame({'Temp' : ['ko1234', 'ko1234|ko445|ko568', 'map123', 'ko895', 'map123|ko889|ko665', 'ko635|map789|map777', 'ko985']})
(out) >>>
ko1234
ko1234|ko445|ko568
map123
ko895
map123|ko889|ko665
ko635|map789|map777
ko985
I need two things:
I want to keep only the words starting with ko but keep the remaining spaces, so:
ko1234
ko1234|ko445|ko568
ko895
ko889|ko665
ko635
ko985
In another case he would like to do this:
if there is only one word keep it
if there are more words divided by a "|" keep only the second one, so:
ko1234
ko445
map123
ko895
ko889
map789
ko985
What is the best way to do this?
Here is how to do it using .apply (or .transform - the result will be the same).
The functions are applied to each element of the Series lists - which cointains a list of words (that were separated by "|" in the column Temp):
lists = df['Temp'].str.split('|')
def starting_with_ko(lst):
ko = [word for word in lst if word.startswith('ko')]
return '|'.join(ko) if ko else ''
def choose_element(lst):
if len(lst) == 1:
return lst[0]
else:
return lst[1]
out1 = lists.apply(starting_with_ko)
out2 = lists.apply(choose_element)
Results:
>>> out1
0 ko1234
1 ko1234|ko445|ko568
2
3 ko895
4 ko889|ko665
5 ko635
6 ko985
dtype: object
>>> out2
0 ko1234
1 ko445
2 map123
3 ko895
4 ko889
5 map789
6 ko985
dtype: object
We can do split then explode and remove the unwanted items with startswith
out = s.str.split('|').explode().str.strip()
out1 = out[out.str.startswith('ko')].groupby(level=0).agg('|'.join).reindex(s.index)
out2 = s.str.split('|').str[1].fillna(s)
Problem Statement:
I have a DataFrame that has to be filtered with multiple conditions.
Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.
While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality
Reproducible Example
Dummy dataframe -
df = pd.DataFrame({'One':['a','a','a','b'],
'Two':['x','y','y','y'],
'Three':['l','m','m','l']})
print(df)
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Let's say that invalid values are the values that don't belong to the respective column. So, for column 'One' all other values are invalid except 'a' and 'b'. If the user input's 'a' then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.
My attempt (with multiple parameters):
def valid_filtering(df, inp):
if inp[0] in df['One'].values:
df = df[df['One']==inp[0]]
if inp[1] in df['Two'].values:
df = df[df['Two']==inp[1]]
if inp[2] in df['Three'].values:
df = df[df['Three']==inp[2]]
return df
With all valid inputs -
inp = ['a','y','m'] #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
One Two Three
1 a y m
2 a y m
With few invalid inputs -
inp = ['a','NA','NA'] #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
One Two Three
0 a x l
1 a y m
2 a y m
P.S. Additional question - is there a way to get DataFrame indexing to behave as -
df[df['One']=='valid'] -> returns filtered df
df[df['One']=='invalid'] -> returns original df
Because this would help me rewrite my filtering -
df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
EDIT: Solution -
An updated solution inspired by the code and logic provided by #corralien and #Ben.T
df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]
Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.
def valid_filtering(df, inp):
# check where inp values are same than in df
m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
# select the columns with at least one True
cols = m.columns[m.any()]
# select the rows that all True amongst wanted columns
rows = m[cols].all(axis=1)
# return df with selected rows
return df.loc[rows]
Note that if you don't have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.
d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))
The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:
mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]
Case 1: inp = ['a','y','m'] (with all valid inputs)
>>> out
One Two Three
1 a y m
2 a y m
Case 2: inp = ['a','NA','NA'] (with few invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Case 4: inp = ['b','x','m'] (with all valid inputs but not results)
>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []
Of course, you can increase input parameters:
df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
One Two Three Four
2 a y m k
Another way with list comprehension:
def valid_filtering(df, inp):
series = [df[column] == inp[i]
for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
for s in series: df = df[s]
return df
Output of print(valid_filtering(df, ['a','NA','NA'])):
One Two Three
0 a x l
1 a y m
2 a y m
Related: applying lambda row on multiple columns pandas
I am mapping specific keywords with text data using applymap in Python. Let's say I want to check how often the keyword "hello" matches with the text data over all rows. Applymap gives me the desired matrix outcome, however only a "True" or "False" instead of the number of appearances.
I tried to connect count() with my applymap function, but I could not make it work.
The minimal working example is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'text': ['hello hello', 'yes no hello', 'good morning']})
keys = ['hello']
keyword = pd.DataFrame({0:keys})
res = []
for a in df['text']:
res.append(keyword.applymap(lambda x: x in a))
map = pd.concat(res, axis=1).T
map.index = np.arange(len(map))
#Output
map
0
0 True
1 True
2 False
#Desired Output with 'hello' appearing twice in the first row, once in the second and zero in the third of df.
0
0 2
1 1
2 0
I am looking for a way to keep my applymap function to obtain the matrix form, but replace the True (1) and False (0) with the number of appearances, such as the desired output shows above.
Instead of testing for an item in the list:
res.append(keyword.applymap(lambda x: x in a)) # x == a
You should use:
res.append(keyword.applymap(lambda x: str.count(a, x))) # counting occurrence of "a"
Hello I have this list:
b = [[2018-12-14, 2019-01-11, 2019-01-25, 2019-02-08, 2019-02-22, 2019-07-26],
[2018-06-14, 2018-07-11, 2018-07-25, 2018-08-08, 2018-08-22, 2019-01-26],
[2017-12-14, 2018-01-11, 2018-01-25, 2018-02-08, 2018-02-22, 2018-07-26]]
dtype: datetime64[ns]]
and I want to know if it's possible to compare this list of dates with another date. I am doing it like this:
r = df.loc[(b[1] > vdate)]
with:
vdate = dt.date(2018, 9, 19)
the output is correct because it select the values that satisfy the condition. But the problem is that I want to do that for all the list values. Something like:
r = df.loc[(b > vdate)] # Without [1]
but this get as an output an error as I expected.
I try some for loop and it seems like it works but I am not sure:
g = []
for i in range(len(b)):
r = df.loc[(b[i] > vdate)]
g.append(r)
Thank you so much for your time and any help would be perfect.
One may use the apply function as stated by #Joseph Developer, but a simple list comprehension would not require you to write the function. The following will give you a list of boolean telling you whether or not each date is greater than vdate :
is_after_b = [x > vdate for x in b]
And if you want to include this directly in your DataFrame you may write :
df['is_after_b'] = [ x > vdate for x in df.b]
Assuming that b is a column of df, which btw would make sure that the length of b and your DataFrame's columns match.
EDIT
I did not consider that b was a list of list, you would need to flatten b by using :
flat_b = [item for sublist in b for item in sublist]
And you can now use :
is_after_b = [x > vdate for x in flat_b]
if you want to go through the entire list just use the following method:
ds['new_list'] = ds['list_dates'].apply(function)
use the .apply () method to process your list through a function
I have a list of phone numbers that have been dialed (nums_dialed).
I also have a set of phone numbers which are the number in a client's office (client_nums)
How do I efficiently figure out how many times I've called a particular client (total)
For example:
>>>nums_dialed=[1,2,2,3,3]
>>>client_nums=set([2,3])
>>>???
total=4
Problem is that I have a large-ish dataset: len(client_nums) ~ 10^5; and len(nums_dialed) ~10^3.
which client has 10^5 numbers in his office? Do you do work for an entire telephone company?
Anyway:
print sum(1 for num in nums_dialed if num in client_nums)
That will give you as fast as possible the number.
If you want to do it for multiple clients, using the same nums_dialed list, then you could cache the data on each number first:
nums_dialed_dict = collections.defaultdict(int)
for num in nums_dialed:
nums_dialed_dict[num] += 1
Then just sum the ones on each client:
sum(nums_dialed_dict[num] for num in this_client_nums)
That would be a lot quicker than iterating over the entire list of numbers again for each client.
>>> client_nums = set([2, 3])
>>> nums_dialed = [1, 2, 2, 3, 3]
>>> count = 0
>>> for num in nums_dialed:
... if num in client_nums:
... count += 1
...
>>> count
4
>>>
Should be quite efficient even for the large numbers you quote.
Using collections.Counter from Python 2.7:
dialed_count = collections.Counter(nums_dialed)
count = sum(dialed_count[t] for t in client_nums)
Thats very popular way to do some combination of sorted lists in single pass:
nums_dialed = [1, 2, 2, 3, 3]
client_nums = [2,3]
nums_dialed.sort()
client_nums.sort()
c = 0
i = iter(nums_dialed)
j = iter(client_nums)
try:
a = i.next()
b = j.next()
while True:
if a < b:
a = i.next()
continue
if a > b:
b = j.next()
continue
# a == b
c += 1
a = i.next() # next dialed
except StopIteration:
pass
print c
Because "set" is unordered collection (don't know why it uses hashes, but not binary tree or sorted list) and it's not fair to use it there. You can implement own "set" through "bisect" if you like lists or through something more complicated that will produce ordered iterator.
The method I use is to simply convert the set into a list and then use the len() function to count its values.
set_var = {"abc", "cba"}
print(len(list(set_var)))
Output:
2