How to split based on string matching? - python

I have two lists, one that contains the user input and the other one that contains the mapping.
The user input looks like this :
The mapping looks like this :
I am trying to split the strings in the user input list. Sometime they enter one record as CO109CO45 but in reality these are two codes and don't belong together. They need to be separated with a comma or space as such CO109,CO45.
There are many examples that have the same behavior and i was thinking to use a mapping list to match and split. Is this something that can be done? What do you suggest? Thanks in advance for your help!

Use a combination of look ahead and look behind regex in the split.
df = pd.DataFrame({'RCode': ['CO109', 'CO109CO109']})
print(df)
RCode
0 CO109
1 CO109CO109
df.RCode.str.split('(?<=\d)(?=\D)')
0 [CO109]
1 [CO109, CO109]
Name: RCode, dtype: object

You can try with regex:
import re
l = ['CO2740CO96', 'CO12', 'CO973', 'CO870CO397', 'CO584', 'CO134CO42CO685']
df = pd.DataFrame({'code':l})
df.code = df.code.str.findall('[A-Za-z]+\d+')
print(df)
Output:
code
0 [CO2740, CO96]
1 [CO12]
2 [CO973]
3 [CO870, CO397]
4 [CO584]
5 [CO134, CO42, CO685]

I usually use something like this, for an input original_list:
output_list = [
[
('CO' + target).strip(' ,')
for target in item.split('CO')
]
for item in original_list
]
There are probably more efficient ways of doing it, but you don't need the overhead of dataframes / pandas, or the hard-to-read aspects of regexes.
If you have a manageable number of prefixes ("CO", "PR", etc.), you can set up a recursive function splitting on each of them. - Or you can use .find() with the full codes.

Related

Python: Unify multiple lists into one

Could you help me with the following challenge I am currently facing:
I have multiple lists, each of which contains multiple strings. Each string has the following format:
"ID-Type" - where ID is a number and type is a common Python type. One such example can be found here:
["1-int", "2-double", "1-string", "5-list", "5-int"],
["3-string", "1-int", "1-double", "5-double", "5-string"]
Before calculating further, I now want to preprocess these list to unify them the following way:
Count how often each type is appearing in each list
Generate a new list, combining both results
Create a mapping from initial list to that new list
As an example
In the above lists, we have the following types:
List 1: 2 int, 1 double, 1 string, 1 list
List 2: 2 string, 2 double, 1 int
The resulting table should now contain:
2 int, 2 double, 2 string, 1 list (in order to be able to contain both lists), like this:
[
"int_1-int",
"int_2-int",
"double_1-double",
"double_2-double",
"string_1-string",
"string_2-string",
"list_1-list"
]
And lastly, in order to map input to output, the idea is to have a corresponding dictionary to map this transformation, e.g., for list_1:
{
"1-int": "int_1-int",
"2-double": "double_1-double",
"1-string": "string_1-string",
"5-list": "list_1-list",
"5-int": "int_2-int"
}
I want to prevent to do this with a nested loop and multiple iterations - are there any libraries or is there maybe a smart vectorized solution to address this challenge?
Just add them:
Example :
['it'] + ['was'] + ['annoying']
You should read the Python tutorial to learn basic info like this.
Just another method....
import itertools
ab = itertools.chain(['it'], ['was'], ['annoying'])
list(ab)
Just add them: Example :
['it'] + ['was'] + ['annoying']
You should read the Python tutorial to learn basic info like this.
Just another method....
import itertools
ab = itertools.chain(['it'], ['was'], ['annoying'])
list(ab)
In general, this approach doesn't really make sense unless you specifically need to have the items in the resulting list and dict in this exact format. But here's how you can do it:
def process_type_list(type_list):
mapping = dict()
for i in type_list:
i_type = i.split('-')[1]
n_occur = 1
map_val = f'{i_type}_{n_occur}-{i_type}'
while map_val in mapping.values():
n_occur += 1
map_val = f'{i_type}_{n_occur}-{i_type}'
mapping[i] = map_val
return mapping
l1 = ["1-int", "2-double", "1-string", "5-list", "5-int"]
l2 = ["3-string", "1-int", "1-double", "5-double", "5-string"]
l1_mapping = process_type_list(l1)
l2_mapping = process_type_list(l2)
Additionally, Python does not have a double type. C doubles are implemented as Python floats (or decimal.Decimal if you need fine control over the precision)
I am pretty sure that this is what you want to do:
To make a joint list:
['any item'] + ['any item 2']
If you want to turn the list into a dictionary:
dict(zip(['key 1', 'key 2'], ['value 1', 'value 2']))
Another method of joining 2 lists:
a = ['list item', 'another list item']
a.extend(['another list item', 'another list item'])

Replace s.str.startwith parameters only in a series

I have a df on which I want to filter a column and replace the str.startswith parameter. Example:
df = pd.DataFrame(data={'fname':['Anky','Anky','Tom','Harry','Harry','Harry'],'lname':['sur1','sur1','sur2','sur3','sur3','sur3'],'role':['','abc','def','ghi','','ijk'],'mobile':['08511663451212','','0851166346','','0851166347',''],'Pmobile':['085116634512','1234567890','8885116634','','+353051166347','0987654321'],'Isactive':['Active','','','','Active','']})
by executing the below line :
df['Pmobile'][df['Pmobile'].str.startswith(('08','8','+353'),na=False)]
I get :
0 085116634512
2 8885116634
4 +353051166347
How do i replace only the parameters I passed under s.str.startswith() here for example : ('08','8','+3538') and don't touch any other number except the starting numbers inside the tuple (on the fly)?
I found this most convenient and concise
df.Pmobile = df.Pmobile.replace(r'^[08|88|+3538]', '')
You can use pandas's replace with regex.
below is sample code.
df.Pmobile.replace(regex={r'^08':'',r'^8':'',r'^[+]353':''})
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html

Comparing a list to a dataframe column and create new column with numbers

I have a dataframe where one column contains urls. I want to compare it to a list of string values and wherever they match add a number to a new column.
The column looks something like this:
source
www.fox5.com/some_article
www.nyt.com/some_article
www.fox40news.com/some_article
www.cnn.com/another_article
...
I want to compare it to this list:
sources = ['fox', 'yahoo', 'abcnews', 'google', 'cnn', 'nyt', 'nbc',
'washingtonpost', 'wsj', 'huffingtonpost']
and where the sources value is contained in the source column add the corresponding number of the list location to a new column. So the resulting new column would look something like this:
sources sourcenum
www.fox5.com/some_article 1
www.nyt.com/some_article 6
www.fox40news.com/some_article 1
www.cnn.com/another_article 5
... ...
Ive tried using a for loop with a count:
count = 1
for x in sources:
if x in df.source.values:
df.sourcenum = count
count += 1
but the output is just all 0's
I also tried using numpys where but that doesnt accept 10 arguments.
The list could be changed to a dictionary like so if that would work better
sources = {'fox':1, 'yahoo':2, 'abcnews':3, 'google':4, 'cnn':5, 'nyt':6,
'nbc':7, 'washingtonpost':8, 'wsj':9, 'huffingtonpost':10}
Any help would be appreciated, thanks.
One way is to use a generator expression with enumerate. In the below implementation we cycle through an enumerated sources list. next extracts the first instance of a partial match. If no partial match exists, 0 is returned.
sources = ['fox', 'yahoo', 'abcnews', 'google', 'cnn', 'nyt', 'nbc',
'washingtonpost', 'wsj', 'huffingtonpost']
def sourcer(x):
return next((i for i, j in enumerate(sources, 1) if j in x), 0)
df['sourcenum'] = df['source'].apply(sourcer)
print(df)
source sourcenum
0 www.fox5.com/some_article 1
1 www.nyt.com/some_article 6
2 www.fox40news.com/some_article 1
3 www.cnn.com/another_article 5
It looks like regular expression can help resolve the problem. Python has 're' module, though I'm not the expert of Python.
But the idea is compose a 'pattern' with your sources list, and match that pattern against the strings. I believe you could get the count of matches which is the number you need.
You can also use tldextract package to get domain name of the url.
Then, apply get_close_matches function from
difflib package to get closest string.
And finally use .index to get corresponding index number from list of sources:
import tldextract
from difflib import get_close_matches
df['sourcenum'] = df['source'].apply(lambda row:sources.index(
get_close_matches(
tldextract.extract(row).domain, sources, cutoff=.5)[0])+1)
print(df)
Result:
source sourcenum
0 www.fox5.com/some_article 1
1 www.nyt.com/some_article 6
2 www.fox40news.com/some_article 1
3 www.cnn.com/another_article 5
Note: in code above, for function get_close_matches the value for cutoff=.5 was set otherwise close match for fox40news was not found.

Iterate over a list of strings in python

I am trying to set up a data set that checks how often several different names are mentioned in a list of articles. So for each article, I want to know how often nameA, nameB and so forth are mentioned. However, I have troubles with iterating over the list.
My code is the following:
for element in list_of_names:
for i in list_of_articles:
list_of_namecounts = len(re.findall(element, i))
list_of_names = a string with several names [nameA nameB nameC]
list_of_articles = a list with 40.000 strings that are articles
Example of article in list_of_articles:
Index: 1
Type: str
Size: Amsterdam - de financiële ...
the error i get is: expected string or buffer
I though that when iterating over the list of strings, that the re.findall command should work using lists like this, but am also fairly new to Python. Any idea how to solve my issue here?
Thank you!
If your list is ['apple', 'apple', 'banana'] and you want the result: number of apple = 2, then:
from collections import Counter
list_count = Counter(list_of_articles)
for element in list_of_names:
list_of_namecounts = list_count[element]
And assuming list_of_namecounts is a list ¿?
list_of_namecounts = []
for element in list_of_names:
list_of_namecounts.append(list_count[element])
See this for more understanding

How to combine initialization and assignment of dictionary in Python?

I would like to figure out if any deal is selected twice or more.
The following example is stripped down for sake of readability. But in essence I thought the best solution would be using a dictionary, and whenever any deal-container (e.g. deal_pot_1) contains the same deal twice or more, I would capture it as an error.
The following code served me well, however by itself it throws an exception...
if deal_pot_1:
duplicates[deal_pot_1.pk] += 1
if deal_pot_2:
duplicates[deal_pot_2.pk] += 1
if deal_pot_3:
duplicates[deal_pot_3.pk] += 1
...if I didn't initialize this before hand like the following.
if deal_pot_1:
duplicates[deal_pot_1.pk] = 0
if deal_pot_2:
duplicates[deal_pot_2.pk] = 0
if deal_pot_3:
duplicates[deal_pot_3.pk] = 0
Is there anyway to simplify/combine this?
There are basically two options:
Use a collections.defaultdict(int). Upon access of an unknown key, it will initialise the correposnding value to 0.
For a dictionary d, you can do
d[x] = d.get(x, 0) + 1
to initialise and increment in a single statement.
Edit: A third option is collections.Counter, as pointed out by Mark Byers.
It looks like you want collections.Counter.
Look at collections.defaultdict. It looks like you want defaultdict(int).
So you only want to know if there are duplicated values? Then you could use a set:
duplicates = set()
for value in values:
if value in duplicates():
raise Exception('Duplicate!')
duplicates.add(value)
If you would like to find all duplicated:
maybe_duplicates = set()
confirmed_duplicates = set()
for value in values:
if value in maybe_duplicates():
confirmed_duplicates.add(value)
else:
maybe_duplicates.add(value)
if confirmed_duplicates:
raise Exception('Duplicates: ' + ', '.join(map(str, confirmed_duplicates)))
A set is probably the way to go here - collections.defaultdict is probably more than you need.
Don't forget to come up with a canonical order for your hands - like sort the cards from least to greatest, by suit and face value. Otherwise you might not detect some duplicates.

Categories

Resources