How to remove extra quotes from strings in a set? - python

I am working on a method and trying to return a list in the format List[Tuple[Set[str], Set[str]]] but I can't figure out how to remove extra quotes. self._history[dates[dat]]._ridings is a list of strings.
what this method does is check the voting areas in a election then comparing those areas with next election or previous election to see which areas were in first but not in second and which were in second but not in first and those are the two elements of tuple. there could be more than one area which may or may not be present in this election. self._history[dates[dat]]._ridings is a list with certain areas in a election and i add 1 to dat so it compares with the next election so every election is compared with the previous one.
I have tried using split and replace methods but doesn't seems to be working since it is a set not string
list1 = []
dates = list(self._history)
for dat in range(0, len(dates) - 1):
a = set(self._history[dates[dat]]._ridings)
b = set(self._history[dates[dat + 1]]._ridings)
list1.append(tuple([(a-b), (b-a)]))
return list1
Expected : [({"St. Paul's...St. Paul's"})]
Actual : [({'"St. Paul.... Paul\'s"'})]

Use the substring function to remove the character at index 0 and the character at index (string length - 1)

Related

Why is re not removing some values from my list?

I'm asking more out of curiosity at this point since I found a work-around, but it's still bothering me.
I have a list of dataframes (x) that all have the same column names. I'm trying to use pandas and re to make a list of the subset of column names that have the format
"D(number) S(number)"
so I wrote the following function:
def extract_sensor_columns(x):
sensor_name = list(x[0].columns)
for j in sensor_name:
if bool(re.match('D(\d+)S(\d+)', j))==False:
sensor_name.remove(j)
return sensor_name
The list that I'm generating has 103 items (98 wanted items, 5 items). This function removes three of the five columns that I want to get rid of, but keeps the columns labeled 'Pos' and 'RH.' I generated the sensor_name list outside of the function and tested the truth value of the
bool(re.match('D(\d+)S(\d+)', sensor_name[j]))
for all five of the items that I wanted to get rid of and they all gave the False value. The other thing I tried is changing the conditional to ==True, which even more strangely gave me 54 items (all of the unwanted column names and half of the wanted column names).
If I rewrite the function to add the column names that have a given format (rather than remove column names that don't follow the format), I get the list I want.
def extract_sensor_columns(x):
sensor_name = []
for j in list(x[0].columns):
if bool(re.match('D(\d+)S(\d+)', j))==True:
sensor_name.append(j)
return sensor_name
Why is the first block of code acting so strangely?
In general, do not change arrays while iterating over them. The problem lies in the fact that you remove elements of the iterable in the first (wrong) case. But in the second (correct) case, you add correct elements to an empty list.
Consider this:
arr = list(range(10))
for el in arr:
print(el)
for i, el in enumerate(arr):
print(el)
arr.remove(arr[i+1])
The second only prints even number as every next one is removed.

How can multiple matches be applied in dataframe?

I have a little performance issue with that problem:
There is a (3_000_000, 6) shape dataframe, call that A, and another one, that has (72_000, 6) shape, call that B. To keep it simple, suppose both of them have only string columns.
In the B dataframe, there are fields of rows where could be ?, so question mark in the value.
For example CITY column: New ?ork instead of New York. The task is to find the right string from the A dataframe.
So another example:
B Dataframe
CITY
ADDRESS
ADDRESS_TYPE
New ?ork
D?ck
str?et
A Dataframe
CITY
ADDRESS
ADDRESS_TYPE
New York
Duck
street
My plan was to make a multiprocess investigation that iterates over the B dataframe and every each row I make a multistep filter, where first I filter with A = A[A.CITY.str.match(city) == True], where the city is a regular expression New ?ork => New .ork.
So first I pre-filter with the city, after the address, and so on...
matched_rows = A[A.CITY.str.match(city) == True]
matched_rows = matched_rows[matched_rows.ADDRESS.str.match(address) == True]
...
Besides that, there is another really important thing: there is a unique ID ( not the index ) column in the A dataframe, that uniquely identifies the rows.
ABC123 - New York - Duck - Street
For me, it's really important to find the appropriate row of A dataframe for every each of B row.
The solution works but its performance is horrible. I don't really know how I should approach.
5 cores were able to decrease to ~200 minutes but I hope there is a better solution for that problem.
So, my question is, Is there any better, and more performance way to apply multiple matches to a dataframe?
As indicated by #dwhswenson, the only strategy that comes to mind is to reduce the size of the dataframes you test, which is not so much a programming problem as a data management problem. This is going to depend on your dataset and what kind of work you want to do, but one naive strategy would be to store indices of rows in which column values start with 'a', 'b', etc. and then select a dataframe to match over based on the query string. So you'd do something like
import itertools
alphabet = 'abcdefghijklmnopqrstuvwxyz'
keys = list(itertools.chain.from_iterable(([c, l] for c in A.columns) for l in alphabet))
A_indexes = {}
for k in keys:
begins_with = lambda x: (x[0] == k[1]) or (x[0] == k[1].upper())
A_indexes[k[0], k[1]] = A[k[0]].loc[A[k[0]].apply(begins_with)].index
Then make a function that takes a column name and a string to match and returns a view of A that contains only rows whos entries for that column begin with the same letter as the string to match:
def get_view(column, string_to_match):
return A.loc[A_indexes[[column, string_to_match[0].lower()]]]
You'd have to double the number of indices of course for the case in which the first letter of the string to match is a wildcard, but this is only an example of the kind of thing you could do to slice the dataset before doing a regex on every row.
if you want to work with the unique ID above, you could make a more complex index lookup dictionary for views into A:
import itertools
alphabet = 'abcdefghijklmnopqrstuvwxyz'
number_of_fields_in_ID = 6
separator = ' - '
keys = itertools.product(alphabet, repeat=number_of_fields_in_ID)
def check_ID(id, key):
id = id.split(separator)
matches_key = True
for i,first_letter in enumerate(key):
if id[i].lower() != first_letter:
matches_key = False
break
return matches_key
A_indexes = {}
for k in keys:
A_indexes[k] = A.loc[A['unique_ID'].apply(check_ID, args=(k,))].index
This will apply the kludgy function check_ID to your 3e6 element series A['unique_ID'] 26**number_of_fields_in_ID times, indexing the dataframe once per iteration (this is more than 4.5e5 iterations if the ID has 4 fields), so there may be significant up-front cost in compute time depending on your dataset. Whether or not this is worth it depends first on what you want to do afterwards (are you just doing a single set of queries to make a second dataset and you're done, or do you need to do a lot of arbitrary queries in the future?) and second on whether the first letters of each ID field are roughly evenly distributed over the alphabet (or decimal digits if you know a field is numeric, or both if it's alphanumeric). If you just have two or three cities, for instance, you wouldn't make indexes for the whole alphabet in that field. But again, doing the lookup this way is naive - if you go this route you'd come up with a lookup method that's tailored to your data.

How many times can a word be created using the input string?

Here is the problem I'm trying to solve:
Write a program to perform the following operations:
Read two inputs - a sequence of characters S & another shorter sequence Y from two separate lines of input
S only contains lower case characters among a-z
Calculate and print how many times the given word Y can be generated from the given sequence S
Characters from string S can be used in order
Each character can be used only once
Sample Input:
apqrctklatc //input
cat //the word that we need to create from input
Output:
word cat can be formed 2 times
Use this:
s = 'apqrctklatc'
y = 'cat'
yc = []
for i in y:
yc.append(s.count(i))
print(min(yc))
This, according to me is the simplest solution.
Let's see how it works:
1) It loops through the second string('cat').
2)It counts how many times each letter in the string occurs in the other string, i.e. 'apqrctklatc' and makes a list.
3)It finds the minimum value of the list formed, i.e. yc.
My solution is:
Step 1: Count the number of times that each distinct characters appear in the 2nd input. Save the result to a map, called mapA, for example: a - 2 times, b - 3 times, etc...
Step 2: Iterate through the 1st input, count the number of times that each characters in mapA appear. Save the result to a map, called mapB.
Step 3: Initialize a variable with a high integer value (max_int is a good choice), called result. Iterate through mapA (or mapB, since both maps have the same list of keys). For each keys in mapA, calculate the floor of mapB.value/mapA.value. If it smaller than result, set result to that value.
Step 4: Return the result, which is the result you need.
For other cases that make your result unexpected, like: 1st input and 2nd input have no common character, etc..., make sure that you have catch all of them before following those steps. Hope that you can finish it without a sample code.
Good luck.
You can try this :
import re
s="apqrctklatc"
y="cat"
ylst = [x for x in y]
print(ylst)
ycount=[]
for ychar in ylst:
count = len(re.findall(ychar, s))
ycount.append(count)
print("word",y," can be formed",min(ycount),"times")
it's working for me also you can see my output:

Shift a string and Find Number of indices that matches between each shift of strings

Having trouble trying to calculate the number of matches between 2 strings:
Firstly, I need to shift a string from numbers 1 to 5 and calculate the number of matches between each shifts, if the # of matches is greater than before then remember that shift # otherwise use the previous one. A match would be based on if a character of an index in the previous string matches the character of the same index in the string after
original string: ABCDZZ
shift 1: ABCDZ match = 1 (index 4,"ABCDZ" is the new string, original shifted by 1)
shift 2: ABCD match = 0 ("ABCD" is the new string, original shifted by 2)
shift 3: ABC match = 0 ("ABC" is the new string, original shifted by 3)
shift 4: AB match = 0 ("AB" is the new string, original shifted by 4)
shift 5: A """
So my output would be shift 1, since that had the greatest # of matches.
Doing it manually is fairly simple but I am having trouble doing this with for loops. Mainly how to get the for loop to iterate comparing the index of each new string? Is that more effective or would it make sense to slice the string based on the number of shifts? If anyone could help with even just the pseudocode for this, I'm not sure my logic is right when I am doing it. Is there any way to figuratively create two rows and compare it that way, because the indexs of each string won't even line up unless I put a space in the front as a placeholder everytime I shift.
Edit:
I was thinking of splitting the string into a list of characters and then comparing the index to the "new" string which would be the list[:-1] and list[0,"] essentially creating a place holder as the spaces in the shift and then -1 for the last characters removal. But going forward, I have to use the "new" string, so I'm not sure how to keep those current changes while still using variables in a for loop. The string is much smaller to the one I actually need to find the matches for, thus it is unknown what the letters actually are. I just need to find a way to compare the indexs of the previous strings to the index of the shifted one.

Moving for loop into a reduce method

I am trying to plot the location of ~4k postcodes onto a UK map, I am using a library that can take in the postcode and kick back latitude, longitude etc.., however the postcode must always contain a space before the last 3 characters in the string, for example:
'AF23 4FR' would be viable as the space is before the last 3 chars in the string..
'AF234FR' would not be allowed as there is no space..
I have to go over each item within my list and check there is a space before the n-3 position in the string, I can do this with a simple for loop but I would prefer to do this with a reduce function. I am struggling to workout how I would rework the check and logic of the following into a reduce method, would it even be worth it in this scenario:
for index, p in enumerate(data_set):
if (p.find(' ') == -1):
first = p[:len(p)]
second = p[len(first):]
data_set[index] = first + ' ' + second
You're pretty much there... Create a generator with spaces removed from your string, then apply slicing and formatting, and use a list-comp to generate a new list of foramtted values, eg:
pcs_without_spaces = (pc.replace(' ', '') for pc in data_set)
formatted = ['{} {}'.format(pc[:-3], pc[-3:]) for pc in pcs_without_spaces)
That way, you don't need additional logic on whether it's got a space or not already in it, as long as your postcode is going to be valid after slicing, just removing the spaces and treating everything with the same logic is enough.

Categories

Resources