Check if character within string; pass if True, do stuff if False - python

I am writing code to process a list of URL's, however some of the URL's have issues and I need to pass them in my for loop. I've tried this:
x_data = []
y_data = []
for item in drop['URL']:
if re.search("J", str(item)) == True:
pass
else:
print(item)
var = urllib.request.urlopen(item)
hdul = ft.open(var)
data = hdul[0].data
start = hdul[0].header['WMIN']
finish = hdul[0].header['WMAX']
start_log = np.log10(start)
finish_log = np.log10(finish)
redshift = hdul[0].header['Z']
length = len(data[0])
xaxis = np.linspace(start, finish, length)
#calculating emitted wavelength from observed and redshift
x_axis_nr = [xaxis[j]/(redshift+1) for j in range(len(xaxis))]
gauss_kernel = Gaussian1DKernel(5/3)
flux = np.convolve(data[0], gauss_kernel)
wavelength = np.convolve(x_axis_nr, gauss_kernel)
x_data.append(x_axis_nr)
y_data.append(data[0])
where drop is a previously defined pandas DataFrame. Previous questions on this topic suggested regex might be the way to go, and I have tried this to filter out any URL containing the letter J (which are only the bad ones).
I get this:
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0581.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0582.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0584.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0587.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0589.fit
http://www.gama-survey.org/dr3/data/spectra/sdss/spec-0915-52443-0592.fit
http://www.gama-survey.org/dr3/data/spectra/2qz/J113606.3+001155a.fit
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-8-2a3083a3a6d7> in <module>
14 finish_log = np.log10(finish)
15 redshift = hdul[0].header['Z']
---> 16 length = len(data[0])
17
18 xaxis = np.linspace(start, finish, length)
TypeError: object of type 'numpy.float32' has no len()
which is the same kind of error I was having before trying to remove J urls, so clearly my regex is not working. I would appreciate some advice on how to filter these, and am happy to provide more information as required.

There's no need to compare the result of re.search with True. From documentation you can see that search returns a match object when a match is found:
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.
So, when comparing a match object with True the return is False and your else condition is executed.
In [35]: re.search('J', 'http://www.gama-survey.org/dr3/data/spectra/2qz/J113606.3+001155a.fit') == True
Out[35]: False

Related

List Indexing using range() and len() not working

I am trying to parse through a data structure and I have used a for loop initializing a variable i and using the range() function. I originally set my range to be the size of the records: 25,173 but then I kept receiving an
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In [63], line 41
37 input_list.append(HOME_AVG_PTS)
39 return input_list
---> 41 Data["AVG_PTS_HOME"]= extract_wl(AVG_PTS_HOME)
Cell In [63], line 28, in extract_wl(input_list)
26 if i==25172:
27 HOME_AVG_PTS[0]= HOME_AVG_PTS[0]/count
---> 28 elif int(Data(["TEAM_ID_AWAY"][i])) == const_home_team_id:
29 if type(PTS_AWAY[i]) is int:
30 count+= 1
IndexError: list index out of range
So I tried changing my for loop to be in the range of the function with the issue i.e.
for i in range(len(Data["TEAM_ID_AWAY"])):
But I keep receiving the same error still
The Data variable holds the contents of a csv file which I have used the panda module to read and put into Data. You can assume all the column headers I have used are valid and furthermore that they all have range 25173.
[Here is an image showing the range and values for the Data"TEAM_ID_HOME"
AVG_PTS_HOME = []
def extract_wl(input_list):
for j in range(25173):
const_season_id = Data["SEASON_ID"][j]
#print(const_season_id)
const_game_id = int(Data["GAME_ID"][j])
#print(const_game_id)
const_home_team_id = Data["TEAM_ID_HOME"][j]
#print(const_home_team_id)
#if j==10:
#break
print("Iteration #", j)
print(len(Data["TEAM_ID_AWAY"]))
count = 0
HOME_AVG_PTS=[0.0]
for i in range(len(Data["TEAM_ID_AWAY"])):
if (int(Data["GAME_ID"][i]) < const_game_id and int(Data["SEASON_ID"][i]) == const_season_id):
if int(Data["TEAM_ID_HOME"][i]) == const_home_team_id:
if type(PTS_HOME[i]) is int:
count+= 1
HOME_AVG_PTS[0]+= PTS_HOME[i]
if i==25172:
HOME_AVG_PTS[0]= HOME_AVG_PTS[0]/count
elif int(Data(["TEAM_ID_AWAY"][i])) == const_home_team_id:
if type(PTS_AWAY[i]) is int:
count+= 1
HOME_AVG_PTS[0]+= PTS_AWAY[i]
if i==25172:
HOME_AVG_PTS[0]= float(HOME_AVG_PTS[0]/count)
print(HOME_AVG_PTS)
input_list.append(HOME_AVG_PTS)
return input_list
Data["AVG_PTS_HOME"]= extract_wl(AVG_PTS_HOME)
Can anyone point out why I am having this error or help me resolve it? In the meantime I think I am going to just create a separate function which takes a list of all the AWAY_IDs and then parse through that instead.
Hardcoding the length of the list is likely to lead to bugs down the line even if you're able to get it all working correctly with a particular data set. Iterating over the lists directly is much less error-prone and generally produces code that's easier to modify. If you want to iterate over multiple lists in parallel, use the zip function. For example, this:
for j in range(25173):
const_season_id = Data["SEASON_ID"][j]
const_game_id = int(Data["GAME_ID"][j])
const_home_team_id = Data["TEAM_ID_HOME"][j]
# do stuff with const_season_id, const_game_id, const_home_team_id
can be written as:
for (const_season_id, game_id, const_home_team_id) in zip(
Data["SEASON_ID"], Data["GAME_ID"], Data["TEAM_ID_HOME"]
):
const_game_id = int(game_id)
# do stuff with const_season_id, const_game_id, const_home_team_id
The zip function will never give you an IndexError because it will automatically stop iterating when it reaches the end of any of the input lists. (If you want to stop iterating when you reach the end of the longest list instead, there's zip_longest.)
elif int(Data(["TEAM_ID_AWAY"][i])) == const_home_team_id
Your parentheses are in the wrong place.
You have parentheses around (["TEAM_ID_AWAY"][i]), therefore it is trying to take the i'th index of the single-element list ["TEAM_ID_AWAY"].
You want Data["TEAM_ID_AWAY"][i], not Data(["TEAM_ID_AWAY"][i]).

python: data cleaning - detect pattern for fraudulent email addresses

I am cleaning a data set with fraudulent email addresses that I am removing.
I established multiple rules for catching duplicates and fraudulent domains. But there is one screnario, where I can't think of how to code a rule in python to flag them.
So I have for example rules like this:
#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))
#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')
#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)
This is the data where I can't figure out a rule to catch it. Basically I am looking for a way to flag addresses that start the same way, but then have consecutive numbers in the end.
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
My solution isn't efficient, nor pretty. But check it out and see if it works for you #jeangelj. It definitely works for the examples you provided. Good luck!
import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('#')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020#gmail.com', 'attn1#gmail.com']
In [31]: spam
Out[31]:
['abc7020.10#gmail.com',
'abc7020.11#gmail.com',
'abc7020.12#gmail.com',
'abc7020.13#gmail.com',
'abc7020.14#gmail.com',
'abc7020.15#gmail.com',
'abc7020.1#gmail.com',
'attn12345678#gmail.com',
'attn1234567#gmail.com',
'attn123456#gmail.com',
'attn12345#gmail.com',
'attn1234#gmail.com',
'attn123#gmail.com',
'attn12#gmail.com']
# THE TRUTH YALL!
You can use a regular expression to do this; example below:
import re
a = "attn12345#gmail.comf"
b = "abc7020.14#gmail.com"
c = "abc7020#gmail.com"
d = "attn12345678#gmail.com"
pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?#")
if pattern.search(a):
print("spam1")
if pattern.search(b):
print("spam2")
if pattern.search(c):
print("spam3")
if pattern.search(d):
print("spam4")
If you run the code you will see:
$ python spam.py
spam1
spam2
spam3
spam4
The benefit to this method is that its standardized (regular expressions) and that you can adjust the strength of the match easily by adjusting the values within {}; which means you can have a global configuration file where you set/adjust the values. You can also adjust the regular expression easily without having to rewrite code.
First take a look at regexp question here
Second, try to filter email address like that:
# Let's email is = 'attn1234#gmail.com'
email = 'attn1234#gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
print ('%s is good' % email)
else:
print ('%s is BAD' % email)
You could pick a diff threshold using edit distance (aka Levenshtein distance). In python:
$pip install editdistance
$ipython2
>>> import editdistance
>>> threshold = 5 # This could be anything, really
>>> data = ["attn1#gmail.com...", ...]# set up data to be the set you gave
>>> fraudulent_emails = set([email for email in data for _ in data if editdistance.eval(email, _) < threshold])
If you wanted to be smarter about it, you could run through the resulting list and, instead of turning it into a set, keep track of how many other email addresses it was near - then use that as a 'weight' to determine fake-ness.
This gets you not only the given cases (where the fraudulent addresses all share a common start and differ only in numerical suffix, but additionally number or letter padding eg at the beginning or in the middle of an email address.
ids = [s.split('#')[0] for s in email_list]
det = np.zeros((len(ids), len(ids)), dtype=np.bool)
for i in range(len(ids)):
for j in range(i + 1, len(ids)):
mi = ids[i]
mj = ids[j]
if len(mj) == len(mi) + 1 and mj.startswith(mi):
try:
int(mj[-1])
det[j,i] = True
det[i,j] = True
except:
continue
spam_indices = np.where(np.sum(det, axis=0) != 0)[0].tolist()
Here's one way to approach it, that should be pretty efficient.
We do it by grouping the email address in lengths, so that we only need to check if each email address matches the level down, by a slice and set membership check.
The code:
First, read in the data:
import pandas as pd
import numpy as np
string = '''
abc7020#gmail.com
abc7020.1#gmail.com
abc7020.10#gmail.com
abc7020.11#gmail.com
abc7020.12#gmail.com
abc7020.13#gmail.com
abc7020.14#gmail.com
abc7020.15#gmail.com
attn1#gmail.com
attn12#gmail.com
attn123#gmail.com
attn1234#gmail.com
attn12345#gmail.com
attn123456#gmail.com
attn1234567#gmail.com
attn12345678#gmail.com
foo123#bar.com
foo1#bar.com
'''
x = pd.DataFrame({'x':string.split()})
#remove duplicates:
x = x[~x.x.duplicated()]
We strip off the #foo.bar part, and then filer to only those that end with a number, and add on a 'length' column:
#split on #, expand means into two columns
emails = x.x.str.split('#', expand = True)
#filter by last in string is a digit
emails = emails.loc[:,emails.loc[:,0].str[-1].str.isdigit()]
#add a length of email column for the next step
emails['lengths'] = emails.loc[:,0].str.len()
Now, all we have to do, is take each length, and length -1, and see if the length. with it's last character dropped, appears in a set of the n-1 lengths (and, we have to check if the opposite is true, in case it is the shortest repeat):
#unique lengths to check
lengths = emails.lengths.unique()
#mask to hold results
mask = pd.Series([0]*len(emails), index = emails.index)
#for each length
for j in lengths:
#we subset those of that length
totest = emails['lengths'] == j
#and those who might be the shorter version
against = emails['lengths'] == j -1
#we make a set of unique values, for a hashed lookup
againstset = set([i for i in emails.loc[against,0]])
#we cut off the last char of each in to test
tests = emails.loc[totest,0].str[:-1]
#we check matches, by checking the set
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
#viceversa, otherwise we miss the smallest one in the group
againstset = set([i for i in emails.loc[totest,0].str[:-1]])
tests = emails.loc[against,0]
mask = mask.add(tests.apply(lambda x: x in againstset), fill_value = 0)
The resulting mask can be converted to boolean, and used to subset the original (deduplicated) dataframe, and the indices should match the original indices to subset like that:
x.loc[~mask.astype(bool),:]
x
0 abc7020#gmail.com
16 foo123#bar.com
17 foo1#bar.com
You can see that we have not removed your first value, as the '.' means it did not match - you can remove the punctuation first.
I have an idea on how to solve this:
fuzzywuzzy
Create a set of unique emails, for-loop over them and compare them with fuzzywuzzy.
Example:
from fuzzywuzzy import fuzz
for email in emailset:
for row in data:
emailcomp = re.search(pattern=r'(.+)#.+',string=email).groups()[0]
rowemail = re.search(pattern=r'(.+)#.+',string=row['email']).groups()[0]
if row['email']==email:
continue
elif fuzz.partial_ratio(emailcomp,rowemail)>80:
'flagging operation'
I took some liberties with how the data is represented, but I feel the variable names are mnemonic enough for you to understand what I am getting at. It is a very rough piece of code, in that I have not thought through how to stop repetitive flagging.
Anyways, the elif part compares the two email addresses without #gmail.com (or any other email e.g. #yahoo.com), if the ratio is above 80 (play around with this number) use your flagging operation.
For example:
fuzz.partial_ratio("abc7020.1", "abc7020")
100

Error building bitstring in python 3.5 : the datatype is being set to U32 without my control

I'm using a function to build an array of strings (which happens to be 0s and 1s only), which are rather large. The function works when I am building smaller strings, but somehow the data type seems to be restricting the size of the string to 32 characters long (U32), without my having asked for it. Am I missing something simple?
As I build the strings, I am first casting them as lists so as to more easily manipulate individual characters before joining them into a string again. Am I somehow limiting my ability to use 'larger' data types by my method? The value of np.max(CM1) in this case is something like ~300 (one recent run yielded 253), but the string only come out 32 characters long...
''' Function to derive genome and count mutations in provided list of cells '''
def derive_genome_biopsy(biopsy_list, family_dict, CM1):
derived_genomes_inBx = np.zeros(len(biopsy_list)).astype(str)
for position, cell in np.ndenumerate(biopsy_list):
if cell == 0: continue
temp_parent = 2
bitstring = list('1')
bitstring += (np.max(CM1)-1)*'0'
if cell == 1:
derived_genomes_inBx[position] = ''.join(bitstring)
continue
else:
while temp_parent > 1:
temp_parent = family_dict[cell]
bitstring[cell-1] = '1'
if temp_parent == 1: break
cell = family_dict[cell]
derived_genomes_inBx[position] = ''.join(bitstring)
return derived_genomes_inBx
The specific error message I get is:
Traceback (most recent call last):
File "biopsyCA.py", line 77, in <module>
if genome[site] == '1':
IndexError: string index out of range
family_dict is a dictionary which carries a list of parents and children that the algorithm above works through to reconstruct the 'genome' of individuals from the branching family tree. it basically sets positions in the bitstring to '1' if your parent had it, then if your grandparent etc... until you get to the first bit, which is always '1', then it should be done.
The 32 character limitation comes from the conversion of float64 array to string array in this line:
derived_genomes_inBx = np.zeros(len(biopsy_list)).astype(str)
The resulting array contains datatype S32 values which limit the contents to 32 characters.
To change this limit, use 'S300' or larger instead of str.
You may also use map(str, np.zeros(len(biopsy_list)) to get more flexible string list and convert it back to numpy array with numpy.array() after you have populated it.
Thanks to help from a number of folks here and local, I finally got this working and the working function is:
''' Function to derive genome and count mutations in provided list of cells '''
def derive_genome_biopsy(biopsy_list, family_dict, CM1):
derived_genomes_inBx = list(map(str, np.zeros(len(biopsy_list))))
for biopsy in range(0,len(biopsy_list)):
if biopsy_list[biopsy] == 0:
bitstring = (np.max(CM1))*'0'
derived_genomes_inBx[biopsy] = ''.join(bitstring)
continue
bitstring = list('1')
bitstring += (np.max(CM1)-1)*'0'
if biopsy_list[biopsy] == 1:
derived_genomes_inBx[biopsy] = ''.join(bitstring)
continue
else:
temp_parent = family_dict[biopsy_list[biopsy]]
bitstring[biopsy_list[biopsy]-1] = '1'
while temp_parent > 1:
temp_parent = family_dict[position]
bitstring[temp_parent-1] = '1'
if temp_parent == 1: break
derived_genomes_inBx[biopsy] = ''.join(bitstring)
return derived_genomes_inBx
The original problem was as Teppo Tammisto pointed out an issue with the 'str' datastructure taking 'S32' format. Once I changed to using the list(map(str, ...) functionality a few more issues arose with the original code, which I've now fixed. When I finish this thesis chapter I'll publish the whole family of functions to use to virtually 'biopsy' a cellular automaton model (well, just an array really) and reconstruct 'genomes' from family tree data and the current automaton state vector.
Thanks all!

Pass integer from one def to another in Python

I'm trying to cross compare two outputs labeled "S" in compareDNA (calculating Hamming distance). Though, I cannot figure out how to call an integer from one def to another. I've tried returning the variable, but, I am unable to call it (in a different def) after returning it.
I'm attempting to see which output of "compareDNA(Udnalin, Mdnalin) and compareDNA(Udnalin, Hdnalin)" is higher, to determine which has a greater hamming distance.
How does one call an integer from one def to another?
import sys
def main():
var()
def var():
Mdna = open("mouseDNA.txt", "r")
Mdnalin = Mdna.readline()
print(Mdnalin)
Mdna.close
Hdna = open("humanDNA.txt", "r")
Hdnalin = Hdna.readline()
print(Hdnalin)
Hdna.close
Udna = open("unknownDNA.txt", "r")
Udnalin = Udna.readline()
print(Udnalin)
Udna.close
S = 0
S1 = 0
S2 = 0
print("Udnalin + Mdnalin")
compareDNA(Udnalin, Mdnalin)
S1 = S
print("Udnalin + Hdnalin")
compareDNA(Udnalin, Hdnalin)
def compareDNA(i, j):
diffs = 0
length = len(i)
for x in range(length):
if i[x] != j[x]:
diffs += 1
S = length - diffs / length
S = round(S, 2)
return S
# print("Mouse")
# print("Human")
# print("RATMA- *cough* undetermined")
main()
You probably want to assign the value returned by each call to compareDNA to a separate variable in your var function. Then you can do whatever you want with those values (what exactly you want to do is not clear from your question). Try something like this:
S1 = compareDNA(Udnalin, Mdnalin) # bind the return value from this call to S1
S2 = compareDNA(Udnalin, Hdnalin) # and this one to S2
# do something with S1 and S2 here!
If what you want to do is especially simple (e.g. comparing them to see which is larger), you could even use the return values directly in an expression, such as the condition in a if statement:
if compareDNA(Udnalin, Mdnalin) > S2 = compareDNA(Udnalin, Hdnalin):
print("Unknown DNA is closer to a Mouse")
else:
print("Unknown DNA is closer to a Human")
There's one further thing I'd like to point out, which is unrelated to the core of your question: You should use with statements to handle closing your files, rather than manually trying to close them. Your current code doesn't actually close the files correctly (you're missing the parentheses after .close in each case which are needed to make it a function call).
If you use a with statement instead, the files will be closed automatically at the end of the block (even if there is an exception):
with open("mouseDNA.txt", "r") as Mdna:
Mdnalin = Mdna.readline()
print(Mdnalin)
with open("humanDNA.txt", "r") as Hdna:
Hdnalin = Hdna.readline()
print(Hdnalin)
with open("unknownDNA.txt", "r") as Udna:
Udnalin = Udna.readline()
print(Udnalin)

Pyparsing : issues with setResultsName

I'm parsing multiple choice questions with multiple answers that look like this :
ParserElement.setDefaultWhitespaceChars(u""" \t""")
in_ = """1) first stem.
= option one one key
= option one two key
- option one three distractor
= option one four key
2) second stem ?
- option two one distractor
- option two two distractor
= option one three key
3) third stem.
- option three one key
= option three two distractor
"""
The equal sign represents a correct answer, the dash a distractor.
My grammar looks like this :
newline = Suppress(u"\n")
end_number = Suppress(oneOf(u') / ('))
end_stem = Suppress(oneOf(u"? .")) + newline
end_phrase = Optional(u'.').suppress() + newline
phrase = OneOrMore(Word(alphas)) + end_phrase
prefix = Word(u"-", max=1)('distractor') ^ Word(u"=", max=1)('key')
stem = Group(OneOrMore(Word(alphas))) + end_stem
number = Word(nums) + end_number
question = number + stem('stem') +
Group(OneOrMore(Group(prefix('prefix') + phrase('phrase'))))('options')
And when I'm parsing the results:
for match, start, end in question.scanString(in_):
for o in match.options:
try:
print('key', o.prefix.key)
except:
print('distractor', o.prefix.distractor)
I get :
AttributeError: 'unicode' object has no attribute 'distractor'
I'm pretty sure the result names are chainable. If so, what am I doing wrong ? I can easily work around this but it's unsatisfactory not knowing what I did wrong and what I misunderstood.
The problem is that o is actually the prefix -- when you call o.prefix, you're actually going one level deeper then you need to, and are retrieving the string the prefix maps to, not the ParseResults object.
You can see this by modifying the code so that it prints out the parse tree:
for match, start, end in question.scanString(in_):
for o in match.options:
print o.asXML()
try:
print('key', o.prefix.key)
except:
print('distractor', o.prefix.distractor)
The code will then print out:
<prefix>
<key>=</key>
<phrase>option</phrase>
<ITEM>one</ITEM>
<ITEM>one</ITEM>
<ITEM>key</ITEM>
</prefix>
Traceback (most recent call last):
File "so07.py", line 37, in <module>
print('distractor', o.prefix.distractor)
AttributeError: 'str' object has no attribute 'distractor'
The problem then becomes clear -- if o is the prefix, then it doesn't make sense to do o.prefix. Rather, you need to simply call o.key or o.distractor.
Also, it appears that if you try and call o.key where no key exists, then pyparsing will return an empty string rather than throwing an exception.
So, your fixed code should look like this:
for match, start, end in question.scanString(in_):
for o in match.options:
if o.key != '':
print('key', o.key)
else:
print('distractor', o.distractor)

Categories

Resources