Separating if conditions for which there can be some overlapping cases - python

Given a pandas dataframe wb, which looks like this (in Excel, before bringing
it into pandas with read_csv():
Column ad_tag_name is in groups of 3. I want to append _level2 to every second of each group of 3, and _level3 to the value of this column in every third of each group of 3, so I end up with something like:
I have decided to use mod division, with the logic that "if it divides evently by both 2 and 3, then append _level3; if it divides evenly only by 2, then append _level2. if it divides evenly only by 3, then append _level3 Otherwise, leave it alone."
for index, elem in enumerate(wb['ad_requests']):
if np.mod(index+1,2) == 0 and np.mod(index+1,3) == 0:
wb.at[index,'\xef\xbb\xbf"ad_tag_name"'] = wb.at[index,'\xef\xbb\xbf"ad_tag_name"'] + "_level3"
elif np.mod(index+1,3) == 0:
wb.at[index,'\xef\xbb\xbf"ad_tag_name"'] = wb.at[index,'\xef\xbb\xbf"ad_tag_name"'] + "_level3"
elif np.mod(index+1,2) == 0:
wb.at[index,'\xef\xbb\xbf"ad_tag_name"'] = wb.at[index,'\xef\xbb\xbf"ad_tag_name"'] + "_level2"
Yet when I save the resulting CSV and examine it, I see:
The pattern is: no suffix, _level2, _level3, level2, no suffix, level3, no suffix, level2, level3 and then this repeats. So it's correct in 8 out of 9 cases, but really that is an accident. I don't like the fact that there may be some overlap between the ifs/elifs I have defined, and I am sure it is this flawed logic that it as the root of the problem.
How can we re-write the conditions so that they are properly achieving the logic I have in mind?
Python: 2.7.10
Pandas: 0.18.0

While pandas can provide some elegant shortcuts, it can also lead one down rabbit-holes of trial-and-error.
Sometimes going back to basics, to what Python provides built in, is the way to go.
for i in range(len(wb))[2::3]:
wb.at[i,'\xef\xbb\xbf"ad_tag_name"'] = wb.at[i,'\xef\xbb\xbf"ad_tag_name"'] + "_level3"
for i in range(len(wb))[1::3]:
wb.at[i,'\xef\xbb\xbf"ad_tag_name"'] = wb.at[i,'\xef\xbb\xbf"ad_tag_name"'] + "_level2"

Related

Unable to change value of dataframe at specific location

So I'm trying to go through my dataframe in pandas and if the value of two columns is equal to something, then I change a value in that location, here is a simplified version of the loop I've been using (I changed the values of the if/else function because the original used regex and stuff and was quite complicated):
pro_cr = ["IgA", "IgG", "IgE"] # CR's considered productive
rows_changed = 0
prod_to_unk = 0
unk_to_prod = 0
changed_ids = []
for index in df_sample.index:
if num=1 and color="red":
pass
elif num=2 and color="blue":
prod_to_unk += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "unknown"
rows_changed += 1
elif num=3 and color="green":
unk_to_prod += 1
changed_ids.append(df_sample.loc[index, "Sequence ID"])
df_sample.at[index, "Functionality"] = "productive"
rows_changed += 1
else:
pass
print("Number of productive columns changed to unknown: {}".format(prod_to_unk))
print("Number of unknown columns changed to productive: {}".format(unk_to_prod))
print("Total number of rows changed: {}".format(rows_changed))
So the main problem is the changing code:
df_sample.at[index, "Functionality"] = "unknown" # or productive
If I run this code without these lines of code, it works properly, it finds all the correct locations, tells me how many were changed and what their ID's are, which I can use to validate with the CSV file.
If I use df_sample["Functionality"][index] = "unknown" # or productive the code runs, but checking the rows that have been changed shows that they were not changed at all.
When I use df.at[row, column] = value I get "AttributeError: 'BlockManager' object has no attribute 'T'"
I have no idea why this is showing up. There are no duplicate columns. Hope this was clear (if not let me know and I'll try to clarify it). Thanks!
To be honest, I've never used df.at - but try using df.loc instead:
df_sample.loc[index, "Functionality"] = "unknown"
You can also iat.
Example: df.iat[iTH row, jTH column]

Assign value via iloc to Pandas Data Frame

Code snippet:
for row in df.itertuples():
current_index = df.index.get_loc(row.Index)
if current_index < max_index - 29:
if df.iloc[current_index + 30].senkou_span_a == 0.0:
df.iloc[current_index + 30].senkou_span_a = 6700
if df.iloc[current_index + 30].senkou_span_b == 0.0:
df.iloc[current_index + 30].senkou_span_b = 6700.0
the last line where I am assigning a value via iloc, it goes through, but the resultant value is 0.0. I ran into this before where I was assigning the value to a copy of the df, but that doesn't seem to be the case this time. I have been staring at this all day. I hope it is just something silly.
df is timeseries(DateTimeIndex) financial data. I have verified the correct index exists, and no exceptions are thrown.
There is other code to assign values and those work just fine, omitted that code for the sake of brevity.
EDIT
This line works:
df.iloc[current_index + 30, df.columns.get_loc('senkou_span_b')] = 6700
why does this one, and not the original?
I'm not sure exactly what's causing your problem (I'm pretty new to Python), but here's some things that came to mind that might be helpful:
Assuming that the value you're replacing is always 0 or 0.0, maybe try switching from = to += to add instead of assign?
Is your dataframe in tuple format when you attempt to assign the value? Your issue might be that tuples are immutable.
If senkou_span_a refers to a column that you're isolating, maybe try only using iloc to isolate the value like df.iloc[(currentindex + 30), 1] == 0.0
Hope this helped!

How to optimize searching and comparing rows in pandas?

I have a two dfs. Base is 100k rows, Snps is 54k rows.
This is structure of dfs:
base:
SampleNum SampleIdInt SecondName
1 ASA2123313 A2123313
2 ARR4112234 R4112234
3 AFG4234122 G4234122
4 GGF412233 F412233
5 GTF423512 F423512
6 POL23523552 L23523552
...
And this is Snps df:
SampleNum SampleIdInt
1 ART2114155
2 KWW4112234
3 AFG4234122
4 GGR9999999
5 YUU33434324
6 POL23523552
...
And now look for example on 2nd row in Snps and base. They have a same numbers (the first 3 chars are not important to me now).
So I created a commonlist contained a numbers from snsp which also appear in the base. The all rows with SAME numbers between dfs. (common has 15k length)
common_list = [4112234, 4234122, 23523552]
And now I want create three new lists.
confirmedSnps = where whole SampleIdInt is identical as in base. In this example: AFG4234122. For this I have a sure that secondName will be proper.
un_comfirmedSnpS = where I have a good number but first three chars are different. Example: KWW4112234 in SnpS and ARR4112234 in base. In this case, I'm not sure that SecondName is proper, so I need to check it later.
And last moreThanOne list. That list should append all duplicate rows. For example If in base I will have KWW4112234 and AFG4112234 both should go to that list.
I wrote a code. It's work fine, but the problem is time. I got 15k elements to filter, and each element processing 4 second. It's mean whole loop will be run for 17h!
I looking for help in optimization that code.
That's my code:
comfirmedSnps = []
un_comfirmedSnps = []
moreThanOne = []
for i in range(len(common)):
testa = baza[baza['SampleIdInt'].str.contains(common[i])]
testa = testa.SampleIdInt.unique()
print("StepOne")
testb = snps[snps['SampleIdInt'].str.contains(common[i])]
testb = testb.SampleIdInt.unique()
print("StepTwo")
if len(testa) == 1 and len(testb) == 1:
if (testa == testb) == True:
comfirmedSnps.append(testb)
else:
un_comfirmedSnps.append(testb)
else:
print("testa has more than one contains records. ")
moreThanOne.append(testb)
print("StepTHREE")
print(i,"/", range(len(common)))
I added a Steps prints to check which part takes most of the time. It's the code between StepOne and stepTwo. First and third steps are running instant.
Can someone help me with that case? For sure most of U will see better solution to this problem.
What you are trying to do is commonly called join, which annoyingly enough is called merge in pandas. There's just the minor annoyance of the three initial letters to deal with, but that's easy:
snps.numeric_id = snps.SampleIdInt.apply(lambda s: s[3:])
base.numeric_id = base.SampleIdInt.apply(lambda s: s[3:])
now you can compute the three dataframes:
confirmed = snps.merge(base, on='SampleIdInt')
unconfirmed = snps.merge(
base, on='numeric_id'
).filter(
lambda r.SampleIdInt_x != r.SampleIdInt_y
)
more_than_one = snps.group_by('numeric_id').filter(lambda g: len(g) > 1)
I bet it won't work, but hopefully you get the idea.

Python / Pandas: Looping through a list of numbers

I am trying to create a loop involving Pandas/ Python and an Excel file. The column in question is named "ITERATION" and it has numbers ranging from 1 to 6. I'm trying to query the number of hits in the Excel file in the following iteration ranges:
1 to 2
3
4 to 6
I've already made a preset data frame named "df".
iteration_list = ["1,2", "3", "4,5,6"]
i = 1
for k in iteration_list:
table = df.query('STATUS == ["Sold", "Refunded"]')
table["ITERATION"] = table["ITERATION"].apply(str)
table = table.query('ITERATION == ["%s"]' % k)
table = pd.pivot_table(table, columns=["Month"], values=["ID"], aggfunc=len)
table.to_excel(writer, startrow = i)
i = i + 3
The snippet above works only for the number "3". The other 2 scenarios don't seem to work as it literally searches for the string "1,2". I've tried other ways such as:
iteration_list = [1:2, 3, 4:6]
iteration_list = [{1:2}, 3, {4:6}]
to no avail.
Does anyone have any suggestions?
EDIT
After looking over Stidgeon's answer, I seemed to come up with the following alternatives. Stidgeon's answer DOES provide an output but not the one I'm looking for (it gives 6 outputs - from iteration 1 to 6 in each loop).
Above, my list was the following:
iteration_list = ["1,2", "3", "4,5,6"]
If you play around with the quotation marks, you could input exactly what you want. Since your strings is literally going to be inputted into this line where %s is:
table = table.query('ITERATION == ["%s"]' % k)
You can essentially play around with the list to fit your precise needs with quotations. Here is a solution that could work:
list = ['1", "2', 3, '4", "5", "6']
Just focusing on getting the values out of the list of strings, this works for me (though - as always - there may be more Pythonic approaches):
lst = ['1,2','3','4,5,6']
for item in lst:
items = item.split(',')
for _ in items:
print int(_)
Though instead of printing at the end, you can pass the value to your script.
This will work if all your strings are either single numbers or numbers separated by commas. If the data are consistently formatted like that, you may have to tweak this code.

How to match fields from two lists and further filter based upon the values in subsequent fields?

EDIT: My question was answered on reddit. Here is the link if anyone is interested in the answer to this problem https://www.reddit.com/r/learnpython/comments/42ibhg/how_to_match_fields_from_two_lists_and_further/
I am attempting to get the pos and alt strings from file1 to match up with what is in
file2, fairly simple. However, file2 has values in the 17th split element/column to the
last element/column (340th) which contains string such as 1/1:1.2.2:51:12 which
I also want to filter for.
I want to extract the rows from file2 that contain/match the pos and alt from file1.
Thereafter, I want to further filter the matched results that only contain certain
values in the 17th split element/column onwards. But to do so the values would have to
be split by ":" so I can filter for split[0] = "1/1" and split[2] > 50. The problem is
I have no idea how to do this.
I imagine I will have to iterate over these and split but I am not sure how to do this
as the code is presently in a loop and the values I want to filter are in columns not rows.
Any advice would be greatly appreciated, I have sat with this problem since Friday and
have yet to find a solution.
import os,itertools,re
file1 = open("file1.txt","r")
file2 = open("file2.txt","r")
matched = []
for (x),(y) in itertools.product(file2,file1):
if not x.startswith("#"):
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y in pos_x and alt_y in alt_x:
matched.append(x)
for z in matched:
cells_z = z.split("\t")
if cells_z[16:len(cells_z)]:
Your requirement is not clear, but you might mean this:
for (x),(y) in itertools.product(file2,file1):
if x.startswith("#"):
continue
cells_y = y.split("\t")
pos_y = cells[0]
alt_y = cells[3]
cells_x = x.split("\t")
pos_x = cells_x[0]+":"+cells_x[1]
alt_x = cells_x[4]
if pos_y != pos_x: continue
if alt_y != alt_x: continue
extra_match = False
for f in range(17, 341):
y_extra = y[f].split(':')
if y_extra[0] != '1/1': continue
if y_extra[2] <= 50: continue
extra_match = True
break
if not extra_match: continue
xy = x + y
matched.append(xy)
I chose to concatenate x and y into the matched array, since I wasn't sure whether or not you would want all the data. If not, feel free to go back to just appending x or y.
You may want to look into the csv library, which can use tab as a delimiter. You can also use a generator and/or guards to make the code a bit more pythonic and efficient. I think your approach with indexes works pretty well, but it would be easy to break when trying to modify down the road, or to update if your file lines change shape. You may wish to create objects (I use NamedTuples in the last part) to represent your lines and make it much easier to read/refine down the road.
Lastly, remember that Python has a shortcut feature with the comparative 'if'
for example:
if x_evaluation and y_evaluation:
do some stuff
when x_evaluation returns False, Python will skip y_evaluation entirely. In your code, cells_x[0]+":"+cells_x[1] is evaluated every single time you iterate the loop. Instead of storing this value, I wait until the easier alt comparison evaluates to True before doing this (comparatively) heavier/uglier check.
import csv
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
if x[3] == y[4] and x[0] == ":".join(y[:1]):
yield x
def match_datestamp_and_alt_and_pos(first_file, second_file):
for z in filter_matching_alt_and_pos(first_file, second_file):
for element in z[16:]:
# I am not sure I fully understood your filter needs for the 2nd half. Here, I split all elements from the 17th onward and look for the two cases you mentioned. This seems like it might be very heavy, but at least we're using generators!
# same idea as before, we abort as early as possible to avoid needless indexing and checks
for chunk in element.split(":"):
# WARNING: if you aren't 100% sure the 2nd element is an int, this is very dangerous
# here, I use the continue keyword and the negative-check to help eliminate excess overhead. The execution is very similar as above, but might be easier to read/understand and can help speed things along in some cases
# once again, I do the lighter check before the heavier one
if not int(chunk[2])> 50:
# continue automatically skips to the next iteration on element
continue
if not chunk[:1] == "1/1":
continue
yield z
if __name__ == '__main__':
first_file = "first.txt"
second_file = "second.txt"
# match_datestamp_and_alt_and_pos returns a generator; for loop through it for the lines which matched all 4 cases
match_datestamp_and_alt_and_pos(first_file=first_file, second_file=second_file)
namedtuples for the first part
from collections import namedtuple
FirstFileElement = namedtuple("FirstFrameElement", "pos unused1 unused2 alt")
SecondFileElement = namedtuple("SecondFrameElement", "pos1 pos2 unused2 unused3 alt")
def filter_matching_alt_and_pos(first_file, second_file):
for x in csv.reader(open(first_file, 'rb'), delimiter='\t'):
for y in csv.reader(open(second_file, 'rb'), delimiter='\t'):
# continue will skip the rest of this loop and go to the next value for y
# this way, we can abort as soon as one value isn't what we want
# .. todo:: we could make a filter function and even use the filter() built-in depending on needs!
x_element = FirstFileElement(*x)
y_element = SecondFileElement(*y)
if x.alt == y.alt and x.pos == ":".join([y.pos1, y.pos2]):
yield x

Categories

Resources