Hello, I'm fairly new to Python and I'm having problems with a table I scrapped using Selenium. Note: The table had merged rows, that's why I created column 'c' as a copy of 'b' for later treatment.
As you can see in the image, in certain rows ([2, 11, 15, 27]) the columns are shifted. Basically what I need to do is to shift, in only those rows, the values in columns above 'd' one cell to the left.
I think I am pretty close to it, but I get an error. So far here's the code I have:
# Identify the rows to shift
list_channels = [str(row["a"]) for index, row in table.iterrows() if len(str(row["a"])) > 2 and str(row["a"])[2] == "-"]
rows_to_shift = table[table["a"].isin(list_channels)].index
# Assign shifted columns
table.iloc[rows_to_shift, 2:] = table.iloc[rows_to_shift, 2:].shift(-1, axis=1) # ValueError
What baffles me is that the right part of the assignation above (table.iloc[rows_to_shift, 3:-1].shift(-1, axis=1)) shows a 4-rows dataframe with indexes [2, 11, 15, 27] with columns ('d', :) in the layout I want, but when I make the assignation I get the error: ValueError: shape mismatch: value array of shape (4,28) could not be broadcast to indexing result of shape (4,1)
Ok, now... here it gets a bit weird: That last assignation stopped the execution and showed me the error, but if I now print the table, it seems to be right (!?). The table was indeed modified and values are now where intended.
This script is intended for a high-profile group of users (high executives) so I cant just ignore the error and force the program to continue, because I can't risk that the script crashes when they use it.
Can anyone point me in the right direction? I've been trying to solve this for days and I'm starting to go in circles.
EDIT: So I detected that this also gives an error: table.iloc[rows_to_shift, 2:] = table.iloc[rows_to_shift, 2:], and now I'm completely hopeless.
Related
I am trying to get some metrics on some data at my company.
Basically, I have this dataframe that I have titled rawData.
rawData contains a number of columns, mostly of parameters I am interested in. The specifics of this are not too important I dont think, so we can just think of these as parameter1, parameter2, and so on.
There is an additional column, which I have titled overallResult. This column will always contain either the string PASS, or FAIL. I am trying to extract a sub-dataframe from my raw data based on the overallResult. It sounds simple enough, but I am messing up my implementation somehow.
I make my mask like this:
mask = rawData[overallResult].eq(truthyVal), where in this case truthyVal is PASS
The mask is created successfully, but..
The mask is like this:
filteredData = rawData[mask]
and I would like filteredData to now contain everything that rawData does, but only on rows where truthyVal exists.
and it always give me this error: cannot reindex on an axis with duplicate labels.
From what I understand, the mask contains a boolean list of my overallResult column, true if truthyVal is found on that row, and false if not. I am pretty sure that I am not applying my mask correctly here. There must be some small extra step I am overlooking, and at this point I am frustrated because it seems so simple, so IDK, any ideas?
You have the principle correct as the following basic example shows:
import pandas as pd
df = pd.DataFrame({'data': [ 1, 2, 3, 4, 5, 6],
'test': ['pass', 'fail', 'pass', 'fail','pass', 'fail']})
mask = df['test'].eq('pass')
print(df[mask])
To decipher your error message it would be necessary to see a data sample which produces it; you might get some useful insights here
I have an arbitrary number of columns to check the condition as to whether any of them is equal to 1, then I want to create a new column based on the results. I want to do something along the lines of how-to-test-multiple-columns-of-pandas-for-a-condition-at-once-and-update-them:
cols=['col_1', ..., 'col_n']
test['col_n+1']=np.where(test[cols] > 0, 1, 0)
However, when I run this, I get an error of:
ValueError: Wrong number of items passed 5, placement implies 1
I understand why this is being thrown, but cannot find a pythonic way of doing this (I'm able to iterate through the dataframe and individually evaluate each column, etc., but the code is ugly)
test = pd.DataFrame({'col1':[10,20,30,40], 'col2':[5,10,15,20], 'col3':[6,12,18,24]})
col=['col2','col3']
#Check where any row has value greater than 19
test['test'] =test[col].gt(19).any(1).astype(int)
I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!
Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output
I have a dataframe containing two columns: one filled with a string (irrelevant), and the other one is (a reference to) a dataframe.
Now I want to only keep the rows, where the dataframes in the second column have entries aka len(df.index) > 0 (there should be rows left, I don't care about columns).
I know that sorting out rows like this works perfectly fine for me if I use it in a list comprehension and can do it on every entry by its own, like in the following example:
[do_x for a, inner_df
in zip(outer_df.index, outer_df["inner"])
if len(inner_df.index) > 0]
But if I try using it for conditional indexing to create a shorter version of the dataframe, it will produce the error KeyError: True.
I thought, that putting len() around it could be a problem so I also tried different approaches to check for zero rows. In the following I show 4 examples of how I tried it:
# a) with the length of the index
outer_df = outer_df.loc[len(outer_df["inner"].index) > 0, :]
# b) same, but with lambda just like in the panda docs user guide
# I used it on the other versions too, with no change in result
outer_df = outer_df.loc[lambda df: len(df["inner"]) > 0, :]
# c) switching
outer_df = outer_df.loc[outer_df["inner"].index.size > 0, :]
# d) even "shorter" version
outer_df = outer_df.loc[not outer_df["inner"].empty, :]
So... where is my error and can I even do it with conditional indexing or do I need to find another way?
Edit: Changed and added some sentences above for more clarity plus added all below.
I know, that the filtering here kind of works through creating a Series the same length as the dataframe consisting of "True" and "False" after a comparison, resulting in keeping only the rows that contain a "True".
I do not however see a fundamental difference between my attempt to create such a list and the following examples (Source https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/):
# 1. difference: the resulting Series is *not* altered
# it just gets compared directly with here the value 80
# -> I thought this might be the problem, but then there is also #2
df = df[df['Percentage'] > 80]
# or
df = df.loc[df['Percentage'] > 80]
# 2. Here the entry is checked in a similar way to my c and d
options = ['x', 'y']
df = df[df['Stream'].isin(options)]
# or
df = df.loc[df['Stream'].isin(options)]
In both, number two here and my versions c & d, the entry in the cell (string // dataframe) is checked for something (is part of list // is empty).
Not sure if I understand your question or where you are stuck. however, I will just write my comment in this answer so that I can easily edit the post.
First, let's try typing in myvar = df['Percentage'] > 80 and see what myvar is. See if the content of myvar makes sense to you.
There is really only 1 true rule of .loc[], that is, the TRUTH TABLE.
Regarding the df[stuff] expression always appears within .loc[ df[stuff] expression ], you might get the impression that df[stuff] expression had some special meaning. For example: df[df['Percentage'] > 80] is asking for any Percentage that is greater than 80, looks quite intuitive! so...df['Percentage'] > 80 must be a "special syntax"? In reality, df['Percentage'] > 80 isn't anything special, it is just another truth table. df[stuff] expression will always be a truth table, that's it.
I have an initial column in a dataframe that contains several bits of information (weight and count of items) I am trying to pull out and do some calculations with.
When I pull out my desired numbers everything looks fine if I print out the variable I store the series in.
Below is my code for how I am parsing out my numbers from the initial column. I just stacked a few methods and used regex to tease it out.
[Hopefully it is fairly easy to read, with some cleaning, my target weight numbers are always in the 3rd to last position after the split() // and my target count numbers are always in the 2nd to last position after the split]
weight = df['Item'].str.replace('1.0gal','128oz').str.replace('YYY','').str.split().str[-3].str.extract('(\d+)', expand=False).astype(np.float64)
count = df['Item'].str.replace('NN','').str.split().str[-2].replace('XX','1ct').str.extract('(\d+)', expand=False).astype(np.float64)
Variable 'weight' returns a series like [32, 32, 0.44, 5.3, 64] and that is what I want to see.
HOWEVER, when I try to set these values into a new column in the dataframe it leaves off everything to the right of the decimal place; for example my new column shows up as [32, 32, 0, 5, 64].
This is throwing off my calculated columns as well.
However if I do the math in a separate variable and print that out it shows up right (decimals and all). But something about assigning it to the dataframe zeros out my weight and screws up any calculations thereafter.
Any and all help is greatly appreciated!
cast the series values to string,
then after you insert the values into a DataFrame column, convert the column to numeric. For example,
weight = weight.asType(str)
df['new_column'] = weight
df['new_column'] = pd.to_numeric(df['new_column'])
check out: Change column type in pandas