Pulling Numerics Out Of Column Drop Numbers Right Of Decimal..? - python

I have an initial column in a dataframe that contains several bits of information (weight and count of items) I am trying to pull out and do some calculations with.
When I pull out my desired numbers everything looks fine if I print out the variable I store the series in.
Below is my code for how I am parsing out my numbers from the initial column. I just stacked a few methods and used regex to tease it out.
[Hopefully it is fairly easy to read, with some cleaning, my target weight numbers are always in the 3rd to last position after the split() // and my target count numbers are always in the 2nd to last position after the split]
weight = df['Item'].str.replace('1.0gal','128oz').str.replace('YYY','').str.split().str[-3].str.extract('(\d+)', expand=False).astype(np.float64)
count = df['Item'].str.replace('NN','').str.split().str[-2].replace('XX','1ct').str.extract('(\d+)', expand=False).astype(np.float64)
Variable 'weight' returns a series like [32, 32, 0.44, 5.3, 64] and that is what I want to see.
HOWEVER, when I try to set these values into a new column in the dataframe it leaves off everything to the right of the decimal place; for example my new column shows up as [32, 32, 0, 5, 64].
This is throwing off my calculated columns as well.
However if I do the math in a separate variable and print that out it shows up right (decimals and all). But something about assigning it to the dataframe zeros out my weight and screws up any calculations thereafter.
Any and all help is greatly appreciated!

cast the series values to string,
then after you insert the values into a DataFrame column, convert the column to numeric. For example,
weight = weight.asType(str)
df['new_column'] = weight
df['new_column'] = pd.to_numeric(df['new_column'])
check out: Change column type in pandas

Related

Python - Can't divide column values

I'm trying to divide each value of a specific column in df_rent dataframe by simply accessing each value and divide by 1000. But it's returning an error and I cannot understand the reason.
Type of the column is float64.
for i in df_rent['distance_new']:
df_rent[i] = df_rent[i] / 1000
print(df_rent[i])
the error is because if you loop over df_rent['distance_new'],the i assigned will be the value of your first cell in 'distance_new', then the second, then the third, it will not be a pointer index. what you should do is rather simple
df_rent['distance_new']/=1000
In case someone doesn't understand, /= operator takes the value, divides by RHS, then replace the LHS value by the result. the LHS can be int, float, or in this case a whole column. this solution also works on multiple column if you slice them correctly, will be something like
df_rent.loc[:,['distance_new','other_col_1','other_col2']]/=1000
i contains the values iterating in df_rent['distance_new']
eg: if df_rent['distance_new'] = [23, 45, 67]
i is iterating on the value 23, then 45 then 67
i is not simplifying to index
also lists need to append the indexes first
try using a dictionary to generate keys,
lists can only modify existing keys
also post in the values of a sample array so you can communicate your problem better

Find Sign when Sign Changes in Pandas Column while Ignoring Zeros using Vectorization

I'm trying to find a vectorized way of determining the first instance where my column of data has a sign change. I looked at this question and it gets close to what I want, except it evaluates my first zeros as true. I'm open to different solutions including changing how the data is set up in the first place. I'll detail what I'm doing below.
I have two columns, let's call them positive and negative, that look at a third column. The third column has values ranging between [-5, 5]. When this column is [3, 5], my positive column gets a +1 on that same row; all other rows are 0 in that column. Likewise, when the third column is between [-5, -3], my negative column gets a -1 in that row; all other rows are 0.
I combine these columns into one column. You can conceptualize this as 'turn machine on, keep it on/off, turn it off, keep it on/off, turn machine on ... etc.' The problem I've having is that my combined column looks something like below:
pos = [1,1,1,0, 0, 0,0,0,0,0,1, 0,1]
neg = [0,0,0,0,-1,-1,0,0,0,0,0,-1,0]
com = [1,1,1,0,-1,-1,0,0,0,0,1,-1,1]
# Below is what I want to have as the final column.
cor = [1,0,0,0,-1, 0,0,0,0,0,1,-1,1]
The problem with what I've linked is that it gets close, but it evaluates the first 0 as a sign change as well. 0's should be ignored and I tried a few things, but seem to be creating new errors. For the sake of completeness, this is what the code linked outputs:
lnk = [True,False,False,True,True,False,True,False,False,False,True,True,True]
As you can see, it's doing the 1 and -1 not flipping fine, but the zero's it's flipping. Not sure if I should change how the combined column is made or if I should change the logic for the creation of the component columns, both. The big thing is I need to vectorize this code for performance concerns.
Any help would be greatly appreciated!
Let's suppose your dataframe is named df with columns pos and neg then you can try something like the following :
df.loc[:, "switch_pos"] = (np.diff(df.pos, prepend=0) > 0)*1
df.loc[:, "switch_neg"] = (np.diff(df.neg, prepend=0) > 0)*(-1)
You can then combine your two switchs columns.
Explanations
no.diff gives you the difference row by row but setting (for pos columns) 1 for 0 to 1 and - 1 for 1 to 0. Considering your desired output, you want to keep only your 0 to 1, that's why you need to keep only the more than zero output

How to iterate columns with control statement?

I have the following code right now:
import pandas as pd
df_area=pd.DataFrame({"area":["Coesfeld","Recklinghausen"],"factor":[2,5]})
df_timeseries=pd.DataFrame({"Coesfeld":[1000,2000,3000,4000],"Recklinghausen":[2000,5000,6000,7000]})
columns_in_timeseries=list(df_timeseries)
columns_to_iterate=columns_in_timeseries[0:]
newlist=[]
for i,k in enumerate(columns_to_iterate):
new=df_area.loc[i,"factor"]*df_timeseries[k]
newlist.append(new)
newframe=pd.DataFrame(newlist)
df1_transposed = newframe.T
The code multiplys each factor from an area with the timeseries from that area. In this example the code is iterating immediately the rows and columns after multiplying. In the next step I want to expand the df_area-Dataframe like the following:
df_area=pd.DataFrame({"area":["Coesfeld","Coesfeld","Recklinghausen","Recklinghausen"],"factor":[2,3,5,6]})
As you can see, I have different factors for the same area. The goal is to iterate the columns in df_timeseries only when the area in df_area changes. My first intention is to use an if-Statement but right now I have no idea how to realize that with the for-loop.
I can't shake off the suspicion that there is something wrong about your whole approach. A first red flag is your use of wide format instead of long format – in my experience, that's probably going to cause you unnecessary trouble.
Be it as it may, here's a function that takes a data frame with time series data and a second data frame with multiplier values and area names as arguments. The two data frames use the same structure as your examples df_timeseries (area names as columns, time series values as cell values) and df_area (area name as values in column area, multiplier as value in column factor). I'm pretty sure that this is not a good way to organize your data, but that's up to you to decide.
What the function does is it iterates through the rows of the second data frame (the df_area-like). It uses the area value to select the correct series from the first data frame (the df_timeseries-like), and multiplies this series with the factor value from that row. The result is added as an element within a list generator.
def do_magic(df1, df2):
return [df1[area] * factor for area, factor in zip(df2.area, df2.factor)]
You can insert this directly into your code to replace your loop:
df_area = pd.DataFrame({"area": ["Coesfeld", "Recklinghausen"],
"factor": [2, 5]})
df_timeseries = pd.DataFrame({"Coesfeld": [1000, 2000, 3000, 4000],
"Recklinghausen": [2000, 5000, 6000, 7000]})
newlist = do_magic(df_timeseries, df_area)
newframe = pd.DataFrame(newlist)
df1_transposed = newframe.T
It also works with your expanded df_area. The resulting list will consist of four series (two for Coesfeld, two for Recklinghausen).

Is there way to replace ranged data (eg 18-25) by its mean in a dataframe?

I have a dataset black friday.
Here is how it looks.
The Age is given in range like 1-17,18-25 etc. I want to replace all such ranges by their mean. I can either traverse each element of the Age column and parse them and replace the string value by mean. That probably would be inefficient.
So I want to know is there any shorter way to do that ? or Is there any alternative way to process the range of data? (in python ofcourse)
There are several ways to transform this variable. In the picture I see, that there are not only bins, but also value '55+', it needs to be considered.
1) One liner:
df['age'].apply(lambda x: np.mean([int(x.split('-')[0]), int(x.split('-')[1])]) if '+' not in x else x[:-1])
It checks whether the value contains '+' (like 55+), if yes than the value without '+' is returned. Otherwise the bin is splitted into two values, they are converted to ints and their mean is calculated.
2) Using dictionary for transformation:
mapping = {'1-17': 9, '18-25': 21.5, '55+': 55}
df['age'].apply(lambda x: mapping[x])
You need to add all values to mapping dictionary (calculate them manually or automatically). Then you apply this transformation to the series.

Trying to divide a dataframe column by a float yields NaN

Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?

Categories

Resources