Custom column on dataframe based on other columns - python

I have a dataframe as seen below:
df = | class |
0
1
1
...
0
where number of rows is 113269 where number of ones = 46337 and number of zeros = 66932.
What i would like to do is create a feature named id with random numbers for 0 to 50.
As a result each id will have some 0s and some 1s assigned to it.
Through tests i noticed that each id has a similar distribution of 1s and 0s compared to the distribution of the original whole dataset (which is zeros/ones = 1.444 ).
What i want is to be able to change manualy this number for as many clients as possible.
Any ideas?

Related

Compare two dataframe columns with binary data

I have two columns with binary data (1s and 0s) And I want to check what's the percent similiarity between one column and the other. Obviously, as they are binary, it is important that the coincidence is based in the position of each cell, not in the global amount of 0s and 1s. In example:
column_1 column_2
0 1
1 1
0 0
1 0
In that case, in both columns there are the same equal number of 0s and 1s (which means a 100% coincidence) however, taking into account the order or position of each, there's just a 50% coincidence. That last steatment is the one I'm trying to figure out.
I know I could do it with a loop... however in case of larger lists that could be a problem.
This gets a binary vector that gives True where col 1 equals 2 and 0 else where, sums it up, and divides by the number of samples.
sim = sum( df.column_1 == df.column_2 ) / len(df.column_1)

Last member of each element of an id list of indices in relational dataset

Suppose I have two datasets in python: households and people (individuals). A key or id (int64) connects a household with one or more individuals. I want to create a binary variable called "last_member" that takes a value of 0 if there are more individuals in the same household, and 1 if this individual is the last member of the household.
A trivial example would be the following:
last_member id ...
0 1 ...
0 1 ...
1 1 ...
1 2 ...
0 3 ...
1 3 ...
...
I can get the number of unique ids from the households dataset or from the individual's dataset itself.
I get a feeling that either numpy's where function, or pandas' aggregate are strong candidates to find such a solution. Still, I can't wrap my head around an efficient solution that does not involve, let's say, looping over the list of indices.
I coded a function that runs efficiently and solves the problem. The idea is to create the variable "last_member" full of zeros. This variable lets us compute the number of members per id using pandas' groupby. Then we compute the cumulative sum (minus 1, because of python's indexing) to find the indices where we would like to change the values of the "last_member" variable to 1.
def create_last_member_variable(data):
""" Creates a last_member variable based on the index of id variable.
"""
data["last_member"] = 0
n_members = data.groupby(["id"]).count()["last_member"]
row_idx = np.cumsum(n_members) - 1
data.loc[row_idx, "last_member"] = 1
return data

Balance dataset using pandas

This is for a machine learning program.
I am working with a dataset that has a csv which contains an id, for a .tif image in another directory, and a label, 1 or 0. There are 220,025 rows in the csv. I have loaded this csv as a pandas dataframe. Currently in the dataframe, there are 220,025 rows, with 130,908 rows with label 0 and 89,117 rows with label 1.
There are 41,791 more rows with label 0 than label 1. I want to randomly drop the extra rows with label 1. After that, I want to decrease the sample size from 178,234 to just 50,000, with 25,000 ids for each label.
Another approach might be to randomly drop 105,908 rows with label 1 and 64,117 with label 0.
How can I do this using pandas?
I have already looked at using .groupby and then using .sample, but that drops an equal amount of rows in both labels, while I only want to drop rows in one label.
Sample of the csv:
id,label
f38a6374c348f90b587e046aac6079959adf3835,0
c18f2d887b7ae4f6742ee445113fa1aef383ed77,1
755db6279dae599ebb4d39a9123cce439965282d,0
bc3f0c64fb968ff4a8bd33af6971ecae77c75e08,0
068aba587a4950175d04c680d38943fd488d6a9d,0
acfe80838488fae3c89bd21ade75be5c34e66be7,0
a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da,1
7f6ccae485af121e0b6ee733022e226ee6b0c65f,1
559e55a64c9ba828f700e948f6886f4cea919261,0
8eaaa7a400aa79d36c2440a4aa101cc14256cda4,0
Personally, I would break it up into the following steps:
Since you have more 0s than 1s, we're first going to ensure that we even out the number of each. Here, I'm using the sample data you pasted in as df
Count the number of 1s (since this is our smaller value)
ones_subset = df.loc[df["label"] == 1, :]
number_of_1s = len(ones_subset)
print(number_of_1s)
3
Sample only the zeros to match the number of number_of_1s
zeros_subset = df.loc[df["label"] == 0, :]
sampled_zeros = zeros_subset.sample(number_of_1s)
print(sampled_zeros)
Stick these 2 chunks (all of the 1s from our ones_subset and our matched sampled_zeros together to make one clean dataframe that has an equal number of 1 and 0 labels
clean_df = pd.concat([ones_subset, sampled_zeros], ignore_index=True)
print(clean_df)
id label
0 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1
1 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da 1
2 7f6ccae485af121e0b6ee733022e226ee6b0c65f 1
3 559e55a64c9ba828f700e948f6886f4cea919261 0
4 f38a6374c348f90b587e046aac6079959adf3835 0
5 068aba587a4950175d04c680d38943fd488d6a9d 0
Now that we have a cleaned up dataset, we can proceed with the last step:
Use the groupby(...).sample(...) approach you mentioned to further downsample this dataset. Taking this from a dataset that has 3 of each label (three 1s and three 0s) to a smaller matched size- (two 1s and two 0s)
downsampled_df = clean_df.groupby("label").sample(2)
print(downsampled_df)
id label
4 f38a6374c348f90b587e046aac6079959adf3835 0
5 068aba587a4950175d04c680d38943fd488d6a9d 0
1 a24ce148f6ffa7ef8eefb4efb12ebffe8dd700da 1
0 c18f2d887b7ae4f6742ee445113fa1aef383ed77 1

Pivot across multiple columns with repeating values in each column

I am trying to pivot a pandas dataframe, but the data is following a strange format that I cannot seem to pivot. The data is structured as below:
Date, Location, Action1, Quantity1, Action2, Quantity2, ... ActionN, QuantityN
<date> 1 Lights 10 CFloor 1 ... Null Null
<date2> 2 CFloor 2 CWalls 4 ... CBasement 15
<date3> 2 CWalls 7 CBasement 4 ... NUll Null
Essentially, each action will always have a quantity attached to it (which may be 0), but null actions will never have a quantity (the quantity will just be null). The format I am trying to achieve is the following:
Lights CFloor CBasement CWalls
1 10 1 0 0
2 0 2 19 11
The index of the rows becomes the location while the columns become any unique action found across the multiple activity columns. When pulling the data together, the value of each row/column is the sum of each quantity associated with the action (i.e Action1 corresponds to Quantity1). Is there a way to do this with the native pandas pivot funciton?
My current code performs a ravel across all the activity columns to get a list of all unique activities. It will also grab all the unique locations from the Location column. Once I have the unique columns, I create an empty dataframe and fill it with zeros:
Lights CFloor CBasement CWalls
1 0 0 0 0
2 0 0 0 0
I then iterate back over the old data frame with the itertuples() method (I was told it was significantly faster than iterrows()) and populate the new dataframe. This empty dataframe acts as a template that is stored in memory and filled later.
#Creates a template from the dataframe
def create_template(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
activities = df[act_cols]
flat_acts = activities.values.ravel('K')
unique_locations = pd.unique(df['Location'])
unique_acts = pd.unique(flat_acts)
pivot_template = pd.DataFrame(index=unique_locations, columns=unique_acts).fillna(0)
return pivot_template
#Fills the template from the dataframe
def create_pivot(df, pivot_frmt):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
for row in df.itertuples():
for act, quantity in zip(act_cols, quant_cols):
act_val = getattr(row, act)
if pd.notna(act_val):
quantity_val = getattr(row, quantity)
location = getattr(row, 'Location')
pivot_frmt.loc[location, act_val] += quantity_val
return pivot_frmt
While my solution works, it is incredibly slow when dealing with a large dataset and has taken 10 seconds or more to complete this type of operation. Any help would be greatly appreciated!
After experimenting with various pandas functions, such as melt and pivot on multiple columns simulatenously, I found a solution that worked for me:
For every quantity-activity pair, I build a partial frame of the final dataset and store it in a list. Once every pair has been addressed I will end up with multiple dataframes that all have the same row counts, but potentially different column counts. I solved this issue by simply concatenating the columns and if any columns are repeated, I then sum them to get the final result.
def test_pivot(df):
act_cols = ['Activity01', 'Activity02', 'Activity03', 'Activity04']
quant_cols = ['Quantity01', 'Quantity02', 'Quantity03', 'Quantity04']
dfs = []
for act, quant in zip(act_cols, quant_cols):
partial = pd.crosstab(index=df['Location'], columns=df[act], values=df[quant], aggfunc=np.sum).fillna(0)
dfs.append(partial)
finalDf = pd.concat(dfs, axis=1)
finalDf = test.groupby(finalDf.columns, axis=1).sum()
return finalDf
There are two assumptions that I make during this approach:
The indexes maintain their order across all partial dataframes
There are an equivalent number of indexes across all partial dataframes
While this is probably not the most elegant solution, it achieves the desired result and reduced the time it took to process the data by a very significant margin (from 10s ~4k rows to 0.2s ~4k rows). If anybody has a better way to deal with this type of scenario and do the process outlined above in one shot, then I would love to see your response!

Adding the quantities of products in a dataframe column in Python

I'm trying to calculate the sum of weights in a column of an excel sheet that contains the product title with the help of Numpy/Pandas. I've already managed to load the sheet into a dataframe, and isolate the rows that contain the particular product that I'm looking for:
dframe = xlsfile.parse('Sheet1')
dfFent = dframe[dframe['Product:'].str.contains("ABC") == True]
But, I can't seem to find a way to sum up its weights, due to the obvious complexity of the problem (as shown below). For eg. if the column 'Product Title' contains values like -
1 gm ABC
98% pure 12 grams ABC
0.25 kg ABC Powder
ABC 5gr
where, ABC is the product whose weight I'm looking to add up. Is there any way that I can add these weights all up to get a total of 268 gm. Any help or resources pointing to the solution would be highly appreciated. Thanks! :)
You can use extractall for values with units or percentage:
(?P<a>\d+\.\d+|\d+) means extract float or int to column a
\s* - is zero or more spaces between number and unit
(?P<b>[a-z%]+) is extract lowercase unit or percentage after number to b
#add all possible units to dictonary
d = {'gm':1,'gr':1,'grams':1,'kg':1000,'%':.01}
df1 = df['Product:'].str.extractall('(?P<a>\d+\.\d+|\d+)\s*(?P<b>[a-z%]+)')
print (df1)
a b
match
0 0 1 gm
1 0 98 %
1 12 grams
2 0 0.25 kg
3 0 5 gr
Then convert first column to numeric and second map by dictionary of all units. Then reshape by unstack and multiple columns by prod, last sum:
a = df1['a'].astype(float).mul(df1['b'].map(d)).unstack().prod(axis=1).sum()
print (a)
267.76
Similar solution:
a = df1['a'].astype(float).mul(df1['b'].map(d)).prod(level=0).sum()
You need to do some data wrangling to get the column consistent in same format. You may do some matching and try to get Product column aligned and consistent, similar to date -time formatting.
Like you may do the following things.
Make a separate column with only values(float)
Change % value to decimal and multiply by quantity
Replace value with kg to grams
Without any string, only float column to get total.
Pandas can work well with this problem.
Note: There is no shortcut to this problem, you need to get rid of strings mixed with decimal values for calculation of sum.

Categories

Resources