Pandas qcut function duplicates parameter - python

Maybe I don't get the point? but why isn't in Pandas qcut function accepting "ignore" as argument from duplicates?
So small Datasets with duplicate Values are printing the Error:
"Bin edges must be unique"
and the advice to use the "drop" option. But if you want to have a fixed Number of bins there is no possibility?
small code example thats not working:
import pandas as pd
import numpy as np
data=pd.Series([1,1,2,3])
pd.qcut(data,10,labels=np.arange(0,10),duplicates="raise")
small code how it works, but don't get the same number of bins:
import pandas as pd
import numpy as np
data=pd.Series([1,1,2,3])
qcut(data,4,labels=np.arange(0,3),duplicates="drop")
What could be a possible solution:
Insert a third option "ignore" to https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/reshape/tile.py#L405
Change the if else block in https://github.com/pandas-dev/pandas/blob/06d230151e6f18fdb8139d09abf539867a8cd481/pandas/core/reshape/tile.py#L418-L424
to
if duplicates == "raise":
raise ValueError(
f"Bin edges must be unique: {repr(bins)}.\n"
f"You can drop duplicate edges by setting the 'duplicates' kwarg"
)
elif duplicates == "drop":
bins = unique_bins

Related

How to filldown (forward fill) a filtered dataframe column in pandas

Problem Statement: Need to filldown a filtered dataframe.
I have a large dataframe. It is not shown here but I have included a dummy dataframe as an example below.
The only Status/State combinations I want are UP/GOOD and DOWN/BAD.
My dataset currently has undesired DOWN/GOOD combination and I'm trying to correct it to DOWN/BAD by filling down a filtered dataframe. Please advise on the code below, it is not working. There are couple of other solutions to this problem but I would like the filldown (.ffill) method.
Thanks!
Unfiltered dataset
Filtered Dataset: Down status shown
Desired Solution
Code:
"""This is a dummy dataframe"""
import pandas as pd
import numpy as np
dummydata=[["Up","Good"],["Up","Good"],["Up","Good"],["Down","Bad"],["Up","Good"],
["Down","Good"],["Down","Good"],["Down","Good"],["Up","Good"],["Up","Good"],
["Down","Bad"],["Up","Good"],["Up","Good"],["Up","Good"],["Up","Good"]]
df=pd.DataFrame(dummydata, columns=['Status','State'])
filt=df['Status']=="Down"
df2=df.loc[filt]
df2.loc[df2['State']=='Good','State']=""
df2.loc[df2.State=='','State']= np.nan
df2.loc[df2['State']=='','State']=df2['State'].ffill()
print(df,df2)
My current filldown method is not working. Code is provided. Any help will be appreciated
There is no need for slicing here.
The most simple would be to do:
df2['State'] = df2['State'].replace('Good', np.nan).ffill()
Output:
Status State
3 Down Bad
5 Down Bad
6 Down Bad
7 Down Bad
10 Down Bad
use this :
df.loc[df['Status']=='Down','State']='Bad'
df[df['Status']=='Down']
Status State
3 Down Bad
5 Down Bad
6 Down Bad
7 Down Bad
10 Down Bad
You can directly filter them using condition and assign them.
Code:
import pandas as pd
import numpy as np
dummydata=[["Up","Good"],["Up","Good"],["Up","Good"],["Down","Bad"],["Up","Good"],
["Down","Good"],["Down","Good"],["Down","Good"],["Up","Good"],["Up","Good"],
["Down","Bad"],["Up","Good"],["Up","Good"],["Up","Good"],["Up","Good"]]
df=pd.DataFrame(dummydata, columns=['Status','State'])
#Filter Status == Down and State == Good
df.loc[((df["Status"] == "Down") & (df["State"] =="Good")), "State"] = "Bad"

AttributeError: 'SingleBlockManager' object has no attribute 'log'

I am using a big data with million rows and 1000 columns. I already referred this post here. Don't mark it as duplicate.
If sample data required, you can use the below
from numpy import *
m = pd.DataFrame(array([[1,0],
[2,3]]))
I have some continuous variables with 0 values in them.
I would like to compute logarithmic transformation of all those continuous variables.
However, I encounter divide by zero error. So, I tried the below suggestion based on above linked post
df['salary'] = np.log(df['salary'], where=0<df['salary'], out=np.nan*df['salary']) #not working `python stopped working` problem`
from numpy import ma
ma.log(df['app_reg_diff']) # error
My questions are as follows
a) How to avoid divide by zero error when applying for 1000 columns? How to do this for all continuous columns?
b) How to exclude zeros from log transformation and get the log values for rest of the non-zero observations?
You can replace the zero values with a value you like and do the logarithm operation normally.
import numpy as np
import pandas as pd
m = pd.DataFrame(np.array([[1,0], [2,3]]))
m[m == 0] = 1
print(np.log(m))
Here you would get zeros for zero items. You can for example replace it with -1 to get NaN.

Why am I recieving key error after slicing my data? [duplicate]

This question already has answers here:
Problem with getting rid of specific columns [closed]
(2 answers)
Closed 3 years ago.
I have a code that slices data and then suppose to calculte different indices according to the columns.
My code worked well but today I had to slice differently the data and since then I get keyerror whenever I try to compute the indices.
unfortinatly I can't share my original data but I hope this code can help in understand what happenned here.
This is my code with some explainations:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df_plants = pd.read_csv('my_data')
#My data contains columns with numerical data and their column title is numbers
#here I have changed the numbers titles into float
float_cols = [float(i) for i in df_plants.columns.tolist()[4:] if type(i)==str]
df_plants.columns.values[4:] = float_cols
#detector edges removal
#Here my goal is to remove some of the columns that has wrong data.
#this part was added today and might be the reason for the problem
cols = df_plants.columns.tolist()
df_plants=df_plants[cols[:4] + cols[11:]].copy()
#Trying to calculte indices:
filter_plants['NDVI']=(filter_plants['801.03']-filter_plants['680.75'])/(filter_plants['801.03']+filter_plants['680.75'])
KeyError: '801.03'
In order to solve this problem I have tried to add this line again before the calculation:
float_cols = [float(i) for i in df_plants.columns.tolist()[4:] ]
df_plants.columns.values[4:] = float_cols
but I still got the keyerror.
My end goal is to be able to do calculations with my indices which I believe relate to changing in the type of the columns
Try changing the last line to:
filter_plants['NDVI']=(filter_plants[801.03]-filter_plants[680.75])/(filter_plants[801.03]+filter_plants[680.75])

Ignoring NaN/null values while looping through data

I wasn't able to find any clear answers on what I assume to be a simple question. This is for Python 3. What are some of your tips and tricks when applying functions, loops, etc... on your data when your column has both null and non null values?
Here is the example I ran into when I was cleaning some data today. I have a function that takes two columns from my merged dataframe then calculates a ratio showing how similar two strings are.
imports:
from difflib import SequenceMatcher
import pandas as pd
import numpy as np
import pyodbc
import difflib
import os
from functools import partial
import datetime
my function:
def apply_sm(merged, c1, c2):
return difflib.SequenceMatcher(None, merged[c1], merged[c2]).ratio()
Here is me calling the function in my code example:
merged['NameMatchRatio'] = merged.apply(partial(apply_sm, c1='CLIENT NAME', c2='ClientName'), axis=1)
CLIENT NAME has no null values, while ClientName does have null values (which throw out errors when I try to apply my function). How can I apply my function while ignoring the NaN values (in either column just in case)?
Thank you for your time and assistance.
You can use math.isnan to check if a value is nan and skip it. Alternatively, you can also replace nan with zero or something else and then apply your function on it. It really depends on what you want to achieve.
A simple example:
import math
test_variable = math.nan
if math.isnan(test_variable):
print("it is a nan value")
Just incorporate this logic into your code as you deem fit.
def apply_sm(merged, c1, c2):
if not merged[[c1,c2]].isnull().any():
return difflib.SequenceMatcher(None, merged[c1], merged[c2]).ratio()
return 0.0 # <-- you could handle the Null case here

Categorizing CSV data by groups defined through string values

So I am trying to organize data through a CSV file using pandas so I can graph it in matplotlib, I have different rows of values in which some are control and others are experimental. I am able to separate the rows to graph however I can not seem to make it work, I have attempted for loops (seen below) to graph although I keep getting 'TypeError: 'type' object is not subscriptable'.
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv('C:\\Users\\User\\Desktop\\Ubiome samples\\samples.csv')
group = (df['Group'])
count = (df['Count'])
time = (df['Time'])
for steps in range [group]:
plt.plot([time],[count],'bs')
plt.show()
There is a typo in your for loop :
for steps in range [group]:
Should be
for steps in range(group):
Your for loop tries to call __getitem__ on range, but since this method isn't defined for range, you get a TypeError: 'type' object is not subscriptable. Check python documentation for getitem() for more details.
However, you cannot use range on a pandas Series to loop over every item in it, since range expects integers as it's input. Instead you should use :
for steps in group:
This will loop over every row in your csv file, and output the exact same plot for each row. I'm quite sure this is not what you actually want to do.
If I understand your question well, you want to plot each group of experimental/control values you have in your csv.
Then you should try (untested) :
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv('C:\\Users\\User\\Desktop\\Ubiome samples\\samples.csv')
for group in df['Group'].unique():
group_data = df[df['Group'] == group]
plt.plot(group_data['Time'], group_data['Count'], 'bs')
plt.show()
for group in df['Group'].unique() will loop over every piece of data in the Group column, ignoring duplicates.
For instance, if your column have 1000 strings in it, but all of these strings are either "experimental" or "control", then this will loop over ['experimental', 'control'] (actually a numpy array, also, do note that unique() doesn't sort, so the order of the output depends on the order of the input).
df[df['Group'] == group] will then select all the rows where the column 'Group' is equal to group.
Check pandas documentation for where method and masking for more details.

Categories

Resources