Coming from R, the code would be
x <- data.frame(vals = c(100,100,100,100,100,100,200,200,200,200,200,200,200,300,300,300,300,300))
x$state <- cumsum(c(1, diff(x$vals) != 0))
Which marks every time the difference between rows is non-zero, so that I can use it to spot transitions in data, like so:
vals state
1 100 1
...
7 200 2
...
14 300 3
What would be a clean equivalent in Python?
Additional question
The answer to the original question is posted below, but won't work properly for a grouped dataframe with pandas.
Data here: https://pastebin.com/gEmPHAb7. Notice that there are 2 different filenames.
When imported as df_all I group it with the following, and then apply solution posted below.
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
Using diff and cumsum, as in your R example:
df['state'] = (df['vals'].diff()!= 0).cumsum()
This uses the fact that True has integer value 1
Bonus question
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
I think you misunderstand what groupby does. All groupby does is create groups based on the criterium (filename in this instance). You then need to tell add another operation to tell what needs to happen with this group.
Common operations are mean, sum, or more advanced as apply and transform.
You can find more information here or here
If you can explain more in detail what you want to achieve with the groupby I can help you find the correct method. If you want to perform the above operation per filename, you probably need something like this:
def get_state(group):
return (group.diff()!= 0).cumsum()
df_all['state'] = df_all.groupby('filename')['Fit'].transform(get_state)
Related
I have a pd dataframe which includes the columns CompTotal and CompFreq.
I wanted to add a third column- NormalizedAnnualCompensation which uses the following logic
If the CompFreq is Yearly then use the exising value in CompTotal
If the CompFreq is Monthly then multiply the value in CompTotal with 12
If the CompFreq is Weekly then multiply the value in CompTotal with 52
I eventually used np.where() to basically write a nested if statement like I'm used to bodging together in excel(pretty new to coding in general)- that's below.
My question is- Could I have done it better? This doesn't feel very pythonic based on what I've read and what I've been taught so far.
df['NormalizedAnnualCompensation'] = np.where(df['CompFreq']=='Yearly',df.CompTotal,
(np.where(df['CompFreq']=='Monthly', df.CompTotal * 12,
(np.where(df['CompFreq']=='Weekly',df.CompTotal *52,'NA')
))))
Thanks in advance.
There is no such thing as the "proper" way to do things, so you already got the correct one!
Still, you can learn for sure by asking for different approaches (while this probably goes beyond the scope of what stackoverflow intents to be).
For example, you may consider using pandas only by using masks and accessing only some specific region of the DataFrames to be set (pd.DataFrame.loc):
df["NormalizedAnnualCompensation"] = "NA"
mask = df["CompFreq"]=="Yearly"
df.loc[mask, "NormalizedAnnualCompensation"] = df.loc[mask, "CompTotal"]
mask = df["CompFreq"]=="Monthly"
df.loc[mask, "NormalizedAnnualCompensation"] = df.loc[mask, "CompTotal"] * 12
mask = df["CompFreq"]=="Weekly"
df.loc[mask, "NormalizedAnnualCompensation"] = df.loc[mask, "CompTotal"] * 52
If you really only want to compare that column for equality and for each of the cases are filling a fixed value (i.e. CompTotal is a constant over the whole dataframe, you could consider simply using pd.Series.map , compare the following minimum example achieving a similar thing:
In [1]: pd.Series(np.random.randint(4, size=10)).map({0: "zero", 1: "one", 2: "two"}).fillna(
...: "NA"
...: )
Out[1]:
0 NA
1 two
2 NA
3 zero
4 two
5 zero
6 one
7 two
8 NA
9 two
dtype: object
np.where() is good for simple if-then-else processing. However, if you have multiple conditions to test, nesting np.where() would look complicated and difficult to read. In this case, you can get cleaner and more readable codes by using np.select(), as follows:
condlist = [df['CompFreq']=='Yearly', df['CompFreq']=='Monthly', df['CompFreq']=='Weekly']
choicelist = [df.CompTotal, df.CompTotal * 12, df.CompTotal * 52]
df['NormalizedAnnualCompensation'] = np.select(condlist, choicelist, default='NA')
I have a Pandas DataFrame like:
COURSE BIB# COURSE 1 COURSE 2 STRAIGHT-GLIDING MEAN PRESTASJON
1 2 20.220 22.535 19.91 21.3775 1.073707
0 1 21.235 23.345 20.69 22.2900 1.077332
This is from a pilot and the DataFrame may be much longer when we perform the real experiment. Now that I have calculated the performance for each BIB#, I want to allocate them into two different groups based on their performance. I have therefore written the following code:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
This sorts values in the DataFrame. Now I want to assign even rows to one group and odd rows to another. How can I do this?
I have no idea what I am looking for. I have looked up in the documentation for the random module in Python but that is not exactly what I am looking for. I have seen some questions/posts pointing to a scikit-learn stratification function but I don't know if that is a good choice. Alternatively, is there a way to create a loop that accomplishes this? I appreciate your help.
Here a figure to illustrate what I want to accomplish
How about this:
threshold = 0.5
df1['group'] = df1['PRESTASJON'] > threshold
Or if you want values for your groups:
df['group'] = np.where(df['PRESTASJON'] > threshold, 'A', 'B')
Here, 'A' will be assigned to column 'group' if precision meets our threshold, otherwise 'B'.
UPDATE: Per OP's update on the post, if you want to group them alternatively into two groups:
#sort your dataframe based on precision column
df1 = df1.sort_values(by='PRESTASJON')
#create new column with default value 'A' and assign even rows (alternative rows) to 'B'
df1['group'] = 'A'
df1.iloc[1::2,-1] = 'B'
Are you splitting the dataframe alternatingly? If so, you can do:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
for i,d in df1.groupby(np.arange(len(df1)) %2):
print(f'group {i}')
print(d)
Another way without groupby:
df1 = df1.sort_values(by='PRESTASJON', ascending=True)
mask = np.arange(len(df1)) %2
group1 = df1.loc[mask==0]
group2 = df1.loc[mask==1]
This question already has an answer here:
How do you shift Pandas DataFrame with a multiindex?
(1 answer)
Closed 4 years ago.
So, I was wondering if I am doing this correctly, because maybe there is a much better way to do this and I am wasting a lot of time.
I have a 3 level index dataframe, like this:
IndexA IndexB IndexC ColumnA ColumnB
A B C1 HiA HiB
A B C2 HiA2 HiB2
I need to do a search for every row, saving data from other rows. I know this sounds strange, but it makes sense with my data. For example:
I want to add ColumnB data from my second row to the first one, and vice-versa, like this:
IndexA IndexB IndexC ColumnA ColumnB NewData
A B C1 HiA HiB HiB2
A B C2 HiA2 HiB2 HiB
In order to do this search, I do an apply on my df, like this:
df['NewData'] = df.apply(lambda r: my_function(df, r.IndexA, r.IndexB, r.IndexC), axis=1)
Where my function is:
def my_function(df, indexA, indexB, indexC):
idx = pd.IndexSlice
#Here I do calculations (substraction) to know what C exactly I want
#newIndexC = C - someConstantValue
try:
res = df.loc[idx[IndexA, IndexB, newIndexC],'ColumnB']
return res
except KeyError:
return -1
I tried to simplify a lot of this problem, sorry if it sounds confusing. Basically my data frame has 20 million rows, and this search takes 2 hours. I know it has to take a lot, because there are a lot of accesses, but I wanted to know if there could be a faster way to do this search.
More information:
On indexA I have different groups of values. Example: Countries.
On indexB I have different groups of dates.
On indexC I have different groups of values.
Answer:
df['NewData'] = df.groupby(level=['IndexA', 'IndexB'])['ColumnB'].shift(7)
All you're really doing is a shift. You can speed it up 1000x like this:
df['NewData'] = df['ColumnB'].shift(-someConstantValue)
You'll need to roll the data from the top someConstantValue number of rows around to the bottom--I'm leaving that as an exercise.
So, I've got a dataframe that looks like:
with 308 different ORIGIN_CITY_NAME and 12 different UNIQUE_CARRIER.
I am trying to remove the cities where the number of unique carrier airline is < 5 As such, I performed this function:
Now, I want i'd like to take this result and manipulate my original data, df in such a way that I can remove the rows where the ORIGIN_CITY_NAME corresponds to being TRUE.
I had an idea in mind which is to use the isin() function or the apply(lambda) function in Python but I'm not familiar how to go about it. Is there a more elegant way to go about this? Thank you!
filter was made for this
df.groubpy('ORIGIN_CITY_NAME').filter(
lambda d: d.UNIQUE_CARRIER.nunique() >= 5
)
However, to continue along the vein you were attempting to get results from...
I'd use map
mask = df.groubpy('ORIGIN_CITY_NAME').UNIQUE_CARRIER.nunique() >= 5
df[df.ORIGIN_CITY_NAME.map(mask)]
Or transform
mask = df.groupby('ORIGIN_CITY_NAME').UNIQUE_CARRIER.transform(
lambda x: x.nunique() >= 5
)
df[mask]
my database structure is such that I have units, that belong to several groups and have different variables (I focus on one, X, for this question). Then we have year-based records. So the database then looks like
unitid, groupid, year, X
0 1 1, 1990, 5
1 2 1, 1990, 2
2 2 1, 1991, 3
3 3 2, 1990, 10
etc. Now what I would like to do is measure some "intensity" variable, that is going to be the number of units per group and year, and I would like to put it back into the database.
So far, I am doing
asd = df.drop_duplicates(cols=['unitid', 'year'])
groups = asd.groupby(['year', 'groupid'])
intensity = groups.size()
And intensity then looks like
year groupid
1961 2000 4
2030 3
2040 1
2221 1
2300 2
However, I don't know how to put them back into the old dataframe. I can access them through intensity[0], but intensity.loc() gives a LocIndexer not callable error.
Secondly, it would be very nice if I could scale intensity. Instead of "units per group-year", it would be "units per group-year, scaled by average/max units per group-year in that year". If {t,g} denotes a group-year cell, that would be:
That is, if my simple intensity variable (for time and group) is called intensity(t, g), I would like to create relativeIntensity(t,g) = intensity(t,g)/mean(intensity(t=t,g=:)) - if this fake code helps at all making myself clear.
Thanks!
Update
Just putting the answer here (explicitly) for readability. The first part was solved by
intensity = intensity.reset_index()
df['intensity'] = intensity[0]
It's a multi-index. You can reset the index by calling .reset_index() to your resultant dataframe. Or you can disable it when you compute the group-by operation, by specifying as_index=False to the groupby(), like:
intensity = asd.groupby(["year", "groupid"], as_index=False).size()
As to your second question, I'm not sure what you mean in Instead of "units per group-year", it would be "units per group-year, scaled by average/max units per group-year in that year".. If you want to compute "intensity" by intensity / mean(intensity), you can use the transform method, like:
asd.groupby(["year", "groupid"])["X"].transform(lambda x: x/mean(x))
Is this what you're looking for?
Update
If you want to compute intensity / mean(intensity), where mean(intensity) is based only on the year and not year/groupid subsets, then you first have to create the mean(intensity) based on the year only, like:
intensity["mean_intensity_only_by_year"] = intensity.groupby(["year"])["X"].transform(mean)
And then compute the intensity / mean(intensity) for all year/groupid subset, where the mean(intensity) is derived only from year subset:
intensity["relativeIntensity"] = intensity.groupby(["year", "groupid"]).apply(lambda x: pd.DataFrame(
{"relativeIntensity": x["X"] / x["mean_intensity_only_by_year"] }
))
Maybe this is what you're looking for, right?
Actually, days later, I found out that the first answer to this double question was wrong. Perhaps someone can elaborate to what .size() actually does, but this is just in case someone googles this question does not follow my wrong path.
It turned out that .size() had way less rows than the original object (also if I used reset_index(), and however I tried to stack the sizes back into the original object, there were a lot of rows left with NaN. The following, however, works
groups = asd.groupby(['year', 'groupid'])
intensity = groups.apply(lambda x: len(x))
asd.set_index(['year', 'groupid'], inplace=True)
asd['intensity'] = intensity
Alternatively, one can do
groups = asd.groupby(['fyearq' , 'sic'])
# change index to save groupby-results
asd= asd.set_index(['fyearq', 'sic'])
asd['competition'] = groups.size()
And the second part of my question is answered through
# relativeSize
def computeMeanInt(group):
group = group.reset_index()
# every group has exactly one weight in the mean:
sectors = group.drop_duplicates(cols=['group'])
n = len(sectors)
val = sum(sectors.competition)
return float(val) / n
result = asd.groupby(level=0).apply(computeMeanInt)
asd= asd.reset_index().set_index('fyearq')
asd['meanIntensity'] = result
# if you don't reset index, everything crashes (too intensive, bug, whatever)
asd.reset_index(inplace=True)
asd['relativeIntensity'] = asd['intensity']/asd['meanIntensity']