How can I efficiently get the first x% of a DataFrame? - python

Say I have a DataFrame with 1000 rows. If I wish to create a series of only the first 5% (or the first 50 rows) what is the best way to do this in terms of percentages? (I don't want to simply do df.head(50))
I would like the code to able to adapt I wanted to change x to say 20% or 30%.

This should work:
your_percenteage = 5 #or 20, 30 etc
df = df.iloc[:round(len(df)/100*your_percentage)]

All you need to do is calculate the percenteage before you you call .head()
Example:
percenteage = 20
rows_to_keep = round(percenteage / 100 * len(df))
df = df.head(rows_to_keep)

Related

How to replace two entire columns in a df by adding 5 to the previous value?

I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!
You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5

Efficient way of vertically expanding a Dataframe by row and keeping the same values?

I am doing this educational challenge on kaggle https://www.kaggle.com/c/competitive-data-science-predict-future-sales
The training set is a file of daily sales numbers of some products and the test set we need to predict is the sales of similar items for the month of november.
Now I would like to use my model to make daily predictions and thus expand the test data set by 30 for each row.
I have the following code:
for row in test.itertuples():
df = pd.DataFrame(index = nov15, columns = test.columns)
df['shop_id'] = row.shop_id
df['item_category_id'] = row.item_category_id
df['item_price'] = row.item_price
df['item_id'] = row.item_id
df = df.reset_index()
df.columns = ['date', 'item_id', 'shop_id', 'item_category_id', 'item_price']
df = df[train.columns]
tt = pd.concat([tt, df])
nov15 is a pandas daterange from 1/nov/2015 to 30/nov/2015
tt is just an empty dataset I fill by expanding it by 30 rows (nov 1 to 30) for every row in the test set.
test is the original dataframe I am copying the rows from
It runs, but it takes hours to complete.
Knowing pandas and learning from previous experiences, there is probably an efficient way to do this.
Thank you for your help!
So I have found a "more" efficient way, and then someone over at Reddit's r/learnpython has told me about the correct and most efficient way.
This above dilemma is easily solved by pandas explode function.
And these two lines do what I did above, but within seconds:
test['date'] = [nov15 for _ in range(len(test))]
test = test.explode('date')
Now my more efficient way or second solution, which is in no way anywhere close to equivalent or good was to simply make 30 copies of the dataframe with a column 'date' added.

Create a matrix with a set of ranges in columns and a set of ranges in rows with Pandas

I have a data frame in which one column 'F' has values from 0 to 100 and a second column 'E' has values from 0 to 500. I want to create a matrix in which frequencies fall within ranges in both 'F' and 'E'. For example, I want to know the frequency in range 20 to 30 for 'F' and range 400 to 500 for 'E'.
What I expect to have is the following matrix:
matrix of ranges
I have tried to group ranges using pd.cut() and groupby() but I don't know how to join data.
I really appreciate your help in creating the matrix with pandas.
you can use the cut function to create the bin "tag/name" for each column.
after you cat pivot the data frame.
df['rows'] = pd.cut(df['F'], 5)
df['cols'] = pd.cut(df['E'], 5)
df = df.groupby(['rows', 'cols']).agg('sum').reset_index([0,1], False) # your agg func here
df = df.pivot(columns='cols', index='rows')
So this is the way I found to create the matrix, that was obviously inspired by #usher's answer. I know it's more convoluted but wanted to share it. Thanks again #usher
E=df.E
F=df.F
bins_E=pd.cut(E, bins=(max(E)-min(E))/100)
bins_F=pd.cut(F, bins=(max(F)-min(F))/10)
bins_EF=bins_E.to_frame().join(bins_F)
freq_EF=bins_EF.groupby(['E', 'F']).size().reset_index(name="counts")
Mat_FE = freq_EF.pivot(columns='E', index='F')

Pandas groupby: compute (relative) sizes and save in original dataframe

my database structure is such that I have units, that belong to several groups and have different variables (I focus on one, X, for this question). Then we have year-based records. So the database then looks like
unitid, groupid, year, X
0 1 1, 1990, 5
1 2 1, 1990, 2
2 2 1, 1991, 3
3 3 2, 1990, 10
etc. Now what I would like to do is measure some "intensity" variable, that is going to be the number of units per group and year, and I would like to put it back into the database.
So far, I am doing
asd = df.drop_duplicates(cols=['unitid', 'year'])
groups = asd.groupby(['year', 'groupid'])
intensity = groups.size()
And intensity then looks like
year groupid
1961 2000 4
2030 3
2040 1
2221 1
2300 2
However, I don't know how to put them back into the old dataframe. I can access them through intensity[0], but intensity.loc() gives a LocIndexer not callable error.
Secondly, it would be very nice if I could scale intensity. Instead of "units per group-year", it would be "units per group-year, scaled by average/max units per group-year in that year". If {t,g} denotes a group-year cell, that would be:
That is, if my simple intensity variable (for time and group) is called intensity(t, g), I would like to create relativeIntensity(t,g) = intensity(t,g)/mean(intensity(t=t,g=:)) - if this fake code helps at all making myself clear.
Thanks!
Update
Just putting the answer here (explicitly) for readability. The first part was solved by
intensity = intensity.reset_index()
df['intensity'] = intensity[0]
It's a multi-index. You can reset the index by calling .reset_index() to your resultant dataframe. Or you can disable it when you compute the group-by operation, by specifying as_index=False to the groupby(), like:
intensity = asd.groupby(["year", "groupid"], as_index=False).size()
As to your second question, I'm not sure what you mean in Instead of "units per group-year", it would be "units per group-year, scaled by average/max units per group-year in that year".. If you want to compute "intensity" by intensity / mean(intensity), you can use the transform method, like:
asd.groupby(["year", "groupid"])["X"].transform(lambda x: x/mean(x))
Is this what you're looking for?
Update
If you want to compute intensity / mean(intensity), where mean(intensity) is based only on the year and not year/groupid subsets, then you first have to create the mean(intensity) based on the year only, like:
intensity["mean_intensity_only_by_year"] = intensity.groupby(["year"])["X"].transform(mean)
And then compute the intensity / mean(intensity) for all year/groupid subset, where the mean(intensity) is derived only from year subset:
intensity["relativeIntensity"] = intensity.groupby(["year", "groupid"]).apply(lambda x: pd.DataFrame(
{"relativeIntensity": x["X"] / x["mean_intensity_only_by_year"] }
))
Maybe this is what you're looking for, right?
Actually, days later, I found out that the first answer to this double question was wrong. Perhaps someone can elaborate to what .size() actually does, but this is just in case someone googles this question does not follow my wrong path.
It turned out that .size() had way less rows than the original object (also if I used reset_index(), and however I tried to stack the sizes back into the original object, there were a lot of rows left with NaN. The following, however, works
groups = asd.groupby(['year', 'groupid'])
intensity = groups.apply(lambda x: len(x))
asd.set_index(['year', 'groupid'], inplace=True)
asd['intensity'] = intensity
Alternatively, one can do
groups = asd.groupby(['fyearq' , 'sic'])
# change index to save groupby-results
asd= asd.set_index(['fyearq', 'sic'])
asd['competition'] = groups.size()
And the second part of my question is answered through
# relativeSize
def computeMeanInt(group):
group = group.reset_index()
# every group has exactly one weight in the mean:
sectors = group.drop_duplicates(cols=['group'])
n = len(sectors)
val = sum(sectors.competition)
return float(val) / n
result = asd.groupby(level=0).apply(computeMeanInt)
asd= asd.reset_index().set_index('fyearq')
asd['meanIntensity'] = result
# if you don't reset index, everything crashes (too intensive, bug, whatever)
asd.reset_index(inplace=True)
asd['relativeIntensity'] = asd['intensity']/asd['meanIntensity']

Creating Pandas 2d heatmap based on accumulated values (rather than actual frequency)?

Thanks for reading, I've spent 3-4 hours searching for examples to solve this but can't find any that solve.. the ones I did try didn't seem to work with pandas DataFrame object.. any help would be very much appreciated!!:)
Ok this is my problem.
I have a Pandas DataFrame containing 12 columns.
I have 500,000 rows of data.
Most of the columns are useless. The variables/columns I am interested in are called: x,y and profit
Many of the x and y points are the same,
so i'd like to group them into a unique combination then add up all the profit for each unique combination.
Each unique combination is a bin (like a bin used in histograms)
Then I'd like to plot a 2d chart/heatmap etc of x,y for each bin and the colour to be total profit.
e.g.
x,y,profit
7,4,230.0
7,5,162.4
6,8,19.3
7,4,-11.6
7,4,180.2
7,5,15.7
4,3,121.0
7,4,1162.8
Note how values x=7, y=4, there are 3 rows that meet this criteria.. well the total profit should be:
230.0 - 11.6 +1162.8 = 1381.2
So in bin x=7, y = 4, the profit is 1381.2
Note for values x=7, y=5, there are 2 instances.. total profit should be: 162.4 + 15.7 = 178.1
So in bin x=7, y = 5, the profit is 178.1
So finally, I just want to be able to plot: x,y,total_profit_of_bin
e.g. To help illustrate what I'm looking for, I found this on internet, it is similar to what I'd like, (ignore the axis & numbers)
http://2.bp.blogspot.com/-F8q_ZcI-HJg/T4_l7D0C7yI/AAAAAAAAAgE/Bqtx3eIHzRk/s1600/heatmap.jpg
Thank-you so much for taking the time to read:)
If for 'bin' of x where the values are x are equal, and the values of y are equal, then you can use groupby.agg. That would look something like this
import pandas as pd
import numpy as np
df = YourData
AggDF = df.groupby('x').agg({'y' : 'max', 'profit' : 'sum'})
AggDF
That would get you the data I think you want, then you could plot as you see fit. Do you need assistance with that also?
NB this is only going to work in the way you want it to if within each 'bin' i.e. the data grouped according to the values of x, the values of y are equal. I assume this must be the case, as otherwise I don't think it would make much sense to be trying to graph x and y together.

Categories

Resources