Thanks for reading, I've spent 3-4 hours searching for examples to solve this but can't find any that solve.. the ones I did try didn't seem to work with pandas DataFrame object.. any help would be very much appreciated!!:)
Ok this is my problem.
I have a Pandas DataFrame containing 12 columns.
I have 500,000 rows of data.
Most of the columns are useless. The variables/columns I am interested in are called: x,y and profit
Many of the x and y points are the same,
so i'd like to group them into a unique combination then add up all the profit for each unique combination.
Each unique combination is a bin (like a bin used in histograms)
Then I'd like to plot a 2d chart/heatmap etc of x,y for each bin and the colour to be total profit.
e.g.
x,y,profit
7,4,230.0
7,5,162.4
6,8,19.3
7,4,-11.6
7,4,180.2
7,5,15.7
4,3,121.0
7,4,1162.8
Note how values x=7, y=4, there are 3 rows that meet this criteria.. well the total profit should be:
230.0 - 11.6 +1162.8 = 1381.2
So in bin x=7, y = 4, the profit is 1381.2
Note for values x=7, y=5, there are 2 instances.. total profit should be: 162.4 + 15.7 = 178.1
So in bin x=7, y = 5, the profit is 178.1
So finally, I just want to be able to plot: x,y,total_profit_of_bin
e.g. To help illustrate what I'm looking for, I found this on internet, it is similar to what I'd like, (ignore the axis & numbers)
http://2.bp.blogspot.com/-F8q_ZcI-HJg/T4_l7D0C7yI/AAAAAAAAAgE/Bqtx3eIHzRk/s1600/heatmap.jpg
Thank-you so much for taking the time to read:)
If for 'bin' of x where the values are x are equal, and the values of y are equal, then you can use groupby.agg. That would look something like this
import pandas as pd
import numpy as np
df = YourData
AggDF = df.groupby('x').agg({'y' : 'max', 'profit' : 'sum'})
AggDF
That would get you the data I think you want, then you could plot as you see fit. Do you need assistance with that also?
NB this is only going to work in the way you want it to if within each 'bin' i.e. the data grouped according to the values of x, the values of y are equal. I assume this must be the case, as otherwise I don't think it would make much sense to be trying to graph x and y together.
Related
I have a DataFrame where some columns are columns are correlated and some are not. I want to display only the uncorrelated columns as output. can anyone help me out in solving this.I dont want to plot but display the uncorrelated column names.
First of all calculate the correlation:
import pandas as pd
myDataFrame=pd.DataFrame(data)
correl=myDataFrame.corr()
Define what you mean by "uncorrelated". I will use an absolute value of 0.5 here
uncor_level=0.5
The following code will give you the names of the pairs that are uncorrelated
pairs=np.full([len(correl)**2,2],None) #define an empty array to store the results
z=0
for x in range(0,len(correl)): #loop for each row(index)
for y in range(0,len(correl)): #loop for each column
if abs(correl.iloc[x,y])<uncor_level:
pair=[correl.index[x],correl.columns[y]]
pairs[z]=pair
z=z+1
I would like to ask for any suggestion how to calculate p-value for each row in my pandas DataFrame. My dataframe looks like this - there are columns with means of Data1 and Data2, and then also columns with standard error of the means. Each row represent one atom. Thus I need calculate p-value for each row (= it means, e.g., compare mean of atom 1 from Data1 with mean of atom 1 from Data2).
SEM-DATA1 MEAN-DATA1 SEM-DATA2 MEAN-DATA2
0 0.001216 0.145842 0.000959 0.143103
1 0.002687 0.255069 0.001368 0.250505
2 0.005267 0.321345 0.003722 0.305767
3 0.027265 0.906731 0.033637 0.731638
4 0.029974 0.773725 0.150025 0.960804
I found here on Stack that many people recommend using scipy. But I dont know how to apply it in the way I need it.
Is it possible?
Thank You.
You are comparing two samples df['MEAN...1'] and df['MEAN...2'], so, you should do this:
from scipy import stats
stats.ttest_ind(df['MEAN-DATA1'],df['MEAN-DATA2'])
which return:
Ttest_indResult(statistic=0.01001479441863673, pvalue=0.9922547232600507)
or if you only want to p-value
a = stats.ttest_ind(df['MEAN-DATA1'],df['MEAN-DATA2'])
a[1]
which gives
0.9922547232600507
EDIT
A clarification is in order here. A t-test (or the aquisition of a "p-value" is aimed at finding out is two distributions are coming from the same population (or sample). Testing for two single values will give NaN.
Is there any library to plot a histogram by percentiles based on a series? I have been digging around pandas but i do not see any available methods for such. I do know of a long workaround which is to manually calculate the number of occurences for each percentile i want. But i think there probably is a better solution.
Currently what i have to get the individual counts
# Sample series
tenth = df.col.quantile(0.1)
twenty = df.col.quantile(0.2)
twenty_count = len(twenty - tenth)
And so on...
However using describe. I manage to get this
df.describe(percentiles = [x/10.0 for x in range(1,11)]
IIUC
df.col.rank(pct=True).hist()
However, this is a bad idea.
Consider the following dataframe df
df = pd.DataFrame(dict(
col=np.random.randn(1000),
col2=np.random.rand(1000)
))
Then
df.col.rank(pct=True).hist()
Which is a silly graph.
Instead, divide by the maximum absolute value
(df / df.abs().max()).hist()
Ok so I have a dataframe object that's indexed as follows:
index, rev, metric1 (more metrics.....)
exp1, 92365, 0.018987
exp2, 92365, -0.070901
exp3, 92365, 0.150140
exp1, 87654, 0.003008
exp2, 87654, -0.065196
exp3, 87654, -0.174096
For each of these metrics I want to create individual stacked barplots comparing them based on their rev.
here's what I've tried:
df = df[['rev', 'metric1']]
df = df.groupby("rev")
df.plot(kind = 'bar')
This results in 2 individual bar graphs of the metric. Ideally I would have these two merged and stacked (right now stacked=true does nothing). Any help would be much appreciated.
This would give me my ideal result, however I don't think reorganizing to fit this is the best way to achieve my goal as I have many metrics and many revisions.
index, metric1(rev87654), metric1(rev92365)
exp1, 0.018987, 0.003008
exp2, -0.070901, -0.065196
exp3, 0.150140, -0.174096
This is my goal. (made by hand)
http://i.stack.imgur.com/5GRqB.png
following from this matplotlib gallery example:
http://matplotlib.org/examples/api/barchart_demo.html
there they get multiple to plot by calling bar once for each set.
You could access these values in pandas with indexing operations as follows:
fig, ax = subplots(figsize=(16.2,10),dpi=300)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[0]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))
ax.bar(X,Y,width=.4)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[2]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))+.5
ax.bar(X,Y,width=.4,color='r')
working from the inside out:
get all of the unique values of 'SL' in one of the cols (rev in your case)
Get a Boolean vector of all rows where 'SL' equals the first (or nth) unique value
Index Tire by that Boolean vector (this will pull out only those rows where the vector is True
access the values of SA or a metric in yourcase. (took only the `[0:13]' values because i was testing this on a huge data set)
bar plot those values
if your experiments are consistently in order in the frame(as shown), that's that. Otherwise you might need to run a little sorting to get your Y values in the right order. .sort(column name) should take care of that. In my code, i'd slip it in between ...[0]] and.SA...
In general, this kind of operation can really help you out in wrangling big frames. .between is useful. And you can always add, multiply etc. the Boolean vectors to construct more complex logic.
I'm not sure how to get the plot you want automatically without doing exactly the reorganization you specify at the end. The answer by user3823992 gives you more detailed control of the plots, but if you want them more automatic here is some temporary reorganization that should work using the indexing similarly but also concatenating back into a DataFrame that will do the plot for you.
import numpy as np
import pandas as pd
exp = ['exp1','exp2','exp3']*2
rev = [1,1,1,2,2,2]
met1 = np.linspace(-0.5,1,6)
met2 = np.linspace(1.0,5.0,6)
met3 = np.linspace(-1,1,6)
df = pd.DataFrame({'rev':rev, 'met1':met1, 'met2':met2, 'met3':met3}, index=exp)
for met in df.columns:
if met != 'rev':
merged = df[df['rev'] == df.rev.unique()[0]][met]
merged.name = merged.name+'rev'+str(df.rev.unique()[0])
for rev in df.rev.unique()[1:]:
tmp = df[df['rev'] == rev][met]
tmp.name = tmp.name+'rev'+str(rev)
merged = pd.concat([merged, tmp], axis=1)
merged.plot(kind='bar')
This should give you three plots, one for each of my fake metrics.
EDIT : Or something like this might do also
df['exp'] = df.index
pt = pd.pivot_table(df, values='met1', rows=['exp'], cols=['rev'])
pt.plot(kind='bar')
my database structure is such that I have units, that belong to several groups and have different variables (I focus on one, X, for this question). Then we have year-based records. So the database then looks like
unitid, groupid, year, X
0 1 1, 1990, 5
1 2 1, 1990, 2
2 2 1, 1991, 3
3 3 2, 1990, 10
etc. Now what I would like to do is measure some "intensity" variable, that is going to be the number of units per group and year, and I would like to put it back into the database.
So far, I am doing
asd = df.drop_duplicates(cols=['unitid', 'year'])
groups = asd.groupby(['year', 'groupid'])
intensity = groups.size()
And intensity then looks like
year groupid
1961 2000 4
2030 3
2040 1
2221 1
2300 2
However, I don't know how to put them back into the old dataframe. I can access them through intensity[0], but intensity.loc() gives a LocIndexer not callable error.
Secondly, it would be very nice if I could scale intensity. Instead of "units per group-year", it would be "units per group-year, scaled by average/max units per group-year in that year". If {t,g} denotes a group-year cell, that would be:
That is, if my simple intensity variable (for time and group) is called intensity(t, g), I would like to create relativeIntensity(t,g) = intensity(t,g)/mean(intensity(t=t,g=:)) - if this fake code helps at all making myself clear.
Thanks!
Update
Just putting the answer here (explicitly) for readability. The first part was solved by
intensity = intensity.reset_index()
df['intensity'] = intensity[0]
It's a multi-index. You can reset the index by calling .reset_index() to your resultant dataframe. Or you can disable it when you compute the group-by operation, by specifying as_index=False to the groupby(), like:
intensity = asd.groupby(["year", "groupid"], as_index=False).size()
As to your second question, I'm not sure what you mean in Instead of "units per group-year", it would be "units per group-year, scaled by average/max units per group-year in that year".. If you want to compute "intensity" by intensity / mean(intensity), you can use the transform method, like:
asd.groupby(["year", "groupid"])["X"].transform(lambda x: x/mean(x))
Is this what you're looking for?
Update
If you want to compute intensity / mean(intensity), where mean(intensity) is based only on the year and not year/groupid subsets, then you first have to create the mean(intensity) based on the year only, like:
intensity["mean_intensity_only_by_year"] = intensity.groupby(["year"])["X"].transform(mean)
And then compute the intensity / mean(intensity) for all year/groupid subset, where the mean(intensity) is derived only from year subset:
intensity["relativeIntensity"] = intensity.groupby(["year", "groupid"]).apply(lambda x: pd.DataFrame(
{"relativeIntensity": x["X"] / x["mean_intensity_only_by_year"] }
))
Maybe this is what you're looking for, right?
Actually, days later, I found out that the first answer to this double question was wrong. Perhaps someone can elaborate to what .size() actually does, but this is just in case someone googles this question does not follow my wrong path.
It turned out that .size() had way less rows than the original object (also if I used reset_index(), and however I tried to stack the sizes back into the original object, there were a lot of rows left with NaN. The following, however, works
groups = asd.groupby(['year', 'groupid'])
intensity = groups.apply(lambda x: len(x))
asd.set_index(['year', 'groupid'], inplace=True)
asd['intensity'] = intensity
Alternatively, one can do
groups = asd.groupby(['fyearq' , 'sic'])
# change index to save groupby-results
asd= asd.set_index(['fyearq', 'sic'])
asd['competition'] = groups.size()
And the second part of my question is answered through
# relativeSize
def computeMeanInt(group):
group = group.reset_index()
# every group has exactly one weight in the mean:
sectors = group.drop_duplicates(cols=['group'])
n = len(sectors)
val = sum(sectors.competition)
return float(val) / n
result = asd.groupby(level=0).apply(computeMeanInt)
asd= asd.reset_index().set_index('fyearq')
asd['meanIntensity'] = result
# if you don't reset index, everything crashes (too intensive, bug, whatever)
asd.reset_index(inplace=True)
asd['relativeIntensity'] = asd['intensity']/asd['meanIntensity']