groupby for pandas data frame gives wrong results

groupby for pandas data frame gives wrong results - python

I am trying to replicate a paper whose code was written in Stata for my course project using Python. I have difficulty replicating the results from a collapse command in their do-file. The corresponding line in the do-file is
collapse lexptot, by(clwpop right)
while I have
df.groupby(['cwpop', 'right'])['lexptot'].agg(['mean'])
The lexptot variable is the logarithm of a variable 'exptot' which I calculated previously using np.log(dfs['exptot]).
Does anyone have an idea what is going wrong here? The means I calculate are typically around 1.5 higher than the means calculated in Stata.

Once you update the question with more relevant details maybe I can answer more. But this is what I think might help you!
df.groupby(['cwpop', 'right']).mean()['lexptot']

Related

Creating a 6 variable Venn Diagram with Boolean values in 6 columns

I am a novice at python so I apologize if this is confusing. I am trying to create a 6 variable venn diagram. I was trying to use matplotlib-venn, however the problem I am having is creating the sets is turning out to be impossible for me. My data is thousands of rows long with a unique index and each column has boolean values for each category. It looks something like this:
|A|B|C|D|E|F|
|0|0|1|0|1|1|
|1|1|0|0|0|0|
|0|0|0|1|0|0|
Ideally I'd like to make a venn diagram which would show that these # of people overlap with category A and B and C. How would I go about doing this? If anyone would be able to point me in the right direction, I'd be really grateful.
I found this person had a similiar problem with me and his solution at the end of that forum is what I'd like to end up at except with 6 variables: https://community.plotly.com/t/how-to-visualize-3-columns-with-boolean-values/36181/4
Thank you for any help!

Perhaps you might try to be more specific about your needs and what you have tried.
Making a six-set Venn diagram is not trivial at all, ever more so if you want to make the areas proportional. I made a program in C++ (nVenn) with a translation to R (nVennR) that can do that. I suppose it might be used from python, but I have never tried and I do not know if that is what you want. Also, interpreting six-set Venn diagrams is not easy, you may want to check upSet for a different kind of representation. In the meantime, I can point you to a web page I made that explains how nVenn works (link).

How to add rivers in FloPy?

I am working on a steady-state problem and I have a raster with head data values for all rivers.I have to add this to my FloPy code (MODFLOW 6) so I've seen the RIV package and DRN with the hope to find a variable I can use to add the data from my raster (read data values as an array using GDAL) but no answers yet. Haven't found any examples as a guide for now.
I've been thinking to add it as STRT in the IC package, having STRT for rivers in the first layer (I did this for the heads of another layer). But I'm not quite sure, is that okay? Is that the right way, any suggestion to do it, please?
Also, what values should one put to say that there are no heads in a layer?
Clearly, I'm a newbie :p Thank you in advance :).

Divide one "column" by another in Tab Delimited file

I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.

In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps

How to find the minimum value of netCDF array excluding zero

I am using Python 2, and dealing with a netcdf data.
This array is a variable called cloud water mixing ratio, which is an output from WRF climate model that has 4 dimensions:
QC(time (25), vertical level (69), latitude (119), longitude (199))
I'm trying to get the minimum value of the values in this array. From initial analysis using NCVIEW visualisation, I found that the minimum value is approximately 1x10-5 and the maximum is 1x10-3.
I've used
var = fh.variables['QC']
var[:].max()
var[:].min()
The max works fine, but the min gives me 0.0.
Then I tried a solution from here , which is
var[var>0].min()
but I also get zero. Then I realised that the above code works for arrays with negatives, while mine doesn't have negatives.
I've tried looking for solutions here and there but found nothing that works for my situation. Please, if anyone could point me to the right directions, I'd appreciate it a lot.
Thanks.

var[var>0].min is a function, you need to call it using ()
var[var>0].min() should work much better

sorry for not being able to post the data as I don't have the privilege to share the data. I have tried creating a random 4d array that is similar to the data, and used all the solution you all provided, especially by #Joao Abrantes, they all seemed to work fine. So I thought maybe there is some problem with the data.
Fortunately, there is nothing wrong with the data. I have discussed this with my friend and we have finally found the solution.
The solution is
qc[:][qc[:]>0].min()
I have to specify the [:] after the variable instead of just doing
qc[qc>0].min()
There is also another way, which is to specify the array into numpy array because, qc = fh.variables['QC']
returns a netCDF4.Variable. By adding the second line qc2 = qc[:], it has become numpy.ndarray.
qc = fh.variables['QC']
qc2 = qc[:] # create numpy array
qc2[qc2>0].min()
I'm sorry if my question was not clear when I posted it yesterday. As I have only learned about this today.

Python: Is pandas the right thing to use? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm trying to take a 1.8mB txt file. There are couple of header lines afterwards its all space separated data. I can pull the data off using pandas. What I'm wanting to do with the data is:
1) Cut out the non essential data. ie the first 1675 lines, roughly I want to remove and the last 3-10 lines, varies day to day, I also want to remove. I can remove the first lines, kind of. The big problem with this idea I'm having right now is knowing for sure where the 1675 pointer location is. Using something like
df = df[df.year > 1978]
only moves the initial 'pointer' to 1675. If I try
dataf = df[df.year > 1978]
it just gives me a pure copy of what I would have with the first line. It still keeps the pointer to the same 1675 start point. It won't allow me to access any of the first 1675 rows but they are still obviously there.
df.year[0]
It comes back with an error suggesting row 0 doesn't exist. I have to go out and search to find what the first readable row is...instead of flat out removing the rows and moving the new pointer up to 0 it just moves the pointer to 1675 and won't allow access to anything lower than that. I still haven't found a way to be able to determine what the last row number is through programming, through the shell it's easy but I need to be able to do it through the program so I can set up the loop for point 2.
2) I want to be able to take averages of the data, 'x' day moving averages and create a new column with the new data once I have calculated the moving average. I think I can create the new column with the Series statement...I think...I haven't tried it yet though as I haven't been able to get this far yet.
3) After all this and some more math I want to be able to graph the data with a homemade graph. I think this should be easy once I have everything else completed. I have already created the sample graph and can plot the points/lines on the graph once I have the data to work with.
Is panda the right lib for the project or should I be trying to use something else? So far the more research I do...the more lost I get as everything I keep trying gets me a little further but sets me even further back at the same time. In something similar I saw mention using something else when wanting to do math on the data block. Their wasn't any indication as to what he used though.

It sounds like you main trouble is indexing. If you want to refer to the "first" thing in a DataFrame, use df.iloc[0]. But DataFrame indexes are really powerful regardless.
http://pandas.pydata.org/pandas-docs/stable/indexing.html

I think you are headed in the right direction. Pandas gives you nice, high level control over your data so that you can manipulate it much more easily than using traditional logic. It will take some work to learn. Work through their tutorials and you should be fine. But don't gloss over them or you'll miss some important details.
I'm not sure why you are concerned that the lines you want to ignore aren't being deleted as long as they aren't used in your analysis, it doesn't really matter. Unless you are facing memory constraints, it's probably irrelevant. But, if you do find you can't afford to keep them around, I'm sure there is a way to really remove them, even if it's a bit sideways.
Processing a few megabytes worth of data is pretty easy these days and Pandas will handle it without any problems. I believe you can easily pass pandas data to numpy for your statistical calculations. You should double check that, though, before taking my word for it. Also, they mention matplotlib on the pandas website, so I am guessing it will be easy to do basic graphing as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.