I'm having trouble selecting specific values of a row with pandas.
I have a CSV file with confirmed cases of Coronavirus in each country each day. So obviously some countries started having cases in different days and progressed in different ways.
Dataframe of countries I'm trying to plot:
I would like to filter each row since de 50th confirmed case, which occurs on different days for each country.
I tried to use the command df[df['column']>50], but this works for a single column and I want to do for all columns.
All my life I worked just with procedural programming in python without libraries but this week I decided to start using some of them, so my library understanding is very limited and I don't know how to insert a for loop on a library function, which I think is the case here. This is also my first question on stack overflow, so if I am doing something wrong please tell me. Thank you!
Related
I have a question regarding how to impute a second level index after a 2 layered groupby in pandas.
I have a dataframe of patient information. I'm trying to track when these reports were generated so I can chart them in pyplot. The three things that matter for what I'm trying to do is when the report was generated, the technology that generated the report, and the count of each technology per month. I have this line of code so far
frame.groupby([pd.Grouper(key="reportDate", freq='M'), pd.Grouper(key="sourceFilePathTechnology")], observed= False).count()
which generates the following table.
I'm close to what I'm trying to get, but I'm missing something and I can't find what I'm looking for in the documentation or in another SO post. The final missing step is that I would like to have every technology represented in the sourceFilePathTechnology index per month. so 2016-03-31 only has FSG, when I need it to also have NTP, MOL, even if the count is 0. And I need this for every month in the reportDate index Does anyone know how I can resolve this?
Thank you to anyone who can offer some input!
Found my answer. I needed to google pandas group by and count 0 and came across this post: Pandas groupby for zero values
the answer was
frame.groupby([pd.Grouper(key="reportDate", freq='M'), pd.Grouper(key="sourceFilePathTechnology")], observed= False).count().unstack(fill_value=0).stack()
Excel newbie and aspiring Data analyst, I have this data and I want to find the distribution of City wise Shopping Experience. The column M has the shopping experience rated from 1 to 5.
What I tried
I am not able to google how to do this at all. I tried running correlation, but the in-built excel data analysis tool does not let me run it on non-numeric data, and I am not able to group the City cells either. I thought of replacing every city with numeric alias but I don't know how to do that either. How to search, or go ahead with this problem?
Update: I was thinking of some way to get this out of the cities column.
I am thinking this is better done in python.
How about something like this, have just taken the cities and data to show averageif, sumif and countif:
I used Data validation to provide the list to select from.
I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.
In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps
I'm working through Wes McKinney's Python for Data Analysis. While I'm working in Python 3 and the book is written in Python 2, this is generally not an issue, and if anything a good exercise.
However, I've reached an impasse on Chapter 2, example 3: US Baby Names 1880 - 2010 (pg. 34). The purpose of the following code is to insert a column titled 'prop' that contains the fraction of babies given a name for each year and gender into the dataframe:
def add_prop(group):
births = group.births.astype(float)
group['prop']=births/births.sum()
return group
names=names.groupby(['year','sex']).apply(add_prop)
'names' is a dataframe with five columns ('name', 'sex', 'births', 'year', and this adds 'prop') and approximately 1.7 million rows. In order to test whether prop was added correctly, you then test when the proportions sum to approximately 1 with np.allclose(names.groupby(['year','sex']).prop.sum(), 1).
My problem is that the function runs unpredictably. Perhaps once out of every 15 or 20 runs np.allclose will be true, and the function will have been applied to the dataframe correctly. Otherwise np.allclose is false. Additionally, it's wrong in different ways. Later you use this dataframe to graph the proportion of births represented in the top 1000 names by sex, and the shape of that graph changes constantly. Some examples of graph change: I know the problem is in how proportion is being calculated and added because the rest of the dataframe doesn't vary.
What is introducing unpredictability into this example? While I suspect it's the .apply() command, I'm not sure and don't know how to test my hypothesis. It's been suggested to me that part of the code block is deprecated, but Jupyter Notebook doesn't come up with a warning and I haven't been able to find anything online. I've gone over my code twice, and overall it's virtually identical to the book's, and is identical in the case of this block. Thanks in advance.
I think the issue is in using a float data type for prop. Floats are bad where you need accuracy. That's why, in the book, he says the sum of the props should add to 1 or are "sufficiently close to (but perhaps not exactly equal to) 1"
I'm new to python myself so I don't know the best solution. In databases I avoid floats and use the decimal data type. Regardless, if we're going to find the proportion of each of over a million records, it's going to be tough to maintain any single accuracy.
Question for experienced Pandas users on approach to working with Dataframe data.
Invariably we want to use Pandas to explore relationships among data elements. Sometimes we use groupby type functions to get summary level data on subsets of the data. Sometimes we use plots and charts to compare one column of data against another. I'm sure there are other application I haven't thought of.
When I speak with other fairly novice users like myself, they generally try to extract portions of a "large" dataframe into smaller dfs that are sorted or formatted properly to run applications or plot. This approach certainly has disadvantages in that if you strip out a subset of data into a smaller df and then want to run an analysis against a column of data you left in the bigger df, you have to go back and recut stuff.
My question is - is best practices for more experienced users to leave the large dataframe and try to syntactically pull out the data in such a way that the effect is the same or similar to cutting out a smaller df? Or is it best to actually cut out smaller dfs to work with?
Thanks in advance.