How to manipulate a matrix of data in python? - python

I'd like to create code that can read create a histogram from a matrix of data that contains information about movies. The data set (matrix) contains several columns, and I'm interested in the column that contains movie release years and another column that says whether or not they pass the bechtel test (the data set defines "Pass" and "Fail" as indicators of whether a movie passed or failed the test). Knowing the nth column number of these two columns (release year and pass/fail), how can I create a histogram of the movies that fail the test, with the x axis containing bins of movie years? The bin sizes are not too important, whatever pyplot defaults to would be fine.
What I can do (which is not a lot) is this:
plt.hist(year_by_Test_binary[:,0])
which creates a pretty but meaningless histogram of how many movies were released in bins of years (the matrix has years in the 0th column).
If you couldn't already tell, I am python-illiterate and struggling. Any help would be appreciated.

Assuming n is the column of the Bechdel test, and that your data is numpy like:
plt.hist([matrix[matrix[:,n] == 'Pass', 0], matrix[matrix[:,n] == 'Fail', 0]])
We're giving numpy two vectors of years, one with movies passing and one with movies failing. It will then create two histograms for each category, so you can visually identify changes to the categories.

for to convert a data to an matrix use :
numpy.asarray(data)
and to present in a histogram you can use :
plt.plot(data)
or
plt.hist(data, bins)
bins is the niveau of your data

Related

how can I take certain elements in a larger data frame and make another data Frame with these elements in python?

I am currently working on a project that uses a data Frame of almost 24000 basketball games from the years 2004-2021. what I want to do in the end is make a single data Frame that has only 1 row for each year and the column values will be the mean for that category. What I have so far is a mask function that can separate by year but I want to make a for loop that will go through the list of years, get the mean of that, and then concatenate them into a new data frame. The code might help explain this better.
## now i want to seperate this into data sets based on year so ill make a function this will be used to seperate by year. in my original dataset "SEASON" is the year.
def mask(year):
mask = stats['SEASON']== year
year_mask= stats[mask]
return year_mask
how can I possibly make this into a loop that seperates by year, finds mean clues of all columns in that year, and combines them into 1 data from that should have 18 rows that span from 2004-2021?
If you are using Pandas dataframes it's best to let pandas do the work for you.
I assume you want to calculate the mean of some category in your dataframe grouped by the year. To do this we can create a function like so:
def foo(df, category):
return df.groupby(by=["year"])[category].mean()
If you want to mean all the categories just use:
df.groupby(by=["year"]).mean()

How to deal with missing value in Pandas DataFrame from open data?

I have downloaded ten open datasets of air pollution in 2010-2019 (which has been transferred to Pandas DataFrame by 'read_csv') that have some missing values.
The rows are ordered by each day including several items (like PM2.5, SO2,...). Most of the data include 17 or 18 items. There are 27 columns which separately are Year, Station, Item, 00, 01, ..., 23.
In this case, I already used
df.fillna(np.nan).apply(lambda x: pd.to_numeric(x,errors='coerce')
and df.interpolate(axis=1,inplace=True)
But now if the data have missing values from '00' to anytime following, the interpolate function would not works. If I want to fill all these blanks, I need to merge the last day data which is not null and use interpolate again.
However, different days have different items numbers, which means there are still some rows that can't be filled.
In a nutshell, now I'm trying to contact all data by the key of items and use interpolate.
By the way, after data cleaning, I would like to apply to xgboost and linear regression to predict PM2.5. Is there any way recommended to deal with the data?
(Or any demo code online?)
For example, the data would be like:
one of the datasets
I used df.groupby('date').size() and got
size of different days
Or in other words, how to split different days and concat together?
Groupby(['date','items'])? and then how to merge?
Or, is that possible to interpolate from the last value of the last row?

Need to create bins having equal population. Also need to generate a report that contains cross tab between bins and cut

I'm using the diamonds dataset, below are the columns
Question: to create bins having equal population. Also need to generate a report that contains cross tab between bins and cut. Represent the number under each cell as a percentage of total
I have the above query. Although being a beginner, I created the Volume column and tried to create bins with equal population using qcut, but I'm not able to proceed further. Could someone help me out with the approach to solve the question?
pd.qcut(diamond['Volume'], q=4)
You are on the right path: pd.qcut() attempts to break the data you provide into q equal-sized bins (though it may have to adjust a little, depending on the shape of your data).
pd.qcut() also lets you specify labels=False as an argument, which will give you back the number of the bin into which the observation falls. This is a little confusing, so here's a quick exaplanation: you could pass labels=['A','B','C','D'] (given your request for 4 bins), which would return the labels of the bin into which each row falls. By telling pd.qcut that you don't have labels to give the bins, the function returns a bin number, just without a specific label. Otherwise, what the function gives back is a tuple with the range into which the observation (row) fell, and the bin number.
The reason you want the bin number is because of your next request: a cross-tab for the bin-indicator column and cut. First, create a column with the bin numbering:
diamond['binned_volume] = pd.qcut(diamond['Volume'], q=4, labels=False)`
Next, use the pd.crosstab() method to get your table:
pd.crosstab(diamond['binned_volume'], diamond['cut'], normalize=True)
The normalize=True argument will have the table calculate the entries as the entry divided by their sum, which is the last part of your question, I believe.

Creating a line graph for a top X in a dataframe (Pandas)

I'm trying to make a line graph for my dataframe that has the names of 10 customers on the X axis and their amount of purchases they made on the Y axis.
I have over 100 customers in my data frame, so I created a new data frame that is grouped by customers and which shows the sum of their orders and I wish to only display the top 10 customers on my graph.
I have tried using
TopCustomers.nlargest(10, 'Company', keep='first')
But I run into the error nlargest() got multiple values for argument 'keep' and if I don't use keep, I get told it's a required argument.
TopCustomers is composed of TopCustomers = raw.groupby(raw['Company'])['Orders'].sum()
Sorting is not required at the moment, but it'd be good to know in advance.
On an additional Note: The list of customer's name is rather lengthy and, after playing with some dummy data, I see that the labels for the X axis are stacked on top of each other, is there a way to make it bigger so that all 10 are clearly visible? and maybe mark a dot where the X,Y meets?
we can do sort_values and tail
TopCustomers.sort_values().tail(10)

Plotting the proportion of each of one feature that contain a specific value of another feature

I have a data frame with multiple features, including two categorical: 'race' (5 unique values) and 'income' (2 unique values: <=$50k and >$50k)
I've figured out how to do a cross-tabulation table between the two.
However, I can't figure out a short way on how to create a table or bar graph that shows what percentage of each of the five races falls in the <=$50k income group
The code below gives me a table where the rows are the individual races; the counts for each of the two categories of income; and the total counts for each race. I can't figure out how to add another column on the right that simply takes the count for <=$50k, divides by the total, and then lists the proportion
ct_race_income=pd.crosstab(adult_df.race, adult_df.income, margins=True)
Here's a bunch of code where I do it the long way: calculating each proportion and then creating a new dataframe for the purposes of making a bar chart. However, I want to code all of this in many fewer lines
total_white=len(adult_df[adult_df.race=="White"])
total_black=len(adult_df[adult_df.race=="Black"])
total_hisp=len(adult_df[adult_df.race=="Hispanic"])
total_asian=len(adult_df[adult_df.race=="Asian"])
total_amer_indian=len(adult_df[adult_df.race=="Amer-Indian"])
prop_white=(len(adult_df_lowincome[adult_df_lowincome.race=="White"])/total_white)
prop_black=(len(adult_df_lowincome[adult_df_lowincome.race=="Black"])/total_black)
prop_hisp=(len(adult_df_lowincome[adult_df_lowincome.race=="Hispanic"])/total_hisp)
prop_asian=(len(adult_df_lowincome[adult_df_lowincome.race=="Asian"])/total_asian)
prop_amer_indian=(len(adult_df_lowincome[adult_df_lowincome.race=="Amer-Indian"])/total_amer_indian)
prop_lower_income=pd.DataFrame()
prop_lower_income['Race']=["White","Black","Hispanic", "Asian", "American Indian"]
prop_lower_income['Ratio']=[prop_white, prop_black, prop_hisp, prop_asian, prop_amer_indian]

Categories

Resources