I´ve been trying to extract specific data from a given data set and add it in a new one in a specific set of organized column. I'm doing this by reading a CSV file and using the string function. The problem is that even though the data is extracted correctly Pandas will add the second column as NaN even though there is data stored in the affected variable, please see my code below, any idea on how to fix this?
processor=pd.DataFrame()
Hospital_beds="SH.MED.BEDS.ZS"
Mask1=data["IndicatorCode"].str.contains(Hospital_beds)
stage=data[Mask1]
Hospital_Data=stage["Value"]
Birth_Rate="SP.DYN.CBRT.IN"
Mask=data["IndicatorCode"].str.contains(Birth_Rate)
stage=data[Mask]
Birth_Data=stage["Value"]
processor["Countries"]=stage["CountryCode"]
processor["Birth Rate per 1000 people"]=Birth_Data
processor["Hospital beds per 100 people"]=Hospital_Data
processor.head(10)
The problem here is that the indices are not matching up. When you initially populate the processor data frame you are using each line from the original dataframe that contained birth rate data. These lines are different from the ones that contain the hospital beds data so when you do
processor["Hospital beds per 100 people"] = Hospital_Data
pandas will create the new column, but since there are no matching indices for the Hospital_Data in processor it will just contain null values.
Probably what you first want to do is re-index the original data using the country code and the year
data.set_index(['CountryCode','Year'], inplace=True)
You can then create a view of just the indicators you are interested in
indicators = ['SH.MED.BEDS.ZS', 'SP.DYN.CBRT.IN']
dview = data[data.IndicatorCode.isin(indicators)]
Finally you can then pivot on the indicator code to view each indicator on the same line
dview.pivot(columns='IndicatorCode')['Value']
But note this will still contain a lot of NaNs. This is just because the hospital bed data is updated very infrequently (or e.g. in Aruba not at all). But you can filter these out as needed.
Related
I am currently working on a project that uses a data Frame of almost 24000 basketball games from the years 2004-2021. what I want to do in the end is make a single data Frame that has only 1 row for each year and the column values will be the mean for that category. What I have so far is a mask function that can separate by year but I want to make a for loop that will go through the list of years, get the mean of that, and then concatenate them into a new data frame. The code might help explain this better.
## now i want to seperate this into data sets based on year so ill make a function this will be used to seperate by year. in my original dataset "SEASON" is the year.
def mask(year):
mask = stats['SEASON']== year
year_mask= stats[mask]
return year_mask
how can I possibly make this into a loop that seperates by year, finds mean clues of all columns in that year, and combines them into 1 data from that should have 18 rows that span from 2004-2021?
If you are using Pandas dataframes it's best to let pandas do the work for you.
I assume you want to calculate the mean of some category in your dataframe grouped by the year. To do this we can create a function like so:
def foo(df, category):
return df.groupby(by=["year"])[category].mean()
If you want to mean all the categories just use:
df.groupby(by=["year"]).mean()
Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!
IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])
I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.
I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()
I'm trying to use pandas on a movie dataset to find the 10 critics with the most reviews, and to list their names in a table with the name of the magazine publication they work for and the dates of their first and last review.
the movie dataset starts as a csv file which in excel looks something like this:
critic fresh date publication title reviewtext
r.ebert fresh 1/2/12 Movie Mag Toy Story 'blahblah'
n.bob rotten 4/2/13 Time Ghostbusters 'blahblah'
r.ebert rotten 3/31/09 Movie Mag CasaBlanca 'blahblah'
(you can assume that a critic posts reviews at only one magazine/publication)
Then my basic code starts out like this:
reviews = pd.read_csv('reviews.csv')
reviews = reviews[~reviews.quote.isnull()]
reviews = reviews[reviews.fresh != 'none']
reviews = reviews[reviews.quote.str.len() > 0]
most_rated = reviews.groupby('critic').size().order(ascending=False)[:30]
print most_rated
output>>>
critic
r.ebert 2
n.bob 1
Then I know how to isolate the top ten critics and the number of reviews they've made (shown above), but I'm still not familiar with pandas groupby, and using it seems to get rid of the rest of the columns (and along with it things like publication and dates). When that code runs, it only prints a list of the movie critics and how many reviews they've done, not any of the other column data.
Honestly I'm lost as to how to do it. Do I need to append data from the original reviews back onto my sorted dataframe? Do I need to make a function to apply onto the groupby function? Tips or suggestions would be very helpful!
As DanB says, groupby() just splits your DataFrame into groups. Then, you apply some number of functions to each group and pandas will stitch the results together as best it can -- indexed by the original group identifiers. Other than that, as far as I understand, there's no "memory" for what the original group looked like.
Instead, you have to specify what you want to output to contain. There are a few ways to do this -- I'd look into 'agg' and 'apply'. 'Agg' is for functions that return a single value for the whole group, whereas apply is much more flexible.
If you specify what you are looking to do, I can be more helpful. For now, I'll just give you two examples.
Suppose you want, for each reviewer, the number of reviews as well as the date of the first and last review and the movies that were reviewed first and last. Since each of these is a single value per group, use 'agg':
grouped_reviews = reviews.groupby('critic')
grouped.agg('size', {'date': ['first', 'last'], 'title': ['first', 'last']})
Suppose you want to return a dataframe of the first and last review by each reviewer. We can use 'apply', which works with any function that outputs a pandas object. So we'll write a function that takes each group and a dataframe of just the first and last row:
def get_first_and_last(df):
return pd.concat((df.iloc[0], df.iloc[-1]), axis = 1,ignore_index = True)
grouped_reviews.apply(get_first_and_last)
If you are more specific about what you are looking to do, I can give you a more specific answer.