Extract unique value with multiple columns from DataFrame - python

I have a dataframe where I want to extract values from two columns but the criteria set is unique values from one of the columns. In the image below, I want to extract unique values of 'education' along with its corresponding values from 'education-num'. I can easily extract the unique values with df['education'].unique() and I am stuck with not being able to extract the 'education-num'.
image of the dataframe.
(Originally the task was to compute the population of people with education of Bachelors, Masters and Doctorate and I assume this would be easier when comparing the 'education-num' rather than logical operators on string. But if there's any way we could do it directly from the 'education' that would also be helpful.
Edit: Turns out the Dataframe.isin helps to select rows by the list of string as given in the solution here.)
P.S. stack-overflow didn't allow me to post the image directly and posted a link to it instead...đŸ˜’

Select columns by subset and call DataFrame.drop_duplicates:
df1 = df[['education', 'education-num']].drop_duplicates()
If need count population use:
df2 = df.groupby(['education', 'education-num']).size().reset_index(name='count')

Related

Sort the DataFrames columns which are dynamically generated

I have a dataframe which is similar to this
d1 = pd.DataFrame({'name':['xyz','abc','dfg'],
'age':[15,34,22],
'sex':['s1','s2','s3'],
'w-1(6)':[96,66,74],
'w-2(5)':[55,86,99],
'w-3(4)':[11,66,44]})
Note that in my original DataFrame the week numbers are generated dynamically (i.e) The columns
w-1(6),w-2(5) and w-3(4) are generated dynamically and change every week. I want to sort all the three columns of the week based on descending order of the values.
But the names of the columns cannot be used as they change every week.
Is there any possible way to achieve this?
Edit : The numbers might not always present for all the three weeks, in the sense that if W-1 has no data, i wont have that column in the dataset at all. So that would mean only two week columns and not three.
You can use the column indices.
d1.sort_values(by=[d1.columns[3], d1.columns[4], d1.columns[5]] , ascending=False)

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.
I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Pandas - select lowest value to date

I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?
Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.

How to create a new python DataFrame with multiple columns of differing row lengths?

I'm organizing a new dataframe in order to easily insert data into a Bokeh visualization code snippet. I think my problem is due to differing row lengths, but I am not sure.
Below, I organized the dataset in alphabetical order, by country name, and created an alphabetical list of the individual countries. new_data.tail() Although Zimbabwe is listed last, there are 80336 rows, hence the sorting.
df_ind_data = pd.DataFrame(ind_data)
new_data = df_ind_data.sort_values(by=['country'])
new_data = new_data.reset_index(drop=True)
country_list = list(ind_data['country'])
new_country_set = sorted(set(country_list))
My goal is create a new DataFrame, with 76 cols (country names), with the specific 'trust' data in the rows underneath each country column.
df = pd.DataFrame()
for country in new_country_set:
pink = new_data.loc[(new_data['country'] == country)]
df[country] = pink.trust
Output here
As you can see, the data does not get included for the rest of the columns after the first. I believe this is due to the fact that the number of rows of 'trust' data for each country varies. While the first column has 1000 rows, there are some with as many as 2500 data points, and as little as 500.
I have attempted a few different methods to specify the number of rows in 'df', but to no avail.
The visualization code snippet I have utilizes this same exact data structure for the template data, so that it why I'm attempting to put it in a dataframe. Plus, I can't do it, so I want to know how to do it.
Yes, I can put it in a dictionary, but I want to put it in a dataframe.
You should use combine_first when you add a new column so that the dataframe index gets extended. Instead of
df[country] = pink.trust
you should use
df = pink.trust.combine_first(df)
which ensures that your index is always union of all added columns.
I think in this case pd.pivot(columns = 'var', values = 'val') , will work for you, especially when you already have dataframe. This function will transfer values from particular column into column names. You could see the documentation for additional info. I hope that helps.

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

Categories

Resources