This question already has answers here:
"Drop random rows" from pandas dataframe
(2 answers)
Closed 2 years ago.
I have a dataset which contains a column 'location' with countries.
id location
0 001 United State
1 002 United State
2 003 Germany
3 004 Brazil
4 005 China
Now I only want the rows with specific countries.
I did this like this:
df2 = df[(df['location'].str.contains('United States')) | (df['location'].str.contains('Germany'))
That works.
Now I want only half of the rows with 'United States'.
(The reason is I have a really large dataset and most of the rows are 'United States'. For the sake of performance for further operations i want to cut half of it or just any %.)
Can anyone help me do that in a fast and clean way? Im sturggling.
TY <3
You can use sample for that, together with drop
df.drop(df[df['location'] == 'United State']).sample(frac=.5).index)
The filter inside, returns ALL the rows that have values equal to 'United State'. Then the sample takes randomly 50% of those and the index will return the index number which then will be used to drop those rows.
Related
I have a DataFrame with more than 11,000 observations. I want to loop through 2 columns from the several columns. below is the 2 columns that I want to loop through:
df1: EWZ sales_territory
0 90164.0 north
1 246794.0 north
2 216530.0 north
3 80196.0 north
4 12380.0 north
11002 224.0 east
11003 1746.0 east
11004 7256.0 east
11005 439.0 east
11006 13724.0 east
The data shown here is the first 5 and last 5 observations of the columns.
The sales_territory column has north,south,east and west observations. EWZ is the population size.
I want to select all east that have same value of population and also with north, south, west with same value of population and append in a variable. I.e, if there are 20 north that have 20,000 as population size, I want to select them. Same with others.
I tried using nested if but frankly speaking, I don't know how to specify the condition for EWZ column. I tried Iterrows(), but I could not find my way out.
How do I figure this out?
You don't have to use a for loop. You can use:
df1.groupby(['EWZ', 'sales_territory']).apply(your_function)
and achieve the desired result. If you want to get a list of unique values then you can just drop duplicates using df1[['EWZ', 'sales_territory']].drop_duplicates()
If you don't care about the EWZ values while selecting you can use 4 loc statements like
df1.loc[df1['sales_territory'] == 'north', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'east', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'south', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'west', 'new_column'] = value
This question already has answers here:
Count unique values per groups with Pandas [duplicate]
(4 answers)
Closed 2 years ago.
One of the columns in the Dataframe is STANME (State name). I want to create a pandas series with index = STNAME and value = number of entries in DataFrame. E.g of sample output is shown below
STNAME
Michigan 83
Arizona 15
Wisconsin 72
Montana 56
North Carolina 100
Utah 29
New Jersey 21
Wyoming 23
My current solution is the following, but seems a but clumsy due to the need to pick arbitrary column, rename this column etc. Would like to know if there is a better way to do this
grouped=df.groupby('STNAME')
# Note: County is an arbitrary column name I picked from the dataframe
grouped_df = grouped['COUNTY'].agg(np.size)
grouped_df.columns = ['Num Counties']
You can achieve this using value_counts(). This function is used to get a pd.Series containing counts of unique values:
freq = df['STANME'].value_counts()
The index will be STANME, and the value will be it's frequency (first element is the most frequently-occurring element).
Note that NA's will be excluded by default.
I am fairly new to data science. Apologies if the question is unclear.
**My Data is following format:**
*year age_group pop Gender Ethnicity
0 1957 0 - 4 Years 264727 Mixed Mixed
1 1957 5 - 9 Years 218097 Male Indian
2 1958 10 - 14 Years 136280 Female Indian
3 1958 15 - 19 Years 135679 Female Chinese
4 1959 20 - 24 Years 119266 Mixed Mixed*
.
.
.
.
Here Mixed means Both Male & Female for gender and Indian & Chinese & others for Ethnicity
where as pop is the population
I have some rows with missing values like the following:
year age_group pop Gender Ethnicity
344 1958 70 - 74 Years NaN Mixed Mixed
345 1958 75 - 79 Years NaN Male Indian
346 1958 80 - 84 Years NaN Mixed Mixed
349 1958 75 Years & Over NaN Mixed Mixed
350 1958 80 Years & Over NaN Female Chinese
.
.
.
These can't be dropped or filled with mean/median/previous values.
I am looking for any cold deck/any imputation techniques which would allow me fill the pop based on the values in year, age_group, gender and ethnicity.
Please do provide any sample code or documentation that would help me.
It's hard to a give a specific answer without knowing what you might want to use the data for. But here are some questions you should ask:
How many null values are there?
If there are a few e.g. less than 20, and you have the time, then you can look at each one individually. In this case, you might want to look up census data on google etc and make a guesstimate for each cell.
If there are more than can be individually assessed then we'll need to work some other magic.
Do you know how the other variables should relate to population?
Think about how other variables should relate to population. For example, if you know there's 500 males in one age cohort of a certain ethnicity but you don't know how many females... well 500 females would be a fair guess.
This would only cover some nulls, but is a logical assumption. You might be able to step through imputations of decreasing strength:
Find all single sex null values where the corresponding gender cohort is known, assume a 50:50 gender ratio for the cohort
Find all null values where the older cohort and younger cohort is known, impute this cohorts population linearly between them
Something else...
This is a lot of work -- but again -- what do you want the data for? If you're looking for a graph it's probably not worth it. But if you're doing a larger study / trying to win a kaggle competition... then maybe it is?
What real world context do you have?
For example, if this data is for population in a certain country, then you might know the age distribution curve of that country? You could then impute values for ethnicities along the age distribution curve given where other age cohorts for the same ethnicity sit. This is brutally simplistic, but might be ok for your use case.
Do you need this column?
If there are lots of nulls, then whatever imputation you do will likely add a good degree of error. So what are you doing with the data? If you don't have to have the column, and there are a lot of nulls, then perhaps you're better without it.
~~
Hope that helps -- good luck!
This question already has answers here:
Aggregate unique values from multiple columns with pandas GroupBy
(3 answers)
Comma separated values from pandas GroupBy
(4 answers)
Closed 3 years ago.
I want to combine the rows and separate distinct entries by a comma.
I tried the following. Starting from
Postcode Borough Neighbourhood
M3A North York Parkwoods
M3A North York Victoria Village
I typed in the following command
df.groupby(['Postcode','Borough'])["Neighbourhood"].apply(lambda item:', '.join(item)
But that gives me
Neighbourhood
Postcode Borough
M3A North York Parkwoods, Victoria Village
The problem is that the last column is somehow 'above' all the others. Can''t I do this in a way that retains the old column structure?
Thanks!
This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Pandas groupby: How to get a union of strings
(8 answers)
Closed 4 years ago.
How do I turn the below input data (Pandas dataframe fed from Excel file):
ID Category Speaker Price
334014 Real Estate Perspectives Tom Smith 100
334014 E&E Tom Smith 200
334014 Real Estate Perspectives Janet Brown 100
334014 E&E Janet Brown 200
into this:
ID Category Speaker Price
334014 Real Estate Perspectives Tom Smith, Janet Brown 100
334014 E&E Tom Smith, Janet Brown 200
So basiscally I want to group by Category, concatenate the Speakers, but not aggregate Price.
I tried different approaches with Pandas dataframe.groupby() and .agg(), but to no avail. Maybe there is simpler pure Python solution?
There are 2 possible solutions - aggregate by multiple columns and join:
dataframe.groupby(['ID','Category','Price'])['Speaker'].apply(','.join)
Or need aggregate only Price column, then is necessary aggregate all columns by first or last:
dataframe.groupby('Price').agg({'Speaker':','.join, 'ID':'first', 'Price':'first'})
Try this
df.groupby(['ID','Category'],as_index=False).agg(lambda x : x if x.dtype=='int64' else ', '.join(x))