How can I combine columns in pandas dataframe with duplicate names? - python

I have got pandas dataframe which looks as following: enter image description here
There are multiple columns for Australia based on the provinces, and the columns are titled Australia, Australia.1, Australia.2 and so on. It is also the case for other countries such as the USA, the UK or Canada. I want to have only one column for each of these countries. For example, I want to have one column named Australia with the sum total of values in each provinces, and I want to avoid duplicate column names. How can I do it using pandas dataframe in Python?

You can transpose the dataframe and reset the index. Then, remove periods and number and groupby the columns which in my example is index:
df = df.T.reset_index()
df['index'] = df['index'].str.replace('\.\d+', '')
df.iloc[:,2:] = df.iloc[:,2:].astype(float)
df = df.groupby('index').sum()
df

Related

Python pandas - series to dataframe

.
How do I print out only the country names that exist in the dataframe among series with country names as index?
The following will filter for rows with an index value that is also in the index of df2 using .isin().
df1.loc[df1.index.isin(df2.index)]

Split distinct values in a column into multiple columns

I want to create a DataFrame that breaks down the genres of movies into separate columns, with each individual genre column having a value of 1 for movies that are of that genre.
from this movie dataframe
to this
dataframe with distinct genre column created, 1 for true and 0 for false
I'm using Databricks PySpark.
many thanks!
I would first get the unique values of the dataframe column in a list, and then iterate over the list.
The name of dataframe is taken as df here
unique_vals = df.select('genres').distinct().rdd.flatMap(lambda x: x).collect()
Now lets iterate over the list
df1=df
for i in unique_vals:
df2 = df1.withColumn(i,F.when(F.col('centroid')==i,1).otherwise(0))
df1=df2
df2.show()
I think this would work
df.groupby().pivot('genres').agg(lit(1)).fillna(0)

Select rows in a panda dataframe based on condition from another dataframe with a different size

Consider a 100x200 dataframe (called df1) representing clinical data from 100 patients. Each patient can be identified through one number in column "ID" and another number in column 'CENTER'.
Now, consider a second 40*170 dataframe df2 containing data from a subset of 40 patients randomly selected from df1, and tested 6 months after on different variables. Similar to df1, df2 contains columns 'ID' and 'CENTER'. I am trying to select these 40 patients in df1 based on their ID and CENTER numbers, but can't find an easy way to do so using Pandas. Any idea ?
You could try this:
df3 = df1[df1.ID.isin(df2.ID) & df1.CENTER.isin(df2.CENTER)]

How can I groupby and aggregate pandas dataframe with many columns

I am working on a pandas dataframe with 168 columns. First three columns contain name of the country, latitude and longtitude. Rest of the columns contain numerical data. Each row represents a country but for some countries there are multiple rows. I need to aggregate those rows by summing. I can aggregate first three columns with following code:
df = df.groupby('Country', as_index=False).agg({'Lat':'first','Long':'first'})
However, I couldn't find a way to include in that code remaining 165 columns without explicitly writing all the column names. In addition, column names represent dates and are named like 5/27/20,5/28/20,5/29/20, etc. So I need to keep the column names.
How can I do that? Thanks.
Maybe you can generate the dictionary from the column names:
df = df.groupby('Country', as_index=False).agg({c: 'first' for c in df.columns})

How do I create a new variable in my dataframe filling the values with the dataframe name?

I have a bunch of datasets with same headers each referring to a different country.
I am trying to create a new column in each of the pandas dataframe that it is filled with my dataframe name (which is the name of the country!)
How do I do it?
EDIT:
I failed to mention that I created the datasets
us = pd.concat([coeff, pvalues], axis = 1).reset_index()
us.columns = ['Factor',"Coeff","P-value"]
Before you concat/join your dataframes together add a new column with the countries name as the default value, then concat.
print(df.name)
>>> Iran
print(df2.name)
>>> United States of America
df['Name'] = df.name
df2['Name'] = df2.name
countryDF = pd.concat([df, df2], axis=1).reset_index()
Dont know what added manipulations you are wanting to do i.e. Cutting out columns etc.

Categories

Resources