.
How do I print out only the country names that exist in the dataframe among series with country names as index?
The following will filter for rows with an index value that is also in the index of df2 using .isin().
df1.loc[df1.index.isin(df2.index)]
Related
I have two dataframes(df3 & dfA) - one with unique values only(df3), the other with multiple values(dfA).
I want to add a column 'Trdr' from dfA to DF3 based on column 'ISIN'
The issue is : dfA has multiple lines with the same ISIN with a 'Trdr' value. Therefore when I try to merge the datasets it adds a row line for every ISIN that has a 'trdr' value.
In laymans terms, I want to Vlookup the 1st value that pandas can find in dfA and assign it to df3.
'''df3=pd.merge( df3,dfA[['Trdr','ISIN']],how="left",on='ISIN')'''
df3 before the merge showing unique ISINs only
dfA showing the trdr values im trying to merge
df3 after the merge showing multiple lines of VODAFONE rather than 1 unique value as per initial screenshot
I have got pandas dataframe which looks as following: enter image description here
There are multiple columns for Australia based on the provinces, and the columns are titled Australia, Australia.1, Australia.2 and so on. It is also the case for other countries such as the USA, the UK or Canada. I want to have only one column for each of these countries. For example, I want to have one column named Australia with the sum total of values in each provinces, and I want to avoid duplicate column names. How can I do it using pandas dataframe in Python?
You can transpose the dataframe and reset the index. Then, remove periods and number and groupby the columns which in my example is index:
df = df.T.reset_index()
df['index'] = df['index'].str.replace('\.\d+', '')
df.iloc[:,2:] = df.iloc[:,2:].astype(float)
df = df.groupby('index').sum()
df
I am working on a pandas dataframe with 168 columns. First three columns contain name of the country, latitude and longtitude. Rest of the columns contain numerical data. Each row represents a country but for some countries there are multiple rows. I need to aggregate those rows by summing. I can aggregate first three columns with following code:
df = df.groupby('Country', as_index=False).agg({'Lat':'first','Long':'first'})
However, I couldn't find a way to include in that code remaining 165 columns without explicitly writing all the column names. In addition, column names represent dates and are named like 5/27/20,5/28/20,5/29/20, etc. So I need to keep the column names.
How can I do that? Thanks.
Maybe you can generate the dictionary from the column names:
df = df.groupby('Country', as_index=False).agg({c: 'first' for c in df.columns})
how to sort a data-frame using two column values, at first look at 1st column values and only if values are duplicate then look at the 2nd column values
Use sort_values() on dataframe as:-
df.sort_values(by=['col1', 'col2'])
I am using dask dataframe with python 2.7 and want to drop duplicated index values from my df.
When using pandas i would use
df = df[~df.index.duplicated(keep = "first")]
And it works
When trying to do the same with dask dataframe i get
AttributeError: 'Index' object has no attribute 'duplicated'
I could reset the index and than use the column that was the index to drop duplicated but I would like to avoid it if possible
I could use df.compute() and than drop the duplicated index values but this df is too big for memory.
How can i drop the duplicated index values from my dataframe using dask dataframe?
I think you need convert index to Series by to_series, keep='first' should be omit, because default parameter in duplicated:
df = df[~df.index.to_series().duplicated()]