Combining two pandas dataframes based on column AND row VALUES - python

First, I haven't found this asked before - probably because I'm not using the right words to ask it. So if it has been asked, please send me in that direction.
How can I combine two pandas data frames based on column AND row. My main dataframe has a column 'years' and a column 'county' among others. Ideally, I want to add another column 'percent' from the second data frame below.
For example, I have this image of my first df:
and I have another data frame with the same 'year' column and every other column name is a string value in the original "main" dataframe's 'county' column:
How can I combine these two data frames in a way that adds another column to the 'main df'? It would be helpful to first put the second data frame in the format where there are three columns: 'year', 'county', and 'percent'. If anyone can help me with this part, I can merge it.

I think what you will want to do is transform the second dataframe to have a row for each year/county combination and then you can use a left join to combine the two. I believe the ```melt`` method will do this transformation. Try this:
melted_second_df = second_df.melt(id_vars=["year"], var_name="county", value_name="percent")
combined_df = first_df.merge(
right=melted_second_df,
on=["year", "county"],
how="left"
)

Related

Multiplying the same column number for two different data frames single expression

Is there a way to multiple the 1st column from one df to the 1st column in a second df, then the 2nd column from one df to the 2nd column in a second df, so on so forth?
I can do it in a for loop, but was wondering if there was a way to do it in a single expression..
Thank you!
This is what I have so far that isn't working, just get nan's
bfAvg = (tankbase.iloc[:,:4].multiply(tankwater.iloc[:,:4],axis=0))

GroupBy using select columns with apply(list) and retaining other columns of the dataframe

data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')

pandas max function results in inoperable DataFrame

I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')

Replacing one column onto another in python/pandas, but keeping the replaced columns values if the replacing column has a NaN value?

I have two columns in a data frame that I want to merge together. The attached image shows the columns:
Image of the two columns I want to merge
I want the "precio_uf_y" column to take precedent over the "precio_uf_x" column a new column, but if there is a NaN value in the "precio_uf_y" column I want the value in the "precio_uf_x" column to go to the new column. My ideal new merged column would look like this:
Desired new column
I have tried different merge functions, and taking min and max with numpy, but maybe there is a way to write a function with these parameters?
Thank you in advance for any help.
You can use df.apply.
def get_new_val(x):
if np.isnan(x.precio_uf_y):
return x.precio_uf_x
else:
return x.precio_uf_y
df["new_precio_uf"] = df.apply(get_new_val, axis=1)

I have pandas dataframe which i would like to be sliced after every 4 columns

I have pandas dataframe which i would like to be sliced after every 4 columns and then vertically stacked on top of each other which includes the date as index.Is this possible by using np.vstack()? Thanks in advance!
ORIGINAL DATAFRAME
Please refer the image for the dataframe.
I want something like this
WANT IT MODIFIED TO THIS
Until you provide a Minimal, Complete, and Verifiable example, I will not test this answer but the following should work:
given that we have the data stored in a Pandas DataFrame called df, we can use pd.melt
moltendfs = []
for i in range(4):
moltendfs.append(df.iloc[:, i::4].reset_index().melt(id_vars='date'))
newdf = pd.concat(moltendfs, axis=1)
We use iloc to take only every fourth column, starting with the i-th column. Then we reset_index in order to be able to keep the date column as our identifier variable. We use melt in order to melt our DataFrame. Finally we simply concatenate all of these molten DataFrames together side by side.

Categories

Resources