Melt dataframe with first two rows as variables - python

apologies but this has me stumped, I thought I could pass the following dataframe into a simple pd.melt using iloc to reference my varaibles but it wasn't working for me (i'll post the error in a moment)
sample df
Date, 0151, 0561, 0522, 0912
0,Date, AVG Review, AVG Review, Review, Review
1,Date NaN NaN NaN NaN
2,01/01/18 2 2.5 4 5
so as you can see, my ID as in the top row, the type of review is in the 2nd row, the date sits in the first column and the observations of the review are in rows on the date.
what I'm trying to do is melt this df to get the following
ID, Date, Review, Score
0151, 01/01/18, Average Review 2
I thought I could be cheeky and just pass the following
pd.melt pd.melt(df,id_vars=[df.iloc[0]],value_vars=df.iloc[1] )
but this threw the error 'Series' objects are mutable, thus they cannot be hashed
I've had a look at similar answers to pd.melt and perhaps reshape or unpivot? but I'm lost on how I should proceed.
any help is much appreciated.
Edit for Nixon :
My first Row has my unique IDs
2nd row has my observation, which in this case is a type of review (average, normal)
3rd row onward has the variables assigned to the above observation - lets call this score.
1st column has my dates which have the score across by row.

An alternative to pd.melt is to set your rows as column levels of a multiindex and then stack them. Your metadata will be stored as an index rather than column though. Not sure if that matters.
df = pd.DataFrame([
['Date', '0151', '0561', '0522', '0912'],
['Date', 'AVG Review', 'AVG Review', 'Review', 'Review'],
['Date', 'NaN', 'NaN', 'NaN', 'NaN'],
['01/01/18', 2, 2.5, 4, 5],
])
df = df.set_index(0)
df.index.name = 'Date'
df.columns = pd.MultiIndex.from_arrays([df.iloc[0, :], df.iloc[1, :]], names=['ID', 'Review'])
df = df.drop(df.index[[0, 1, 2]])
df.stack('ID').stack('Review')
Output:
Date ID Review
01/01/18 0151 AVG Review 2
0522 Review 4
0561 AVG Review 2.5
0912 Review 5
dtype: object
You can easily revert index to columns with reset_index.

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

Pandas: How to delete duplicates in rows and do multiple topic matching

I have the following dataframe dfstart where the first column holds different comments containing a variety of different topics. The labels column contains keywords that are associated with the topics.
Using a second dataframe matchlist
I want to create the final dataframe dffinal where for each comment you can see both the labels and the topics that occur in that comment. I also want the labels to only occur once per row.
I tried eliminating the duplicate labels through a for loop
for label in matchlist['label']:
if dfstart[label[n]] == dfstart[label[n-1]]:
dfstart['label'] == np.nan
However, this doesn't seem to work. Further, I have manged to merge dfstart with matchlist to have the first topic displayed in the dataframe. The code I used for that is
df2 = pd.merge(df, matchlist, on='label1')
Of course, I could keep renaming the label column in matchlist and keep repeating the process, but this would take a long time and would not be efficient because my real dataframe is much larger than this toy example. So I was wondering if there was a more elegant way to do this.
Here are three toy dataframes:
d = {'comment':["comment1","comment2","comment3"], 'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"]}
dfstart = pd.DataFrame(data=d)
dfstart[['label1','label2', 'label3']] = dfstart.label.str.split(",",expand=True,)
d3 = {'label':["boxing","election","rain"], 'topic': ["sport","politics","weather"]}
matchlist = pd.DataFrame(data=d3)
d2 = {'comment':["comment1","comment2","comment3"],'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"], 'label1':["boxing", "boxing", "election"], 'label2':["election", np.nan, "rain"], 'label3':["rain", np.nan, np.nan], 'topic1':["sports", "sports", "politics"], 'topic2':["politics", np.nan, "weather"], 'topic3':["weather", np.nan, np.nan]}
dffinal = pd.DataFrame(data=d2)
Thanks for your help!
Use str.extractall instead of str.split so you can obtain all matches in one go, then flatten the results and map to your matchlist, finally concat all together:
d = {'comment':["comment1","comment2","comment3"],
'label': ["boxing, election, rain", "boxing, boxing", "election, rain, election"]}
df = pd.DataFrame(d)
matchlist = pd.DataFrame({'label':["boxing","election","rain"], 'topic':["sport","politics","weather"]})
s = matchlist.set_index("label")["topic"]
found = (df["label"].str.extractall("|".join(f"(?P<label{num}>{i})" for num, i in enumerate(s.index, 1)))
.groupby(level=0).first())
print (pd.concat([df, found,
found.apply(lambda d: d.map(s))
.rename(columns={f"label{i+1}":f"topic{i+1}" for i in range(1, 4)})], axis=1) )
comment label label1 label2 label3 label1 topic2 topic3
0 comment1 boxing, election, rain boxing election rain sport politics weather
1 comment2 boxing, boxing boxing NaN NaN sport NaN NaN
2 comment3 election, rain, election NaN election rain NaN politics weather

How to handle records in dataframe with same ID but some different values in columns in python

I am working on a dataframe using pandas with bank (loan) details for customers. There is a problem because some unique loan id have been recorded 2 times with different values for some of the features respectively. I am attaching a screenshot to be more specific.
Now you see for instance this unique Loan ID has been recorded 2 times. I want to drop the second one with nan values but I can't do it manually because there are 4900 similar cases. any idea?
The problem is not the NaN value, the problem is the double records. I want to drop rows with nan values only for double records not for the entire dataframe
Thanks in advance
Count rows where there are > 1, and then only drop nans where there are > 1 rows.
df['flag'] = df.groupby(['Loan ID', 'Credit ID'])['Loan ID'].transform('count')
df = df.loc[df['flag'] > 1].dropna(subset=['Credit Score', 'Annual Income']).drop('flag', axis=1)
Instead of dropping nan rows, just take the rows where credit score or annual income is not nan:
df = df[df['Credit Score'].notna()]

Pandas: convert to time series with frequency counts + maintaining index

I currently have pandas df that looks like:
Company Date Title
Apple 1/2/2020 Sr. Exec
Google 2/2/2020 Manager
Google 2/2/2020 Analyst
How do I get it to maintain the index while counting the frequency of 'title' per date (as shown below)
Company 1/2/2020 2/2/2020
Apple 1 0
Google 0 2
I've tried using group_by() on the date but it doesn't break up the dates to show on the top row and I need to export the resulting df to csv so group by didn't work.
It looks like what you want is a pivot table
pivot = df.pivot_table(
index="Company",
columns="Date",
values="Title",
aggfunc=len,
fill_value=0
).reset_index()
A quick explanation of what is happening here:
Rows will be made for each unique value in the 'Company' column
Values from the 'Date' column will become column headers
We want to count how frequently a title occurs at a given date in a given company, so we set the 'Title' to be the value and pass the aggfunc (aggregation function) of len to tell pandas to count the values
Since there could be an instance where Google doesn't have any analysts on the 20th of Febuary 2020 we supply a fill_value of 0, preventing empty (Null) values
Finally, we reset the index so that the 'Company' value is just a column not the index of the dataframe.
You will end up having a new index, but this is inevitable since you are no longer going to have rows with duplicated values in the 'Company' column.
The pivot_table method is extremely powerful. Look here for the full documentation
Like this:
pd.pivot_table(df, index='Company', columns='Date', values='Title', aggfunc='count').reset_index().rename_axis(None, axis=1).fillna(0)
Output:
Company 1/2/2020 2/2/2020
0 Apple 1.0 0.0
1 Google 0.0 2.0

Extend a pandas dataframe to include 'missing' weeks

I have a pandas dataframe which contains time series data, so the index of the dataframe is of type datetime64 at weekly intervals, each date occurs on the Monday of each calendar week.
There are only entries in the dataframe when an order was recorded, so if there was no order placed, there isn't a corresponding record in the dataframe. I would like to "pad" this dataframe so that any weeks in a given date range are included in the dataframe and a corresponding zero quantity is entered.
I have managed to get this working by creating a dummy dataframe, which includes an entry for each week that I want with a zero quantity and then merging these two dataframes and dropping the dummy dataframe column. This results in a 3rd padded dataframe.
I don't feel this is a great solution to the problem and being new to pandas wanted to know if there is a more specific and or pythonic way to achieve this, probably without having to create a dummy dataframe and then merge.
The code I used is below to get my current solution:
# Create the dummy product
# Week hold the week date of the order, want to set this as index later
group_by_product_name = df_all_products.groupby(['Week', 'Product Name'])['Qty'].sum()
first_date = group_by_product_name.head(1) # First date in entire dataset
last_date = group_by_product_name.tail().index[-1] # last date in the data set
bdates = pd.bdate_range(start=first_date, end=last_date, freq='W-MON')
qty = np.zeros(bdates.shape)
dummy_product = {'Week':bdates, 'DummyQty':qty}
df_dummy_product = pd.DataFrame(dummy_product)
df_dummy_product.set_index('Week', inplace=True)
group_by_product_name = df_all_products.groupby('Week')['Qty'].sum()
df_temp = pd.concat([df_dummy_product, group_by_product_name], axis=1, join='outer')
df_temp.fillna(0, inplace=True)
df_temp.drop(columns=['DummyQty'], axis=1, inplace=True)
The problem with this approach is sometimes (I don't know why) the indexes don't match correctly, I think somehow the dtype of the index on one of the dataframes loses its type and goes to object instead of staying with dtype datetime64. So I am sure there is a better way to solve this problem than my current solution.
EDIT
Here is a sample dataframe with "missing entries"
df1 = pd.DataFrame({'Week':['2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-25'], 'Qty':[100, 200, 300, 500]})
df1.set_index('Week', inplace=True)
df1.head()
Here is an example of the padded dataframe that contains the additional missing dates between the date range
df_zero = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-06-04',
'2018-06-11', '2018-06-18', '2018-06-25', '2018-07-02'], 'Dummy Qty':[0, 0, 0, 0, 0, 0, 0]})
df_zero.set_index('Week', inplace=True)
df_zero.head()
And this is the intended outcome after concatenating the two dataframes
df_padded = pd.concat([df_zero, df1], axis=1, join='outer')
df_padded.fillna(0, inplace=True)
df_padded.drop(columns=['Dummy Qty'], inplace=True)
df_padded.head(6)
Note that the missing entries are added before and between other entries where necessary in the final dataframe.
Edit 2:
As requested here is an example of what the initial product dataframe would look like:
df_all_products = pd.DataFrame({'Week':['2018-05-21', '2018-05-28', '2018-05-21', '2018-06-11', '2018-06-18',
'2018-06-25', '2018-07-02'],
'Product Name':['A', 'A', 'B', 'A', 'B', 'A', 'A'],
'Qty':[100, 200, 300, 400, 500, 600, 700]})
Ok given your original data you can achieve the expected results by using pivot and resample for any missing weeks, like the following:
results = df_all_products.groupby(
['Week','Product Name']
)['Qty'].sum().reset_index().pivot(
index='Week',columns='Product Name', values='Qty'
).resample('W-MON').asfreq().fillna(0)
Output results:
Product Name A B
Week
2018-05-21 100.0 300.0
2018-05-28 200.0 0.0
2018-06-04 0.0 0.0
2018-06-11 400.0 0.0
2018-06-18 0.0 500.0
2018-06-25 600.0 0.0
2018-07-02 700.0 0.0
So if you want to get the df for Product Name A, you can do results['A'].

Categories

Resources