Pythonic way to convert Pandas dataframe from wide to long [duplicate] - python

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
I have a JSON file, which I then convert to a Pandas dataframe called stocks. The stocks dataframe is in wide format and I'd like to convert it to long format.
Here's what the stocks dataframe looks like after it's ingested and converted from JSON:
TSLA MSFT GE DELL
0 993.22 320.72 93.19 57.25
I would like to convert the stocks dataframe into the following format:
ticker price
0 TSLA 993.22
1 MSFT 320.72
2 GE 93.19
3 DELL 57.25
Here is my attempt (which works):
stocks = pd.read_json('stocks.json', lines=True).T.reset_index()
stocks.columns = ['ticker', 'price']
Is there a more Pythonic way to do this? Thanks!

pandas provides the melt function for this job.
pd.melt(stocks, var_name="ticker", value_name="price")
# ticker price
#0 TSLA 993.22
#1 MSFT 320.72
#2 GE 93.19
#3 DELL 57.25

A perhaps more intuitive method than melt would be to transpose and reset the index:
df = df.T.reset_index().set_axis(['ticker', 'price'], axis=1)
Output:
>>> df
ticker price
0 TSLA 993.22
1 MSFT 320.72
2 GE 93.19
3 DELL 57.25
Edit: oops, saw that the OP already did that! :)

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.
You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])
You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

How to replace slow 'apply' method in pandas DataFrame

I have a DataFrame with currencies transactions:
import pandas as pd
data = [[1653663281618, -583.8686, 'USD'],
[1653741652125, -84.0381, 'USD'],
[1653776860252, -33.8723, 'CHF'],
[1653845294504, -465.4614, 'CHF'],
[1653847155140, 22.285, 'USD'],
[1653993629537, -358.04640000000006, 'USD']]
df = pd.DataFrame(data = data, columns = ['time', 'qty', 'currency_1'])
I need to add new column "balance" which would calculate the sum of the column 'qty' for all previous transactions. I have a simple function:
def balance(row):
table = df[(df['time'] < row['time']) & (df['currency_1'] == row['currency_1'])]
return table['qty'].sum()
df['balance'] = df.apply(balance, axis = 1)
But my real DataFrame is very large and .apply method works extremely slow.
Is it a way to avoid using apply function in this case?
Something like np.where?
You could just use pandas cumsum here:
EDIT
After adding a condition:
I don't know how transform performs compared to apply, I'd say just try it on your real data. Can't think of an easier solution for the moment.
df['balance'] = df.groupby('currency_1')['qty'].transform(lambda x: x.shift().cumsum())
print(df)
time qty currency_1 balance
0 1653663281618 -583.8686 USD NaN
1 1653741652125 -84.0381 USD -583.8686
2 1653776860252 -33.8723 CHF NaN
3 1653845294504 -465.4614 CHF -33.8723
4 1653847155140 22.2850 USD -667.9067
5 1653993629537 -358.0464 USD -645.6217
old answer:
df['Balance'] = df['qty'].shift(fill_value=0).cumsum()
print(df)
time qty currency_1 Balance
0 1653663281618 -583.8686 USD 0.0000
1 1653741652125 -84.0381 USD -583.8686
2 1653776860252 -33.8723 USD -667.9067
3 1653845294504 -465.4614 USD -701.7790
4 1653847155140 22.2850 USD -1167.2404
5 1653993629537 -358.0464 USD -1144.9554

Grouping dates together by year in Pandas

I have a dataset of property prices and they are currently listed by 'DATE_SOLD'. I'd like to be able to count them by year. The dataset looks like this -
SALE_DATE COUNTY SALE_PRICE
0 2010-01-01 Dublin 343000.0
1 2010-01-03 Laois 185000.0
2 2010-01-04 Dublin 438500.0
3 2010-01-04 Meath 400000.0
4 2010-01-04 Kilkenny 160000.0
This is the code I've tried -
by_year = property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
print(by_year)
I think I'm close but as a biblical noob it's quite frustrating!
Thank you for any help you can provide; this site has been awesome so far in finding little tips and tricks to make my life easier
You are close. As you did, you can use pd.to_datetime to convert your sale_date to a datetime column. Then groupby the year, using dt.year which gets the year of the datetime, and use size() on that which computes the size of each group, which in this case is the year.
property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
property_prices.groupby(property_prices.SALE_DATE.dt.year).size()
Which prints:
SALE_DATE
2010 5
dtype: int64
import pandas as pd
sample_dict = {'Date':['2010-01-11', '2020-01-22', '2010-03-12'], 'Price':[1000,2000,3500]}
df = pd.DataFrame(sample_dict)
# Creating 'year' column using the Date column
df['year'] = df.apply(lambda row: row.Date.split('-')[0], axis=1)
# Groupby function
df1 = df.groupby('Year')
# Print the first value in each group
df1.first()
Output:
Date x
year
2010 2010-01-11 1
2020 2020-01-22 2

How can I group my CSV's list of dates into their Months?

I have a CSV file which contains two columns, the first is a date column in the format 01/01/2020 and the second is a number for each month representing the months sales volume. The dates range from 2004 to 2019 and my task is to create a 12 bar chart, with each bar representing the average sales volume for that month across every years data. I attempted to use a groupby function but got an error relating to not having numeric types to aggregate. I am very new to python so apologies for the beginner questions. I Have posted my code so far below. Thanks in advance for any help with this :)
# -*- coding: utf-8 -*-
import csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
file = "GlasgowSalesVolume.csv"
data = pd.read_csv(file)
typemean = (data.groupby(['Date', 'SalesVolume'], as_index=False).mean().groupby('Date')
['SalesVolume'].mean())
Output:
DataError: No numeric types to aggregate
enter code here
I prepared a DataFrame limited to just 2 rows and 3 months:
Date Sales
0 01/01/2019 3
1 01/02/2019 4
2 01/03/2019 8
3 01/01/2020 10
4 01/02/2020 20
5 01/03/2020 30
For now Date column is of string type, so the first step is to
convert it to datetime64:
df.Date = pd.to_datetime(df.Date, dayfirst=True)
Now to compute your result, run:
result = df.groupby(df.Date.dt.month).Sales.mean()
The result is a Series containing:
Date
1 6.5
2 12.0
3 19.0
Name: Sales, dtype: float64
The index is the month number (1 thru 12) and the value is the mean from
respective month, from all years.

return streams for multiple securities in pandas

Suppose I have a table which looks like this:
Ticker Date ClosingPrice
0 A 01-02-2010 11.4
1 A 01-03-2010 11.5
...
1000 AAPL 01-02-2010 634
1001 AAPL 01-02-2010 635
So, in other words, we have a sequence of timeseries spliced together one per ticker symbol. Now, I would like to generate a column of daily returns. If I had only one symbol, that would be very easy with the pandas pct_change() function, but how do I do it for multiple time series as above (I can do a sequence of groupbys, make each a dataframe, do the return computation, then splice them all together with pd.concat() but that does not seem optimal.
use groupby
df.set_index(['Ticker', 'Date']).ClosingPrice.groupby(level=0).pct_change()
Ticker Date
A 01-02-2010 NaN
01-03-2010 0.008772
AAPL 01-02-2010 NaN
01-02-2010 0.001577
Name: ClosingPrice, dtype: float64

Categories

Resources