How do I reshape only parts of a dataframe

How do I reshape only parts of a dataframe - python

I have a dataframe that looks like this:
Date City_State HousingPrice DowPrice NasqPrice
0 1996-04 New York, NY 169300.0 5579.864351 1135.628092
1 1996-04 Los Angeles, CA 157700.0 5579.864351 1135.628092
2 1996-04 Houston, TX 86500.0 5579.864351 1135.628092
3 1996-04 Chicago, IL 114000.0 5579.864351 1135.628092
4 1996-04 Phoenix, AZ 88100.0 5579.864351 1135.628092
5 1996-05 New York, NY 169800.0 5616.707742 1220.540472
6 1996-05 Los Angeles, CA 157600.0 5616.707742 1220.540472
I'm trying to reshape the dataframe so that I could plot it.
Is there a simple way to move the DowPrice and NasqPrice into the City_State column so it looks something like this, without having to split the dataframe in two, reshape them and then merge them back?
Date Category Price
0 1996-04 New York, NY 169300.0
1 1996-04 Los Angeles, CA 157700.0
2 1996-04 Houston, TX 86500.0
3 1996-04 DowPrice 5579.864351
4 1996-04 NasqPrice 1135.628092

This should do the trick:
df=pd.concat([
df.groupby("Date")["DowPrice"].first().to_frame().rename(
columns={"DowPrice": "Price"}
).assign(Category="Dow"),
df.groupby("Date")["NasqPrice"].first().to_frame().rename(
columns={"NasqPrice": "Price"}
).assign(Category="Nasdaq"),
df.set_index("Date").rename(
columns={"City_State": "Category", "HousingPrice": "Price"}
).drop(["NasqPrice", "DowPrice"], axis=1)
], axis=0, sort=False).reset_index()
Output (I removed spaces in categories on purpose - just as a shortcut to get data from your df - you will see them fine, while using the code above):
Date Price Category
0 1996-04 5579.864351 Dow
1 1996-05 5616.707742 Dow
2 1996-04 1135.628092 Nasdaq
3 1996-05 1220.540472 Nasdaq
4 1996-04 169300.0 NewYork,NY
5 1996-04 157700.0 LosAngeles,CA
6 1996-04 86500.0 Houston,TX
7 1996-04 114000.0 Chicago,IL
8 1996-04 88100.0 Phoenix,AZ
9 1996-05 169800.0 NewYork,NY
10 1996-05 157600.0 LosAngeles,CA

You can use reshape/melt to do whatever you want, but your intent is not totally clear.
You want Price to denote:
HousingPrice for each (City_State, Date), if Category was a City_State
else DowPrice/NasqPrice for that Date
So you want to reshape/melt multiple columns, selecting depending on the value of Category

You could do something like this and append it to itself, although I guess I'm reshaping and merging...
df.append(
df[['Date', 'DowPrice', 'NasqPrice']].drop_duplicates()\
.melt('Date')\
.rename(columns= {'variable':'City_State','value':'HousingPrice'})
).drop(columns = ['DowPrice','NasqPrice'])

I think this may be what you are asking for.
If you are looking to read the data from a csv:
import csv as cs
with open('/Documents/prices.csv', newline='') as csvfile:
spamreader=cs.reader(csvfile, delimiter=',')
for row in spamreader:
print(','.join(row))
This is the easiest I could find if you export the data as a csv file with pandas dataframes.
import pandas as pd
data = pd.read_csv('/Documents/prices.csv')
part1 = data.filter(items = ['Date', 'Category', 'HousingPrice'])
However, it seems like you might want to be able to plot the date of the housing price over the dates of the dowjowns price over the nasqPirce. I would just split up the dataframe into 3 series then plot that.
Where the three series are:
part1 = data.filter(items = ['Date', 'Category', 'HousingPrice'])
d2 = pd.DataFrame(data.filter(items = ['Date', 'NasqPrice']))
d3 = pd.DataFrame(data.filter(items = ['Date', 'DowPrice']))
Or just simply: (this may be wrong and need an edit)
lines = data.plot.line(x='date', y=['HousingPrice', 'DowPrice', 'NasqPrice'])

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?

There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

Group by one column and take count of multiple categories in Pandas

I have a dataset, df, where I would like to group by one column and then take the count of each category within a second column
name location sku
svc1 ny hey1
svc2 ny hey1
svc3 ny hey1
svc4 ny hey1
lo1 ny ok1
lo2 ny ok1
fab1 ny hi
fab2 ny hi
fab3 ny hi
hello ca no
hello ca no
desired
location sku count
ny hey1 4
ny ok1 2
ny hi 3
ca no 2
doing
df2 = pd.DataFrame()
df2['sku'] = df.groupby('location')['sku'].nth(0)
df2['count'] = df.groupby('sku').count()
However, I am getting NAN for count, and I am not getting all of the data listed under sku.
Any suggestion is appreciated.

You are looking to group by two columns:
df.groupby(['location','sku']).size().reset_index(name='count')
Or groupby one column and value_counts the other:
# this should be slightly faster
(df.groupby('location')['sku'].value_counts()
.reset_index(name='count'))
Output:
location sku count
0 ca no 2
1 ny hey1 4
2 ny hi 3
3 ny ok1 2

How to change Datetime formate without converting to string

I have a data frame that I'm trying to create a tsplot using seaborn, the first issue I'm having after converting my DateTime from string to a DateTime object is that the day had been automatically added.
The original data frame looks like this:
zillowvisualtop5.head()
Out[161]:
City_State Year Price
0 New York, NY 1996-04 169300.0
1 Los Angeles, CA 1996-04 157700.0
2 Houston, TX 1996-04 86500.0
3 Chicago, IL 1996-04 114000.0
6 Phoenix, AZ 1996-04 88100.0
(Note that the date is in year-month format)
After I covert it into DateTime object so that I could plot it using seaborn, I get the issue of having a date added after the month.
zillowvisualtop5['Year'] = pd.to_datetime(zillowvisualtop5['Year'], format= '%Y-%m')
zillowvisualtop5.head()
Out[165]:
City_State Year Price
0 New York, NY 1996-04-01 169300.0
1 Los Angeles, CA 1996-04-01 157700.0
2 Houston, TX 1996-04-01 86500.0
3 Chicago, IL 1996-04-01 114000.0
6 Phoenix, AZ 1996-04-01 88100.0
The solution that I've found seems to suggest converting to strftime but I need my time to be in DateTime format so I can plot it using seaborn.

The problem that you are having is that a DateTime object will always include all the components of a date object and a time object. There is no way to have a DateTime object that only has year and month info (source: here).
However, you can use the matplotlib converters like this:
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import seaborn as sns
cols = ['City', 'Price', 'Month']
data = [['New York', 125, '1996-04'],
['New York', 79, '1996-05'],
['New York', 85, '1996-06'],
['Houston', 90, '1996-04'],
['Houston', 95, '1996-05'],
['Houston',127, '1996-06']]
df = pd.DataFrame(data, columns = cols)
print (df)
chart = sns.lineplot(x='Month', y='Price', hue='City', data=df)
Does that get the results you were looking for?

multiple conditions for lookup in pandas

I have 2 dataframes. One with the City, dates and sales
sales = [['20101113','Miami',35],['20101114','New York',70],['20101114','Los Angeles',4],['20101115','Chicago',36],['20101114','Miami',12]]
df2 = pd.DataFrame(sales,columns=['Date','City','Sales'])
print (df2)
Date City Sales
0 20101113 Miami 35
1 20101114 New York 70
2 20101114 Los Angeles 4
3 20101115 Chicago 36
4 20101114 Miami 12
The second has some dates and cities.
date = [['20101114','New York'],['20101114','Los Angeles'],['20101114','Chicago']]
df = pd.DataFrame(date,columns=['Date','City'])
print (df)
I want to extract the sales from the first dataframe that match the city and and dates in the 3nd dataframe and add the sales to the 2nd dataframe. If the date is missing in the first table then the next highest date's sales should be retrieved.
The new dataframe should look like this
Date City Sales
0 20101114 New York 70
1 20101114 Los Angeles 4
2 20101114 Chicago 36
I am having trouble extracting and merging tables. Any suggestions?

This is pd.merge_asof, which allows you to join on a combination of exact matches and then a "close" match for some column.
import pandas as pd
df['Date'] = pd.to_datetime(df.Date)
df2['Date'] = pd.to_datetime(df2.Date)
pd.merge_asof(df.sort_values('Date'),
df2.sort_values('Date'),
by='City', on='Date',
direction='forward')
Output:
Date City Sales
0 2010-11-14 New York 70
1 2010-11-14 Los Angeles 4
2 2010-11-14 Chicago 36

Pivoting a pandas dataframe to generate a (seaborn) heatmap

I'm new to Python and fairly new to SO.
I have a pandas dataframe named df which looks like:
Text
Date Location
2015-07-08 San Diego, CA 1
2015-07-07 Bellevue, WA 1
Los Angeles, CA 1
New York, NY 1
Los Angeles, CA 1
Unknown 1
I want to pivot the data using:
import pandas, numpy as np
df_pivoted = df.pivot_table(df, values=['Text'], index=['Date'],
columns=['Location'],aggfunc=np.sum)
The idea is to generate a heat map that shows the count of "Text" by "Location" and "Date".
I get error:
TypeError: pivot_table() got multiple values for keyword argument 'values'
When using a simplified approach:
df = df.pivot_table('Date', 'Location', 'Text')
I get error:
raise DataError('No numeric types to aggregate')
I'm using Python 2.7 and Pandas 0.16.2
In[2]: df.dtypes
Out[2]:
Date datetime64[ns]
Text object
Location object
dtype: object
Anyone having an idea?

import pandas as pd
import numpy as np
# just try to replicate your dataframe
# ==============================================
date = ['2015-07-08', '2015-07-07', '2015-07-07', '2015-07-07', '2015-07-07', '2015-07-07']
location = ['San Diego, CA', 'Bellevue, WA', 'Los Angeles, CA', 'New York, NY', 'Los Angeles, CA', 'Unknown']
text = [1] * 6
df = pd.DataFrame({'Date': date, 'Location': location, 'Text': text})
Out[141]:
Date Location Text
0 2015-07-08 San Diego, CA 1
1 2015-07-07 Bellevue, WA 1
2 2015-07-07 Los Angeles, CA 1
3 2015-07-07 New York, NY 1
4 2015-07-07 Los Angeles, CA 1
5 2015-07-07 Unknown 1
# processing
# ==============================================
pd.pivot_table(df, index='Date', columns='Location', values='Text', aggfunc=np.sum)
Out[142]:
Location Bellevue, WA Los Angeles, CA New York, NY San Diego, CA Unknown
Date
2015-07-07 1 2 1 NaN 1
2015-07-08 NaN NaN NaN 1 NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I reshape only parts of a dataframe - python

You could do something like this and append it to itself, although I guess I'm reshaping and merging... df.append( df[['Date', 'DowPrice', 'NasqPrice']].drop_duplicates()\ .melt('Date')\ .rename(columns= {'variable':'City_State','value':'HousingPrice'}) ).drop(columns = ['DowPrice','NasqPrice'])

Related

How can I count # of occurences of more than one column (eg city & country)?

Group by one column and take count of multiple categories in Pandas

How to change Datetime formate without converting to string

multiple conditions for lookup in pandas

Pivoting a pandas dataframe to generate a (seaborn) heatmap

Categories

Resources