I have a data frame that I'm trying to create a tsplot using seaborn, the first issue I'm having after converting my DateTime from string to a DateTime object is that the day had been automatically added.
The original data frame looks like this:
zillowvisualtop5.head()
Out[161]:
City_State Year Price
0 New York, NY 1996-04 169300.0
1 Los Angeles, CA 1996-04 157700.0
2 Houston, TX 1996-04 86500.0
3 Chicago, IL 1996-04 114000.0
6 Phoenix, AZ 1996-04 88100.0
(Note that the date is in year-month format)
After I covert it into DateTime object so that I could plot it using seaborn, I get the issue of having a date added after the month.
zillowvisualtop5['Year'] = pd.to_datetime(zillowvisualtop5['Year'], format= '%Y-%m')
zillowvisualtop5.head()
Out[165]:
City_State Year Price
0 New York, NY 1996-04-01 169300.0
1 Los Angeles, CA 1996-04-01 157700.0
2 Houston, TX 1996-04-01 86500.0
3 Chicago, IL 1996-04-01 114000.0
6 Phoenix, AZ 1996-04-01 88100.0
The solution that I've found seems to suggest converting to strftime but I need my time to be in DateTime format so I can plot it using seaborn.
The problem that you are having is that a DateTime object will always include all the components of a date object and a time object. There is no way to have a DateTime object that only has year and month info (source: here).
However, you can use the matplotlib converters like this:
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import seaborn as sns
cols = ['City', 'Price', 'Month']
data = [['New York', 125, '1996-04'],
['New York', 79, '1996-05'],
['New York', 85, '1996-06'],
['Houston', 90, '1996-04'],
['Houston', 95, '1996-05'],
['Houston',127, '1996-06']]
df = pd.DataFrame(data, columns = cols)
print (df)
chart = sns.lineplot(x='Month', y='Price', hue='City', data=df)
Does that get the results you were looking for?
Related
I have a data frame where the the columns are “city” and “datetime”. The data indicates the arrival of VIP’s into the city.
City datetime
New York 2022-12-06 10:37:25
New York 2022-12-06 10:42:34
New York 2022-12-06 10:47:12
New York 2022-12-06 10:52:10
New York 2022-12-06 02:37:25
As you can see the last column stands out from the rest as datetime. The first 3 entries are at a time interval less than 10minutes with respect to the column above and the last column datetime Interval is more than 10minutes.
Now I want to group city into 2 different groups , the first 3 as 1 group and last column alone as 1 group.
Desired out out
City datetime- count
New York 4 [‘2022-12-06 10:37:25’, 2022-12-06 10:42:34’, ‘2022-12-06 10:47:12’, ‘2022-12-06 10:52:10’]
New York 1 [‘2022-12-06 02:37:25’]
This is my first time using this forum . Any help is greatly appreciated
I have tried groupby on the ”city” column but it just group every column with the same city name . But I want to group the city based date time.
You can simply use groupby with Grouper:
# create df
df = pd.DataFrame({
'City': ['New York', 'New York', 'New York', 'New York', 'New York'],
'datetime': ['2022-12-06 10:37:25', '2022-12-06 10:42:34', '2022-12-06 10:47:12', '2022-12-06 10:52:10',
'2022-12-06 02:37:25']
})
# set datetime col as index
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
df['Date'] = df.index
# groupby
grouped = df.groupby(['City', pd.Grouper(freq='15min', origin='start')])
new_df = grouped.count()
new_df['Dates'] = grouped['Date'].apply(list)
new_df.reset_index().drop('datetime', axis=1)
output:
Learn more: pandas.Grouper
This question already has answers here:
Pandas df.apply does not modify DataFrame
(2 answers)
Closed 1 year ago.
As a reproducible example, I created the following dataframe:
dictionary = {'Metropolitan area': ['New York City','New York City','Los Angeles', 'Los Angeles'],
'Population (2016 est.)[8]': [20153634, 20153634, 13310447, 13310447],
'NBA':['Knicks',' ',' ', 'Clippers']}
df = pd.DataFrame(dictionary)
to substitute any space present in df['NBA'] by 'None' I created the following function:
def transform(x):
if len(x)<2:
return None
else:
return x
which I apply over df['NBA'] using .apply method:
df['NBA'].apply(transform)
After doing this, I get the following output, which seems to have been succesful:
> 0 Knicks
1 Missing Value
2 Missing Value
3 Clippers
Name: NBA, dtype: object
But, here the problem, when I call for df, df['NBA'] is not transformed, and I get that column as it was from the beginning, and the spaces are still present and not replaced by None:
Metropolitan area Population (2016 est.)[8] NBA
0 New York City 20153634 Knicks
1 New York City 20153634
2 Los Angeles 13310447
3 Los Angeles 13310447 Clippers
What am I doing wrong? am I misunderstunding the .apply method?
The command df['NBA'].apply(transform) on its own will do the operation but not save it to the original DataFrame in the memory.
so you just have to save the new column:
df['NBA'] = df['NBA'].apply(transform)
and the resulting DataFrame should be:
Metropolitan area Population (2016 est.)[8] NBA
0 New York City 20153634 Knicks
1 New York City 20153634 None
2 Los Angeles 13310447 None
3 Los Angeles 13310447 Clippers
Assign the results of apply back to the column.
df['NBA'] = df['NBA'].apply(transform)
I have a dataframe that looks like this:
Date City_State HousingPrice DowPrice NasqPrice
0 1996-04 New York, NY 169300.0 5579.864351 1135.628092
1 1996-04 Los Angeles, CA 157700.0 5579.864351 1135.628092
2 1996-04 Houston, TX 86500.0 5579.864351 1135.628092
3 1996-04 Chicago, IL 114000.0 5579.864351 1135.628092
4 1996-04 Phoenix, AZ 88100.0 5579.864351 1135.628092
5 1996-05 New York, NY 169800.0 5616.707742 1220.540472
6 1996-05 Los Angeles, CA 157600.0 5616.707742 1220.540472
I'm trying to reshape the dataframe so that I could plot it.
Is there a simple way to move the DowPrice and NasqPrice into the City_State column so it looks something like this, without having to split the dataframe in two, reshape them and then merge them back?
Date Category Price
0 1996-04 New York, NY 169300.0
1 1996-04 Los Angeles, CA 157700.0
2 1996-04 Houston, TX 86500.0
3 1996-04 DowPrice 5579.864351
4 1996-04 NasqPrice 1135.628092
This should do the trick:
df=pd.concat([
df.groupby("Date")["DowPrice"].first().to_frame().rename(
columns={"DowPrice": "Price"}
).assign(Category="Dow"),
df.groupby("Date")["NasqPrice"].first().to_frame().rename(
columns={"NasqPrice": "Price"}
).assign(Category="Nasdaq"),
df.set_index("Date").rename(
columns={"City_State": "Category", "HousingPrice": "Price"}
).drop(["NasqPrice", "DowPrice"], axis=1)
], axis=0, sort=False).reset_index()
Output (I removed spaces in categories on purpose - just as a shortcut to get data from your df - you will see them fine, while using the code above):
Date Price Category
0 1996-04 5579.864351 Dow
1 1996-05 5616.707742 Dow
2 1996-04 1135.628092 Nasdaq
3 1996-05 1220.540472 Nasdaq
4 1996-04 169300.0 NewYork,NY
5 1996-04 157700.0 LosAngeles,CA
6 1996-04 86500.0 Houston,TX
7 1996-04 114000.0 Chicago,IL
8 1996-04 88100.0 Phoenix,AZ
9 1996-05 169800.0 NewYork,NY
10 1996-05 157600.0 LosAngeles,CA
You can use reshape/melt to do whatever you want, but your intent is not totally clear.
You want Price to denote:
HousingPrice for each (City_State, Date), if Category was a City_State
else DowPrice/NasqPrice for that Date
So you want to reshape/melt multiple columns, selecting depending on the value of Category
You could do something like this and append it to itself, although I guess I'm reshaping and merging...
df.append(
df[['Date', 'DowPrice', 'NasqPrice']].drop_duplicates()\
.melt('Date')\
.rename(columns= {'variable':'City_State','value':'HousingPrice'})
).drop(columns = ['DowPrice','NasqPrice'])
I think this may be what you are asking for.
If you are looking to read the data from a csv:
import csv as cs
with open('/Documents/prices.csv', newline='') as csvfile:
spamreader=cs.reader(csvfile, delimiter=',')
for row in spamreader:
print(','.join(row))
This is the easiest I could find if you export the data as a csv file with pandas dataframes.
import pandas as pd
data = pd.read_csv('/Documents/prices.csv')
part1 = data.filter(items = ['Date', 'Category', 'HousingPrice'])
However, it seems like you might want to be able to plot the date of the housing price over the dates of the dowjowns price over the nasqPirce. I would just split up the dataframe into 3 series then plot that.
Where the three series are:
part1 = data.filter(items = ['Date', 'Category', 'HousingPrice'])
d2 = pd.DataFrame(data.filter(items = ['Date', 'NasqPrice']))
d3 = pd.DataFrame(data.filter(items = ['Date', 'DowPrice']))
Or just simply: (this may be wrong and need an edit)
lines = data.plot.line(x='date', y=['HousingPrice', 'DowPrice', 'NasqPrice'])
I have a dataframe (df) and I would like to create a new column called country, which is calculated buy looking at the region column and where the region value is present in the EnglandRegions list then the country value is set to England else its the value from the region column.
Please see below for my desired output:
name salary region B1salary country
0 Jason 42000 London 42000 England
1 Molly 52000 South West England
2 Tina 36000 East Midland England
3 Jake 24000 Wales Wales
4 Amy 73000 West Midlands England
You can see that all the values in country are set to England except for the value assigned to Jakes record that is set to Wales (as Wales is not in the EnglandRegions list). The code below produces the following error:
File "C:/Users/stacey/Documents/scripts/stacey.py", line 20
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
^
SyntaxError: invalid syntax
The code is as follows:
import pandas as pd
import numpy as np
EnglandRegions = ["London", "South West", "East Midland", "West Midlands", "East Anglia"]
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'salary': [42000, 52000, 36000, 24000, 73000],
'region': ['London', 'South West', 'East Midland', 'Wales', 'West Midlands']}
df = pd.DataFrame(data, columns = ['name', 'salary', 'region'])
df['B1salary'] = np.where((df['salary']>=40000) & (df['salary']<=50000) , df['salary'], '')
df['country'] = np.where((df.loc[df['region'].isin(EnglandRegions)),'England', df['region'])
print(df)
The specific issue the error is referencing is that you are missing a ] to enclose your .loc. However, fixing this won't work anyways. Try:
df['country'] = np.where(df['region'].isin(EnglandRegions), 'England', df['region'])
This is essentially what you already had in the line above it for B1salary anyways.
I'm new to Python and fairly new to SO.
I have a pandas dataframe named df which looks like:
Text
Date Location
2015-07-08 San Diego, CA 1
2015-07-07 Bellevue, WA 1
Los Angeles, CA 1
New York, NY 1
Los Angeles, CA 1
Unknown 1
I want to pivot the data using:
import pandas, numpy as np
df_pivoted = df.pivot_table(df, values=['Text'], index=['Date'],
columns=['Location'],aggfunc=np.sum)
The idea is to generate a heat map that shows the count of "Text" by "Location" and "Date".
I get error:
TypeError: pivot_table() got multiple values for keyword argument 'values'
When using a simplified approach:
df = df.pivot_table('Date', 'Location', 'Text')
I get error:
raise DataError('No numeric types to aggregate')
I'm using Python 2.7 and Pandas 0.16.2
In[2]: df.dtypes
Out[2]:
Date datetime64[ns]
Text object
Location object
dtype: object
Anyone having an idea?
import pandas as pd
import numpy as np
# just try to replicate your dataframe
# ==============================================
date = ['2015-07-08', '2015-07-07', '2015-07-07', '2015-07-07', '2015-07-07', '2015-07-07']
location = ['San Diego, CA', 'Bellevue, WA', 'Los Angeles, CA', 'New York, NY', 'Los Angeles, CA', 'Unknown']
text = [1] * 6
df = pd.DataFrame({'Date': date, 'Location': location, 'Text': text})
Out[141]:
Date Location Text
0 2015-07-08 San Diego, CA 1
1 2015-07-07 Bellevue, WA 1
2 2015-07-07 Los Angeles, CA 1
3 2015-07-07 New York, NY 1
4 2015-07-07 Los Angeles, CA 1
5 2015-07-07 Unknown 1
# processing
# ==============================================
pd.pivot_table(df, index='Date', columns='Location', values='Text', aggfunc=np.sum)
Out[142]:
Location Bellevue, WA Los Angeles, CA New York, NY San Diego, CA Unknown
Date
2015-07-07 1 2 1 NaN 1
2015-07-08 NaN NaN NaN 1 NaN