I'm new to Python and fairly new to SO.
I have a pandas dataframe named df which looks like:
Text
Date Location
2015-07-08 San Diego, CA 1
2015-07-07 Bellevue, WA 1
Los Angeles, CA 1
New York, NY 1
Los Angeles, CA 1
Unknown 1
I want to pivot the data using:
import pandas, numpy as np
df_pivoted = df.pivot_table(df, values=['Text'], index=['Date'],
columns=['Location'],aggfunc=np.sum)
The idea is to generate a heat map that shows the count of "Text" by "Location" and "Date".
I get error:
TypeError: pivot_table() got multiple values for keyword argument 'values'
When using a simplified approach:
df = df.pivot_table('Date', 'Location', 'Text')
I get error:
raise DataError('No numeric types to aggregate')
I'm using Python 2.7 and Pandas 0.16.2
In[2]: df.dtypes
Out[2]:
Date datetime64[ns]
Text object
Location object
dtype: object
Anyone having an idea?
import pandas as pd
import numpy as np
# just try to replicate your dataframe
# ==============================================
date = ['2015-07-08', '2015-07-07', '2015-07-07', '2015-07-07', '2015-07-07', '2015-07-07']
location = ['San Diego, CA', 'Bellevue, WA', 'Los Angeles, CA', 'New York, NY', 'Los Angeles, CA', 'Unknown']
text = [1] * 6
df = pd.DataFrame({'Date': date, 'Location': location, 'Text': text})
Out[141]:
Date Location Text
0 2015-07-08 San Diego, CA 1
1 2015-07-07 Bellevue, WA 1
2 2015-07-07 Los Angeles, CA 1
3 2015-07-07 New York, NY 1
4 2015-07-07 Los Angeles, CA 1
5 2015-07-07 Unknown 1
# processing
# ==============================================
pd.pivot_table(df, index='Date', columns='Location', values='Text', aggfunc=np.sum)
Out[142]:
Location Bellevue, WA Los Angeles, CA New York, NY San Diego, CA Unknown
Date
2015-07-07 1 2 1 NaN 1
2015-07-08 NaN NaN NaN 1 NaN
Related
I have one dataframe(df1) which is my raw data from which i want to filter or extract a part of the data. I have another dataframe(df2) which have my filter conditions. The catch here is my filter condition column if blank should skip tht column condition and move to the other column conditions
Example below:
DF1:
City
District
Town
Country
Continent
NY
WASHIN
DC
US
America
CZCH
SEATLLE
DC
CZCH
Europe
NY
MARYLAND
DC
US
S America
NY
WASHIN
NY
US
America
NY
SEAGA
NJ
UK
Europe
DF2:(sample filter condition table - this table can have multiple conditions)
City
District
Town
Country
Continent
NY
DC
NJ
Notice that i have left the district, country and continent column blank. As I may or may not use it later. I cannot delete these columns.
OUTPUT DF: should look like this
City
District
Town
Country
Continent
NY
WASHIN
DC
US
America
NY
MARYLAND
DC
US
S America
NY
SEAGA
NJ
UK
Europe
So basically i need a filter condition table which will extract information from the raw data for fields i input in the filter tables. I cannot change/delete columns in DF2. I can only leave the column blank if i dont require the filter condition.
Thanks in advance,
Nitz
If DF2 has always one row:
df = df1.merge(df2.dropna(axis=1))
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
1 NY NJ DC US S America
If multiple rows with missing values:
Sample data:
nan = np.nan
df1 = pd.DataFrame({'City': ['NY', 'CZCH', 'NY', 'NY', 'NY'], 'District': ['WASHIN', 'SEATLLE', 'MARYLAND', 'WASHIN', 'SEAGA'], 'Town': ['DC', 'DC', 'DC', 'NY', 'NJ'], 'Country': ['US', 'CZCH', 'US', 'US', 'UK'], 'Continent': ['America', 'Europe', 'S America', 'America', 'Europe']})
df2 = pd.DataFrame({'City': ['NY', nan], 'District': [nan, nan], 'Town': ['DC', 'NJ'], 'Country': [nan, nan], 'Continent': [nan, nan]})
First remove missing values with reshape by DataFrame.stack:
print (df2.stack())
0 City NY
Town DC
1 Town NJ
dtype: object
Then for each group compare df1 columns if exist in columns names and value from df2:
m = [df1[list(v.droplevel(0).index)].eq(v.droplevel(0)).all(axis=1)
for k, v in df2.stack().groupby(level=0)]
print (m)
[0 True
1 False
2 True
3 False
4 False
dtype: bool, 0 False
1 False
2 False
3 False
4 True
dtype: bool]
Use logical_or.reduce and filter in boolean indexing:
print (np.logical_or.reduce(m))
[ True False True False True]
df = df1[np.logical_or.reduce(m)]
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
2 NY MARYLAND DC US S America
4 NY SEAGA NJ UK Europe
Another possible solution, using numpy broadcasting (it works even when df2 has more than one row):
df1.loc[np.sum(np.sum(
df1.values == df2.values[:, None], axis=2) ==
np.sum(df2.notna().values, axis=1)[:,None], axis=0) == 1]
Output:
City District Town Country Continent
0 NY WASHIN DC US America
2 NY MARYLAND DC US S America
4 NY SEAGA NJ UK Europe
I have a dataset with statistics by region. I would like to build several other city datasets based on this dataset. At the same time, when creating in each such set, I would like to add a column with the name of the city.
That is, from one data set, I would like to receive three.
I'll give you an example. Initial dataset:
df
date name_region
2022-01-01 California
2022-01-02 California
2022-01-03 California
Next, I have a list with cities: city_list = ['Los Angeles', 'San Diego', 'San Jose']
As an output, I want to have 3 datasets (or more, depending on the number of items in the list):
df_city_1
date name_region city
2022-01-01 California Los Angeles
2022-01-02 California Los Angeles
2022-01-03 California Los Angeles
df_city_2
date name_region city
2022-01-01 California San Diego
2022-01-02 California San Diego
2022-01-03 California San Diego
df_city_3
date name_region city
2022-01-01 California San Jose
2022-01-02 California San Jose
2022-01-03 California San Jose
It would be ideal if, at the same time, the data set could be accessed by a key determined by an element in the list:
df_city['Los Angeles']
date name_region city
2022-01-01 California Los Angeles
2022-01-02 California Los Angeles
2022-01-02 California Los Angeles
How can I do that? I found only a way of this division into several data sets, when the original set already has information on the unique values of the column (in this case, the same cities), , but this does not suit me very well.
Another possible solution:
dfs = []
for city in city_list:
dfs.append(df.assign(city = city))
cities = dict(zip(city_list, dfs))
cities['Los Angeles']
Output:
date name_region city
0 2022-01-01 California Los Angeles
1 2022-01-02 California Los Angeles
2 2022-01-02 California Los Angeles
#ouroboros1, to whom I thank, suggests a very nice way of shortening my code:
cities = dict(zip(city_list, [df.assign(city = city) for city in city_list]))
You can use a dictionary comprehension, and add the column city each time using df.assign.
import pandas as pd
data = {'date': {0: '2022-01-01', 1: '2022-01-02', 2: '2022-01-02'},
'name_region': {0: 'California', 1: 'California', 2: 'California'}}
df = pd.DataFrame(data)
city_list = ['Los Angeles', 'San Diego', 'San Jose']
# "df_city" as a `dict`
df_city = {city: df.assign(city=city) for city in city_list}
# accessing each `df` by key (i.e. a `list` element)
print(df_city['Los Angeles'])
date name_region city
0 2022-01-01 California Los Angeles
1 2022-01-02 California Los Angeles
2 2022-01-02 California Los Angeles
I have a dataframe that looks like this:
Date City_State HousingPrice DowPrice NasqPrice
0 1996-04 New York, NY 169300.0 5579.864351 1135.628092
1 1996-04 Los Angeles, CA 157700.0 5579.864351 1135.628092
2 1996-04 Houston, TX 86500.0 5579.864351 1135.628092
3 1996-04 Chicago, IL 114000.0 5579.864351 1135.628092
4 1996-04 Phoenix, AZ 88100.0 5579.864351 1135.628092
5 1996-05 New York, NY 169800.0 5616.707742 1220.540472
6 1996-05 Los Angeles, CA 157600.0 5616.707742 1220.540472
I'm trying to reshape the dataframe so that I could plot it.
Is there a simple way to move the DowPrice and NasqPrice into the City_State column so it looks something like this, without having to split the dataframe in two, reshape them and then merge them back?
Date Category Price
0 1996-04 New York, NY 169300.0
1 1996-04 Los Angeles, CA 157700.0
2 1996-04 Houston, TX 86500.0
3 1996-04 DowPrice 5579.864351
4 1996-04 NasqPrice 1135.628092
This should do the trick:
df=pd.concat([
df.groupby("Date")["DowPrice"].first().to_frame().rename(
columns={"DowPrice": "Price"}
).assign(Category="Dow"),
df.groupby("Date")["NasqPrice"].first().to_frame().rename(
columns={"NasqPrice": "Price"}
).assign(Category="Nasdaq"),
df.set_index("Date").rename(
columns={"City_State": "Category", "HousingPrice": "Price"}
).drop(["NasqPrice", "DowPrice"], axis=1)
], axis=0, sort=False).reset_index()
Output (I removed spaces in categories on purpose - just as a shortcut to get data from your df - you will see them fine, while using the code above):
Date Price Category
0 1996-04 5579.864351 Dow
1 1996-05 5616.707742 Dow
2 1996-04 1135.628092 Nasdaq
3 1996-05 1220.540472 Nasdaq
4 1996-04 169300.0 NewYork,NY
5 1996-04 157700.0 LosAngeles,CA
6 1996-04 86500.0 Houston,TX
7 1996-04 114000.0 Chicago,IL
8 1996-04 88100.0 Phoenix,AZ
9 1996-05 169800.0 NewYork,NY
10 1996-05 157600.0 LosAngeles,CA
You can use reshape/melt to do whatever you want, but your intent is not totally clear.
You want Price to denote:
HousingPrice for each (City_State, Date), if Category was a City_State
else DowPrice/NasqPrice for that Date
So you want to reshape/melt multiple columns, selecting depending on the value of Category
You could do something like this and append it to itself, although I guess I'm reshaping and merging...
df.append(
df[['Date', 'DowPrice', 'NasqPrice']].drop_duplicates()\
.melt('Date')\
.rename(columns= {'variable':'City_State','value':'HousingPrice'})
).drop(columns = ['DowPrice','NasqPrice'])
I think this may be what you are asking for.
If you are looking to read the data from a csv:
import csv as cs
with open('/Documents/prices.csv', newline='') as csvfile:
spamreader=cs.reader(csvfile, delimiter=',')
for row in spamreader:
print(','.join(row))
This is the easiest I could find if you export the data as a csv file with pandas dataframes.
import pandas as pd
data = pd.read_csv('/Documents/prices.csv')
part1 = data.filter(items = ['Date', 'Category', 'HousingPrice'])
However, it seems like you might want to be able to plot the date of the housing price over the dates of the dowjowns price over the nasqPirce. I would just split up the dataframe into 3 series then plot that.
Where the three series are:
part1 = data.filter(items = ['Date', 'Category', 'HousingPrice'])
d2 = pd.DataFrame(data.filter(items = ['Date', 'NasqPrice']))
d3 = pd.DataFrame(data.filter(items = ['Date', 'DowPrice']))
Or just simply: (this may be wrong and need an edit)
lines = data.plot.line(x='date', y=['HousingPrice', 'DowPrice', 'NasqPrice'])
I have a data frame that I'm trying to create a tsplot using seaborn, the first issue I'm having after converting my DateTime from string to a DateTime object is that the day had been automatically added.
The original data frame looks like this:
zillowvisualtop5.head()
Out[161]:
City_State Year Price
0 New York, NY 1996-04 169300.0
1 Los Angeles, CA 1996-04 157700.0
2 Houston, TX 1996-04 86500.0
3 Chicago, IL 1996-04 114000.0
6 Phoenix, AZ 1996-04 88100.0
(Note that the date is in year-month format)
After I covert it into DateTime object so that I could plot it using seaborn, I get the issue of having a date added after the month.
zillowvisualtop5['Year'] = pd.to_datetime(zillowvisualtop5['Year'], format= '%Y-%m')
zillowvisualtop5.head()
Out[165]:
City_State Year Price
0 New York, NY 1996-04-01 169300.0
1 Los Angeles, CA 1996-04-01 157700.0
2 Houston, TX 1996-04-01 86500.0
3 Chicago, IL 1996-04-01 114000.0
6 Phoenix, AZ 1996-04-01 88100.0
The solution that I've found seems to suggest converting to strftime but I need my time to be in DateTime format so I can plot it using seaborn.
The problem that you are having is that a DateTime object will always include all the components of a date object and a time object. There is no way to have a DateTime object that only has year and month info (source: here).
However, you can use the matplotlib converters like this:
import pandas as pd
from pandas.plotting import register_matplotlib_converters
import seaborn as sns
cols = ['City', 'Price', 'Month']
data = [['New York', 125, '1996-04'],
['New York', 79, '1996-05'],
['New York', 85, '1996-06'],
['Houston', 90, '1996-04'],
['Houston', 95, '1996-05'],
['Houston',127, '1996-06']]
df = pd.DataFrame(data, columns = cols)
print (df)
chart = sns.lineplot(x='Month', y='Price', hue='City', data=df)
Does that get the results you were looking for?
CSV file: (sample1.csv)
Location_City, Location_State, Name, hobbies
Los Angeles, CA, John, "['Music', 'Running']"
Texas, TX, Jack, "['Swimming', 'Trekking']"
I want to convert hobbies column of CSV into following output
Location_City, Location_State, Name, hobbies
Los Angeles, CA, John, Music
Los Angeles, CA, John, Running
Texas, TX, Jack, Swimming
Texas, TX, Jack, Trekking
I have read csv into dataframe but I don't know how to convert it?
data = pd.read_csv("sample1.csv")
df=pd.DataFrame(data)
df
You can use findall or extractall for get lists from hobbies colum, then flatten with chain.from_iterable and repeat another columns:
a = df['hobbies'].str.findall("'(.*?)'").astype(np.object)
lens = a.str.len()
from itertools import chain
df1 = pd.DataFrame({
'Location_City' : df['Location_City'].values.repeat(lens),
'Location_State' : df['Location_State'].values.repeat(lens),
'Name' : df['Name'].values.repeat(lens),
'hobbies' : list(chain.from_iterable(a.tolist())),
})
Or create Series, remove first level and join to original DataFrame:
df1 = (df.join(df.pop('hobbies').str.extractall("'(.*?)'")[0]
.reset_index(level=1, drop=True)
.rename('hobbies'))
.reset_index(drop=True))
print (df1)
Location_City Location_State Name hobbies
0 Los Angeles CA John Music
1 Los Angeles CA John Running
2 Texas TX Jack Swimming
3 Texas TX Jack Trekking
We can solve this using pandas.DataFrame.explode function which was introduced in version 0.25.0 if you have same or higher version, you can use below code.
explode function reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html
import pandas as pd
import ast
data = {
'Location_City': ['Los Angeles','Texas'],
'Location_State': ['CA','TX'],
'Name': ['John','Jack'],
'hobbies': ["['Music', 'Running']", "['Swimming', 'Trekking']"]
}
df = pd.DataFrame(data)
# Converting a string representation of a list into an actual list object
list_eval = lambda x: ast.literal_eval(x)
df['hobbies'] = df['hobbies'].apply(list_eval)
# Exploding the list
df = df.explode('hobbies')
print(df)
Location_City Location_State Name hobbies
0 Los Angeles CA John Music
0 Los Angeles CA John Running
1 Texas TX Jack Swimming
1 Texas TX Jack Trekking