How to recursively select group of columns from panel data in pandas? - python

I have cross-sectional data which consists of yearly crime frequencies in the chicago area and house price. I want to select a group of columns recursively from dataset because I want to use them as features for training regression model. Is there any quick way to do this? any idea?
example data snippet:
here is the screenshot of my data:
here is example data snippet on the cloud for browsing data.
my attempt:
here is one example that I could select group of columns as features for the training ML model.
import urllib
import pandas as pd
# download data from cloud
u = "https://filebin.net/ml0sjn455gr8pvh3/crime_realEstate?t=7dkm15wq"
crime_realEstate = urllib.request.urlretrieve (u, "Ktest.csv")
# or just manually download data first and read
crime_realEstate = pd.read_csv('crime_realEstate.csv')
cols_2012 = crime_realEstate.filter(regex='_2012').columns
crime_realEstate['Area_Name']=crime_realEstate['Area_Name'].apply(lambda x: re.sub(' ', '_', str(x)))
regDF_2012 = crime_realEstate[cols_2012]
regDF_2012 = regDF_2012.assign(community_code=crime_finalDF['community_area'])
regDF_2012.dropna(inplace=True)
X_feats = regDF_2012.drop(['Avg_Price_2012'], axis=1)
y_label = regDF_2012['Avg_Price_2012'].values
basically, I want to do same things for regDF_2013, regDF_2014 and so on in the loop for better manipulation and easy to access data.
any idea to make this happen? any thoughts? Thanks

Melt your dataframe. This way you have separate column for each variable and index by Area_name. :
import pandas as pd
crime_realEstate = pd.read_csv("Ktest.csv", delimiter="\t", index_col=0)
crime_melted = pd.melt(crime_realEstate, id_vars=['Area_Name', 'community_area'])
crime_melted["crime"] = crime_melted["variable"].apply(lambda x: x[:-5])
crime_melted["year"] = crime_melted["variable"].apply(lambda x: x[-4:])
crime_melted.drop(columns=["variable"], inplace=True)
crime_melted.set_index("Area_Name", inplace=True)
Resulting dataframe is (example rows):
community_area value crime year
Area_Name
Grand Boulevard 38.0 135.000000 assault 2012
Grand Boulevard 38.0 108.000000 assault 2013
Grand Boulevard 38.0 116.000000 assault 2014
Grand Boulevard 38.0 78.000000 assault 2015
Grand Boulevard 38.0 105.000000 assault 2016
Index can be accessed by using loc:
crime_melted.loc["Grand Boulevard"]
Separate column for every variable is what you need for machine learning :-)

Related

Create a new DataFrame from selected data from another DataFrame

I want to create a box plot using pandas. I have data with average temperatures and I want to select three cities and create three box plots to compare temperatures among these cities. To achieve this, I have created a result DataFrame to store the data, the values for cities are supposed to be stored in three columns (one column per city).
However, the following code only shows plot for the first city. The problem is with the DataFrame. A separated query correctly gives a series of values, but when I insert it into the result dataset, a column of NaN values is stored there. What I am missing here?
import pandas
import matplotlib.pyplot as plt
import wget
wget.download("https://raw.githubusercontent.com/pesikj/python-012021/master/zadani/5/temperature.csv")
temperatures = pandas.read_csv("temperature.csv")
helsinki = temperatures[temperatures["City"] == "Helsinki"]["AvgTemperature"]
miami = temperatures[temperatures["City"] == "Miami Beach"]["AvgTemperature"]
tokyo = temperatures[temperatures["City"] == "Tokyo"]["AvgTemperature"]
result = pandas.DataFrame()
result["Helsinki"] = helsinki
result["Miami Beach"] = miami
result["Tokyo"] = tokyo
result.plot(kind="box",whis=[0,100])
plt.show()
Pivot into City columns using pivot_table() and select the 3 cities you want:
result = temperatures.pivot_table(
index='Day',
columns='City',
values='AvgTemperature',
)[['Helsinki', 'Miami Beach', 'Tokyo']]
# City Helsinki Miami Beach Tokyo
# Day
# 1 29.6 74.6 59.1
# 2 29.5 76.8 62.3
# ...
# 29 35.3 77.7 58.4
# 30 35.7 78.0 51.5
result.plot(kind='box', whis=[0,100])
Since you're using data science packages, consider using seaborn, which does the job of filtering/grouping data for you whenever you call one of its plot functions:
# Load dataset
url = "https://raw.githubusercontent.com/pesikj/python-012021/master/zadani/5/temperature.csv"
temperatures = pd.read_csv(url)
# Filter for cities of interest
cities = ['Helsinki', 'Miami Beach', 'Tokyo']
filtered_temperatures = temperatures.loc[temperatures['City'].isin(cities)]
# Let seaborn do the grouping
sns.violinplot(data=filtered_temperatures, x='City', y='AvgTemperature')
plt.show()
Result:

How to construct new DataFrame based on data from for loops?

I have a data set (datacomplete2), where I have data for each country for two different years. I want to calculate the difference between these years for each country (for values life, health, and lifegdp) and create a new data frame with the results.
The code:
for i in datacomplete2['Country'].unique():
life.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'life'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'life'])
health.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'health'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'health'])
lifegdp.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'lifegdp'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'lifegdp'])
newData = pd.DataFrame([life, health, lifegdp, datacomplete2['Country'].unique()], columns = ['life', 'health', 'lifegdp', 'country'])
newData
I think the for loop for calculating is correct, and the problem is in creating the new DataFrame. When I try to run the code, I get an error message: 4 columns passed, passed data had 210 columns.
I have 210 countries so I assume it somehow throws these values to the columns?
Here is also a link to a sneak peek of the data I'm using: https://i.imgur.com/jbGFPpk.png
The data as text would look like:
Country Code Year life health lifegdp
0 Algeria DZA 2000 70.292000 3.489033 20.146558
1 Algeria DZA 2016 76.078000 6.603844 11.520259
2 Angola AGO 2000 47.113000 1.908599 24.684593
3 Angola AGO 2016 61.547000 2.713149 22.684710
4 Antigua and Barbuda ATG 2000 73.541000 4.480701 16.412834
... ... ... ... ... ... ...
415 Vietnam VNM 2016 76.253000 5.659194 13.474181
416 World OWID_WRL 2000 67.684998 8.617628 7.854249
417 World OWID_WRL 2016 72.035337 9.978453 7.219088
418 Zambia ZMB 2000 44.702000 7.152371 6.249955
419 Zambia ZMB 2016 61.874000 4.477207 13.819775
Quick help required !!!
I started coding like two weeks ago so I'm very novice with this stuff.
Anurag Reddy's answer is a good concise solution if you know the dates in advance. To present an alternative and slightly more general answer - this problem is a good example use case for pandas.DataFrame.diff.
Note you don't actually need to sort the data in your example data but I've included a sort_values() line below to account for unsorted DataFrames.
import pandas as pd
# Read the raw datafile in
df = pd.read_csv("example.csv")
# Sort the data if required
df.sort_values(by=["Country"], inplace=True)
# Remove columns where you don't need the difference
new_df = df.drop(["Code", "Year"], axis=1)
# Group the data by country, take the difference between the rows, remove NaN rows, and reset the index to sequential integers
new_df = new_df.groupby(["Country"], as_index=False).diff().dropna().reset_index(drop=True)
# Add back the country names and codes as columns in the new DataFrame
new_df.insert(loc=0, column="Country", value=df["Country"].unique())
new_df.insert(loc=1, column="Code", value=df["Code"].unique())
You could do this instead
country_list = df.Country.unique().tolist()
df.drop(columns = ['Code'])
df_2016 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2016)].reset_index()
df_2000 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2000)].reset_index()
df_2016.drop(columns=['Year'])
df_2000.drop(columns=['Year'])
df_2016.set_index('Country').subtract(df_2000.set_index('Country'), fill_value=0)

When answering this using the excel.csv and Pandas? I wrote out some code but it gives errors

I have to answer these questions. Based on This data
Which country has the most elite level ramen bowls?
Which large brand produces the least consistent scores?
Assume review # is time based (Lower # means earlier)… Has the average starts changed over time?
Is the amount of detail in variety indicative of quality?
What “Style” of ramen would you prefer?
These are the questions I am answering.
my code is on the Juptyr Notes Google colab platform.
import os
import json
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
df = pd.read_csv('file path')
## Question 1.
frame1 = df[df.Style == 'Bowl']
frame1 = frame1.groupby('Country')['Stars'].mean
##now I get an error. I have seen it with a max instead of a mean and working but mean should work still.
Could someone help me through this?
First of all, if you check your data, in Stars columns there are some string entries named "Unrated". If you want, it is a good practice to delete this rows by filtering to keep only numeric data using a regular expression.
df = df[df.Stars.str.contains(r"\d")]
After that, transform Stars column to float type:
df["Stars"] = df["Stars"].apply(lambda x: float(x))
And now you can calculate whatever aggregated value:
frame1 = df[df.Style == 'Bowl']
frame1 = frame1.groupby('Country')['Stars'].mean()
Output:
Country
Canada 2.281250
China 3.527778
Hong Kong 3.735000
Japan 4.140278
Malaysia 4.281250
Philippines 3.375000
Singapore 4.096154
South Korea 3.865809
Taiwan 3.263514
Thailand 3.142045
UK 3.250000
USA 3.400000
Vietnam 3.362500
Name: Stars, dtype: float64

How to append two dataframe objects containing same column data but different column names?

I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement

Adding information from a smaller table to a large one with Pandas

I would like to add the regional information to the main table that contains entity and account columns. In this way, each row in the main table should be duplicated, just like the append tool in Alteryx.
Is there a way to do this operation with Pandas in Python?
Thanks!
Unfortunately no build-in method exist, as you'll need to build cartesian product of those DataFrame check that fancy explanation of merge DataFrames in pandas
But for your specific problem, try this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['Entity', 'Account'])
df1.Entity = ['Entity1', 'Entity1']
df1.Account = ['Sales', 'Cost']
df2 = pd.DataFrame(columns=['Region'])
df2.Region = ['North America', 'Europa', 'Asia']
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
resultdf = cartesian_product_simplified(df1, df2)
print(resultdf)
output:
0 1 2
0 Entity1 Sales North America
1 Entity1 Sales Europa
2 Entity1 Sales Asia
3 Entity1 Cost North America
4 Entity1 Cost Europa
5 Entity1 Cost Asia
as expected.
Btw, please provide the Data Frame the next time as code, not as a screenshot or even as link. It helps up saving time (please check how to ask)

Categories

Resources