I have a large data set : https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-19/technology.csv
here is the head of the dataset:
head
I have to grouped this dataset by the variables and taken the averages of each country's technology adoption with this:
df.groupby(['variable','iso3c'])[['value']].mean()
here is the output
value
variable iso3c
BCG AFG 45.763158
AGO 56.648649
ALB 93.875000
ARE 86.650000
ARG 93.700000
... ...
visitorrooms VNM 46920.636364
YEM 5527.280000
ZAF 48431.850000
ZMB 3518.000000
ZWE 4696.440000
Now, I want to sort within the variables by largest values to smallest. I thought of doing this:
df.groupby(['variable','iso3c'])[['value']].mean().sort_values(['variable','value'])
but this is the output
value
variable iso3c
BCG SWE 1.722500e+01
SOM 3.812500e+01
AFG 4.576316e+01
TCD 4.586111e+01
ETH 5.141026e+01
... ...
visitorrooms ESP 5.755948e+05
JPN 6.531027e+05
DEU 7.400641e+05
ITA 9.286496e+05
USA 3.040499e+06
[16933 rows x 1 columns]
I have no idea what happens to the values here. How do I fix this?
It looks like you just have a large variance in the values so it's using exp() form.
Option 1: You can chain your sort_values() with a round(x) where x is the number of significant digits you want.
Option 2: Set the pandas precision option to a form you find more comfortable to work with.
import pandas as pd
# import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-07-19/technology.csv')
# Data Pre-Process
df_v2 = df.groupby(['variable','iso3c'])['value'].mean().reset_index()
df_v2.sort_values(['variable','value'],ascending=[True, False] ,inplace=True)
# Showing Output
df_v2
Hi Brother,
I have attached the code for you, if you have any question please let me know
Thanks
Leon
Related
I have a data set (datacomplete2), where I have data for each country for two different years. I want to calculate the difference between these years for each country (for values life, health, and lifegdp) and create a new data frame with the results.
The code:
for i in datacomplete2['Country'].unique():
life.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'life'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'life'])
health.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'health'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'health'])
lifegdp.append(datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2016), 'lifegdp'] - datacomplete2.loc[(datacomplete2['Country']==i)&(datacomplete2['Year']==2000), 'lifegdp'])
newData = pd.DataFrame([life, health, lifegdp, datacomplete2['Country'].unique()], columns = ['life', 'health', 'lifegdp', 'country'])
newData
I think the for loop for calculating is correct, and the problem is in creating the new DataFrame. When I try to run the code, I get an error message: 4 columns passed, passed data had 210 columns.
I have 210 countries so I assume it somehow throws these values to the columns?
Here is also a link to a sneak peek of the data I'm using: https://i.imgur.com/jbGFPpk.png
The data as text would look like:
Country Code Year life health lifegdp
0 Algeria DZA 2000 70.292000 3.489033 20.146558
1 Algeria DZA 2016 76.078000 6.603844 11.520259
2 Angola AGO 2000 47.113000 1.908599 24.684593
3 Angola AGO 2016 61.547000 2.713149 22.684710
4 Antigua and Barbuda ATG 2000 73.541000 4.480701 16.412834
... ... ... ... ... ... ...
415 Vietnam VNM 2016 76.253000 5.659194 13.474181
416 World OWID_WRL 2000 67.684998 8.617628 7.854249
417 World OWID_WRL 2016 72.035337 9.978453 7.219088
418 Zambia ZMB 2000 44.702000 7.152371 6.249955
419 Zambia ZMB 2016 61.874000 4.477207 13.819775
Quick help required !!!
I started coding like two weeks ago so I'm very novice with this stuff.
Anurag Reddy's answer is a good concise solution if you know the dates in advance. To present an alternative and slightly more general answer - this problem is a good example use case for pandas.DataFrame.diff.
Note you don't actually need to sort the data in your example data but I've included a sort_values() line below to account for unsorted DataFrames.
import pandas as pd
# Read the raw datafile in
df = pd.read_csv("example.csv")
# Sort the data if required
df.sort_values(by=["Country"], inplace=True)
# Remove columns where you don't need the difference
new_df = df.drop(["Code", "Year"], axis=1)
# Group the data by country, take the difference between the rows, remove NaN rows, and reset the index to sequential integers
new_df = new_df.groupby(["Country"], as_index=False).diff().dropna().reset_index(drop=True)
# Add back the country names and codes as columns in the new DataFrame
new_df.insert(loc=0, column="Country", value=df["Country"].unique())
new_df.insert(loc=1, column="Code", value=df["Code"].unique())
You could do this instead
country_list = df.Country.unique().tolist()
df.drop(columns = ['Code'])
df_2016 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2016)].reset_index()
df_2000 = df.loc[(df['Country'].isin(country_list))&(df['Year']==2000)].reset_index()
df_2016.drop(columns=['Year'])
df_2000.drop(columns=['Year'])
df_2016.set_index('Country').subtract(df_2000.set_index('Country'), fill_value=0)
I have to answer these questions. Based on This data
Which country has the most elite level ramen bowls?
Which large brand produces the least consistent scores?
Assume review # is time based (Lower # means earlier)… Has the average starts changed over time?
Is the amount of detail in variety indicative of quality?
What “Style” of ramen would you prefer?
These are the questions I am answering.
my code is on the Juptyr Notes Google colab platform.
import os
import json
from google.colab import drive
drive.mount('/content/drive')
import pandas as pd
df = pd.read_csv('file path')
## Question 1.
frame1 = df[df.Style == 'Bowl']
frame1 = frame1.groupby('Country')['Stars'].mean
##now I get an error. I have seen it with a max instead of a mean and working but mean should work still.
Could someone help me through this?
First of all, if you check your data, in Stars columns there are some string entries named "Unrated". If you want, it is a good practice to delete this rows by filtering to keep only numeric data using a regular expression.
df = df[df.Stars.str.contains(r"\d")]
After that, transform Stars column to float type:
df["Stars"] = df["Stars"].apply(lambda x: float(x))
And now you can calculate whatever aggregated value:
frame1 = df[df.Style == 'Bowl']
frame1 = frame1.groupby('Country')['Stars'].mean()
Output:
Country
Canada 2.281250
China 3.527778
Hong Kong 3.735000
Japan 4.140278
Malaysia 4.281250
Philippines 3.375000
Singapore 4.096154
South Korea 3.865809
Taiwan 3.263514
Thailand 3.142045
UK 3.250000
USA 3.400000
Vietnam 3.362500
Name: Stars, dtype: float64
I would like to add the regional information to the main table that contains entity and account columns. In this way, each row in the main table should be duplicated, just like the append tool in Alteryx.
Is there a way to do this operation with Pandas in Python?
Thanks!
Unfortunately no build-in method exist, as you'll need to build cartesian product of those DataFrame check that fancy explanation of merge DataFrames in pandas
But for your specific problem, try this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(columns=['Entity', 'Account'])
df1.Entity = ['Entity1', 'Entity1']
df1.Account = ['Sales', 'Cost']
df2 = pd.DataFrame(columns=['Region'])
df2.Region = ['North America', 'Europa', 'Asia']
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
resultdf = cartesian_product_simplified(df1, df2)
print(resultdf)
output:
0 1 2
0 Entity1 Sales North America
1 Entity1 Sales Europa
2 Entity1 Sales Asia
3 Entity1 Cost North America
4 Entity1 Cost Europa
5 Entity1 Cost Asia
as expected.
Btw, please provide the Data Frame the next time as code, not as a screenshot or even as link. It helps up saving time (please check how to ask)
I have cross-sectional data which consists of yearly crime frequencies in the chicago area and house price. I want to select a group of columns recursively from dataset because I want to use them as features for training regression model. Is there any quick way to do this? any idea?
example data snippet:
here is the screenshot of my data:
here is example data snippet on the cloud for browsing data.
my attempt:
here is one example that I could select group of columns as features for the training ML model.
import urllib
import pandas as pd
# download data from cloud
u = "https://filebin.net/ml0sjn455gr8pvh3/crime_realEstate?t=7dkm15wq"
crime_realEstate = urllib.request.urlretrieve (u, "Ktest.csv")
# or just manually download data first and read
crime_realEstate = pd.read_csv('crime_realEstate.csv')
cols_2012 = crime_realEstate.filter(regex='_2012').columns
crime_realEstate['Area_Name']=crime_realEstate['Area_Name'].apply(lambda x: re.sub(' ', '_', str(x)))
regDF_2012 = crime_realEstate[cols_2012]
regDF_2012 = regDF_2012.assign(community_code=crime_finalDF['community_area'])
regDF_2012.dropna(inplace=True)
X_feats = regDF_2012.drop(['Avg_Price_2012'], axis=1)
y_label = regDF_2012['Avg_Price_2012'].values
basically, I want to do same things for regDF_2013, regDF_2014 and so on in the loop for better manipulation and easy to access data.
any idea to make this happen? any thoughts? Thanks
Melt your dataframe. This way you have separate column for each variable and index by Area_name. :
import pandas as pd
crime_realEstate = pd.read_csv("Ktest.csv", delimiter="\t", index_col=0)
crime_melted = pd.melt(crime_realEstate, id_vars=['Area_Name', 'community_area'])
crime_melted["crime"] = crime_melted["variable"].apply(lambda x: x[:-5])
crime_melted["year"] = crime_melted["variable"].apply(lambda x: x[-4:])
crime_melted.drop(columns=["variable"], inplace=True)
crime_melted.set_index("Area_Name", inplace=True)
Resulting dataframe is (example rows):
community_area value crime year
Area_Name
Grand Boulevard 38.0 135.000000 assault 2012
Grand Boulevard 38.0 108.000000 assault 2013
Grand Boulevard 38.0 116.000000 assault 2014
Grand Boulevard 38.0 78.000000 assault 2015
Grand Boulevard 38.0 105.000000 assault 2016
Index can be accessed by using loc:
crime_melted.loc["Grand Boulevard"]
Separate column for every variable is what you need for machine learning :-)
I am a noob and I have a large CSV file with data structured like this (with a lot more columns):
State daydiff
CT 5.5
CT 6.5
CT 6.25
NY 3.2
NY 3.225
PA 7.522
PA 4.25
I want to output a new CSV where the daydiff is averaged for each State like this:
State daydiff
CT 6.083
NY 3.2125
PA 5.886
I have tried numerous ways and the cleanest seemed to leverage pandas groupby but when i run the code below:
import pandas as pd
df = pd.read_csv('C:...input.csv')
df.groupby('State')['daydiff'].mean()
df.to_csv('C:...AverageOutput.csv')
I get a file that is identical to the original file but with a counter added in the first column with no header:
,State,daydiff
0,CT,5.5
1,CT,6.5
2,CT,6.25
3,NY,3.2
4,NY,3.225
5,PA,7.522
6,PA,4.25
I was also hoping to control the new average in datediff to a decimal going out only to the hundredths. Thanks
The "problem" with the counter is because the default behaviour for to_csvis to write the index. You should do df.to_csv('C:...AverageOutput.csv', index=False).
You can control the output format of daydiff by converting it to string. df.daydiff = df.daydiff.apply(lambda x: '{:.2f}'.format(x))
Your complete code should be:
df = pd.read_csv('C:...input.csv')
df2 = df.groupby('State')['daydiff'].mean().apply(lambda x: '{:.2f}'.format(x))
df2.to_csv('C:...AverageOutput.csv')