How to convert from pandas dataframe to a dictionary - python

I have look at Convert a Pandas DataFrame to a dictionary for guides to convert my dataframe to a dictionary. However, I can't seem to change my code to convert my output into a dictionary.
Below are my codes.
import pandas as pd
import collections
governmentcsv = pd.read_csv('government-procurement-via-gebiz.csv',parse_dates=True) #read csv and it contain dates (parse_dates = true)
extract = governmentcsv.loc[:, ['supplier_name','award_date']] #only getting these columns
extract.award_date= pd.to_datetime(extract.award_date)
def extract_supplier_not_na_2015():
notNAFifteen = extract[(extract.supplier_name != 'na') & (extract.award_date.dt.year == 2015)] #extract only year 2015
notNAFifteen.reset_index(drop = True,inplace=True) #reset index
notNAFifteen.index += 1 #and index start from 1
#SortednotNAFifteen = collections.orderedDictionary(notNAFifteen)
return notNAFifteen
print extract_supplier_not_na_2015()
The OUTPUT is:
supplier_name award_date
1 RMA CONTRACTS PTE. LTD. 2015-01-28
2 TESCOM (SINGAPORE) SOFTWARE SYSTEMS TESTING PT... 2015-07-01
3 MKS GLOBAL PTE. LTD. 2015-04-24
4 CERTIS TECHNOLOGY (SINGAPORE) PTE. LTD. 2015-06-26
5 RHT COMPLIANCE SOLUTIONS PTE. LTD. 2015-08-14
6 CLEANMAGE PTE. LTD. 2015-07-30
7 SOLUTIONSATWORK PTE. LTD. 2015-11-23
8 Ufinity Pte Ltd 2015-05-04
9 NCS PTE. LTD. 2015-01-28

I think that I find this dataset:
https://data.gov.sg/dataset/government-procurement
Anyway, here is code
import pandas as pd
df = pd.read_csv('government-procurement-via-gebiz.csv',
encoding='unicode_escape',
parse_dates=['award_date'],
infer_datetime_format=True,
usecols=['supplier_name', 'award_date'],
)
df = df[(df['supplier_name'] != 'Unknown') & (df['award_date'].dt.year == 2015)].reset_index(drop=True)
#Faster way:
d1 = df.set_index('supplier_name').to_dict()['award_date']
#Alernatively:
d2 = dict(zip(df['supplier_name'], df['award_date']))

Related

Using Python and Pandas for a filtering code

You’ll need to bring all your filtering skills together for this task. We’ve provided you a list of companies in the developers variable. Filter df however you choose so that you only get games that meet the following conditions:
Sold in all 3 regions (North America, Europe, and Japan)
The Japanese sales were greater than the combined sales from North America and Europe
The game developer is one of the companies in the developers list
There is no column that explicitly says whether a game was sold in each region, but you can infer that a game was not sold in a region if its sales are 0 for that region.
Use the cols variable to select only the 'name', 'developer', 'na_sales', 'eu_sales', and 'jp_sales' columns from the filtered DataFrame, and assign the result to a variable called df_filtered. Print the whole DataFrame.
You can use a filter mask or query string for this task. In either case, you need to check if the 'jp_sales' column is greater than the sum of 'na_sales' and 'eu_sales', check if each sales column is greater than 0, and use isin() to check if the 'developer' column contains one of the values in developers. Use [cols] to select only those columns and then print df_filtered.
developer
na_sales
eu_sales
jp_sales
critic_score
user_score
0
Nintendo
41.36
28.96
3.77
76.0
8.0
1
NaN
29.08
3.58
6.81
NaN
NaN
2
Nintendo
15.68
12.76
3.79
82.0
8.3
3
Nintendo
15.61
10.93
3.28
80.0
8.0
4
NaN
11.27
8.89
10.22
NaN
NaN
This is my code. Pretty difficult and having difficulty providing a df_filtered variable with a running code.
import pandas as pd
df = pd.read_csv('/datasets/vg_sales.csv')
df['user_score'] = pd.to_numeric(df['user_score'], errors='coerce')
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
cols = ['name', 'developer', 'na_sales', 'eu_sales', 'jp_sales']
df_filtered = df([cols ]> 0 | cols['jp_sales'] > sum(cols['eu_sales']+cols['na_sales']) | df['developer'].isin(developers))
print(df_filtered)
If I understand correctly, it looks like a multi-condition dataframe filtering:
df[
df["developer"].isin(developers) \
& (df["jp_sales"] > df["na_sales"] + df["eu_sales"]) \
& ~df["na_sales"].isnull()
& ~df["eu_sales"].isnull()
& ~df["jp_sales"].isnull()
]
It will not return results for sample dataset given in question because the conditions that JP sales should exceed NA and EU sales and developer should be from given list are not met. But it works for proper data:
data=[
("SquareSoft",41.36,28.96,93.77,76.0,8.0),
(np.nan,29.08,3.58,6.81,np.nan,np.nan),
("SquareSoft",15.68,12.76,3.79,82.0,8.3),
("Nintendo",15.61,10.93,30.28,80.0,8.0),
(np.nan,11.27,8.89,10.22,np.nan,np.nan)
]
columns = ["developer","na_sales","eu_sales","jp_sales","critic_score","user_score"]
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
df = pd.DataFrame(data=data, columns=columns)
[Out]:
developer na_sales eu_sales jp_sales critic_score user_score
0 SquareSoft 41.36 28.96 93.77 76.0 8.0
Try this:
developers = ['SquareSoft', 'Enix Corporation', 'Square Enix']
cols = ['name', 'developer', 'na_sales', 'eu_sales', 'jp_sales']
cond = (
# Sold in all 3 regions
df[["na_sales", "eu_sales", "jp_sales"]].gt(0).all(axis=1)
# JP sales greater than NA and EU sales combined
& df["jp_sales"].gt(df["na_sales"] + df["eu_sales"])
# Developer is in a predefined list
& df["developer"].isin(developers)
)
if cond.any():
df_filtered = df.loc[cond, cols]
else:
print("No match found")

Aggregating two values in a panel data with the same date and id but different type

I have been trying to work on this dataset that includes quantities for two types of sales (0,1) for different counties across different dates. Some dates, however, include both type 1 and type 0 sales. How can I merge type 1 and 0 sales for the same date and same id? The dataset has over 40k rows and I have no idea where to start. I was thinking about creating an if loop but I have no idea how to write it. It can be in python or R.
Essentially, I have a table that looks like this:
Date
City
Type
Quantity
2020-01-01
Rio
1
10
2020-01-01
Rio
0
16
2020-03-01
Rio
0
23
2020-03-01
Rio
1
27
2020-05-01
Rio
1
29
2020-08-01
Rio
0
36
2020-01-01
Sao Paulo
0
50
2020-01-01
Sao Paulo
1
62
2020-03-01
Sao Paulo
0
30
2020-04-01
Sao Paulo
1
32
2020-05-01
Sao Paulo
0
65
2020-05-01
Sao Paulo
1
155
I want to combine, for example, Rio's quantities for both type 1 and 0 on 2020-01-01, as well as 2020-03-01, and the same thing for Sao Paulo and all subsequent counties. I want to aggregate types 1 and 0 quantities but still preserve the date and city columns.
Try something like this:
import pandas as pd
df = pd.read_csv('your_file_name.csv')
df.pivot_table(values='Sales', index=['Date', 'City'], aggfunc='sum')
You can use the pandas groupby and agg functions to do this operation. Here is some example code:
import pandas as pd
df = pd.DataFrame({'date': ['3/10/2000', '3/11/2000', '3/12/2000', '3/10/2000'],
'id':[0,1,0,0], 'sale_type':[0,0,0,1], 'amount': [2, 3, 4, 2]})
df['date'] = pd.to_datetime(df['date'])
df.groupby(['date', 'id']).agg({'amount':sum})
>>> amount
date id
2000-03-10 0 4
2000-03-11 1 3
2000-03-12 0 4
My version of code:
# -*- coding: utf-8 -*-
import pandas as pd
# generating a sample dataframe
df = pd.DataFrame([['10-01-2020', 311100, 'ABADIA', 'MG', 'MINAS', 'IVERMECTIONA', 0, 68],
['10-01-2020', 311100, 'ABADIA', 'MG', 'MINAS', 'IVERMECTIONA', 1, 120]],
columns=['date', 'code1', 'code2', 'code3', 'code4', 'code5', 'type_of_sales', 'count_sales'])
# printing content of dataframe
print(df)
# using group by operation over columns we want to see in resultset and aggregating additive columns
df = df.groupby(['date', 'code1', 'code2', 'code3', 'code4', 'code5']).agg({'count_sales': ['sum']})
# aligning levels of column headers
df = df.droplevel(axis=1, level=0).reset_index()
# renaming column name to previous after aggregating
df = df.rename(columns={'sum':'count_sales'})
print(df)

keep order while using python pandas pivot

df = {'Region':['France','France','France','France'],'total':[1,2,3,4],'date':['12/30/19','12/31/19','01/01/20','01/02/20']}
df=pd.DataFrame.from_dict(df)
print(df)
Region total date
0 France 1 12/30/19
1 France 2 12/31/19
2 France 3 01/01/20
3 France 4 01/02/20
The dates are ordered. Now if I am using pivot
pandas_temp = df.pivot(index='Region',values='total', columns='date')
print(pandas_temp)
date 01/01/20 01/02/20 12/30/19 12/31/19
Region
France 3 4 1 2
I am losing the order. How can I keep it ?
Convert values to datetimes before pivot and then if necessary convert to your custom format:
df['date'] = pd.to_datetime(df['date'])
pandas_temp = df.pivot(index='Region',values='total', columns='date')
pandas_temp = pandas_temp.rename(columns=lambda x: x.strftime('%m/%d/%y'))
#alternative
#pandas_temp.columns = pandas_temp.columns.strftime('%m/%d/%y')
print (pandas_temp)
date 12/30/19 12/31/19 01/01/20 01/02/20
Region
France 1 2 3 4

How to use pandas to pull out the counties with the largest amount of water used in a given year?

I am new to python and pandas and I am struggling to figure out how to pull out the 10 counties with the most water used for irrigation in 2014.
%matplotlib inline
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('info.csv') #reads csv
data['Year'] = pd.to_datetime(['Year'], format='%Y') #converts string to
datetime
data.index = data['Year'] #makes year the index
del data['Year'] #delete the duplicate year column
This is what the data looks like (this is only partial of the data):
County WUCode RegNo Year SourceCode SourceID Annual CountyName
1 IR 311 2014 WELL 1 946 Adams
1 IN 311 2014 INTAKE 1 268056 Adams
1 IN 312 2014 WELL 1 48 Adams
1 IN 312 2014 WELL 2 96 Adams
1 IR 312 2014 INTAKE 1 337968 Adams
3 IR 315 2014 WELL 5 81900 Putnam
3 PS 315 2014 WELL 6 104400 Putnam
I have a couple questions:
I am not sure how to pull out only the "IR" in the WUCode Column with pandas and I am not sure how to print out a table with the 10 counties with the highest water usage for irrigation in 2014.
I have been able to use the .loc function to pull out the information I need, with something like this:
data.loc['2014', ['CountyName', 'Annual', 'WUCode']]
From here I am kind of lost. Help would be appreciated!
import numpy as np
import pandas as pd
import string
df = pd.DataFrame(data={"Annual": np.random.randint(20, 1000000, 1000),
"Year": np.random.randint(2012, 2016, 1000),
"CountyName": np.random.choice(list(string.ascii_letters), 1000)},
columns=["Annual", "Year", "CountyName"])
Say df looks like:
Annual Year CountyName
0 518966 2012 s
1 44511 2013 E
2 332010 2012 e
3 382168 2013 c
4 202816 2013 y
For the year 2014...
df[df['Year'] == 2014]
Group by CountyName...
df[df['Year'] == 2014].groupby("CountyName")
Look at Annual...
df[df['Year'] == 2014].groupby("CountyName")["Annual"]
Get the sum...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum()
Sort the result descending...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False)
Take the top 10...
df[df['Year'] == 2014].groupby("CountyName")["Annual"].sum().sort_values(ascending=False).head(10)
This example prints out (your actual result may vary since my data was random):
CountyName
Q 5191814
y 4335358
r 4315072
f 3985170
A 3685844
a 3583360
S 3301817
I 3231621
t 3228578
u 3164965
This may work for you:
res = df[df['WUCode'] == 'IR'].groupby(['Year', 'CountyName'])['Annual'].sum()\
.reset_index()\
.sort_values('Annual', ascending=False)\
.head(10)
# Year CountyName Annual
# 0 2014 Adams 338914
# 1 2014 Putnam 81900
Explanation
Filter by WUCode, as required, and groupby Year and CountyName.
Use reset_index so your result is a dataframe rather than a series.
Use sort_values and extract top 10 via pd.DataFrame.head.

Not able to pull data from Census API because it is rejecting calls

I am trying to run this script to extract data from the US census but the census API is rejecting my request. It is rejecting my pulls, I did a bit of work, but am stumped....any ideas on how to deal with this
import pandas as pd
import requests
from pandas.compat import StringIO
#Sourced from the following site https://github.com/mortada/fredapi
from fredapi import Fred
fred = Fred(api_key='xxxx')
import StringIO
import datetime
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO as stio
else:
from io import StringIO as stio
year_list = '2013','2014','2015','2016','2017'
month_list = '01','02','03','04','05','06','07','08','09','10','11','12'
#############################################
#Get the total exports from the United States
#############################################
exports = pd.DataFrame()
for i in year_list:
for s in month_list:
try:
link="https://api.census.gov/data/timeseries/intltrade/exports/hs?get=CTY_CODE,CTY_NAME,ALL_VAL_MO,ALL_VAL_YR&time="
str1 = ''.join([i])
txt = '-'
str2 = ''.join([s])
total_link=link+str1+txt+str2
r = requests.get(total_link, headers = {'User-agent': 'your bot 0.1'})
df = pd.read_csv(StringIO(r.text))
##################### change starts here #####################
##################### since it is a dataframe itself, so the method to create a dataframe from a list won't work ########################
# Drop the total sales line
df.drop(df.index[0])
# Rename Column name
df.columns=['CTY_CODE','CTY_NAME','EXPORT MTH','EXPORT YR','time','UN']
# Change the ["1234" to 1234
df['CTY_CODE']=df['CTY_CODE'].str[2:-1]
# Change the 2017-01] to 2017-01
df['time']=df['time'].str[:-1]
##################### change ends here #####################
exports = exports.append(df, ignore_index=False)
except:
print i
print s
Here you go:
import ast
import itertools
import pandas as pd
import requests
base = "https://api.census.gov/data/timeseries/intltrade/exports/hs?get=CTY_CODE,CTY_NAME,ALL_VAL_MO,ALL_VAL_YR&time="
year_list = ['2013','2014','2015','2016','2017']
month_list = ['01','02','03','04','05','06','07','08','09','10','11','12']
exports = []
rejects = []
for year, month in itertools.product(year_list, month_list):
url = '%s%s-%s' % (base, year, month)
r = requests.get(url, headers={'User-agent': 'your bot 0.1'})
if r.text:
r = ast.literal_eval(r.text)
df = pd.DataFrame(r[2:], columns=r[0])
exports.append(df)
else:
rejects.append((int(year), int(month)))
exports = pd.concat(exports).reset_index().drop('index', axis=1)
Your result looks like this:
CTY_CODE CTY_NAME ALL_VAL_MO ALL_VAL_YR time
0 1010 GREENLAND 233446 233446 2013-01
1 1220 CANADA 23170845914 23170845914 2013-01
2 2010 MEXICO 17902453702 17902453702 2013-01
3 2050 GUATEMALA 425978783 425978783 2013-01
4 2080 BELIZE 17795867 17795867 2013-01
5 2110 EL SALVADOR 207606613 207606613 2013-01
6 2150 HONDURAS 429806151 429806151 2013-01
7 2190 NICARAGUA 75752432 75752432 2013-01
8 2230 COSTA RICA 598484187 598484187 2013-01
9 2250 PANAMA 1046236431 1046236431 2013-01
10 2320 BERMUDA 47156737 47156737 2013-01
11 2360 BAHAMAS 256292297 256292297 2013-01
... ... ... ... ...
13883 0024 LAFTA 27790655209 193139639307 2017-07
13884 0025 EURO AREA 15994685459 121039479852 2017-07
13885 0026 APEC 76654291110 550552655105 2017-07
13886 0027 ASEAN 6030380132 44558200533 2017-07
13887 0028 CACM 2133048149 13333440411 2017-07
13888 1XXX NORTH AMERICA 41622877949 299981278306 2017-07
13889 2XXX CENTRAL AMERICA 4697852283 30756310800 2017-07
13890 3XXX SOUTH AMERICA 8117215081 55039567414 2017-07
13891 4XXX EUROPE 25201247938 189925038230 2017-07
13892 5XXX ASIA 38329181070 274304503490 2017-07
13893 6XXX AUSTRALIA AND OC... 2389798925 16656777753 2017-07
13894 7XXX AFRICA 1809443365 13022520158 2017-07
Walkthrough:
itertools.product iterates over the product of (year, month) combinations, joining them with your base url
if the text of the response object is not blank (periods such as 2017-12 will be blank), create a DataFrame out of the literally-evaluated text, which is a list of lists. Use the first element as columns and ignore the second element.
otherwise, add the (year, month) combo to rejects, a list of tuples of the items not found
I used exports = [] because it is much more efficiently to concatenate a list of DataFrames than to append to an existing DataFrame

Categories

Resources