Transform Hot Encoding - python

I have this data;
ID Month
001 June
001 July
001 August
002 July
I want the result to be like this:
ID June July August
001 1 1 1
002 0 1 0
I have tried one-hot encoding, my query is like this:
one_hot = pd.get_dummies(frame['month'])
frame = frame.drop('Month',axis = 1)
frame = frame.join(one_hot)
However, the result is like this
ID June July August
001 1 0 0
001 0 1 0
001 0 0 1
002 0 1 0
May I know which part of my query is wrong?

get_dummies returns strictly 1-hot encoded values, you can use pd.crosstab:
>>> out = pd.crosstab(df.ID, df.Month)
>>> out
Month August July June
ID
1 1 1 1
2 0 1 0
To preserve the order of appearance of Months, you can reindex:
>>> out.reindex(df.Month.unique(), axis=1)
Month June July August
ID
1 1 1 1
2 0 1 0
In case an ID can have more than 1 month associated with it and you want to see it as 1:
out = out.ne(0).astype(int)
can be used afterwards.

If need hot encoding convert ID to index and aggregate max for always 0,1 ouput:
one_hot = (pd.get_dummies(frame.set_index('ID')['Month'])
.max(level=0)
.reindex(df.Month.unique(), axis=1))
print (one_hot)
June July August
ID
1 1 1 1
2 0 1 0

Related

Convert text file into dataframe with custom multiple delimiter in python

i'am new to python. I have one txt file. it contains some data like
0: 480x640 2 persons, 1 cat, 1 clock, 1: 480x640 2 persons, 1 chair, Done. date (0.635s) Tue, 05 April 03:54:02
0: 480x640 3 persons, 1 cat, 1 laptop, 1 clock, 1: 480x640 4 persons, 2 chairs, Done. date (0.587s) Tue, 05 April 03:54:05
0: 480x640 3 persons, 1 chair, 1: 480x640 4 persons, 2 chairs, Done. date (0.582s) Tue, 05 April 03:54:07
i used to convert it into pandas dataframe with multiple delimiter
i tried code :
import pandas as pd
`student_csv = pd.read_csv('output.txt', names=['a', 'b','date','status'], sep='[0: 480x640, 1: 480x640 , date]')
student_csv.to_csv('txttocsv.csv', index = None)`
Now how to convert it into pandas dataframe like this...
a b c
2 persons 2 persons, Done Tue, 05 April03:54:02
How to convert text file into dataframe
It's tricky to know exactly what are your rules for splitting. You can use a regex as delimiter.
Here is a working example to split the lists and date as columns, but you'll probably have to tweak it to your exact rules:
df = pd.read_csv('output.txt', sep=r'(?:,\s*|^)(?:\d+: \d+x\d+|Done[^)]+\)\s*)',
header=None, engine='python', names=(None, 'a', 'b', 'date')).iloc[:, 1:]
output:
a b date
0 2 persons, 1 cat, 1 clock 2 persons, 1 chair Tue, 05 April 03:54:02
1 3 persons, 1 cat, 1 laptop, 1 clock 4 persons, 2 chairs Tue, 05 April 03:54:05
2 3 persons, 1 chair 4 persons, 2 chairs Tue, 05 April 03:54:07
You can use | in sep argument for multiple delimiters
df = pd.read_csv('data.txt', sep=r'0: 480x640|1: 480x640|date \(.*\)',
engine='python', names=('None', 'a', 'b', 'c')).drop('None', axis=1)
print(df)
a b \
0 2 persons, 1 cat, 1 clock, 2 persons, 1 chair, Done.
1 3 persons, 1 cat, 1 laptop, 1 clock, 4 persons, 2 chairs, Done.
2 3 persons, 1 chair, 4 persons, 2 chairs, Done.
c
0 Tue, 05 April 03:54:02
1 Tue, 05 April 03:54:05
2 Tue, 05 April 03:54:07

Panel data regression with fixed effects using Python

I have the following panel stored in df:
state
district
year
y
constant
x1
x2
time
0
01
01001
2009
12
1
0.956007
639673
1
1
01
01001
2010
20
1
0.972175
639673
2
2
01
01001
2011
22
1
0.988343
639673
3
3
01
01002
2009
0
1
0
33746
1
4
01
01002
2010
1
1
0.225071
33746
2
5
01
01002
2011
5
1
0.450142
33746
3
6
01
01003
2009
0
1
0
45196
1
7
01
01003
2010
5
1
0.427477
45196
2
8
01
01003
2011
9
1
0.854955
45196
3
y is the number of protests in each district
constant is a column full of ones
x1 is the proportion of the district's area covered by a mobile network provider
x2 is the population count in each district (note that it is fixed in time)
How can I run the following model in Python?
Here's what I tried
# Transform `x2` to match model
df['x2'] = df['x2'].multiply(df['time'], axis=0)
# District fixed effects
df['delta'] = pd.Categorical(df['district'])
# State-time fixed effects
df['eta'] = pd.Categorical(df['state'] + df['year'].astype(str))
# Set indexes
df.set_index(['district','year'])
from linearmodels.panel import PanelOLS
m = PanelOLS(dependent=df['y'], exog=df[['constant','x1','x2','delta','eta']])
ValueError: exog does not have full column rank. If you wish to proceed with model estimation irrespective of the numerical accuracy of coefficient estimates, you can set rank_check=False.
What am I doing wrong?
I dug around the documentation and the solution turned out to be quite simple.
After setting the indexes and turning the fixed effect columns to pandas.Categorical types (see question above):
# Import model
from linearmodels.panel import PanelOLS
# Model
m = PanelOLS(dependent=df['y'],
exog=df[['constant','x1','x2']],
entity_effects=True,
time_effects=False,
other_effects=df['eta'])
m.fit(cov_type='clustered', cluster_entity=True)
That is, DO NOT pass your fixed effect columns to exog.
You should pass them to entity_effects (boolean), time_effects (boolean) or other_effects (pandas.Categorical).

Count how many values occur in a month Pandas

I have a dataframe as follows;
Country
From Date
02/04/2020 Canada
04/02/2020 Ireland
10/03/2020 France
11/03/2020 Italy
15/03/2020 Hungary
.
.
.
10/10/2020 Canada
And I simply want to do a groupby() or something similar which will tell me how many times a country occurs per month
eg.
Canada Ireland France . . .
2010 1 3 4 1
2 4 3 2
.
.
.
10 4 4 4
Is there a simple way to do this?
Any help much appreciated!
A different angle to solve your question would be to use groupBy, count_values and unstack.
It goes like this:
I assume your "from date" is type date (datetime64[ns]) if not:
df['From Date']=pd.to_datetime(df['From Date'], format= '%d/%m/%Y')
convert the date to string with Year + Month:
df['From Date'] = df['From Date'].dt.strftime('%Y-%m')
group by From Date and count the values:
df.groupby(['From Date'])['Country'].value_counts().unstack().fillna(0).astype(int).reindex()
desired result (from the snapshot in your question):
Country Canada France Hungary Ireland Italy
From Date
2020-02 0 0 0 1 0
2020-03 0 1 1 0 1
2020-04 1 0 0 0 0
note the unstack that places the countries on the horizontal, astype(int) to avoid instances such as 1.0 and fillna(0) just in case any country has nothing - show zero.
Check with crosstab
# df.index=pd.to_datetime(df.index, format= '%d/%m/%Y')
pd.crosstab(df.index.strftime('%Y-%m'), df['Country'])

Count a certain value for each country

I am attempting to do a Excel countif function with pandas but hitting a roadblock in doing so.
I have this dataframe. I need to count the YES for each country quarter-wise. I have posted the requested answers below.
result.head(3)
Country Jan 1 Feb 1 Mar 1 Apr 1 May 1 Jun 1 Quarter_1 Quarter_2
FRANCE Yes Yes No No No No 2 0
BELGIUM Yes Yes No Yes No No 2 1
CANADA Yes No No Yes No No 1 1
I tried the following but Pandas spats out a total value instead showing a 5 for all the values under Quarter_1. I am oblivious on how to calculate my function below by Country? Any assistance with this please!
result['Quarter_1'] = len(result[result['Jan 1'] == 'Yes']) + len(result[result['Feb 1'] == 'Yes'])
+ len(result[result['Mar 1'] == 'Yes'])
We can use the length of your column and take the floor division to create your quarters. Then we groupby on these and take the sum.
Finally to we add the prefix Quarter:
df = df.set_index('Country')
grps = np.arange(len(df.columns)) // 3
dfn = (
df.join(df.eq('Yes')
.groupby(grps, axis=1)
.sum()
.astype(int)
.add_prefix('Quarter_'))
.reset_index()
)
Or using list comprehension to rename your columns:
df = df.set_index('Country')
grps = np.arange(len(df.columns)) // 3
dfn = df.eq('Yes').groupby(grps, axis=1).sum().astype(int)
dfn.columns = [f'Quarter_{col+1}' for col in dfn.columns]
df = df.join(dfn).reset_index()
Country Jan 1 Feb 1 Mar 1 Apr 1 May 1 Jun 1 Quarter_1 Quarter_2
0 FRANCE Yes Yes No No No No 2 0
1 BELGIUM Yes Yes No Yes No No 2 1
2 CANADA Yes No No Yes No No 1 1

Python group by 2 columns, output multiple columns

I have a tab-delimited file with movie genre and year in 2 columns:
Comedy 2013
Comedy 2014
Drama 2012
Mystery 2011
Comedy 2013
Comedy 2013
Comedy 2014
Comedy 2013
News 2012
Sport 2012
Sci-Fi 2013
Comedy 2014
Family 2013
Comedy 2013
Drama 2013
Biography 2013
I want to group the genres together by year and print out in the following format (does not have to be in alphabetical order):
Year 2011 2012 2013 2014
Biography 0 0 1 0
Comedy 0 0 5 3
Drama 0 1 1 0
Family 0 0 1 0
Mystery 1 0 0 0
News 0 1 0 0
Sci-Fi 0 0 1 0
Sport 0 1 0 0
How should I approach it? At the moment I'm creating my output through MS Excel, but I would like to do it through Python.
If you don't like to use pandas, you can do it as follows:
from collections import Counter
# load file
with open('tab.txt') as f:
lines = f.read().split('\n')
# replace separating whitespace with exactly one space
lines = [' '.join(l.split()) for l in lines]
# find all years and genres
genres = sorted(set(l.split()[0] for l in lines))
years = sorted(set(l.split()[1] for l in lines))
# count genre-year combinations
C = Counter(lines)
# print table
print "Year".ljust(10),
for y in years:
print y.rjust(6),
print
for g in genres:
print g.ljust(10),
for y in years:
print `C[g + ' ' + y]`.rjust(6),
print
The most interesting function is probably Counter, which counts the number of occurrences of each element. To make sure that the length of the separating whitespace does not influence the counting, I replace it with a single space beforehand.
The easiest way do to this is using the pandas library, which provides lots of way of interacting with tables of data:
df = pd.read_clipboard(names=['genre', 'year'])
df.pivot_table(index='genre', columns='year', aggfunc=len, fill_value=0)
Output:
year 2011 2012 2013 2014
genre
Biography 0 0 1 0
Comedy 0 0 5 3
Drama 0 1 1 0
Family 0 0 1 0
Mystery 1 0 0 0
News 0 1 0 0
Sci-Fi 0 0 1 0
Sport 0 1 0 0
If you're only just starting with Python, you might find trying to learn pandas is a bit too much on top of learning the language, but once you have some Python knowledge, pandas provides very intuitive ways to interact with data.

Categories

Resources