How to create a transition matrix out of a Pandas dataframe

How to create a transition matrix out of a Pandas dataframe - python

I have customers at a specific location. The customer may change the location from year to year. I would like to create a transition matrix that shows me the customers that transited from one location to another.
Tidy dataframe:
year cust loc
2019 C1 LA
2019 C2 LA
2019 C3 LB
2019 C4 LC
2019 C5 LA
2019 C6 LA
2020 C1 LB
2020 C2 LA
2020 C4 LC
2020 C5 LC
2020 C6 LC
2020 C7 LD
LA LB LC LD dorp
LA C1 C5,C6
LB C3
LC
LD
I am looking for an elegant way to achieve that in pandas. Any clever idea where extensive nested looping is not needed?

Related

Breaking the dataframe when particular string is found and creating multiple dataframes from the same

the data which i have is in below format :
col_1 col_2 col_3
NaN NaN NaN
Date 21-04-2022 NaN
Id Name status
01 A11 Pass
02 A22 F_1
03 A33 P_2
SUMMARY 'Total :$20 Approved $ 10' NaN
NaN NaN NaN
Date 22-04-2022 NaN
Id Name status
04 A12 P_2
05 A23 F_1
06 A34 P_2
SUMMARY 'Total :$30 Approved $ 20' NaN
Expected Output :
df_1 -
Id Name status
01 A11 Pass
02 A22 F_1
03 A33 P_2
SUMMARY 'Total :$20 Approved $ 10' NaN
df_2 -
Id Name status
04 A12 P_2
05 A23 F_1
06 A34 P_2
SUMMARY 'Total :$30 Approved $ 20' NaN
Above is just the sample data. Actual Number of columns which i have is around 24K. thus many number of df's will be created
how it can be approached..?

You can use:
grp = df['col_1'].eq('Id').cumsum() # create virtual groups
msk = ~df.isna().all(axis=1) & df['col_1'].ne('Date') # keep wanted rows
# create a dict with subset dataframes
dfs = {f'df{name}': pd.DataFrame(temp.values[1:], columns=temp.iloc[0].tolist())
for name, temp in df[msk].groupby(grp)}
Output:
>>> dfs['df1']
Id Name status
0 01 A11 Pass
1 02 A22 F_1
2 03 A33 P_2
3 SUMMARY Total :$20 Approved $ 10 NaN
>>> dfs['df2']
Id Name status
0 04 A12 P_2
1 05 A23 F_1
2 06 A34 P_2
3 SUMMARY Total :$30 Approved $ 20 NaN
Update: export to excel:
with pd.ExcelWriter('data.xlsx') as writer:
for name, temp in dfs.items():
temp.to_excel(writer, index=False, sheet_name=name)

You could create an auxiliary column of booleans and use that to slice the dataframe into smaller parts:
import pandas as pd
df = pd.DataFrame({'col_1': [1,2,'Id',3,4,5,'SUMMARY',1,2,'Id',3,4,5,'SUMMARY']})
mask = df['col_1'].eq('Id') | df['col_1'].eq('SUMMARY').shift()
df['group_id'] = mask.cumsum()
dfs = list()
for group_id in df['group_id'].unique():
if group_id % 2 != 0:
dfs.append(df[df['group_id'].eq(group_id)])
print(dfs[0])
print(dfs[1])

Highly inspired by piRSquared's answer here, you can approach your goal like this :
import pandas as pd
import numpy as np
df.columns = ["Id", "Name", "Status"]
# is the row a Margin ?
m = df["Id"].eq("SUMMARY")
l_df = list(filter(lambda d: not d.empty, np.split(df, np.flatnonzero(m) + 1)))
_ = [exec(f"globals()['df_{idx}'] = df.reset_index(drop=True) \
.loc[3:].reset_index(drop=True)")
for idx, df in enumerate(l_df, start=1)]
NB : We used globals to create the variables/sub-dataframes dynamically.
# Output :
print(len(l_df), "DataFrames was created!")
2 DataFrames was created!
print(df_1, type(df_1)), print(df_2, type(df_2)))
Id Name Status
0 04 A12 P_2
1 05 A23 F_1
2 06 A34 P_2
3 SUMMARY 'Total :$30 Approved $ 20' NaN <class 'pandas.core.frame.DataFrame'>
Id Name Status
0 01 A11 Pass
1 02 A22 F_1
2 03 A33 P_2
3 SUMMARY 'Total :$20 Approved $ 10' NaN <class 'pandas.core.frame.DataFrame'>

With the following code, you can select the row you want as column names and put the rows you want as new data frame rows.
import numpy as np
df_1 = pd.DataFrame(data = np.array(df.iloc[3:7]),columns=np.array(df.iloc[2:3])[0])
df_2 = pd.DataFrame(data = np.array(df.iloc[11:15]),columns=np.array(df.iloc[10:11])[0])
Output df_1:
Id Name status
01 A11 Pass
02 A22 F_1
03 A33 P_2
SUMMARY 'Total :$20 Approved $ 10' NaN
Output df_2:
Id Name status
04 A12 P_2
05 A23 F_1
06 A34 P_2
SUMMARY 'Total :$30 Approved $ 20' NaN

Get Subject and grade from string

Given this string
result = '''Check here to visit our corporate website
Results
Candidate Information
Examination Number
986542346
Candidate Name
JOHN DOE JAMES
Examination
MFFG FOR SCHOOL CANDIDATES 2021
Centre
LORDYARD
Subject Grades
DATA PROCESSING
B3
ECONOMICS
B3
CIVIC EDUCATION
B3
ENGLISH LANGUAGE
A1
MATHEMATICS
B3
AGRICULTURAL SCIENCE
OUTSTANDING
BIOLOGY
A1
CHEMISTRY
B2
PHYSICS
C5
C Information
Card Use
1 of 5'''
How can I extract the NAME(JOHN DOE JAMES, SUBJECTS and the GRADES to different list.
I have tried this for the subject and grades but not giving me the desired results. Firstly, where subject name is more than one word it only returns to last 1 eg instead DATA PROCESSING am getting PROCESSING. Secondly, it is skipping AGRICULTURAL SCIENCE(subject) and OUTSTANDING(grade)
Please note that am new in using regex. Thanks in advance.
pattern = re.compile(r'[A-Z]+\n{1}[A-Z][0-9]')
searches = pattern.findall(result)
if searches:
print(searches)searches = pattern.findall(result)
for search in searches:
print(search)
OUTPUT FOR THE FIRST PRINT STATEMENT:
['PROCESSING\nB3', 'ECONOMICS\nB3', 'EDUCATION\nB3', 'LANGUAGE\nA1', 'MATHEMATICS\nB3', 'BIOLOGY\nA1', 'CHEMISTRY\nB2', 'PHYSICS\nC5']
SECOND PRINT STATEMENT
PROCESSING
B3
ECONOMICS
B3
EDUCATION
B3
LANGUAGE
A1
MATHEMATICS
B3
BIOLOGY
A1
CHEMISTRY
B2
PHYSICS
C5

Here's a way to do this without using regexes. Note that I am assuming "OUTSTANDING" is intended to be a grade. That takes special processing.
result = '''Check here to visit our corporate website Results Candidate Information Examination Number 986542346 Candidate Name JOHN DOE JAMES Examination MFFG FOR SCHOOL CANDIDATES 2021 Centre LORDYARD Subject Grades DATA PROCESSING B3 ECONOMICS B3 CIVIC EDUCATION B3 ENGLISH LANGUAGE A1 MATHEMATICS B3 AGRICULTURAL SCIENCE OUTSTANDING BIOLOGY A1 CHEMISTRY B2 PHYSICS C5 C Information Card Use 1 of 5'''
i = result.find('Name')
j = result.find('Examination',i)
k = result.find('Centre')
l = result.find('Subject Grades')
m = result.find('Information Card')
name = result[i+5:j-1]
exam = result[j+12:k-1]
grades = result[l+15:m].split()
print("Name:", name)
print("Exam:", exam)
print("Grades:")
subject = []
for word in grades:
if len(word) == 2 or word=='OUTSTANDING':
print(' '.join(subject), "......", word)
subject = []
else:
subject.append(word)
Output:
Name: JOHN DOE JAMES
Exam: MFFG FOR SCHOOL CANDIDATES 2021
Grades:
DATA PROCESSING ...... B3
ECONOMICS ...... B3
CIVIC EDUCATION ...... B3
ENGLISH LANGUAGE ...... A1
MATHEMATICS ...... B3
AGRICULTURAL SCIENCE ...... OUTSTANDING
BIOLOGY ...... A1
CHEMISTRY ...... B2
PHYSICS ...... C5

Datetime in bytes

I am trying to decode datetime from byte form. I tried various methods (seconds, minutes, hours form 1-1-1970, minutes form 1-1-1 etc). I also tried mysql encoding (https://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html) with no effect either.
Please help me find the datetime save key.
bf aa b8 c3 e5 2f
d7 be ba c3 e5 2f
80 a0 c0 c3 e5 2f
a7 fc bf c3 e5 2f
ae fd f2 c3 e5 2f
9e dd fa c3 e5 2f
c7 ce fa c3 e5 2f
b9 f5 82 c4 e5 2f
f8 95 f2 c3 e5 2f
Everything is around 01/14/2022 12:00

Each value is a timestamp encoded into VARINT. Long time ago I've made next functions to decode/encode VARINT:
def to_varint(src):
buf = b""
while True:
towrite = src & 0x7f
src >>= 7
if src:
buf += bytes((towrite | 0x80,))
else:
buf += bytes((towrite,))
break
return buf
def from_varint(src):
shift = 0
result = 0
for i in src:
result |= (i & 0x7f) << shift
shift += 7
return result
So using this function we can decode your values:
from datetime import datetime
...
values = """\
bf aa b8 c3 e5 2f
d7 be ba c3 e5 2f
80 a0 c0 c3 e5 2f
a7 fc bf c3 e5 2f
ae fd f2 c3 e5 2f
9e dd fa c3 e5 2f
c7 ce fa c3 e5 2f
b9 f5 82 c4 e5 2f
f8 95 f2 c3 e5 2f"""
for value in values.splitlines():
timestamp = from_varint(bytes.fromhex(value))
dt = datetime.fromtimestamp(timestamp / 1000)
print(dt)
To get current timestamp in this encoding you can use next code:
current = to_varint(int(datetime.now().timestamp() * 1000)).hex(" ")

How to calculate a weighted average in Python for each unique value in two columns?

The picture below shows a few lines of printed lists I have in Python. I would like to get: a list of unique values of boroughs, a corresponding list of unique values of years, and a list of weighted averages of "averages" with "nobs" as weights but for each borough and each year (the variable "type" indicates if there was just one, two or three types in a specific year in a borough).
I know how to get a weighted average using the entire lists:
weighted_avg = np.average(average, weights=nobs)
But I don't know how to calculate one for each unique borough-year.
I'm new to Python, please help if you know how to do it.

Assuming that the 'type' column doesn't affect your calculations, you can get the average using groupby. Here's the data:
df = pd.DataFrame({'borough': ['b1', 'b2']*6, 'year': [2008, 2009, 2010, 2011]*3,
'average': np.random.randint(low=100, high=200, size=12),
'nobs': np.random.randint(low=1, high=40, size=12)})
print(df):
borough year average nobs
0 b1 2008 166 1
1 b2 2009 177 35
2 b1 2010 114 27
3 b2 2011 187 18
4 b1 2008 193 2
5 b2 2009 105 27
6 b1 2010 114 36
7 b2 2011 144 3
8 b1 2008 114 39
9 b2 2009 157 6
10 b1 2010 133 17
11 b2 2011 176 12
we add a new column which is the product of the average and nobs columns:
df['average x nobs'] = df['average']*df['nobs']
newdf = pd.DataFrame({'weighted average': df.groupby(['borough', 'year']).sum()['average x nobs']/df.groupby(['borough', 'year']).sum()['nobs']})
print(newdf):
weighted average
borough year
b1 2008 119.000000
2010 118.037500
b2 2009 146.647059
2011 179.090909

Pandas DataFrame: How to extract the last two string type numbers from a column which doesn't always end with the two numbers

Sorry for the possible confusion in the title, here's what I'm trying to do:
I'm trying to merge my Parcels data frame with my Municipality Code look up table. The Parcels dataframe:
df1.head()
PARID OWNER1
0 B10 2 1 0131 WILSON ROBERT JR
1 B10 2 18B 0131 COMUNALE MICHAEL J & MARY ANN
2 B10 2 18D 0131 COMUNALE MICHAEL J & MARY ANN
3 B10 2 19F 0131 MONROE & JEFFERSON HOLDINGS LLC
4 B10 4 11 0131 NOEL JAMES H
The Municipality Code dataframe:
df_LU.head()
PARID Municipality
0 01 Allen Twp.
1 02 Bangor
2 03 Bath
3 04 Bethlehem
4 05 Bethlehem Twp.
The last two numbers in the first column of df1 ('31' in 'B10 2 1 0131') are the Municipality Code that I need to merge with the Municipality Code DataFrame. But in my 30,000 or so records, there are about 200 records end with letters as shown below:
PARID OWNER1
299 D11 10 10 0131F HOWARD THEODORE P & CLAUDIA S
1007 F10 4 3 0134F KNEEBONE JUDY ANN
1011 F10 5 2 0134F KNEEBONE JUDY ANN
1114 F8 18 10 0626F KNITTER WILBERT D JR & AMY J
1115 F8 18 8 0626F KNITTER DONALD
For these rows, the two numbers before the last letter are the Code that I need to extract out (like '31' in 'D11 10 10 0131F')
If I just use
pd.DataFrame(df1['PARID'].str[-2:])
This will give me:
PARID
...
299 1F
...
While what I need is:
PARID
...
299 31
...
My code of accomplishing this is pretty lengthy, which pretty much invloves:
Join all the rows that end with 2 numbers.
Find out the index of the rows that end with a letter in the 'PARID' field
Join the results from step 2 again with the Municipality look up dataframe.
The code is there:
#Do the extraction and merge for the rows that end with numbers
df_2015= df1[['PARID','OWNER1']]
df_2015['PARID'] = df_2015['PARID'].str[-2:]
df_15r =pd.merge(df_2015, df_LU, how = 'left', on = 'PARID')
df_15r
#The pivot result for rows generated from above.
Result15_First = df_15r.groupby('Municipality').count()
Result15_First.to_clipboard()
#Check the ID field for rows that end with letters
check15 = df_2015['PARID'].unique()
check15
C = pd.DataFrame({'ID':check15})
NC = C.dropna()
LNC = NC[NC['ID'].str.endswith('F')]
MNC = NC[NC['ID'].str.endswith('A')]
F = [LNC, MNC]
NNC = pd.concat(F, axis = 0)
s = NNC['ID'].tolist()
s
# Identify the records in s
df_p15 = df_2015.loc[df_2015['PARID'].isin(s)]
df_p15
# Separate out a dataframe with just the rows that end with a letter
df15= df1[['PARID','OWNER1']]
df15c = df15[df15.index.isin(df_p15.index)]
df15c
#This step is to create the look up field from the new data frame, the two numbers before the ending letter.
df15c['PARID1'] = df15c['PARID'].str[-3:-1]
df15c
#Then I will join the look up table
df_15t =df15c.merge(df_LU.set_index('PARID'), left_on = 'PARID1', right_index = True)
df_15b = df_15t.groupby('Municipality').count()
df_15b
It wasn't until I finished that I realized how lengthy my code was for a seemingly simple task. If there is a better way to achieve, which is a sure thing, please let me know. Thanks.

You can use pandas string methods to extract the last two numbers
df1['PARID'].str.extract('.*(\d{2})', expand = False)
You get
0 31
1 31
2 13
3 13
4 31

You can use str.replace to remove all non-digits. After that, you should be able to use .str[-2:].
import pandas as pd
df1 = pd.DataFrame({ 'PARID' : pd.Series(["M3N6V2 B7 13A 0131", "M3N6V2 B7 13B
0131", "Y2 7 B13 0213", "Y2 7 B14 0213", "M5 N4 12 0231A"]),
'Owner' : pd.Series(["Tom", "Jerry", "Jack", "Chris", "Alex"])})
df1['PARID'].str.replace(r'\D+', '').str[-2:]

import pandas as pd
df = pd.DataFrame([['M3N6V2 B7 13A 0131','M3N6V2 B7 13B 0131','Y2 7 B13 0213', 'Y2 7 B14 0213', 'M5 N4 12 0231A' ], ['Tom', 'Jerry', 'Jack', 'Chris', 'Alex']])
df = df.T
df.columns = ['PARID', 'Owner']
print(df)
prints your left DataFrame
PARID Owner
0 M3N6V2 B7 13A 0131 Tom
1 M3N6V2 B7 13B 0131 Jerry
2 Y2 7 B13 0213 Jack
3 Y2 7 B14 0213 Chris
4 M5 N4 12 0231A Alex
and now for your right DataFrame
import numpy as np
df['IDPART'] = None
for row in df.index:
if df.at[row, 'PARID'][-1].isalpha():
df.at[row, 'IDPART'] = df.at[row, 'PARID'][-3:-1]
else:
df.at[row, 'IDPART'] = df.at[row, 'PARID'][-2:]
df['IDPART']=df['IDPART'].apply(int) #Converting the column to be joined to an integer column
print(df)
gives:
PARID Owner IDPART
0 M3N6V2 B7 13A 0131 Tom 31
1 M3N6V2 B7 13B 0131 Jerry 31
2 Y2 7 B13 0213 Jack 13
3 Y2 7 B14 0213 Chris 13
4 M5 N4 12 0231A Alex 31
and then merge
merged = pd.merge(df, otherdf, how = 'left', left_on = 'IDPART', right_on = 'PARID', left_index=False, right_index=False)
print(merged)
gives:
PARID_x Owner IDPART PARID_y Municipality
0 M3N6V2 B7 13A 0131 Tom 31 31 Tatamy
1 M3N6V2 B7 13B 0131 Jerry 31 31 Tatamy
2 Y2 7 B13 0213 Jack 13 13 Allentown
3 Y2 7 B14 0213 Chris 13 13 Allentown
4 M5 N4 12 0231A Alex 31 31 Tatamy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a transition matrix out of a Pandas dataframe - python

Related

Breaking the dataframe when particular string is found and creating multiple dataframes from the same

Get Subject and grade from string

Datetime in bytes

How to calculate a weighted average in Python for each unique value in two columns?

Pandas DataFrame: How to extract the last two string type numbers from a column which doesn't always end with the two numbers

Categories

Resources