How to generate label by sparse cumcount - python

Here's my master dataset
Id Data Category Code
1 tey Airport AIR_02
2 fg Hospital HEA_04
3 dffs Airport AIR_01
4 dsfs Hospital HEA_03
5 fdsf Airport AIR_04
Here's the data I want to merge
Id Data Category
1 tetyer Airport
2 fgdss Hospital
3 dffsdsa Airport
4 dsfsas Hospital
5 fdsfada Airport
My Expected Output
Id Data Category Code
1 tey Airport AIR_02
2 fg Hospital HEA_04
3 dffs Airport AIR_01
4 dsfs Hospital HEA_03
5 fdsf Airport AIR_04
6 tetyer Airport AIR_03
7 fgdss Hospital HEA_01
8 dffsdsa Airport AIR_05
9 dsfsas Hospital HEA_02
10 fdsfada Airport AIR_06
Note:
HEA_01is not available on existing dataset, Every Hospital Code start with HEA_ and Every airport start with AIR_, code 01,02 etc is by availability.

Use:
#split Code by _
df1[['a','b']] = df1['Code'].str.split('_', expand=True)
#converting values to integers
df1['b'] = df1['b'].astype(int)
#aggregate for list and first value for mapping
df11 = df1.groupby(['Category']).agg({'a':'first', 'b':list})
#get difference by np.arange with used values
def f(x):
L = df11['b'][x.name]
a = np.arange(1, len(x) + len(L) + 1)
#difference with filter same number of values like length of group
return np.setdiff1d(a, L)[:len(x)]
df2['Code'] = df2.groupby('Category')['Category'].transform(f)
#created Code with join
df2['Code'] = df2['Category'].map(df11['a']) + '_' + df2['Code'].astype(str).str.zfill(2)
print (df2)
Id Data Category Code
0 1 tetyer Airport AIR_03
1 2 fgdss Hospital HEA_01
2 3 dffsdsa Airport AIR_05
3 4 dsfsas Hospital HEA_02
4 5 fdsfada Airport AIR_06
df = pd.concat([df1.drop(['a','b'], 1), df2], ignore_index=True)
print (df)
Id Data Category Code
0 1 tey Airport AIR_02
1 2 fg Hospital HEA_04
2 3 dffs Airport AIR_01
3 4 dsfs Hospital HEA_03
4 5 fdsf Airport AIR_04
5 1 tetyer Airport AIR_03
6 2 fgdss Hospital HEA_01
7 3 dffsdsa Airport AIR_05
8 4 dsfsas Hospital HEA_02
9 5 fdsfada Airport AIR_06

To solve this, I would define a class to act as a code filler. The advantage of this approach is that you can then easily add more data without needing to recompute everything:
class CodeFiller():
def __init__(self, df, col='Code', maps=None):
codes = df[col].str.split('_', expand=True).groupby(0)[1].agg(set).to_dict()
self.maps = maps
self.gens = {prefix: self.code_gen(prefix, codes[prefix]) for prefix in codes}
def code_gen(self, prefix, codes):
from itertools import count
for i in count(1):
num = f'{i:02}'
if num not in codes:
yield f'{prefix}_{num}'
def __call__(self, prefix):
if self.maps:
prefix = self.maps[prefix]
return next(self.gens[prefix])
refs = {'Airport': 'AIR', 'Hospital': 'HEA'}
filler = CodeFiller(df1, maps=refs)
df3 = pd.concat([df1, df2.assign(Code=df2['Category'].map(filler))], ignore_index=True)
output:
Id Data Category Code
0 1 tey Airport AIR_02
1 2 fg Hospital HEA_04
2 3 dffs Airport AIR_01
3 4 dsfs Hospital HEA_03
4 5 fdsf Airport AIR_04
5 1 tetyer Airport AIR_03
6 2 fgdss Hospital HEA_01
7 3 dffsdsa Airport AIR_05
8 4 dsfsas Hospital HEA_02
9 5 fdsfada Airport AIR_06
Now imagine you have more data coming, you can just continue (reusing df2 here for the example):
pd.concat([df3, df2.assign(Code=df2['Category'].map(filler))], ignore_index=True)
output:
Id Data Category Code
[...]
10 1 tetyer Airport AIR_10
11 2 fgdss Hospital HEA_07
12 3 dffsdsa Airport AIR_11
13 4 dsfsas Hospital HEA_08
14 5 fdsfada Airport AIR_12

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

Replace the values in a column based on frequency

I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())
Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'
If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)
Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')
You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object

How to read multiple files from a directory and append them in python?

Iam working with files in a folder where i need better way to loop through files and append a column to make master file. For two files i was using reading as two dataframe and appending series. However now i ran into situation with more more than 100 files.
file 1 is as below:
Num Department Product Salesman Location rating1
1 Electronics TV 3 Bigmart, Delhi 5
2 Electronics TV 1 Bigmart, Mumbai 4
3 Electronics TV 2 Bigmart, Bihar 3
4 Electronics TV 2 Bigmart, Chandigarh 5
5 Electronics Camera 2 Bigmart, Jharkhand 5
similary file 2:
Num Department Product Salesman Location rating2
1 Electronics TV 3 Bigmart, Delhi 2
2 Electronics TV 1 Bigmart, Mumbai 4
3 Electronics TV 2 Bigmart, Bihar 4
4 Electronics TV 2 Bigmart, Chandigarh 5
5 Electronics Camera 2 Bigmart, Jharkhand 3
What I am trying to achieve is read Rating column from all the other file and append verticaly. Expected:
Num Department Product Salesman Location rating1 rating2
1 Electronics TV 3 Bigmart, Delhi 5 2
2 Electronics TV 1 Bigmart, Mumbai 4 4
3 Electronics TV 2 Bigmart, Bihar 3 5
4 Electronics TV 2 Bigmart, Chandigarh 5 5
5 Electronics Camera 2 Bigmart, Jharkhand 5 3
I modified some of the code posted here. Following Code worked:
def read_folder(folder):
files = [i for i in os.listdir(folder) if 'xlsx' in i]
df = pd.read_excel(folder+'/{}'.format(files[0]))
for f in files[1:]:
df2 = pd.read_excel(folder+'/{}'.format(f))
df = df.merge(df2.iloc[:,5],left_index=True,right_index=True)
return df
This method read folder and return all in a pandas dataframe
import pandas as pd
import os
def read_folder(csv_folder)
files = os.listdir(csv_folder)
df = []
for f in files:
print(f)
csv_file = csv_folder + "/" + f
df.append(pd.read_csv(csv_file))
df_full = pd.concat(df, ignore_index=True)
return df, full
As I understand your last comment, you need to add rating columns and create one file. After reading all files you can do below operation.
final_df = df[0]
i = 1
for d in df[1:]:
final_df["rating_"+i] = d["rating"]
i = i+1
This version of read_folder() returns a list of data frames. It also add a helper column (for ratings).
import pandas as pd
from pathlib import Path
def read_folder(csv_folder):
''' Input is a folder with csv files; return list of data frames.'''
csv_folder = Path(csv_folder).absolute()
csv_files = [f for f in csv_folder.iterdir() if f.name.endswith('csv')]
# the assign() method adds a helper column
dfs = [
pd.read_csv(csv_file).assign(rating_src = f'rating-{idx}')
for idx, csv_file in enumerate(csv_files, 1)
]
return dfs
Now assemble the data frames into the desired shape:
dfs = read_folder(csv_folder)
dfs = (pd.concat((d for d in dfs))
.set_index(['Num', 'Department', 'Product', 'Salesman', 'Location', 'rating_src'])
.squeeze()
.unstack(level='rating_src')
.reset_index()
)
dfs.columns.name = ''

printing multiple sections of text between two markers in python

I converted this page (it's squad lists for different sports teams) from PDF to text using this code:
import PyPDF3
import sys
import tabula
import pandas as pd
#One method
pdfFileObj = open(sys.argv[1],'rb')
pdfReader = PyPDF3.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
print(text)
The output looks like this:
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF
I wanted to transform this output to a tab delimited file with three columns: team name, player name, and number. So for the example I gave, the output would look like:
Bohemians James Talbot 1
Bohemians Derek Pender 2
Bohemians Darragh Leahy 3
Cork City Mark McNulty 1
Cork City Colm Horgan 2
Cork City Alan Bennett 3
Derry City Peter Cherrie 1
Derry City Conor McDermott 2
Derry City Ciaran Coll 3
I know I need to first (1) Divide the file into sections based on team, and then (2) within each team section; combine each name + number field into pairs to assign each number to a name.
I wrote this little bit of code to parse the big file into each sports team:
import sys
fileopen = open(sys.argv[1])
recording = False
for line in fileopen:
if not recording:
if line.startswith('PREMI'):
recording = True
elif line.startswith('2019 SEA'):
recording = False
else:
print(line)
But I'm stuck, because the above code won't divide up the block of text per team (i.e. i need multiple blocks of text extracted to separate strings or lists?). Can someone advise how to divide up the text file I have per team (so in this example, I should be left with three blocks of text...and then somehow I can work on each team-divided block of text to pair numbers and names).
Soooo, not necessarily true to form and I don't take into consideration the other libraries you'd used, but it was designed to give you a start. You can reformat it however you wish.
>>> string = '''2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: BOHEMIANS
1
James Talbot
GK
2
Derek Pender
DF
3
Darragh Leahy
DF
.... some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: CORK CITY
1
Mark McNulty
GK
2
Colm Horgan
DF
3
Alan Bennett
DF
....some more names....
2019 SEASON
PREMIER DIVISION SQUAD NUMBERS
CLUB: DERRY CITY
1
Peter Cherrie
GK
2
Conor McDermott
DF
3
Ciaran Coll
DF'''
>>> def reorder(string):
import re
headers = ['Team', 'Name', 'Number']
print('\n')
print(headers)
print()
paragraphs = re.findall('2019[\S\s]+?(?=2019|$)', string)
for paragraph in paragraphs:
club = re.findall('(?i)CLUB:[\s]*([\S\s]+?)\n', paragraph)
names_numbers = re.findall('(?i)([\d]+)[\n]{1,3}[\s]*([\S\ ]+)', paragraph)
for i in range(len(names_numbers)):
if len(club) == 1:
print(club[0]+' | '+names_numbers[i][1]+' | '+names_numbers[i][0])
>>> reorder(string)
['Team', 'Name', 'Number']
BOHEMIANS | James Talbot | 1
BOHEMIANS | Derek Pender | 2
BOHEMIANS | Darragh Leahy | 3
CORK CITY | Mark McNulty | 1
CORK CITY | Colm Horgan | 2
CORK CITY | Alan Bennett | 3
DERRY CITY | Peter Cherrie | 1
DERRY CITY | Conor McDermott | 2
DERRY CITY | Ciaran Coll | 3

Tallying number of times certain strings occur in Python

Im working on a database of incidents affecting different sectors in different countries and want to create a table tallying the incident rate breakdown for each country.
The database looks like this atm
Incident Name | Country Affected | Sector Affected
incident_1 | US,TW,CN | Engineering,Media
incident_2 | FR,RU,CN | Government
etc., etc.
My aim would be to build a which looks like this:
Country | Engineering | Media | Government
CN | 3 | 0 | 5
etc.
Right now my method is basically to use an if loop to check if the country column contains a specific string (for example 'CN') and if this returns True then to run Counter from collections to create a dictionary of the initial tally, then save this.
My issue is how to scale this us to a level where it can be run across the entire database AND how to actually save the dictionary produced by Counter.
pd.Series.str.get_dummies and pd.DataFrame.dot
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Engineering Government Media
CN 1 1 1
FR 0 1 0
RU 0 1 0
TW 1 0 1
US 1 0 1
bigger example
np.random.seed([3,1415])
countries = ['CN', 'FR', 'RU', 'TW', 'US', 'UK', 'JP', 'AU', 'HK']
sectors = ['Engineering', 'Government', 'Media', 'Commodidty']
def pick_rnd(x):
i = np.random.randint(1, len(x))
j = np.random.choice(x, i, False)
return ','.join(j)
df = pd.DataFrame({
'Country Affected': [pick_rnd(countries) for _ in range(10)],
'Sector Affected': [pick_rnd(sectors) for _ in range(10)]
})
df
Country Affected Sector Affected
0 CN Government,Media
1 FR,TW,JP,US,UK,CN,RU,AU Commodidty,Government
2 HK,AU,JP Commodidty
3 RU,CN,FR,JP,UK Media,Commodidty,Engineering
4 CN,RU,FR,JP,TW,HK,US,UK Government,Media,Commodidty
5 FR,CN Commodidty
6 FR,HK,JP,TW,US,AU,CN Commodidty
7 CN,HK,RU,TW,UK,US,FR,JP Media,Commodidty
8 JP,UK,AU Engineering,Media
9 RU,UK,FR Media
Then
c = df['Country Affected'].str.get_dummies(sep=',')
s = df['Sector Affected'].str.get_dummies(sep=',')
c.T.dot(s)
Commodidty Engineering Government Media
AU 3 1 1 1
CN 6 1 3 4
FR 6 1 2 4
HK 4 0 1 2
JP 6 2 2 4
RU 4 1 2 4
TW 4 0 2 2
UK 4 2 2 5
US 4 0 2 2

Categories

Resources