Columns selection on specific text - python

I want to extract specific columns that contain specific names. Below you can see my data
import numpy as np
import pandas as pd
data = {
'Names': ['Store (007) Total amount of Sales ',
'Store perc (65) Total amount of sales ',
'Mall store, aid (005) Total amount of sales',
'Increase in the value of sales / Additional seling (22) Total amount of sales',
'Dividends (0233) Amount of income tax',
'Other income (098) Total amount of Sales',
'Other income (0245) Amount of Income Tax',
],
'Sales':[10,10,9,7,5,5,5],
}
df = pd.DataFrame(data, columns = ['Names',
'Sales',
])
df
This data have some specific columns that I need to be selected in the separate data frame. Keywords for this selection are words Total amount of Sales or Total amount of sales . These words are placed after the second brackets ). Also please take into account that text is no trimmed so empty spaces are possible.
So can anybody help me how to solve this ?

Use Series.str.contains without test cases with case=False in boolean indexing:
df1 = df[df['Names'].str.contains('Total amount of Sales', case=False)]
print (df1)
Names Sales
0 Store (007) Total amount of Sales 10
1 Store perc (65) Total amount of sales 10
2 Mall store, aid (005) Total amount of sales 9
3 Increase in the value of sales / Additional se... 7
5 Other income (098) Total amount of Sales 5
Or if need test sales or Sales use:
df2 = df[df['Names'].str.contains('Total amount of [Ss]ales')]

Related

Tiering pandas column based on unique id and range cutoffs

I have one df that categorizes income into tiers across males and females and thousands of zip codes. I need to add a column to df2 that maps each person's income level by zip code (average, above average etc.).
The idea is to assign the highest cutoff exceeded by a given person's income, or assign to lowest tier by default
The income level for each tier also varies by zip code. For certain zip codes there are limited number of tiers (e.g. no very high incomes). There are also separate tiers for males by zip code not shown due to space.
I think I need to create some sort of dictionary, not sure how to handle this. any help would go a long way, thanks.
**Edit: The first df acts as a key, and I am looking to use it to assign the corresponding row value from the column 'Income Level' to df2
E.g. for a unique id in df2, compare df2['Annual Income'] to the matching id in df['Annual Income cutoff']. Then assign the highest possible Income level from df as a new row value in df2
import pandas as pd
import numpy as np
data = [['female',10009,'very high',10000000],['female',10009,'high',100000],['female',10009,'above average',75000],['female', 10009, 'average', 50000]]
df = pd.DataFrame(data, columns = ['Sex', 'Area Code', 'Income level', 'Annual Income cutoff'])
print(df)
Sex Area Code Income level Annual Income cutoff
0 female 10009 very high 10000000
1 female 10009 high 100000
2 female 10009 above average 75000
3 female 10009 average 50000
data_2 = [['female',10009, 98000], ['female', 10009, 56000]]
df2 = pd.DataFrame(data_2, columns = ['Sex', 'Area Code', 'Annual Income'])
print(df2)
Sex Area Code Annual Income
0 female 10009 98000
1 female 10009 56000
output_data = [['female',10009, 98000, 'above average'], ['female', 10009, 56000, 'average']]
final_output = pd.DataFrame(output_data, columns = ['Sex', 'Area Code', 'Annual Income', 'Income Level'])
print(final_output)
Sex Area Code Annual Income Income Level
0 female 10009 98000 above average
1 female 10009 56000 average
One way to do this is to use pd.merge_asof:
pd.merge_asof(df2.sort_values('Annual Income'),
df.sort_values('Annual Income cutoff'),
left_on = 'Annual Income',
right_on = 'Annual Income cutoff',
by=['Sex', 'Area Code'], direction = 'backward')
Output:
Sex Area Code Annual Income Income level Annual Income cutoff
0 female 10009 56000 average 50000
1 female 10009 98000 average 50000

How to lookup/find the value in a two-columns range from another Dataframe? - Python Pandas Dataframe

I have a question about Pandas Dataframe. There are two tables, 1 table is a mapping table, and 2nd table is a transactional date.
In the mapping table, there are two columns with a range of From and To.
Below are the two dataframes:
1). The df1 is the mapping table with a range of account numbers to map to a specific tax type.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Category':['FBT Tax','CIT','GST','Stamp Duty','Sales Tax'],
'GL From':['10000000','20000000','30000000','40000000','50000000'],
'GL To':['10009999','20009999','30009999','40009999','50009999']})
Category GL From GL To
0 FBT Tax 10000000 10009999
1 CIT 20000000 20009999
2 GST 30000000 30009999
3 Stamp Duty 40000000 40009999
4 Sales Tax 50000000 50009999
2). The df2 is the transactional table (there should be more columns I skipped for this demo), with the account number that I want to use to search/lookup in the range in df1.
df2 = pd.DataFrame({'Date':['1/10/19','2/10/19','3/10/19','10/11/19','12/12/19','30/08/19','01/07/19'],
'GL Account':['20000456','30000199','20004689','40008900','50000876','10000325','70000199'],
'Product LOB':['Computer','Mobile Phone','TV','Fridge','Dishwasher','Tablet','Table']})
Date GL Account Product LOB
0 1/10/19 20000456 Computer
1 2/10/19 30000199 Mobile Phone
2 3/10/19 20004689 TV
3 10/11/19 40008900 Fridge
4 12/12/19 50000876 Dishwasher
5 30/08/19 10000325 Tablet
6 01/07/19 70000199 Table
In the df1 and df2, the account numbers are in String dtype. Hence, I created a simple function to convert into Integer.
def to_integer(col):
return pd.to_numeric(col,downcast='integer')
I have tried both np.dot and .loc to map the Category column, but I encountered this error:
ValueError: Can only compare identically-labeled Series objects
result = np.dot((to_integer(df2['GL Account']) >= to_integer(df1['GL From'])) &
(to_integer(df2['GL Account']) <= to_integer(df1['GL To'])),df1['Category'])
result = df1.loc[(to_integer(df2['GL Account']) >= to_integer(df1['GL From'])) &
(to_integer(df2['GL Account']) <= to_integer(df1['GL To'])),"Category"]
What I want to achieve is like below:
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 2/10/19 30000199 Mobile Phone GST
2 3/10/19 20004689 TV CIT
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
6 01/07/19 70000199 Table NaN
Is there anyway to map between two dataframes based on From-To range?
Pandas >= 0.25.0
We can do a cartesian merge by first assigning two artificial columns called key and joining on these. Then we can use query to filter everything between the correct ranges. Notice that we use backtick () to get our columns with spaces in the name, this ispandas >= 0.25.0`:
df2.assign(key=1).merge(df1.assign(key=1), on='key')\
.drop(columns='key')\
.query('`GL Account`.between(`GL From`, `GL To`)')\
.drop(columns=['GL From', 'GL To'])\
.reset_index(drop=True)
If you use left join, replace the .query part with:
.query('`GL Account`.between(`GL From`, `GL To`) | `GL From`.isna()')
To keep the rows which didn't match in the join
Or
Pandas < 0.25.0
Simple boolean indexing
mrg = df2.assign(key=1).merge(df1.assign(key=1), on='key')\
.drop(columns='key')
mrg[mrg['GL Account'].between(mrg['GL From'], mrg['GL To'])]\
.drop(columns=['GL From', 'GL To'])\
.reset_index(drop=True)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 2/10/19 30000199 Mobile Phone GST
2 3/10/19 20004689 TV CIT
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
In case your data follows the pattern provided, you can create a column that has the lower bound value of each account and then merge on it:
df1['GL From'] = df1['GL From'].astype(int) #make it integer
### create lower bound
df2['lbound'] = df2['GL Account'].astype(int)//10000000*10000000
### merge
df2.merge(df1, left_on='lbound', right_on='GL From')\
.drop(['lbound','GL From','GL To'], axis=1)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 3/10/19 20004689 TV CIT
2 2/10/19 30000199 Mobile Phone GST
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
Added
In case the data does not follow a specific patter, you can use np.intersect1d with np.where to find out the lower bound and upper bound intersection, and therefore the index of the matched range.
For instance:
### func to get the index where account is greater or equal to `FROM` and lower or equal to `TO`
#np.vectorize
def match_ix(acc_no):
return np.intersect1d(np.where(acc_no>=df1['GL From'].values),np.where(acc_no<=df1['GL To'].values))
## Apply to dataframe
df2['right_ix'] = match_ix(df2['GL Account'])
## Merge using the index. Use 'how=left' for the left join to preserve unmatched
df2.merge(df1, left_on='right_ix', right_on=df1.index, how='left')\
.drop(['right_ix','GL From','GL To'], axis=1)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 3/10/19 20004689 TV CIT
2 2/10/19 30000199 Mobile Phone GST
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
In terms of performance, you will get something quicker and without the issue of Memory Error you might have on full joins:
### Using 100* the sample provided
tempdf2 = pd.concat([df2]*100)
tempdf1 = pd.concat([df1]*100)
#23 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(Python) How to group unique values in column with total of another column

This is a sample what my dataframe looks like:
company_name country_code state_code software finance commerce etc......
google USA CA 1 0 0
jimmy GBR unknown 0 0 1
I would like to be able to group the industry of a company with its state code. For example I would like to have the total number of software companies in a state etc. (e.g. 200 software companies in CA, 100 finance companies in NY).
I am currently just counting the number of total companies in each state using:
usa_df['state_code'].value_counts()
But I can't figure out how to group the number of each type of industry in each individual state.
If the 1s and 0s are boolean flags for each category then you should just need sum.
df[df.country_code == 'USA'].groupby('state_code').sum().reset_index()
# state_code commerce finance software
#0 CA 0 0 1
df.groupby(['state_code']).agg({'software' : 'sum', 'finance' : 'sum', ...})
This will group by the state_code, and sum up the number of 'software', 'finance', etc in each grouping.
Could also do a pivot_table:
df.pivot_table(index = 'state_code', columns = ['software', 'finance', ...], aggfunc = 'sum')
This may help you:
result_dataframe = dataframe_name.groupby('state_code ').sum()

Merging 2 data frames without changing associated values

I currently have 2 datasets
1 = Drugs prescribed per hospital
2 = Crimes committed
I have been able to assign the located hospital ID to the various crimes so therefore I can identify which hospital is closer.
What I really would like to do is to assign the amount of drugs prescribed using the count_values method to the hospital ID in the Crime data so that I can then plot a scatter matrix of where the crimes took place and the total quantity of drugs prescribed from the closest hospital.
I have tried using the following
df = Crimes.merge(hosp[['hosp no', 'Total Quantity']],
left_on='hosp_no', right_on='hosp no').drop('hosp no', 1)
df
However when I use the above code the associated Hosp ID to the crime changes and I don't want it too!!
I am new to jupyter notebook so I would be most grateful for any help!!
Thank you in advance
Crimes df
ID Type Hosp No
0 Anti-Social 222
Hosp df
Hosp no Total Quantity Drug name
222 1000 Paracetamol
So basically Hosp 222 has prescribed 1000 Paracetamol drugs how can I assign the number 1000 to the Crime df where Hosp No = 222 to look like this:
Crimes df
ID Type Hosp No Total Quantity
0 Anti-Social 222 1000
If the columns you are merging on share the same name, you don't need on parameter. Since you need column added to crime, we can use parameter how = left
Crimes = Crimes.merge(Hosp[['Hosp No', 'Total Quantity']], how = 'left')
ID Type Hosp No Total Quantity
0 0 Anti-Social 222 1000
Let me know if this is the desired output or you need anything else

Python 3.4 : Pandas DataFrame not responding to ordered dictionary

I am populating a DataFrame with an ordered dictionary, but the pandas DataFrame is alphabetically organizing the columns.
code
labels = income_data[0:-1:4]
year1 = income_data[1:-1:4]
key = eachTicker
value = OrderedDict(zip(labels, year1))
full_dict[key] = value
df = pd.DataFrame(full_dict)
print(df)
As you can see below full_dict is a zipped dictionary from multiple lists, namely : labels and year1
output of full_dict
print(full_dict)
OrderedDict([('AAPL', OrderedDict([('Total Revenue', 182795000), ('Cost of Revenue', 112258000), ('Gross Profit', 70537000), ('Research Development', 6041000), ('Selling General and Administrative', 11993000), ('Non Recurring', 0), ('Others', 0), ('Total Operating Expenses', 0), ('Operating Income or Loss', 52503000), ('Total Other Income/Expenses Net', 980000), ('Earnings Before Interest And Taxes', 53483000), ('Interest Expense', 0), ('Income Before Tax', 53483000), ('Income Tax Expense', 13973000), ('Minority Interest', 0), ('Net Income From Continuing Ops', 39510000), ('Discontinued Operations', 0), ('Extraordinary Items', 0), ('Effect Of Accounting Changes', 0), ('Other Items', 0), ('Net Income', 39510000), ('Preferred Stock And Other Adjustments', 0), ('Net Income Applicable To Common Shares', 39510000)]))])
The outputted DataFrame is ordered alphabetically and I do not know why. I want it to be ordered as in full_dict
code output
AAPL AMZN LNKD
Cost of Revenue 112258000 62752000 293797
Discontinued Operations 0 0 0
Earnings Before Interest And Taxes 53483000 99000 31205
Effect Of Accounting Changes 0 0 0
Extraordinary Items 0 0 0
Gross Profit 70537000 26236000 1924970
Income Before Tax 53483000 -111000 31205
Income Tax Expense 13973000 167000 46525
Interest Expense 0 210000 0
Minority Interest 0 0 -427
Net Income 39510000 -241000 -15747
Net Income Applicable To Common Shares 39510000 -241000 -15747
Net Income From Continuing Ops 39510000 -241000 -15747
Non Recurring 0 0 0
Operating Income or Loss 52503000 178000 36135
Other Items 0 0 0
Others 0 0 236946
Preferred Stock And Other Adjustments 0 0 0
Research Development 6041000 0 536184
Selling General and Administrative 11993000 26058000 1115705
Total Operating Expenses 0 0 0
Total Other Income/Expenses Net 980000 -79000 -4930
Total Revenue 182795000 88988000 2218767
This looks like a bug in the DataFrame ctor in that it's not respecting the key order when the orient is 'columns' a work around is to use from_dict and transpose the result when you specify the orient as 'index':
In [31]:
df = pd.DataFrame.from_dict(d, orient='index').T
df
Out[31]:
AAPL
Total Revenue 182795000
Cost of Revenue 112258000
Gross Profit 70537000
Research Development 6041000
Selling General and Administrative 11993000
Non Recurring 0
Others 0
Total Operating Expenses 0
Operating Income or Loss 52503000
Total Other Income/Expenses Net 980000
Earnings Before Interest And Taxes 53483000
Interest Expense 0
Income Before Tax 53483000
Income Tax Expense 13973000
Minority Interest 0
Net Income From Continuing Ops 39510000
Discontinued Operations 0
Extraordinary Items 0
Effect Of Accounting Changes 0
Other Items 0
Net Income 39510000
Preferred Stock And Other Adjustments 0
Net Income Applicable To Common Shares 39510000
EDIT
The bug is due to line 5746 in index.py:
def _union_indexes(indexes):
if len(indexes) == 0:
raise AssertionError('Must have at least 1 Index to union')
if len(indexes) == 1:
result = indexes[0]
if isinstance(result, list):
result = Index(sorted(result)) # <------ culprit
return result
When it constructs the index, it extracts the key using result = indexes[0] but then it checks if it's a list and if so sorts the result: result = Index(sorted(result)) this is why you get this result.
Issue here
duplicate issue

Categories

Resources