I want to create a column which essentially shows the data type of the data within an excel spreadsheet, i.e. if the data within any given cell is a string or an integer or a float etc. Currently I'm working with mocked up data to test with and hope to eventually use this for larger excel files with more field headers.
My Current high level method is as follows:
Read Excel file and create a dataframe
Re-format this table to create a column of all data I wish to label with a data type (i.e if it is a string, integer or float), alongside the respective field headers.
Create a 'Data Type' column which will contain these labels for each piece of data which is populated by the corresponding data types held in a dictionary of regular expressions
import os
from glob import glob
import pandas as pd
from os import path
import re
sample_file = 'C:/Users/951297/Documents/Python Scripts/DD\\Fund_Data.xlsx'
dataf = pd.read_excel(sample_file)
dataf
FUND ID FUND NAME AMOUNT
0 10101 Holdings company A 10000
1 20202 Holdings company B 2000.5
2 30303 Holdings company C 3000
# Create column list of data attributes
stackdf= dataf.stack().reset_index()
stackdf = stackdf.rename(columns={'level_0':'index','level_1':'fh',0:'attribute'})
# Create a duplicate column of attribute to apply regex
stackdf_regex = stackdf.iloc[:,2:].rename(columns = {'attribute':'Data Type'})
# Dictionary of regex to replace values within the 'Data Type' column depending on the attribute
repl_dict = {re.compile(r'^[\d]+$'):'Integer',
re.compile(r'^[a-zA-Z0-9_ ]*$'): 'String',
re.compile(r'[\d]+\.'): 'Float'}
#concatenate tables
pd.concat([stackdf, stackdf_regex], axis=1)
This is the reformatted table I wish to apply my regular expressions onto:
index fh attribute Data Type
0 0 FUND ID 10101 10101
1 0 FUND NAME Holdings company A Holdings company A
2 0 AMOUNT 10000 10000
3 1 FUND ID 20202 20202
4 1 FUND NAME Holdings company B Holdings company B
5 1 AMOUNT 2000.5 2000.5
6 2 FUND ID 30303 30303
7 2 FUND NAME Holdings company C Holdings company C
8 2 AMOUNT 3000 3000
This is the desired output:
index fh attribute Data Type
0 0 FUND ID 10101 Integer
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 Integer
3 1 FUND ID 20202 Integer
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 Float
6 2 FUND ID 30303 Integer
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 Integer
However the following code produces the table below:
stackdf_regex = stackdf_regex.replace({'Data Type':repl_dict}, regex=True)
pd.concat([stackdf, stackdf_regex], axis=1)
index fh attribute Data Type
0 0 FUND ID 10101 10101
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 10000
3 1 FUND ID 20202 20202
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 2000.5
6 2 FUND ID 30303 30303
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 3000
Perhaps my regular expressions are incorrect or my understanding is lacking in applying the regular expressions on the dataframe. Happy to receive any suggestions on this current method or another suitable/efficient method I have not considered.
Note: I hope to eventually expand the regex dictionary to account for more data types and I understand it may not be efficient to check every cell for a pattern for larger datasets but I'm still in the early stages.
You can use, np.select, where each of the conditions test a given regex to the column Data Type using Series.str.contains and choices corresponds to the conditions:
conditions = [
df['Data Type'].str.contains(r'^\d+$'),
df['Data Type'].str.contains(r'^[\w\s]+$'),
df['Data Type'].str.contains(r'^\d+\.\d+$')]
choices = ['Interger', 'String', 'Float']
df['Data Type'] = np.select(conditions, choices, default=None)
# print(df)
index fh attribute Data Type
0 0 FUND ID 10101 Interger
1 0 FUND NAME Holdings company A String
2 0 AMOUNT 10000 Interger
3 1 FUND ID 20202 Interger
4 1 FUND NAME Holdings company B String
5 1 AMOUNT 2000.5 Float
6 2 FUND ID 30303 Interger
7 2 FUND NAME Holdings company C String
8 2 AMOUNT 3000 Interger
Related
I have a data frame as below :
col value
0 companyId 123456
1 company_name small company
2 department IT
3 employee_name Jack
4 rank Grade 8
5 department finance
6 employee_name Tim
7 rank Grade 6
and i would like the data frame to be reshaped to tabular format, ideally like this:
companyId company_name department employee_name rank
0 123456 small company IT Jack Grade 8
1 123456 small company finance Tim Grade 6
can any one help me please? thanks.
Making two assumptions you could reshape your data.
1- the companies are determined using headers and all subsequent rows are data from employees of the company
2- there is a given starting item to define employees records (here department)
headers = ['companyId', 'company_name']
first_item = 'department'
masks = {h: df['col'].eq(h) for h in headers}
df2 = (df
# move headers as new columns
.assign(**{h: df['value'].where(m).ffill().bfill() for h,m in masks.items()})
# and drop their rows
.loc[~pd.concat(masks, axis=1).any(1)]
# compute a unique identifier per employee
.assign(idx=lambda d: d['col'].eq(first_item).cumsum())
# pivot the data
.pivot(index=['idx']+headers, columns='col', values='value')
.reset_index(headers)
)
output:
companyId company_name department employee_name rank
1 123456 small company IT Jack Grade 8
2 123456 small company finance Tim Grade 6
Example on a more complex input:
col value
0 companyId 123456
1 company_name small company
2 department IT
3 employee_name Jack
4 rank Grade 8
5 department finance
6 employee_name Tim
7 rank Grade 6
8 companyId 67890
9 company_name other company
10 department IT
11 employee_name Jane
12 rank Grade 9
13 department management
14 employee_name Tina
15 rank Grade 12
output:
companyId company_name department employee_name rank
1 123456 small company IT Jack Grade 8
2 123456 small company finance Tim Grade 6
3 67890 other company IT Jane Grade 9
4 67890 other company management Tina Grade 12
I have a string as String = 'Oil - this company
In my dataframe df1:
id CompanyName
1 Oil - this company
2 oil
3 oily
4 comp
I want to keep the rows that contain part of CompanyName
My final df should be: df1
id CompanyName
1 Oil - this company
2 oil
I tried:
df = df[df['CompanyName'].str.contains(String)]
but it deleted the second row 2 oil
Is there any way to keep the company Name that contains part of the string?
I'm trying to extract all columns from multiple excel files and then map the filename to each extracted column however I'm struggling to work around a 'TypeError: Index does not support mutable operations'.
Below are my two files:
Fund_Data.xlsx:
FUND ID FUND NAME AMOUNT client code Price description Trade Date Trade Datetime
0 10101 Holdings company A 10000.5 1234 124.3 abcd 2020-08-19 2020-08-19 12:30:00
1 20202 Holdings company B -2000.5 192 -24.2 abcd 2020-08-20 2020-08-20 12:30:00
2 30303 Holdings company C 3000.5 123 192 NaN 2020-08-21 2020-08-21 12:30:00
3 10101 Holdings company A 10000 1234567 5.5 NaN 2020-08-22 2020-08-22 12:30:00
4 20202 Holdings company B 10000.5 9999 3.887 abcd 2020-08-23 2020-08-23 12:30:00
Stocks.xlsx
ID STOCK VALUE
1 3i 100
2 Admiral Group 200
3 Anglo American 300
4 Antofagasta 100
5 Ashtead 200
6 Associated British Foods 300
7 AstraZeneca 400
8 Auto Trader Group 500
9 Avast 600
And here is my code so far:
import pandas as pd
from os import walk
f = []
directory = 'C:/Users/rrai020/Documents/Python Scripts/DD'
for (dirpath, dirnames, filenames) in os.walk(directory):
for x in filenames:
if x.endswith('xlsx'):
f.append(x)
#f = ['Fund_Data.xlsx', 'Stocks.xlsx'] created a list from filenames in directory ^^^
data = pd.DataFrame() # initialize empty df
for filename in f:
df = pd.read_excel(filename, dtype = object, ignore_index=True).columns # read in each excel to df
df['filename'] = filename # add a column with the filename
data = data.append(df) # add all small df's to big df
print(data)
I'm trying to achieve the following output (or similar):
Field Name Filename
FUND ID Fund_Data.xlsx
FUND NAME Fund_Data.xlsx
AMOUNT Fund_Data.xlsx
client code Fund_Data.xlsx
Price Fund_Data.xlsx
description Fund_Data.xlsx
Trade Date Fund_Data.xlsx
Trade Datetime Fund_Data.xlsx
Trade time Fund_Data.xlsx
ID Stocks.xlsx
STOCK Stocks.xlsx
VALUE Stocks.xlsx
I would like the code to be flexible so that it can work for more than the 2 files I have here. Apologies if this is trivial, I'm still learning!
The problem is with the dataframe that you're appending. We need to create a dataframe with Field Name, Filename columns for each file inside the loop, and then append it to data.
Here's an option:
data = pd.DataFrame()
for filename in f:
# read in each excel to df
df = pd.read_excel(filename, dtype = object, ignore_index=True).columns
# create a dataframe with (Field Name, Filename) columns for current file
x = pd.DataFrame({'Field Name': x.columns, 'Filename': filename})
# append to the global dataframe
data = data.append(x)
data
Output:
Field Name Filename
0 FUND ID Fund_Data.xlsx
1 FUND NAME Fund_Data.xlsx
2 AMOUNT Fund_Data.xlsx
3 client code Fund_Data.xlsx
4 Price description Fund_Data.xlsx
5 Trade Date Fund_Data.xlsx
6 Trade Datetime Fund_Data.xlsx
7 ID Stocks.xlsx
8 STOCK Stocks.xlsx
9 VALUE Stocks.xlsx
I have a rather "cross platformed" question. I hope it is not too general.
One of my tables, say customers, consists of my customer id's and their associated demographic information. Another table, say transaction, contains all purchases from the customers in the respective shops.
I am interested in analyzing basket compositions together with demographics in python. Hence, I would like to have the shops as columns and the sum for the given customers at the shops in my dataframe
For clarity,
select *
from customer
where id=1 or id=2
gives me
id age gender
1 35 MALE
2 57 FEMALE
and
select *
from transaction
where id=1 or id=2
gives me
customer_id shop amount
1 2 250
1 2 500
2 3 100
2 7 200
2 11 125
Which should end up in a (preferably) Pandas dataframe as
id age gender shop_2 shop_3 shop_7 shop_11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
Such that the last columns is the aggregated baskets of the customers.
I have tried to create a python dictionary of the purchases and amounts for each customer in SQL in the following way:
select customer_id, array_agg(concat(cast(shop as varchar), ' : ', cast(amount as varchar))) as basket
from transaction
group by customer_id
Resulting in
id basket
1 ['2 : 250', '2 : 500']
2 ['3 : 100', '7 : 200', '11 : 125']
which could easily be joined on the customer table.
However, this solution is not optimal due to the fact that it is strings and not integers inside the []. Hence, it involves a lot of manipulation and looping in python to get it on the format I want.
Is there any way where I can aggregate the purchases in SQL making it easier for python to read and aggregate into columns?
One simple solution would be to do the aggregation in pandas using pivot_table on the second dataframe and then merge with the first:
df2 = df2.pivot_table(columns='shop', values='amount', index='customer_id', aggfunc='sum', fill_value=0.0).reset_index()
df = pd.merge(df1, df2, left_on='id', right_on='customer_id')
Resulting dataframe:
id age gender 2 3 7 11
1 35 MALE 750 0 0 0
2 57 FEMALE 0 100 200 125
I have tried to find a solution to this but have failed
I have my master df with transactional data and specifically credit card names:
transactionId, amount, type, person
1 -30 Visa john
2 -100 Visa Premium john
3 -12 Mastercard jenny
I am grouping by person and then aggregating by numb of records and amount.
person numbTrans Amount
john 2 -130
jenny 1 -12
This is fine but I need to add the dimension of creditcard type to my df.
I have grouped a df of the creditcards in use
index CreditCardName
0 Visa
1 Visa Premium
2 Mastercard
So, what I can't do is creating a new column in my master dataframe called 'CreditCard_id' which uses the string 'Visa/Visa Premium/Mastercard' to pull in the index for the column.
transactionId, amount, type, CreditCardId, person
1 -30 Visa 0 john
2 -100 Visa Premium 1 john
3 -12 Mastercard 2 jenny
I need this as I am doing some simple kmeans clustering and require ints, not strings (or at least I think I do)
Thanks in advance
Rob
If you set the 'CreditCardName' as the index of the second df then you can just call map:
In [80]:
# setup dummydata
import pandas as pd
temp = """transactionId,amount,type,person
1,-30,Visa,john
2,-100,Visa Premium,john
3,-12,Mastercard,jenny"""
temp1 = """index,CreditCardName
0,Visa
1,Visa Premium
2,Mastercard"""
df = pd.read_csv(io.StringIO(temp))
# crucually set the index column to be the credit card name
df1 = pd.read_csv(io.StringIO(temp1), index_col=[1])
df
Out[80]:
transactionId amount type person
0 1 -30 Visa john
1 2 -100 Visa Premium john
2 3 -12 Mastercard jenny
In [81]:
df1
Out[81]:
index
CreditCardName
Visa 0
Visa Premium 1
Mastercard 2
In [82]:
# now we can call map passing the series, naturally the map will align on index and return the index value for our new column
df['CreditCardId'] = df['type'].map(df1['index'])
df
Out[82]:
transactionId amount type person CreditCardId
0 1 -30 Visa john 0
1 2 -100 Visa Premium john 1
2 3 -12 Mastercard jenny 2