Python - Parsing structured text to excel

Python - Parsing structured text to excel - python

I need to convert a huge number of files in structured text format into excel (csv would work) to be able to merge them with some other data I have.
Here is a sample of the text:
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: NORTHQUEST CAPITAL FUND INC
CENTRAL INDEX KEY: 0001142728
IRS NUMBER: 223772454
STATE OF INCORPORATION: NJ
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: NSAR-A
SEC ACT: 1940 Act
SEC FILE NUMBER: 811-10419
FILM NUMBER: 03805344
BUSINESS ADDRESS:
STREET 1: 16 RIMWOOD LANE
CITY: COLTS NECK
STATE: NJ
ZIP: 07722
BUSINESS PHONE: 7328423504
FORMER COMPANY:
FORMER CONFORMED NAME: NORTHPOINT CAPITAL FUND INC
DATE OF NAME CHANGE: 20010615
</SEC-HEADER>
<DOCUMENT>
<TYPE>NSAR-A
<SEQUENCE>1
<FILENAME>answer.fil
<DESCRIPTION>ANSWER.FIL
<TEXT>
<PAGE> PAGE 1
000 A000000 06/30/2003
000 C000000 0001142728
000 D000000 N
000 E000000 NF
000 F000000 Y
000 G000000 N
000 H000000 N
000 I000000 6.1
000 J000000 A
001 A000000 NORTHQUEST CAPITAL FUND, INC.
001 B000000 811-10493
001 C000000 7328921057
002 A000000 16 RIMWOOD LANE
002 B000000 COLTS NECK
002 C000000 NJ
002 D010000 07722
003 000000 N
004 000000 N
005 000000 N
006 000000 N
007 A000000 N
007 B000000 0
007 C010100 1
007 C010200 2
007 C010300 3
007 C010400 4
007 C010500 5
007 C010600 6
007 C010700 7
007 C010800 8
007 C010900 9
007 C011000 10
008 A000001 EMERALD RESEARCH CORP.
008 B000001 A
008 C000001 801-60455
008 D010001 BRICK
008 D020001 NJ
008 D030001 08724
013 A000001 SANVILLE & COMPANY
013 B010001 ABINGTON
013 B020001 PA
013 B030001 19001
015 A000001 FLEET BANK
015 B000001 C
015 C010001 POINT PLEASANT BEACH
015 C020001 NJ
015 C030001 08742
015 E030001 X
018 000000 Y
019 A000000 N
019 B000000 0
<PAGE> PAGE 2
020 A000001 SCHWAB
020 B000001 94-1737782
020 C000001 0
020 A000002 BESTVEST BROOKERAGE
020 B000002 23-1452837
020 C000002 0
and it continues to page 8 of the same structure.
The information about the company's name should go into relative columns and the rest should be like the first two values are the column names and the third value would be the value of the row.
I was trying to work it out with pyparsing but haven't been able to successfully do so.
Any comment on the approach would be helpful.

The way you describe it, these are like key:value pairs for each file. I would handle the parsing part like this:
import sys
import re
import csv
colonseperated = re.compile(' *(.+) *: *(.+) *')
fixedfields = re.compile('(\d{3} \w{7}) +(.*)')
matchers = [colonseperated, fixedfields]
outfile = csv.writer(open('out.csv', 'w'))
outfile.writerow(['Filename', 'Key', 'Value'])
for filename in sys.argv[1:]:
for line in open(filename):
line = line.strip()
for matcher in matchers:
match = matcher.match(line)
if match:
outfile.writerow([filename] + list(match.groups()))
You can call this something like parser.py and call it with python parser.py *.infile or whatever your filename convention is. It will create a csv file with three columns: a filename, a key and a value. You can open this in excel and then use a pivot table to get the values into the correct format.
Alternatively you can use this:
import csv
headers = []
rows = {}
filenames = []
outfile = csv.writer(open('flat.csv', 'w'))
infile = csv.reader(open('out.csv'))
infile.next()
for filename, key, value in infile:
if not filename in rows:
rows[filename] = {}
filenames.append(filename)
if key not in headers:
headers.append(key)
rows[filename][key] = value
outfile.writerow(headers)
for filename in filenames:
outfile.writerow([rows[filename].get(header, '') for header in headers])

Related

How to regex extract CAR MAKE from URL in pandas df column

I am trying to extract from URL str "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte..."
the entire Make name, i.e. "Mercedes-Benz"
BUT my pattern only returns the first letter, i.e. "M"
Please help me come up with the correct pattern to use on pandas df.
Thank you
CODE:
URLS_by_City['Make'] = URLS_by_City['Page'].str.extract('.+([A-Z])\w+(?=[\/])+', expand=True) Clean_Make = URLS_by_City.dropna(subset=["Make"]) Clean_Make # WENT FROM 5K rows --> to 2688 rows
Page City Pageviews Unique Pageviews Avg. Time on Page Entrances Bounce Rate % Exit **Make**
71 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Jose 310 149 00:00:27 149 2.00% 47.74% **B**
103 /used/Audi/2015-Audi-SQ5-286f67180a0e09a872992... Menlo Park 250 87 00:02:36 82 0.00% 32.40% **A**
158 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... San Francisco 202 98 00:00:18 98 2.04% 48.02% **B**
165 /used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cf... San Francisco 194 93 00:00:42 44 2.22% 29.38% **A**
168 /used/Mercedes-Benz/2021-Mercedes-Benz-Sprinte... (not set) 192 91 00:00:11 91 2.20% 47.40% **B**
... ... ... ... ... ... ... ... ... ...
4995 /used/Subaru/2019-Subaru-Crosstrek-5717b3040a0... Union City 10 3 00:02:02 0 0.00% 30.00% **S**
4996 /used/Tesla/2017-Tesla-Model+S-15605a190a0e087... San Jose 10 5 00:01:29 5 0.00% 50.00% **T**
4997 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Las Vegas 10 4 00:00:09 2 0.00% 40.00% **T**
4998 /used/Tesla/2018-Tesla-Model+3-0f3ea14d0a0e09a... Austin 10 4 00:03:29 2 0.00% 40.00% **T**
4999 /used/Tesla/2018-Tesla-Model+3-5f29cdc70a0e09a... Orinda 10 4 00:04:00 1 0.00% 0.00% **T**
TRIED:
`example_url = "/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1"
pattern = ".+([a-zA-Z0-9()])\w+(?=[/])+"
wanted_make = URLS_by_City['Page'].str.extract(pattern)
wanted_make
`
0
0 r
1 r
2 NaN
3 NaN
4 r
... ...
4995 r
4996 l
4997 l
4998 l
4999 l
It worked in regex online tool.
but unfortunately not in my jupyter notebook
EXAMPLE PATTERNS - I bolded what should match:
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2020-Audi-S8-c6df09610a0e09af26b5cff998e0f96e.htm
/used/Mercedes-Benz/2021-Mercedes-Benz-Sprinter+2500-9f3d32130a0e09af63592c3c48ac5c24.htm?store_code=AudiOakland&ads_adgroup=139456079219&ads_adid=611973748445&ads_digadprovider=adpearance&adpdevice=m&campaign_id=17820707224&adpprov=1
/used/Audi/2021-Audi-RS+5-b92922bd0a0e09a91b4e6e9a29f63e8f.htm
/used/LEXUS/2018-LEXUS-GS+350-dffb145e0a0e09716bd5de4955662450.htm
/used/Porsche/2014-Porsche-Boxster-0423401a0a0e09a9358a179195e076a9.htm
/used/Audi/2014-Audi-A6-1792929d0a0e09b11bc7e218a1fa7563.htm
/used/Honda/2018-Honda-Civic-8e664dd50a0e0a9a43aacb6d1ab64d28.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/used-inventory/index.htm
/new-inventory/index.htm
/new-inventory/index.htm?normalFuelType=Hybrid&normalFuelType=Electric
/

I have tried completing your requirement in Jupyter Notebook.
PFB the code and screenshots:
I have created a dummy pandas dataframe(data_df), below is a screenshot of the same
I have created a pattern based on the pattern of the string to be extracted
pattern = "^/used/(.*)/(?=[20][0-9{2}])"
Used the patten to extract required data from the URLs and saved it in another column in the same dataframe
data_df['Car Maker'] = data_df['urls'].str.extract(pattern)
Below is a screenshot of the output
I hope this is helpful..

I would use:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'([^/]+)/\d{4}\b')
This targets the URL path segment immediately before the portion with the year. You could also try this version:
URLS_by_City["Make"] = URLS_by_City["Page"].str.extract(r'/[^/]+/([^/]+)')

Below code will give you the model & VIN values:
pattern2 = '^/used/[a-zA-Z\-]*/([0-9]{4}[a-zA-Z0-9\-+]*)-[a-z0-9]*.htm'
pattern3 = '^/used/[a-zA-Z\-]*/[0-9]{4}[a-zA-Z0-9\-+]*-([a-z0-9]*).htm'
data_df['Model'] = data_df['urls'].str.extract(pattern2)
data_df['VIN'] = data_df['urls'].str.extract(pattern3)
Here is a screenshot of the output:

How to split 2 columns with patterns?

I have a dataset (df) with 2 columns of arbitrary length, and I need to split it up based on the value.
BUS
CODE
150 H.S.London-lon3 11£150 H.S.London-lon3 16£150 H.S.London-lon3 120
GERI
400 Airport Luton-ptr5 12£400 Airport Luton-ptr5 15£400 Airport Luton-ptr5 17
24£JTR
005 Plaza-cata-md6 08£005 Plaza-cata-md6 012£005 Plaza-cata-md6 18
78£TDE
I've been trying to split it to look like this:
bus
directions
zone
time
code
name
150
H.S.London
lon3
11
NaN
GERI
400
Airport Luton
ptr5
12
24
JTR
005
Plaza-cata
md6
08
78
TDE
So far, I tried to split by patterns, but isn't working and I'm out of ideas or how to split it in other way.
bus = '(?P<bus>[\d]+) (?P<direction>[\w\W]+)-(?P<zone>[\w]+)'
code = '(?P<code>[\S]+)£(?P<name>\d+)
df.BUS.str.extract(bus)).join(df.CODE.str.extract(code)
I was wondering if anyone had a good solution to this.

You can use .str.extract with regex pattern containing named capturing groups:
code = r'^(?P<code>\d+)?.*?(?P<name>[A-Za-z]+)'
bus = r'^(?P<bus>\d+)\s(?P<directions>.*?)-(?P<zone>[^\-]+)\s(?P<time>\d+)'
df['BUS'].str.extract(bus).join(df['CODE'].str.extract(code))
bus directions zone time code name
0 150 H.S.London lon3 11 NaN GERI
1 400 Airport Luton ptr5 12 24 JTR
2 005 Plaza-cata md6 08 78 TDE
See the regex demo for code pattern here and for bus pattern here.

You could use split:
For your code column:
new_cols = ['code','name']
df[new_cols] = df.CODE.str.split(pat = '£', expand = True)
Im sure you can find a way to do this for your first column, and if you have duplicates remove them after splitting

Filter a 2 columns of a DataFrame using a dictionary containing lists Python

I have a pandas dataframe of employees that I need to filter based on 2 columns. I need to filter on department and level. So let's say we have department 'Human Resources' and within that it has level 1,2,3,4,5. I'm specifically looking for Human Resources level 2,4 and 5.
I have my desired departments and levels stored in dictionary, for example:
departments = dict({'Human Resources' : ['2','4','5'] ,'IT' : ['1','3','5','6'], etc.... })
My dataframe will list every employee, for all departments and for all levels (plus lots more). I now want to filter that dataframe using the dictionary above. So in the Human Resources example, I just want returned the employees who are in 'Human Resouces' and are in levels 2, 4 and 5.
An example of the df would be:
employee_ID Department Level
001 Human Resources 1
002 Human Resources 1
003 Human Resources 2
004 Human Resources 3
005 Human Resources 4
006 Human Resources 4
007 Human Resources 5
008 IT 1
009 IT 2
010 IT 3
011 IT 4
012 IT 5
013 IT 6
Using the dictionary I've displayed above, my expected result would be
employee_ID Department Level
003 Human Resources 2
005 Human Resources 4
006 Human Resources 4
007 Human Resources 5
008 IT 1
010 IT 3
012 IT 5
013 IT 6
I have no idea how I'd do this?

you can use groupby on Departement and use isin on the Level and get the value for the departement concerned with the name of the group.
#example data
departments = dict({'Human Resources' : ['2','4','5'] ,'IT' : ['1','3','5','6']})
df = pd.DataFrame({'Id':range(10),
'Departement': ['Human Resources']*5+['IT']*5,
'Level':list(range(1,6))*2})
#filter
print (df[df.groupby('Departement')['Level']
.apply(lambda x: x.isin(departments[x.name]))])
Id Departement Level
1 1 Human Resources 2
3 3 Human Resources 4
4 4 Human Resources 5
5 5 IT 1
7 7 IT 3
9 9 IT 5

Create LineString for unique values in Pandas DataFrame

I have a pandas dataframe I would like to iterate over. For instance a simplified version of my dataframe can be:
abc begin end ID Lat Long
def1 001 123 CAT 13.167 52.411
def2 002 129 DOG 13.685 52.532
def3 003 145 MOOSE 13.698 52.131
def1 004 355 CAT 13.220 52.064
def2 005 361 CAT 13.304 52.121
def3 006 399 DOG 12.020 52.277
def1 007 411 MOOSE 13.699 52.549
def2 008 470 MOOSE 11.011 52.723
I would like to iterate over each unique ID and create a (shapely)LineString from the matching Lat / Long columns.
grp = df.groupby('ID')
for x in grp.groups.items():
# this is where I need the most help
For the above example I would like to get three iterations with 3 LineStrings put back into a single dictionary.
{'CAT':LINESTRING (13.167 52.411, 13.22 52.064, 13.304 52.121), 'DOG':LINESTRING (13.685 52.532, 12.02 52.277), 'MOOSE':LINESTRING (13.698 52.131, 12.699 52.549, 13.011 52.723)}

I don't have the LINESTRING package installed, but I guess you can easily convert what's in d into the format you need.
d = {}
df.groupby('ID').apply(lambda x: d.update({x.ID.iloc[0]:x[['Lat','Long']].values.tolist()}))
{'CAT': [[13.167, 52.411], [13.22, 52.064], [13.304, 52.121]],
'DOG': [[13.685, 52.532], [12.02, 52.277]],
'MOOSE': [[13.698, 52.131], [13.699, 52.549], [11.011, 52.723]]}

pd.read_csv multiple tables and parse data frames using index=0

I am new to pandas/python. Have used excel and stata pretty extensively.
I get a .csv file with multiple tables in it from a supplier that will not change their format.
The tables have headers and a blank row in between them.
The number of rows in each table can vary
The number of tables also seems to vary (i just discovered!)
There are 23 possible tables that can come in the file
I have managed to create one big data frame from the file
I can't seem to group by the index=0
Here is the code i have so far:
%matplotlib inline
import csv
from pandas import Series, DataFrame
import pandas as pd # if len(row) == 0,new_table_coming_up = 1if len(row) > 0,if new_table_coming_up == 0
import numpy as np
import matplotlib.pyplot as plt
import io
df = pd.read_csv(r'C:\Users\file.csv',names=range(25))
table_names = ["WAREHOUSE","SUPPLIER","PRODUCT","BRAND","INVENTORY","CUSTOMER","CONTACT","CHAIN","ROUTE","INVOICE","INVOICETRANS","SURVEY","FORECAST","PURCHASE","PURCHASETRANS","PRICINGMARKET","PRICINGMARKETCUSTOMER","PRICINGLINE","PRICINGLINEPRODUCT","EMPLOYEE"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}
here is a sample of the .csv file with the first 3 tables:
Record Identifier Sender ID Receiver ID Action Warehouse ID Warehouse Name System Close Date DBA Address Address 2 City State Postal Code Phone Fax Primary Contact Email FEIN DUNS GLN
WAREHOUSE COX SUPPLIERX Change 1 Richmond 20160127 Company 700 Court Anywhere CA 99999 5555555555 5555555555 na na 0 50682020
Record Identifier Sender ID Receiver ID Sender Supplier ID Supplier Name Supplier Family
SUPPLIER COX SUPPLIERX 16 SUPPLIERX SUPPLIERX
Record Identifier Sender ID Receiver ID Supplier Product Number Sender Product ID Product Name Sender Brand ID Active Cases Per Pallet Cases Per Layer Case GTIN Carrier GTIN Unit GTIN Package Name Case Weight Case Height Case Width Case Length Case Ounces Case Equivalents Retail Units Per Case Consumable Units Per Case Selling Unit Of Measure Container Material
PRODUCT COX SUPPLIERX 53030 LAG DOGTOWN PALE ALE 4/6/12OZ NR 217 Active 70 10 7.2383E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53071 LAG DOGTOWN PALE ALE 1/2 KEG 217 Active 8 8 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 2100008003 53122 LAG CAPPUCCINO STOUT 12/22OZ NR 221 Active 75 15 7.2383E+11 7.2383E+11 7.2383E+11 12/22oz NR 33.6 9.5 10.75 14.2083 264 0.916667 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53130 LAG SUCKS ALE 4/6/12OZ NR 1473 Active 70 10 7.23831E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53132 LAG SUCKS ALE 12/32oz NR 1473 Active 50 10 7.23831E+11 7.2383E+11 7.2383E+11 12/32oz NR 38.2 9.5 10.75 20.6667 384 1.333333 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53170 LAG SUCKS ALE 1/4 KEG 1473 Inactive 1 1 0 1.11111E+11 KEG-1/4 BBL 87.2 11.75 17 17 992 3.444444 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53171 LAG FARMHOUSE SAISON 1/2 KEG 1478 Inactive 16 1 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53172 LAG SUCKS ALE 1/2 KEG 1473 Active 80 4 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53255 LAG FARMHOUSE HOP STOOPID ALE 12/22 222 Active 75 15 7.23831E+11 7.2383E+11 7.2383E+11 12/22oz NR 33.6 9.5 10.75 14.2083 264 0.916667 12 12 Case Aluminum
PRODUCT COX SUPPLIERX 53271 LAG FARMHOUSE HOP STOOPID 1/2 KEG 222 Active 8 8 0 KEG-1/2 BBL 160.6 23.5 15.75 15.75 1984 6.888889 1 1 Each Aluminum
PRODUCT COX SUPPLIERX 53330 LAG CENSORED ALE 4/6/12OZ NR 218 Active 70 10 7.23831E+11 7.2383E+11 7.2383E+11 4/6/12oz NR 31.9 9.5 10.75 15.5 288 1 4 24 Case Aluminum
PRODUCT COX SUPPLIERX 53331 LAG CENSORED ALE 2/12/12 OZ NR 218 Inactive 60 1 7.2383E+11 7.2383E+11 7.2383E+11 2/12/12oz NR 31.9 9.5 10.75 15.5 288 1 2 24 Case Aluminum
PRODUCT COX SUPPLIERX 53333 LAG CENSORED ALE 24/12 OZ NR 218 Inactive 70 1 7.2383E+11 24/12oz NR 31.9 9.5 10.75 15.5 288 1 1 24 Case Aluminum

The first thing you need is simply to load your data cleanly. I'm going to assume your input file is tab-separated, even though your code doesn't specify that. This code works for me:
from cStringIO import StringIO
import pandas as pd
subfiles = [StringIO()]
with open('t.txt') as bigfile:
for line in bigfile:
if line.strip() == "": # blank line, new subfile
subfiles.append(StringIO())
else: # continuation of same subfile
subfiles[-1].write(line)
for subfile in subfiles:
subfile.seek(0)
table = pd.read_csv(subfile, sep='\t')
print '*****************'
print table
Basically what I do is to break up the original file into subfiles by looking for blank lines. Once that's done, reading the chunks with Pandas is straightforward, so long as you specify the correct sep character.

this worked, then i used the slicer to create tables
df = pd.read_csv(fileloaction.csv',delim_whitespace=True,names=range(25))
table_names=["WAREHOUSE","SUPPLIER","PRODUCT"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Parsing structured text to excel - python

Related

How to regex extract CAR MAKE from URL in pandas df column

How to split 2 columns with patterns?

Filter a 2 columns of a DataFrame using a dictionary containing lists Python

Create LineString for unique values in Pandas DataFrame

pd.read_csv multiple tables and parse data frames using index=0

Categories

Resources