I am trying to open this CSV file to then parse the data into columns. The problem is the way the data comes in is causing me problems. Wheni try to run a python script i get all the data in each sentence encclosed with a [' DATA HERE ']. I want to parse the data into columns like 'Account#', 'Service Address', 'City', etc. Just like the column names that are already in place below. The way this data is structured like i said is weird because it has column heads above and below. For example the column header 'Account #' has a second column header below as 'rate code'. Not sure the best way to go about this and would like to get some input from the experts.
Python Script
import csv
with open('C:/Users/DEMO/Documents/statement-9-28-18.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
for line in csv_reader:
print(line)
Result
[' XYZ COMPANY DATE : 09/28/18 ']
[' PAGE : 1 ']
[' ELECTRIC BILL STATEMENT ']
[' ']
[' CUSTOMER NAME: XYZ CUSTOMER SUMMARY BILL NUMBER: 12345-67890 IF YOU HAVE ANY QUESTIONS, ']
[' CUSTOMER NUMBER: 1111111 PLEASE CONTACT: ']
[' MAILING ADDRESS: 4122 RICHARDSON ST ']
[' BILLING DATE: 09/28/18 SUMB#XYZ.COM45 ']
[' SANFORD FL 32771 PAST DUE DATE: 10/09/18 (305)333-3333 ']
[' ']
[' ']
[' READ SVC B MAXIMUM TOTAL DUE METER NO REMARKS ']
[' ACCOUNT # SERVICE ADDRESS CITY DATE DAY C KWH KWD AMOUNT ']
[' RATE CODE CY CUSTOMER NAME MAILING ADDRESS ']
[' ---------------------------------------------------------------------------------------------------------------------------------- ']
[' 11111-22222 485 JOHNSON AVE APT 1405 MIAMI 09/26/18 28 C 140 29.11 BAT0123 ']
[' RS-1 XYZ COMPANY 485 JOHNSON AVE ']
[' ']
[' 22222-33333 485 JOHNSON AVE APT 3541 MIAMI 09/26/18 28 C 130 28.08 BAT0123 ']
[' RS-1 XYZ COMPANY 485 JOHNSON AVE ']
[' ']
[' 33333-44444 485 JOHNSON AVE APT 4544 MIAMI 09/26/18 28 C 172 32.42 BAT0123 ']
[' RS-1 XYZ COMPANY 485 JOHNSON AVE ']
[' ']
[' 55555-66666 485 JOHNSON ST AVE APT 1111 MIAMI 09/26/18 28 C 243 39.81 BAT0123 ']
[' RS-1 XYZ COMPANY 485 JOHNSON AVE ']
Question: I want to parse the data into column
Note: The simple regex will split on - and / also. If you expand the regex to your needs, this could be avoided.
import re
rc = re.compile(r'(\w+)')
with open('C:/Users/DEMO/Documents/statement-9-28-18.csv', 'r') as itxt:
for n, line in enumerate(itxt.readline(), 1):
# Row 13 and 14 hold the Header
if n in [13, 14]:
findall = re.findall(rc, line)
print("{}".format(findall))
if n >= 16 and n%3 > 0:
findall = re.findall(rc, line)
print("{}".format(findall))
Output:
['ACCOUNT', 'SERVICE', 'ADDRESS', 'CITY', 'DATE', 'DAY', 'C', 'KWH', 'KWD', 'AMOUNT']
['RATE', 'CODE', 'CY', 'CUSTOMER', 'NAME', 'MAILING', 'ADDRESS']
['11111', '22222', '485', 'JOHNSON', 'AVE', 'APT', '1405', 'MIAMI', '09', '26', '18', '28', 'C', '140', '29', '11', 'BAT0123']
['RS', '1', 'XYZ', 'COMPANY', '485', 'JOHNSON', 'AVE']
['22222', '33333', '485', 'JOHNSON', 'AVE', 'APT', '3541', 'MIAMI', '09', '26', '18', '28', 'C', '130', '28', '08', 'BAT0123']
['RS', '1', 'XYZ', 'COMPANY', '485', 'JOHNSON', 'AVE']
['33333', '44444', '485', 'JOHNSON', 'AVE', 'APT', '4544', 'MIAMI', '09', '26', '18', '28', 'C', '172', '32', '42', 'BAT0123']
['RS', '1', 'XYZ', 'COMPANY', '485', 'JOHNSON', 'AVE']
['55555', '66666', '485', 'JOHNSON', 'ST', 'AVE', 'APT', '1111', 'MIAMI', '09', '26', '18', '28', 'C', '243', '39', '81', 'BAT0123']
['RS', '1', 'XYZ', 'COMPANY', '485', 'JOHNSON', 'AVE']
Tested with Python: 3.4.2
Related
I have a file called customer.CR that contains multiple rows of lists. It looks like this:
1361886|5303477|CR|WY|WY & NW RAILROAD CO|UNKNOWN||UNKNOWN|WY|00000|C|100.000000000|29|HOLDER|
1280535|5394419|CR|WY|CHAMBERS JERRY|7800 E UNION AVE # 1100||DENVER|CO|802372715|P|100.000000000|15|LESSEE|
1324915|5312567|CR|WY|EXXONMOBIL OIL CORP|PO BOX 650232||DALLAS|TX|752650232|C|100.000000000|15|LESSEE|
...
I want to convert this file into a panda dataframe that looks like this but with many rows and 14 columns (here you can see 2 columns, but I need 14:
Column A
Column B
1361886
5303477
1280535
5394419
I ran the following command:
myfile = dbdir + '/Data/customer.CR.load'
with open(myfile, newline = '') as csvfile:
csvreader = csv.reader(csvfile, delimiter='|')
for row in csvreader:
print(row)
I got the following result.
['1361886', '5303477', 'CR', 'WY', 'WY & NW RAILROAD CO', 'UNKNOWN', '', 'UNKNOWN', 'WY', '00000', 'C', '100.000000000', '29', 'HOLDER', '']
['1280535', '5394419', 'CR', 'WY', 'CHAMBERS JERRY', '7800 E UNION AVE # 1100', '', 'DENVER', 'CO', '802372715', 'P', '100.000000000', '15', 'LESSEE', '']
...
...
...
['1324915', '5312567', 'CR', 'WY', 'EXXONMOBIL OIL CORP', 'PO BOX 650232', '', 'DALLAS', 'TX', '752650232', 'C', '100.000000000', '15', 'LESSEE', '']
['1353999', '5325242', 'CR', 'WY', 'ULTRA RESOURCES INC', '1550 WYNKOOP ST STE 300', '', 'DENVER', 'CO', '802021648', 'C', '20.000000000', '15', 'LESSEE', '']
now I want to convert row to panda dataframe. I wrote the following command:
df = pd.DataFrame([row], index=None)
df
but in the output, I get just one row. My question is, how should I change my code so that I get a dataframe that has all the rows and columns? (Columns are separated by verticle pipe '|' in the file.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
> 0 1306746 5466867 CR WY HUBER EMERICK M 311 S CONWELL ST CASPER WY 826012938 P 7.598157000 15 LESSEE
I have a data frame that looks like this:
data = {'State': ['24', '24', '24',
'24','24','24','24','24','24','24','24','24'],
'County code': ['001', '001', '001',
'001','002','002','002','002','003','003','003','003'],
'TT code': ['123', '123', '123',
'123','124','124','124','124','125','125','125','125'],
'BLK code': ['221', '221', '221',
'221','222','222','222','222','223','223','223','223'],
'Age Code': ['1', '1', '2', '2','2','2','2','2','2','1','2','1']}
df = pd.DataFrame(data)
essentially I want to just have where only the TT code where the age code is 2 and there are no 1's. So I just want to have the data frame where:
'State': ['24', '24', '24', '24'],
'County code': ['002','002','002','002',],
'TT code': ['124','124','124','124',],
'BLK code': ['222','222','222','222'],
'Age Code': ['2','2','2','2']
is there a way to do this?
IIUC, you want to keep only the TT groups where there are only Age groups with value '2'?
You can use a groupby.tranform('all') on the boolean Series:
df[df['Age Code'].eq('2').groupby(df['TT code']).transform('all')]
output:
State County code TT code BLK code Age Code
4 24 002 124 222 2
5 24 002 124 222 2
6 24 002 124 222 2
7 24 002 124 222 2
This should work.
df111['Age Code'] = "2"
I am just wondering why the choice of string for valueType of integer
I have got a list like below which I have read from a CSV usng Python
list_FN = [[' Braund', ' Mr. Owen Harris ', '1'], [' Heikkinen', ' Miss. Laina ', '0'], [' Allen', ' Mr. William Henry ', '0'],....]
I want to use regular expression to separate title, first name, last name throughout the list and store data in a new list.
I am new to regular expressions and having trouble finding a soution.
So far what I have done is
with open('/home/username/Desktop/FairDealCustomerData.csv', newline='') as f:
reader = csv.reader(f)
data1 = list(reader)
print(data1)
l1 = data1[0]
print(l1)
Ouput:
[[' Braund', ' Mr. Owen Harris ', '1'], [' Heikkinen', ' Miss. Laina ', '0'], [' Allen', ' Mr. William Henry ', '0'], [' Moran', ' Mr. James ', '0']...
[' Braund', ' Mr. Owen Harris ', '1']
A=[['1'], ['2'], ['3'], ['4'], ['5'], ['6'], ['7'], ['8'], ['9'], ['10'], ['11'], ['12'], ['13'], ['14'], ['15'], ['16'], ['17'], ['18'], ['19'], ['20'], ['21'], ['22'], ['23'], ['24'], ['25'], ['26'], ['27'], ['29'], ['30'], ['31'], ['32'], ['33'], ['34'], ['35'], ['36']]
B=[['Andaman and Nicobar Islands', ' '], ['Andhra Pradesh'], ['Arunachal Pradesh'], ['Assam'], ['Bihar'], ['Chandigarh', ' '], ['Chhattisgarh'], ['Dadra and Nagar Haveli', ' '], ['Daman and Diu', ' '], ['National Capital Territory of Delhi', ' '], ['Goa'], ['Gujarat'], ['Haryana'], ['Himachal Pradesh'], ['Jammu and Kashmir'], ['Jharkhand'], ['Karnataka'], ['Kerala'], ['Lakshadweep', ' '], ['Madhya Pradesh'], ['Maharashtra'], ['Manipur'], ['Meghalaya'], ['Mizoram'], ['Nagaland'], ['Odisha'], ['Puducherry', ' '], ['Rajasthan'], ['Sikkim'], ['Tamil Nadu'], ['Telangana'], ['Tripura'], ['Uttar Pradesh'], ['Uttarakhand'], ['West Bengal']]
C=[['Port Blair'], ['Hyderabad', ' ', '(', 'de jure', ' to 2024)', '\n', 'Amaravati', ' ', '(', 'de facto', ' from 2017)', '[3]', ' ', '[4]', ' ', '[a]'], ['Itanagar'], ['Dispur'], ['Patna'], ['Chandigarh', '[c]'], ['Naya Raipur', '[d]'], ['Silvassa'], ['Daman'], ['New Delhi'], ['Panaji', '[e]'], ['Gandhinagar'], ['Chandigarh'], ['Shimla', '\n', 'Dharamshala', ' (W/2nd)', '[8]', '\n'], ['Srinagar', '\xa0(Summer)', '\n', 'Jammu', '\xa0(Winter)'], ['Ranchi'], ['Bengaluru'], ['Thiruvananthapuram'], ['Kavaratti'], ['Bhopal'], ['Mumbai', '[g]', '\n', 'Nagpur', '\xa0(W/2nd)', '[h]'], ['Imphal'], ['Shillong'], ['Aizawl'], ['Kohima'], ['Bhubaneswar'], ['Puducherry'], ['Jaipur'], ['Gangtok', '[j]'], ['Chennai', '[k]'], ['Hyderabad', '[l]'], ['Agartala'], ['Lucknow'], ['Dehradun', '[m]'], ['Kolkata']]
I have the above three lists and I want it to convert them to a pandas dataframe in the following format:
Numbers State/UT Capital
1 Andaman and Nicobar Islands Port Blair
2 Andhra Pradesh Hyderabad
You can use itertools and zip to help with this:
from itertools import chain
import pandas as pd
df = pd.DataFrame({'Numbers': list(chain.from_iterable(A)),
'State/UT Capital': [' '.join([i[0], j[0]]) for i, j in zip(B, C)]})
Result:
Numbers State/UT Capital
0 1 Andaman and Nicobar Islands Port Blair
1 2 Andhra Pradesh Hyderabad
2 3 Arunachal Pradesh Itanagar
3 4 Assam Dispur
.........
I have the following list of strings:
data = ['1 General Electric (GE) 24581660 $18.19 0.04 0.22 ',
'2 Qudian ADR (QD) 24227349 12.22 -3.93 -24.33 ',
'3 Square Cl A (SQ) 16233308 48.86 0.05 0.10 ',
'4 Teva Pharmaceutical Industries ADR (TEVA) 15830425 13.70 0.22 1.63 ',
'5 Vale ADR (VALE) 14768221 10.98 0.21 1.95 ',
'6 Bank of America (BAC) 13938799 26.59 -0.07 -0.26 ',
'7 Entercom Communications Cl A (ETM) 13087209 12.00 0.10 0.84 ',
'8 Chesapeake Energy (CHK) 12948648 3.92 -0.05 -1.26 ',
"9 Macy's (M) 12684478 21.07 0.44 2.13 "]
Where the format of every string is: count, stock name, volume, some more int values...
I need to split these strings into a list where each element is one of the items in the string format above, and this is how I attempted to do that:
for i in range(1, len(data)-1):
split = data[i].split()
temp = "{} {} {}".format(split[1], split[2], split[3])
del split[2 : 4]
split[1] = temp
print(split)
However, I believe this is inefficient and it doesn't work when the name is more or less than two words. How would I handle this? Would I have to adjust how I generate the list of strings (data) in the first place?
EDIT:
final_data = [
re.split('(?<=\))\s+|(?<=[\d\$-])\s(?=[\d\$-])|(?<=\d)\s(?=[a-zA-Z])', i)
for i in data[1]]
final_data = [i[:-1]+[i[-1][:-1]] for i in final_data]
print(final_data)
Output:
~/workspace $ python extract.py 2017-11-27-04-26-51-ss.xhtml
[[''],
[''],
[''],
...,
[''],
[''],
['']]
You can use re.split:
import re
data = ['1 General Electric (GE) 24581660 $18.19 0.04 0.22 ', '2 Qudian ADR (QD) 24227349 12.22 -3.93 -24.33 ', '3 Square Cl A (SQ) 16233308 48.86 0.05 0.10 ', '4 Teva Pharmaceutical Industries ADR (TEVA) 15830425 13.70 0.22 1.63 ', '5 Vale ADR (VALE) 14768221 10.98 0.21 1.95 ', '6 Bank of America (BAC) 13938799 26.59 -0.07 -0.26 ', '7 Entercom Communications Cl A (ETM) 13087209 12.00 0.10 0.84 ', '8 Chesapeake Energy (CHK) 12948648 3.92 -0.05 -1.26 ', "9 Macy's (M) 12684478 21.07 0.44 2.13 "]
final_data = [re.split('(?<=[a-zA-Z])\s+(?=\()|(?<=\))\s+|(?<=[\d\$-])\s+(?=[\d\$-])|(?<=\d)\s+(?=[a-zA-Z])', i) for i in data]
Output:
[['1', 'General Electric', '(GE)', '24581660', '$18.19', '0.04', '0.22 '], ['2', 'Qudian ADR', '(QD)', '24227349', '12.22', '-3.93', '-24.33 '], ['3', 'Square Cl A', '(SQ)', '16233308', '48.86', '0.05', '0.10 '], ['4', 'Teva Pharmaceutical Industries ADR', '(TEVA)', '15830425', '13.70', '0.22', '1.63 '], ['5', 'Vale ADR', '(VALE)', '14768221', '10.98', '0.21', '1.95 '], ['6', 'Bank of America', '(BAC)', '13938799', '26.59', '-0.07', '-0.26 '], ['7', 'Entercom Communications Cl A', '(ETM)', '13087209', '12.00', '0.10', '0.84 '], ['8', 'Chesapeake Energy', '(CHK)', '12948648', '3.92', '-0.05', '-1.26 '], ['9', "Macy's", '(M)', '12684478', '21.07', '0.44', '2.13 ']]
With the parenthesis removed:
final_data = [[b[1:-1] if b.startswith('(') and b.endswith(')') else b for b in i] for i in final_data]
Output:
[['1', 'General Electric', 'GE', '24581660', '$18.19', '0.04', '0.22 '], ['2', 'Qudian ADR', 'QD', '24227349', '12.22', '-3.93', '-24.33 '], ['3', 'Square Cl A', 'SQ', '16233308', '48.86', '0.05', '0.10 '], ['4', 'Teva Pharmaceutical Industries ADR', 'TEVA', '15830425', '13.70', '0.22', '1.63 '], ['5', 'Vale ADR', 'VALE', '14768221', '10.98', '0.21', '1.95 '], ['6', 'Bank of America', 'BAC', '13938799', '26.59', '-0.07', '-0.26 '], ['7', 'Entercom Communications Cl A', 'ETM', '13087209', '12.00', '0.10', '0.84 '], ['8', 'Chesapeake Energy', 'CHK', '12948648', '3.92', '-0.05', '-1.26 '], ['9', "Macy's", 'M', '12684478', '21.07', '0.44', '2.13 ']]
You can split lists on characters
All of the strings in your original data list have 2 sections, the stock name and then the number values, if you split on the closing paranthesis in the string you can break it into a list holding a string for the stockname and a string containing the numbers, the numbers have consistent spacing between them of one space and then you can split the list of numbers on the space character.
https://docs.python.org/3/library/stdtypes.html#str.split