Python - Extracting text by Column Header from Given Row - python

I created a text file from multiple email messages.
Each of the three tuples below was written to the text file from a different email message and sender.
Cusip NAME Original Current Cashflow Collat Offering
362341D71 GSAA 2005-15 2A2 10,000 8,783 FCF 5/25 65.000
026932AC7 AHM 2007-1 GA1C 9,867 7,250 Spr Snr OA 56.250
Name O/F C/F Cpn FICO CAL WALB 60+ Notes Offer
CSMC 06-9 7A1 25.00 11.97 L+45 728 26 578 35.21 FLT,AS,0.0% 50-00
LXS 07-10H 2A1 68.26 34.01 L+16 744 6 125 33.98 SS,9.57% 39-00`
CUSIP Name BID x Off SIZE C/E 60++ WAL ARM CFLW
86360KAA6 SAMI 06-AR3 11A1 57-00 x 59-00 73+MM 46.9% 67.0% 65 POA SSPT
86361HAQ7 SAMI 06-AR7 A12 19-08 x 21-08 32+MM 15.4% 61.1% 61 POA SRMEZ
By 'Name' I need a way to pull out the Price info (Price info = data under the words:'Offering','Offer' and 'Off'). This process will be replicated over the whole text file and the extracted data ('Name' and 'Price') will be written to an excel file via XLWT. Notice that the format for the price data varies by tuple.

The formatting for this makes it a little tricky since your names can have spaces, which can make csv difficult to use. One way to get around this is to use the first column to get the location and width of the columns you are interested by using regex. You can try something like this:
import re
for email in emails:
print email
lines = email.split('\n')
name = re.search(r'name\s*', lines[0], re.I)
price = re.search(r'off(er(ing)?)?\s*', lines[0], re.I)
for line in lines[1:]:
n = line[name.start():name.end()].strip()
p = line[price.start():price.end()].strip()
print (n, p)
print
This assumes that emails is a list where each entry is an email. Here is the output:
Cusip NAME Original Current Cashflow Collat Offering
362341D71 GSAA 2005-15 2A2 10,000 8,783 FCF 5/25 65.000
026932AC7 AHM 2007-1 GA1C 9,867 7,250 Spr Snr OA 56.250
('GSAA 2005-15 2A2', '65.000')
('AHM 2007-1 GA1C', '56.250')
Name O/F C/F Cpn FICO CAL WALB 60+ Notes Offer
CSMC 06-9 7A1 25.00 11.97 L+45 728 26 578 35.21 FLT,AS,0.0% 50-00
LXS 07-10H 2A1 68.26 34.01 L+16 744 6 125 33.98 SS,9.57% 39-00`
('CSMC 06-9 7A1', '50-00')
('LXS 07-10H 2A1', '39-00')
CUSIP Name BID x Off SIZE C/E 60++ WAL ARM CFLW
86360KAA6 SAMI 06-AR3 11A1 57-00 x 59-00 73+MM 46.9% 67.0% 65 POA SSPT
86361HAQ7 SAMI 06-AR7 A12 19-08 x 21-08 32+MM 15.4% 61.1% 61 POA SRMEZ
('SAMI 06-AR3 11A1', '59-00')
('SAMI 06-AR7 A12', '21-08')

Just use csv module.
and use good formatting for your numbers

Related

How to save pandas dataframe rows as seperate files with the first row fixed for all?

I have a DataFrame with multiple columns and rows. The rows are student names with marks and the columns are marking criteria. I want to save the first row (column names) along with each row in seperate files with the name of the student as the name file.
Example of my data:
Marking_Rubric
Requirements and Delivery\nWeight 45.00%
Coding Standards\nWeight 10.00%
Documentation\nWeight 25.00%
Runtime - Effectiveness\nWeight 10.00%
Efficiency\nWeight 10.00%
Total
Comments
John Doe
54
50
90
45
50
31
Limited documentation
Jane Doe
23
12
87
10
34
98
No comments
Desired output:
Marking_Rubric
Requirements and Delivery
Coding Standards
Documentation
Runtime - Effectiveness
Efficiency
Total
Comments
John Doe
54
50
90
45
50
31
Limited documentation
Marking_Rubric
Requirements and Delivery
Coding Standards
Documentation
Runtime - Effectiveness
Efficiency
Total
Comments
Jane Doe
23
12
87
10
34
98
No comments
Just note that you have to have a unique name to save a file. Otherwise files with the same name will overwrite each other.
# `````````````````````````````````````````````````````````````````````````
### create dummy data
column1_list = ['John Doe','John Doe','Not John Doe','special ß ß %&^ character name', 'no special character name again']
column2_list = [53,23,100,0,10]
column3_list = [50,12,200,0,10]
df = pd.DataFrame({'Marking_Rubric': column1_list,
'Requirements and Delivery': column2_list,
'Coding Standards': column3_list})
# `````````````````````````````````````````````````````````````````````````
### create unique identifier that will be used as name of file, otherwise
### you will overwrite files with the same name
df['row_number'] = df.index
df['Marking_Rubric_Rowed'] = df.Marking_Rubric + " " + df.row_number.astype(str)
df
Output 1
# `````````````````````````````````````````````````````````````````````````
### create a loop the length of your dataframe and save each row as a csv
for x in range(0,len(df)):
### try to save file
try:
### get your current row of data first then selecting name of your file ,
### if you want another name just change column
df[x:x+1].to_csv(df[x:x+1].Marking_Rubric_Rowed.iloc[0]+'.csv', #### selecting name for your file here
index=False)
### catch and print out exception if something went wrong
except Exception as e:
print(e)
### continue your loop, you could also put "break" to break your loop
continue
Output 2

Python Pandas Question: Index / Match with Missing Values + Duplicates + Everything In Between

Basically, I have smaller table of assets purchased this year and a table of assets the company holds. I want to be able to get the value of certain symbols from the table of assets the company holds and merge into the assets purchased dataset. I want to use the CUSIP. If there is a CUSIP in the assets purchased this year that is blank, this code can return blank or NaN. If there are duplicate CUSIPS in the Holdings dataset, then return the first value. I have tried 4 different ways of merging these tables now without much luck. I run into a memory error for some reason
The equivalent excel code would be:
=IFNA(INDEX(asset_holdings!ADMIN_SYMBOLS,MATCH(asset_purchases!CUSIP_n, asset_holdings!CUSIPs, 0)),"")
Holdings Table
CUSIP
SYMBOL
353187EV5
1A
74727PAY7
3A
80413TAJ8
FE
02765UCR3
3G
000000000
3G
74727PAYA
3E
000000000
4E
Purchase Table
CUSIP
SHARES
353187EV5
10
74727PAY7
67
80413TAJ8
35
02765UCR4
3666
74727PAY7
3613
74727PAYA
13
000000000
14
Desired Result
CUSIP
SHARES
SYMBOL
353187EV5
10
1A
74727PAY7
67
3A
80413TAJ8
35
FE
02765UCR4
3666
""
74727PAY7
3613
3A
74727PAYA
13
3E
000000000
14
3G
C:\ProgramData\Continuum\Anaconda\lib\site-packages\pandas\core\reshape\merge.py in _get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
1140 join_func = _join_functions[how]
1141
-> 1142 return join_func(lkey, rkey, count, **kwargs)
1143
1144
pandas\_libs\join.pyx in pandas._libs.join.left_outer_join()
MemoryError:
What I tried:
dfnew = dfPurchases.merge(dfHoldings[['CUSIP','SYMBOL']],how='left', on='CUSIP')
dfPurchases = dfPurchases.set_index('CUSIP')
dfPurchases['SYMBOL'] = dfHoldings.lookup(dfHoldings['CUSIP'], df1['SYMBOL'])
enter image description here
Let me elaborate on the question a little bit so that you can review if I have the correct understanding of your question. You want to do a left outer join of purchase dataset with holdings dataset. But, since your holding data set has duplicates for CUSIP ids, It will not be a One-to-one join.
Now you have two options:
Accept multiple rows for one row of the purchase dataset
Make CUSIP id unique in the Holdings dataset and then perform the merge
First way:
import pandas as pd
left = pd.read_csv('purchase.csv')
right = pd.read_csv('holdings.csv')
result = pd.merge(left, right, on="CUSIP", how='left')
print(result)
But, As per your question, the above result isn't acceptable so, We are gonna make CUSIP column unique in the right dataset.
import pandas as pd
left = pd.read_csv('purchase.csv')
right = pd.read_csv('holdings.csv')
# By default it takes first but i have added explicitly for better understanding
right_unique = right.drop_duplicates('CUSIP', keep='first')
result = pd.merge(left, right_unique, on="CUSIP", how='left', validate="many_to_one")
print(result)
Bonus: You can also explore the validation param by putting it into the first version and see the validation errors.

What is the most efficient way of converting all "HH:MM:SS" to seconds and invalid strings to NaT?

I have a DataFrame storing marathon segment splits (5K, 10K, ...) and identifiers (age, gender, country) as columns and individuals as rows. Each cell for a marathon segment split column may contain either a string in "HH:MM:SS" format or a "-" (to represent that the marathon segment split data for that individual is invalid or does not exist).
What is the most efficient way of converting all "-" to NaT and "HH:MM:SS" to seconds?
Here is some sample data:
Age M/F Country 5K ... 15K 20K Half Official Time
2323 38 M CHI 0:21:40 ... 1:03:54 1:25:07 1:29:43 2:58:47
2324 23 M USA 0:21:26 ... 1:02:09 1:22:17 1:26:34 2:58:47
2325 36 M USA 0:21:08 ... 1:02:55 1:23:56 1:28:30 2:58:47
2326 37 M POL 0:20:34 ... 1:02:03 1:22:52 1:27:24 2:58:47
2327 32 M DEN - ... 1:03:02 1:24:06 1:28:39 2:58:48
I've referenced this answer but my data has already been read from a CSV file (I do not want to change how I read in the CSV file) and seems to not be able to accept "-". Conversion to DateTime objects with the following code:
df.loc[:, "5K":] = df.loc[:, "5K":].apply(pd.to_datetime, format="%H:%M:%S", errors="coerce")
causes each cell for a marathon segment split column to be prefixed with "1900-01-01".
If you're measuring runtimes, a more appropriate conversion function might be pd.to_timedelta:
df.loc[:, "5K":].apply(pd.to_timedelta, unit='S', errors='coerce'))
Two things to keep in mind here:
For durations, rather than points on the timeline, pd.to_timedelta is conceptually more appropriate than pd.to_datetime.

Issue when importing excel file with thousands of lines- ValueError: Passed header=1 but only 1 lines in file

I am trying to import an excel file with headers on the second row
selected columns=['Student name','Age','Faculty']
data = pd.read_excel(path_in + 'Results\\' + 'Survey_data.xlsx', header = 1,usecols = selected_columns).rename(columns={'Student Name':'First name'}).drop_duplicates()
Currently, the excel looks something like this:
Student name Surname Faculty Major Scholarship Age L1 TFM Date Failed subjects
Ana Ruiz Economics Finance N 20 N 0
Linda Peterson Mathematics Mathematics Y 22 N 2021-12-04 0
Gregory Olsen Engineering Industrial Engineering N 21 N 0
Ana Watson Business Marketing N 22 N 0
I have tried including the last column in the selected_columns list but it returns the same error. Would greatly appreciate if someone can let me know why python is not reading all the lines.
Thanks in advance.

read_table error while reading .idx file

I'm trying to read an .idx file that is about 1.89Gb in size. If I write:
indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx")
I get the output as:
Company Name Form Type CIK Date Filed File Name
0 033 ASSET MANAGEMENT LLC / ...
1 033 ASSET MANAGEMENT LLC / ...
2 1 800 CONTACTS INC ...
3 1 800 CONTACTS INC ...
4 1 800 FLOWERS COM INC ...
Where all the columns are merged together in a single column
If I do:
indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx",sep=" ")
I get the error:
CParserError: Error tokenizing data. C error: Expected 69 fields in line 4, saw 72
I can use:
indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx",error_bad_lines=False)
But that will just remove most of my data.
Is there any workaround?
PS: Link to a sample .idf file SEC EDGAR. Download the company.idx file.
Your column entries also have spaces in them. So use 2 spaces as separator.
indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx",sep=" ")

Categories

Resources