read_table error while reading .idx file

read_table error while reading .idx file - python

I'm trying to read an .idx file that is about 1.89Gb in size. If I write:
indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx")
I get the output as:
Company Name Form Type CIK Date Filed File Name
0 033 ASSET MANAGEMENT LLC / ...
1 033 ASSET MANAGEMENT LLC / ...
2 1 800 CONTACTS INC ...
3 1 800 CONTACTS INC ...
4 1 800 FLOWERS COM INC ...
Where all the columns are merged together in a single column
If I do:
indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx",sep=" ")
I get the error:
CParserError: Error tokenizing data. C error: Expected 69 fields in line 4, saw 72
I can use:
indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx",error_bad_lines=False)
But that will just remove most of my data.
Is there any workaround?
PS: Link to a sample .idf file SEC EDGAR. Download the company.idx file.

Your column entries also have spaces in them. So use 2 spaces as separator.
indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx",sep=" ")

Related

Create multiple rows of fixed length from a data frame column in Pyspark with file has , and &

I have similar usecase but file has special characters, So this command is failing. How to convert such fixed length files? Example:
AADP0067 907000075 0 11DP999999PANE E VINO ITALIAN RESTAURANT, DELICATESSEN & BAKERY 00AADP0067 907000075 0 11DP999999PANE E VINO ITALIAN RESTAURANT, DELICATESSEN & BAKERY 00
expected output: AADP0067 AADP0067
But facing issue as file has , and & .

Issue when importing excel file with thousands of lines- ValueError: Passed header=1 but only 1 lines in file

I am trying to import an excel file with headers on the second row
selected columns=['Student name','Age','Faculty']
data = pd.read_excel(path_in + 'Results\\' + 'Survey_data.xlsx', header = 1,usecols = selected_columns).rename(columns={'Student Name':'First name'}).drop_duplicates()
Currently, the excel looks something like this:
Student name Surname Faculty Major Scholarship Age L1 TFM Date Failed subjects
Ana Ruiz Economics Finance N 20 N 0
Linda Peterson Mathematics Mathematics Y 22 N 2021-12-04 0
Gregory Olsen Engineering Industrial Engineering N 21 N 0
Ana Watson Business Marketing N 22 N 0
I have tried including the last column in the selected_columns list but it returns the same error. Would greatly appreciate if someone can let me know why python is not reading all the lines.
Thanks in advance.

How can i convert this lists object into a dataframe?

I have something which lokos like (called lines)
[' id\t Name\t Type\t User\t Q\t country\t Final-score\t Progress\t website',
'abcde\t jen\t engineer\t jenabc\t RUNNING\t UK\t 75%\t N/A',
'fres\t Penny\t dr\t dr123\t RUNNING\t DENMARK\t 67%\t N/A']
each line which is in the speech marks and separated by ',' is a dataframe row. However i cannot convert to dataframe.
new_df = pd.read_csv(StringIO(",".join(lines[1:])),sep = "\t")
i do [1:] since the first line is just a comment. i get the error: ParserError: Error tokenizing data. C error: Expected 963 fields in line 3, saw 1099
i'd like my datframe to be such that the first row is the headers, and the rest are the contents separated by \t. how can i do this?

df = pd.read_csv(StringIO("\n".join(lines)), sep=r"\s+")
print(df)
Prints:
id Name Type User Q country Final-score Progress website
0 abcde jen engineer jenabc RUNNING UK 75% NaN NaN
1 fres Penny dr dr123 RUNNING DENMARK 67% NaN NaN

Trying to join a .dat file with a .asc file using Python or Excel

guys.
I've got a bit of a unique issue trying to merge two big data files together. Both files have a column of the same data (patent number) with all other columns different.
The idea is to join them such that these patent number columns align so the other data is readable and connected.
Just the first few lines of the .dat file looks like:
IL 1 Chicago 10030271 0 3930271
PA 1 Bedford 10156902 0 3930272
MO 1 St. Louis 10112031 0 3930273
IL 1 Chicago 10030276 0 3930276
And the .asc:
02 US corporation No change 11151713 TRANSCO PROD INC 58419
02 US corporation No change 11151720 SECURE TELECOM INC 502530
02 US corporation No change 11151725 SOA SYSTEMS INC 520365
02 US corporation No change 11151738 REVTEK INC 473150
The .dat file is too large to open fully in Excel so I don't think reorganizing it there is an option (rather I don't know if it is or not through any macros I've found online yet).
Quite a newbie question I feel but does anyone know how I could link these data sets together (preferably using Python) with this patent number unique identifier?

You will want to write a program that reads in the data from the two files you would like to merge. You will open the file and parse the data for each line. From there you are able to write the data to a new file in any order that you would like. This is accomplish-able through python file IO.
pseudo code:
def filehandler(self, filename1, filename2):
Fd =open(filename1, "r")
Fd2 = open(filename2, "r")
while True:
line1 = Fd.readline()
if not line1: break # this will exit the loop if there is no more to read
Line1_array = line1.split()
# first line of first file is split and saved in an array deliniated by spaces.

Python - Extracting text by Column Header from Given Row

I created a text file from multiple email messages.
Each of the three tuples below was written to the text file from a different email message and sender.
Cusip NAME Original Current Cashflow Collat Offering
362341D71 GSAA 2005-15 2A2 10,000 8,783 FCF 5/25 65.000
026932AC7 AHM 2007-1 GA1C 9,867 7,250 Spr Snr OA 56.250
Name O/F C/F Cpn FICO CAL WALB 60+ Notes Offer
CSMC 06-9 7A1 25.00 11.97 L+45 728 26 578 35.21 FLT,AS,0.0% 50-00
LXS 07-10H 2A1 68.26 34.01 L+16 744 6 125 33.98 SS,9.57% 39-00`
CUSIP Name BID x Off SIZE C/E 60++ WAL ARM CFLW
86360KAA6 SAMI 06-AR3 11A1 57-00 x 59-00 73+MM 46.9% 67.0% 65 POA SSPT
86361HAQ7 SAMI 06-AR7 A12 19-08 x 21-08 32+MM 15.4% 61.1% 61 POA SRMEZ
By 'Name' I need a way to pull out the Price info (Price info = data under the words:'Offering','Offer' and 'Off'). This process will be replicated over the whole text file and the extracted data ('Name' and 'Price') will be written to an excel file via XLWT. Notice that the format for the price data varies by tuple.

The formatting for this makes it a little tricky since your names can have spaces, which can make csv difficult to use. One way to get around this is to use the first column to get the location and width of the columns you are interested by using regex. You can try something like this:
import re
for email in emails:
print email
lines = email.split('\n')
name = re.search(r'name\s*', lines[0], re.I)
price = re.search(r'off(er(ing)?)?\s*', lines[0], re.I)
for line in lines[1:]:
n = line[name.start():name.end()].strip()
p = line[price.start():price.end()].strip()
print (n, p)
print
This assumes that emails is a list where each entry is an email. Here is the output:
Cusip NAME Original Current Cashflow Collat Offering
362341D71 GSAA 2005-15 2A2 10,000 8,783 FCF 5/25 65.000
026932AC7 AHM 2007-1 GA1C 9,867 7,250 Spr Snr OA 56.250
('GSAA 2005-15 2A2', '65.000')
('AHM 2007-1 GA1C', '56.250')
Name O/F C/F Cpn FICO CAL WALB 60+ Notes Offer
CSMC 06-9 7A1 25.00 11.97 L+45 728 26 578 35.21 FLT,AS,0.0% 50-00
LXS 07-10H 2A1 68.26 34.01 L+16 744 6 125 33.98 SS,9.57% 39-00`
('CSMC 06-9 7A1', '50-00')
('LXS 07-10H 2A1', '39-00')
CUSIP Name BID x Off SIZE C/E 60++ WAL ARM CFLW
86360KAA6 SAMI 06-AR3 11A1 57-00 x 59-00 73+MM 46.9% 67.0% 65 POA SSPT
86361HAQ7 SAMI 06-AR7 A12 19-08 x 21-08 32+MM 15.4% 61.1% 61 POA SRMEZ
('SAMI 06-AR3 11A1', '59-00')
('SAMI 06-AR7 A12', '21-08')

Just use csv module.
and use good formatting for your numbers

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

read_table error while reading .idx file - python

Your column entries also have spaces in them. So use 2 spaces as separator. indexfile=pd.read_table("C:\Edgar Zip files\2001\company.idx",sep=" ")

Related

Create multiple rows of fixed length from a data frame column in Pyspark with file has , and &

Issue when importing excel file with thousands of lines- ValueError: Passed header=1 but only 1 lines in file

How can i convert this lists object into a dataframe?

Trying to join a .dat file with a .asc file using Python or Excel

Python - Extracting text by Column Header from Given Row

Categories

Resources