Python-pandas with large/disordered text files

Python-pandas with large/disordered text files - python

I have a large (for my experience level anyway) text file of astrophysical data and I'm trying to get a handle on python/pandas. As a noob to python, it's comin' along slowly. Here is a sample of the text file, it's a 145Mb total file. When I'm trying to read this in pandas I'm getting confused because I don't know what to use pd.read_table(example.txt) or pd.read_csv(example.csv). In either case I can't call on a specific column without ipython freaking out, such as here. I know I'm doing something absent-minded. Can anyone explain what that might be? I've done this same procedure with smaller files and it works great, but this one seems to be limiting it's output, or just not working at all.
Thanks.

It looks like your columns are separated by varying amounts of whitespace, so you'll need to specify that as the separator. Try read_csv(example.csv, sep=r'\s+'). \s+ is the regular expression for "any amount of whitespace". Also, you should remove that # character from the beginning of the first line, as that will be read as an extra column and will mess up the reading.

Related

Python/ Numpy Changing all „ ,“ to „.“ in an array

I am trying to extract data from a .txt file which embodies certain measurement values that I would like to use inside Python. I am doing this with the numpy module (numpy.genfromtxt), which saves the values into an array.
Nevertheless, whenever there is a decimal value, it is written with a comma (1,456 f.e.), which Python does not accept as a decimal. Sadly, this is the way that the data has been given to me. Now, I would like to write a Python Code that goes through all elements of the array, basically looks out for commas and changes them to dots (I have multiple files and I would like to automate this process, even though I could technically do it manually :) ).
As I have started programming with C and C++, I would have done this with pointers and loops. Nevertheless, the pointer concept does not seem to exist in Python or is at least not advised. I would be very glad if any of you could please tell me whether there is a way to advance this problem in Python. Thank you very much!

Welcome to SO. Please give more details. We cannot answer you if you do not include the code you wrote yet / sample data / and full error messages
about your trial to solve this problem, so we can reproduce and help.
See MRE here: https://stackoverflow.com/help/minimal-reproducible-example

read the file content and replace the "," characters like so:
with open('file.txt.','r') as f:
content = f.read().replace(',','.')
# do whatever with "content"

Parse CSV files with different quoting (and a wrong one)

I have to process CSV files (around ~500 000) in Python 3 which all came from the same source. However, for unknown reasons, some of them (a few minority) don't have the same quoting mechanism than the others.
Unfortunately, I have no way to distinguish a CSV quoting rule from the source itself (I can't make a rule based on the file name).
Here's the 3 kinds of CSV I have:
"field_A";"Field_B";"Field_C"
"";"john; the real one";"krasinski"
field_A;Field_B;Field_C
;john "the original";krasinski
"field_A";"Field_B";"Field_C";"Field_D"
"";"john;"krasinski;"2019-12-12"
I can parse the first by setting my quotingchar to ", the second one by having no quoting char, but I have no clue on how to process the last one, which includes fields that must be quoted, and fields that cannot be quoted (because of the single ")
Is there a way for me to deal with this different kind of files ?
In the end, I'd like to have the value krasinski only for the Field_C for example, but I either have "krasinski" and krasinski, or krasinski and an error.
All the solutions that came to me feel "wrong".
I'm running Python 3.8, and currently using what appears to be the recommended library but I'm not committed to it.
Thanks for your help !

append to .csv file with pandas

I have a treatment that uses pandas.DataFrame.to_csv() to create .csv files of my data and I wish to execute this treatment several times, each time appending new data to the already existing file. (column names of the output are the same for each treatment)
Of course, an idea to work around my problem would be to create a file for each treatment and then concatenate them, but I would like to do that without changing much to my scripts.
Is it possible to append to an existing .csv file in pandas 0.14 ? (unfortunately, I cannot upgrade my version).
I was thinking I could do something using the 'mode' argument, http://pandas.pydata.org/pandas-docs/version/0.14.0/generated/pandas.DataFrame.to_csv.html?highlight=to_csv#pandas.DataFrame.to_csv , but I do not seem to find the right thing to do.
Any suggestions?

Yes you can use write mode 'a'. You may also need/want to use header=False.
I'm a little unclear why you don't want to do .read_csv() into df.append() into df.to_csv(), but that seems like an option too

How do I find unique, non-English characters in a SQL script that has a lot of tables, scripts, etc. related to it?

I am getting a UnicodeDecodeError in my Python script and I know that the unique character is not in Latin (or English), and I know what row it is in (there are thousands of columns). How do I go through my SQL code to find this unique character/these unique characters?

Do a binary search. Break the files (or the scripts, or whatever) in half, and process both files. One will (should) fail, and the other shouldn't. If they both have errors, doesn't matter, just pick one.
Continue splitting the broken files until you've narrowed it down to the something more manageable that you can probe in to, even down to the actual line that's failing.

My re.sub statement gets hung up

I'm a Python (and regex) rookie with relatively little programming experience outside of statistical packages (SAS & Stata). So far, I've gotten by using Python tutorials and answers to other questions on stackoverflow, but I'm stuck. I'm running Python 3.4 on Mac OS X.
I've written a script which downloads and parses SEC filings. The script has four main steps:
Open the URL and load the contents to a string variable
remove HTML encoding using BeautifulSoup
remove other encoding with regex statements (like jpg definitions, embedded zip files, etc.)
save the resulting text file.
My goal is to remove as much of the "non-text" information as possible from each filing before saving to my local drive. I have another script written where I do the actual analysis on the residual text.
I'm running into a problem with step 3 on at least one filing. The line that is causing the hangup is:
_content1 = re.sub(r'(?i).*\.+(xls|xlsx|pdf|zip|jpg|gif|xml)+?[\d\D]+?(end)',r'',_content1)
where _content is a string variable containing contents of the SEC filing. The regex statement is supposed to capture blocks beginning with a line ending in a file extension (xls, pdf, etc.) and ending on the word "end."
The above code has worked fine for entire years' worth of filings (i.e., I've analyzed all of 2001 and 2002 without issue), but my script is getting hung up on one particular filing in 2013 (http://www.sec.gov/Archives/edgar/data/918160/0000918160-13-000024.txt). I'm unsure how to debug as I'm not getting any error message. The script just hangs up on that one line of code (I've verified this with print statements before and after). Interestingly, if I replace the above line of code with this:
_content1 = re.sub(r'(?i)begin*.*(xls|xlsx|pdf|zip|jpg|gif|xml)+?[\d\D]+?(end)',r'',_content1)
Then everything works fine. Unfortunately, certain kinds of embedded files in the filings don't start with "begin" (like zip files), so it won't work for me.
I'm hoping one of the resident experts can identify something in my regex substitution statement that would cause a problem, as going match-by-match through the linked SEC filing probably isn't feasible (at least I wouldn't know where to begin). Any help is greatly appreciated.
Thanks,
JRM
EDIT:
I was able to get my script working by using the following REGEX:
_content1 = re.sub(r'(?i)begin|\n+?.+?(xls|xlsx|pdf|zip|jpg|gif|xml)+?[\d\D]+?(end)',r'\n',_content1)
This seems to be accomplishing what I want, but I am still curious as to why the original didn't work if anyone has a solution.

I think your biggest problem is the lack of anchors. Your original regex begins with .*, which can start matching anywhere and won't stop matching until it reaches a newline or the end of the text. Then it starts backtracking, giving back one character at a time, trying to match the first falsifiable component of the pattern: the dot and the letters of the file extension.
So it starts at the beginning of the file and consumes potentially thousands of characters, only to backtrack all the way to the beginning before giving up. Then it bumps ahead and does the same thing starting at the second character. And again from the third character, from the fourth, and so on. I know it seems incredibly dense, but that's the tradeoff we make for the power and compactness of regexes.
Try this regex:
r"(?im)^[^<>\n]+\.(?:xlsx?|pdf|zip|jpg|gif|xml)\n(?:(?!end$)\S+\n)+end\n"
The start anchor (^) in multiline mode makes sure the match can only start at the beginning of a line. I used [^<>\n]+ for the first part of the line because I'm working with the file you linked to; if you've removed all the HTML and XML markup, you might be able to use .+ instead.
Then I used (?:(?!end$).+\n)+ to match one or more complete lines that don't consist entirely of end. It's probably more efficient than your [\d\D]+?, but the most important difference is that, when I do match end, I know it's at the beginning of the line (and the $ ensures it's at the end of the line).

Try using the following REGEX
_content1 = re.sub(r'(?i).*?\.+(xls|xlsx|pdf|zip|jpg|gif|xml)+?[\d\D]+?(end)',r'',_content1)
I've converted your * operation to *? which is non-greedy which is most likely what you want.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python-pandas with large/disordered text files - python

Related

Python/ Numpy Changing all „ ,“ to „.“ in an array

Parse CSV files with different quoting (and a wrong one)

append to .csv file with pandas

How do I find unique, non-English characters in a SQL script that has a lot of tables, scripts, etc. related to it?

My re.sub statement gets hung up

Categories

Resources