Python and CSV with formulae - python

I have a CSV file with formulae, like this:
1;;2.74;0
=A1+C1;=A2;=C1
What's the best way to convert formulae into numbers, as follows
1;;2.74;0
3.74;3.74;2.74
?
The only way I know is to read it with csv.reader as a list of lists and then loop through each element. But it seems there must be a simpler way.
P. S. I have a hint to use eval

The CSV format does not support formulas. It is a plain text only format.
Although some popular software like MS Excel, will calculate the formulas. I am not aware of a parser that allows this. You may however, attempt to write your own parser. The success of this will depend on how advanced formulas you are looking to have in the CSV.

Related

Is it possible to convert 'dynamic' excel formulas to python code?

Is it possible to convert excel formulas to python code? For example:
"=TEXT(SORT(PROPER(UNIQUE(FILTER("
"ws_1!A:A,ws_2!B:B=ws_3!C3"
')))), "")'
Or it is not possible. I was looking into Pycel, xlcalculator, formulas module. But unfortunately i cannot find more complicated example than sum(A,B).
Probably i could do it with pandas, but it won't work constantly in spreadsheet. Or can i save some python script instead formula to cell?
if you have any idea how to translate easier formulas eg. or any library that can do it, I would be grateful for the tips :
'=IFERROR(VLOOKUP(C2,ws!A2:B3,2,0), "Invalid")'
My motivation is to avoid a long excel formula in python code. And make it testable

detect partial strings and rearrange a csv accordingly

I am very-very new to Python and still learning my way around. I am trying to process some data and I have a very big raw_data.csv file that reads as follows:
ARB1,k_abc,t_def,s_ghi,1.321
ARB2,ref,k_jkl,t_mno,s_pqr,0.31
ARB3,k_jkl,t_mno,s_pqr,qrs,0.132
ARB4,sql,k_jkl,t_mno,s_pqr,ets,0.023
I want to append this data in an existing all_data.csv and it should look like
ARB1,k_abc,t_def,s_ghi,1.321
ARB2,k_jkl,t_mno,s_pqr,0.31
ARB3,k_jkl,t_mno,s_pqr,0.132
ARB4,k_jkl,t_mno,s_pqr,0.023
As you can see, the code has to detect partial strings and numbers and rearrange them in an orderly way (by excluding the cells that don't have them). I was trying to use the csv module with very little luck. Can anyone help please?
You can parse this using pandas.read_csv on Pandas. Alternatively, if you don't want to use Pandas, I would recommend simply reading in data, a line at a time and splitting by commas using Python's string operators. You can create a 2-D array that you can populate row-by-row as you read in more data.

can you subset while reading in a csv in python

I have daily weather data in csv since 1980, >10GB in size. The column I am interested in date, and I want to be able to have a user select a date so that only the results from that date are returned.
I wonder if it is possible to read in and subset at the same time to save memory and computation
I am relatively new to python and tried:
d=pd.read_csv('weather.csv',sep='\t')['Date' == 'yyyymmdd']
to no avail.
Is it possible to read in all of the data that is only present for a single day (ei 20011004)?
Short answer: from a csv you'll not be able to do so.
Long answer: csv formats are very handy for humans to read, but it's the worst for machines to operate with. You'll need to parse line by line until you find the lines where the date fits the requested one.
A possible solution: You should convert the csv into a more amenable format for such operations. My suggestion would be to go with something like hdf5. You can read the whole csv with pandas and then save it as a hdf5 file as d.to_hdf('weather.h5', format='table'). You can check the pandas hdf documentation here. This should allow you to handle in a more memory and cpu efficient way.
Binary files can implement indexes and sorting in such a way that you don't have to go through all the data to check for those pieces you need. The same ideas apply to databases.
Addendum: There are other options for binary formats, like parquet (which maybe would be even better you should test) or feather (if you want some level of "native" interoperativity with R). You might want to check the following post for some insights regarding loading/saving times in different formats and their size.

Divide one "column" by another in Tab Delimited file

I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.
In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps

Preferred data format for R dataframe

I am writing a data-harvesting code in Python. I'd like to produce a data frame file that would be as easy to import into R as possible. I have full control over what my Python code will produce, and I'd like to avoid unnecessary data processing on the R side, such as converting columns into factor/numeric vectors and such. Also, if possible, I'd like to make importing that data as easy as possible on the R side, preferably by calling a single function with a single argument of file name.
How should I store data into a file to make this possible?
You can write data to CSV using http://docs.python.org/2/library/csv.html Python's csv module, then it's a simple matter of using read.csv in R. (See ?read.csv)
When you read in data to R using read.csv, unless you specify otherwise, character strings will be converted to factors, numeric fields will be converted to numeric. Empty values will be converted to NA.
First thing you should do after you import some data is to look at the ?str of it to ensure the classes of data contained within meet your expectations. Many times have I made a mistake and mixed a character value in a numeric field and ended up with a factor instead of a numeric.
One thing to note is that you may have to set your own NA strings. For example, if you have "-", ".", or some other such character denoting a blank, you'll need to use the na.strings argument (which can accept a vector of strings ie, c("-",".")) to read.csv.
If you have date fields, you will need to convert them properly. R does not necessarily recognize dates or times without you specifying what they are (see ?as.Date)
If you know in advance what each column is going to be you can specify the class using colClasses.
A thorough read through of ?read.csv will provide you with more detailed information. But I've outlined some common issues.
Brandon's suggestion of using CSV is great if your data isn't enormous, and particularly if it doesn't contain a whole honking lot of floating point values, in which case the CSV format is extremely inefficient.
An option that handled huge datasets a little better might be to construct an equivalent DataFrame in pandas and use its facilities to dump to hdf5, and then open it in R that way. See for example this question for an example of that.
This other approach feels like overkill, but you could also directly transfer the dataframe in-memory to R using pandas's experimental R interface and then save it from R directly.

Categories

Resources