I am very-very new to Python and still learning my way around. I am trying to process some data and I have a very big raw_data.csv file that reads as follows:
ARB1,k_abc,t_def,s_ghi,1.321
ARB2,ref,k_jkl,t_mno,s_pqr,0.31
ARB3,k_jkl,t_mno,s_pqr,qrs,0.132
ARB4,sql,k_jkl,t_mno,s_pqr,ets,0.023
I want to append this data in an existing all_data.csv and it should look like
ARB1,k_abc,t_def,s_ghi,1.321
ARB2,k_jkl,t_mno,s_pqr,0.31
ARB3,k_jkl,t_mno,s_pqr,0.132
ARB4,k_jkl,t_mno,s_pqr,0.023
As you can see, the code has to detect partial strings and numbers and rearrange them in an orderly way (by excluding the cells that don't have them). I was trying to use the csv module with very little luck. Can anyone help please?
You can parse this using pandas.read_csv on Pandas. Alternatively, if you don't want to use Pandas, I would recommend simply reading in data, a line at a time and splitting by commas using Python's string operators. You can create a 2-D array that you can populate row-by-row as you read in more data.
Related
I have written an optimization algorithm that tests some functions on historical stock data, then returns a 2d list of the pandas dataframes generated by each run and the function parameters used. This list takes the form of [[df,params],[df,params], ... [df,params],[df,params]]. After it has been generated, I would like to save this data to be processed in another script, but I am having trouble. Currently I am converting this list to a dataframe and using the to_csv() method from pandas, but this is mangling my data when I open it in another file - I expect the data types to be [[dataframe,list][dataframe,list]...[dataframe,list][dataframe,list]], but they instead become [[str,str],[str,str]...,[str,str],[str,str]]. I open the file using the read_csv() method from pandas, then I convert the resulting dataframe back into a list using the df.values.to_list() method.
To clarify, I save the list to a .csv like this, where out is the list:
out = pd.DataFrame(out)
out.to_csv('optimized_ticker.csv')
And I open the .csv and convert it back from a dataframe to a list like this:
df = pd.read_csv('optimized_ticker.csv')
list = df.values.tolist()
I figured that the problem was my dataframes had commas in there somewhere, so I tried changing the delimiter on the .csv to a few different things, but the issue persisted. How can I fix this issue, so that my datatypes aren't ? It is not imperative that I use the .csv format, so if there's a filetype more suited to the job I can switch to using it. The only purpose of saving the data is so that I can process it with any number of other scripts without having to re-run the simulation each time.
The best way to save a pandas dataframe is not via CSV if its only purpose is to be read by another pandas script. Parquet offers a much more robust option, it saves the datatypes for each column, can be compressed and you won't have to worry about things like comma's in values. Just use the following:
out.to_parquet('optimized_ticker.parquet')
df = pd.read_parquet('optimized_ticker.parquet')
EDIT:
As mentioned in the comments pickle is also a possibility, so the solution depends on your case. Google will be your best friend in figuring out whether to use pickle or parquet or feather.
I was looking for how to unload the obtained data into a .txt or csv file, but I could not find a simple and understandable solution with my level of understanding of this process.
I need to sort words by frequency and highlight the top 100 words. I did it (I know not in the best way, I did everything in the Google collaboratori)
from collections import Counter
Counter(" ".join(test_data['body']).split()).most_common(100)
DATA= Counter(" ".join(test_data['body']).split()).most_common(100)
DATA
Question:
how to save the result from these top 100 words to a text file .csv or .txt, and possibly an Excel version.(or in three versions at once)
I'm just learning and don't know a lot, trying to figure it out and understand.
Here is a link to the collab, for me the problem is that the words are Russian, and all the practices are for English texts, and it's easier than working with Russian text.
https://colab.research.google.com/drive/1LZ3RHPTjTib8lUjzKGcCJgzYnODSjewL?usp=sharing
As I can see in the colab file you are using pandas so the best way would be to use Pandas to_csv function to write to csv and txt by modifying the delimiter argument
You can write to excel using to_excel
Now since you want the top 100 of a particular column only you can first extract that into a separate dataframe by indexing on the column, sorting it (if not sorted) and using head for range 100
I have many files: 1.csv, 2.csv ... N.csv. I want to read them all and aggregate a DataFrame. But reading files sequentially in one process will definitely be slow. So how can I improve it? Besides, Jupyter notebook is used.
Also, I am a little confused about the "cost of parsing parameters or return values between python processes"
I know the question may be duplicated. But I found that most of the answers use multi-process to solve it. Multiprocess does solve the GIL problem. But in my experience(maybe it is wrong): parsing large data(like a DataFrame) as a parameter to subprocess is slower than a for loop in a single process because the procedure needs serializing and de-serializing. And I am not sure about the return of large values from the subprocess.
Is it most efficient to use a Qeueu or joblib or Ray?
Reading csv is fast. I would read all csv in a list and then concat the list to one dataframe. Here is a bit of code form my use case. I find all .csv files in my path and save the csv file names in variable "results". I then loop the file names and read the csv and store it in list which I later concat to one dataframe.
data = []
for item in result:
data.append(pd.read_csv(path))
main_df = pd.concat(data, axis = 0)
I am not saying this is the best approach, but this works great for me :)
I am writing a program that will process a bunch of data and fill a column in excel. I am using openpyxl, and strictly using write_only mode as well. Each column will have a fixed 75 cell size, and each cell in the row will have the same formula applied to it. However, I can only process the data one column at a time, I cannot process an entire row, then iterate through all of the rows.
How can I write to a column, then move onto the next column once I have filled the previous one?
This is a rather open ended question, but may I suggest using Pandas. Without some kind of example of what you are trying to achieve it's difficult to make a great recommendation, but I have used pandas in the past a ton for automating processing of excel files. Basically you would just load whatever data into a Pandas DataFrame, then do your transformations/calculations and whenever you are done write it back to either the same or a new excel file (or a number of other formats).
Because the OOXML file format is row-oriented, you must write in rows in write-only mode, it is simply not possible otherwise.
What you might be able to do is to create some kind transitional object that will allow to fill it with columns and then use this to write to openpyxl. A Pandas DataFrame would probably be suitable for this and openpyxl supports converting these into rows.
I am writing a data-harvesting code in Python. I'd like to produce a data frame file that would be as easy to import into R as possible. I have full control over what my Python code will produce, and I'd like to avoid unnecessary data processing on the R side, such as converting columns into factor/numeric vectors and such. Also, if possible, I'd like to make importing that data as easy as possible on the R side, preferably by calling a single function with a single argument of file name.
How should I store data into a file to make this possible?
You can write data to CSV using http://docs.python.org/2/library/csv.html Python's csv module, then it's a simple matter of using read.csv in R. (See ?read.csv)
When you read in data to R using read.csv, unless you specify otherwise, character strings will be converted to factors, numeric fields will be converted to numeric. Empty values will be converted to NA.
First thing you should do after you import some data is to look at the ?str of it to ensure the classes of data contained within meet your expectations. Many times have I made a mistake and mixed a character value in a numeric field and ended up with a factor instead of a numeric.
One thing to note is that you may have to set your own NA strings. For example, if you have "-", ".", or some other such character denoting a blank, you'll need to use the na.strings argument (which can accept a vector of strings ie, c("-",".")) to read.csv.
If you have date fields, you will need to convert them properly. R does not necessarily recognize dates or times without you specifying what they are (see ?as.Date)
If you know in advance what each column is going to be you can specify the class using colClasses.
A thorough read through of ?read.csv will provide you with more detailed information. But I've outlined some common issues.
Brandon's suggestion of using CSV is great if your data isn't enormous, and particularly if it doesn't contain a whole honking lot of floating point values, in which case the CSV format is extremely inefficient.
An option that handled huge datasets a little better might be to construct an equivalent DataFrame in pandas and use its facilities to dump to hdf5, and then open it in R that way. See for example this question for an example of that.
This other approach feels like overkill, but you could also directly transfer the dataframe in-memory to R using pandas's experimental R interface and then save it from R directly.