I have Data of different Stocks of various dates
and each stock is stored in different Text Files (not in CSV)
How to Gather up Data in some way into a Single File
you can use something like pandas and load each file to a data frame and then merge all of the data frames together, you can find some info here
https://pandas.pydata.org/pandas-docs/stable/io.html
If you want to merge all the text files into a single one, use a command type(if linux, use cat). Firstly, cd to the directory your text files in(for example: E:\data).
cd E:\data\
type *.txt > all.csv
and then you can use pandas
df = pd.read_csv("all.csv")
...
Related
So I have an excel sheet that contains in this order:
Sample_name | column data | column data2 | column data ... n
I also have a .txt file that contains
Sample_name
What I want to do is filter the excel file for only the sample names contained in the .txt file. My current idea is to go through each column (excel sheet) and see if it matches any name in the .txt file, if it does, then grab the whole column. However, this seems like a nonefficient way to do it. I also need to do this using python. I was hoping someone could give me an idea on how to approach this better. Thank you very much.
Excel PowerQuery should do the trick:
Load .txt file as a table (list)
Load sheet with the data columns as another table
Merge (e.g. Left join) first table with second table
Optional: adjust/select the columns to be included or excluded in the resulting table
In Python with Pandas’ data frames the same can be accomplished (joining 2 data frames)
P.S. Pandas supports loading CSV files and txt files (as a variant of CSV) into a data frame
I have to work with 50+ .txt files each containing 2 columns and 631 rows where I have to do different operations to each (sometimes with each other) before doing data analysis. I was hoping there was a way to import each text file under a different dataframe in pandas instead of doing it individually. The code I've been using individually has been
df = pd.read_table(file_name, skiprows=1, index_col=0)
print(B)
I use index_col=0 because the first row is the x-value. I use skiprows=1 because I have to drop the title which is the first row (and file name in folder) of each .txt file. I was thinking maybe I could use glob package and importing all as a single data frame from the folder and then splitting it into different dataframes while keeping the first column as the name of each variable? Is there a feasible way to import all of these files at once under different dataframes from a folder and storing them under the first column name? All .txt files would be data frames of 2 col x 631 rows not including the first title row. All values in the columns are integers.
Thank you
Yes. If you store your file in a list named filelist (maybe using glob) you can use the following commands to read all files and store them on a dict.
dfdict = {f: pd.read_table(f,...) for f in filelist}
Then you can use each data frame with dfdict["filename.txt"].
I have a spark dataframe (hereafter spark_df) and I'd like to convert that to .csv format. I tried two following methods:
spark_df_cut.write.csv('/my_location/my_file.csv')
spark_df_cut.repartition(1).write.csv("/my_location/my_file.csv", sep=',')
where I get no error message for any of them and both get completed [it seems], but I cannot find any output .csv file in the target location! Any suggestion?
I'm on a cloud-based Jupyternotebook using spark '2.3.1'.
spark_df_cut.write.csv('/my_location/my_file.csv')
//will create directory named my_file.csv in your specified path and writes data in CSV format into part-* files.
We are not able to control the names of files while writing the dataframe, look for directory named my_file.csv in your location (/my_location/my_file.csv).
In case if you want filename ending with *.csv then you need to rename using fs.rename method.
spark_df_cut.write.csv save the files as part files. there is no direct solution available in spark to save as .csv file that can be opened directly with xls or some other. but there are multiple workarounds available one such work around is to convert spark Dataframe to panda Dataframe and use to_csv method like below
df = spark.read.csv(path='game.csv', sep=',')
pdf = df.toPandas()
pdf.to_csv(path_or_buf='<path>/real.csv')
this will save the data as .csv file
and another approach is using open the file using hdfs command and cat that to a file.
please post if you need more help
I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)
I would like to upload a large number of binary values from a file (a .phys file) into Python and then export these values into Excel for graphing purposes. Excel only supports ~32,000 rows at a time, but I have up to 3mil values in some cases. I am able to upload the data set into Python using
f = open("c:\DR005289_F00001.PHYS", "rb")
How do I then export this file to Excel in a format which Excel can support? For example, how could I break up the data into columns? I don't care how many values are in each column, it can be an arbitrary break depending on what Excel can support.
This has served me well. Use xlwt to Put all the data into the file.
I would create a list of lists to break the data into columns. Write each list (pick a length, 10k?) to the excel file.