I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)
Related
So I have an excel sheet that contains in this order:
Sample_name | column data | column data2 | column data ... n
I also have a .txt file that contains
Sample_name
What I want to do is filter the excel file for only the sample names contained in the .txt file. My current idea is to go through each column (excel sheet) and see if it matches any name in the .txt file, if it does, then grab the whole column. However, this seems like a nonefficient way to do it. I also need to do this using python. I was hoping someone could give me an idea on how to approach this better. Thank you very much.
Excel PowerQuery should do the trick:
Load .txt file as a table (list)
Load sheet with the data columns as another table
Merge (e.g. Left join) first table with second table
Optional: adjust/select the columns to be included or excluded in the resulting table
In Python with Pandas’ data frames the same can be accomplished (joining 2 data frames)
P.S. Pandas supports loading CSV files and txt files (as a variant of CSV) into a data frame
I have a bunch of parquet files, each containing a subset of my dataset. Let's say that the files are named data-N.parquet with N being an integer.
I can read them all and subsequently convert to a pandas dataframe:
files = glob.glob("data-**.parquet")
ds = pq.ParquetDataset(
files,
metadata_nthreads=64,
).read_table(use_threads=True)
df = ds.to_pandas()
This works just fine. What it would like to have is an additional column in the final data frame, indicating from which file the data is originating.
As far as I understand, the ds data is partitioned, with one partition per file. So it would be a matter of including the partition key in the data frame.
Is this feasible?
The partition key is, at the moment, included in the dataframe. However, all existing partitioning schemes use directory names for the key. So if your data was /N/data.parquet or /batch=N/data.parquet this will happen (you will need to supply a partitioning object when you read the dataset).
There is no way today (in pyarrow) to get the filename in the returned results.
I can group large datasets and make multiple CSV, excel files with Pandas data frame. But how to do the same with the Pyspark data frame to group 700K records into around 230 groups and make 230 CSV files country wise.
Using pandas
grouped = df.groupby("country_code")
# run this to generate separate Excel files
for country_code, group in grouped:
group.to_excel(excel_writer=f"{country_code}.xlsx", sheet_name=country_code, index=False)
with Pyspark data frame, when I try to like this-
for country_code, df_country in df.groupBy('country_code'):
print(country_code,df_country.show(1))
It returns,
TypeError: 'GroupedData' object is not iterable
If your requirement is to save all country data in different files you can achieve it by partitioning the data but instead of file you will get folder for each country because spark can't save data to file directly.
Spark creates folder whenever a dataframe writer is called.
df.write.partitionBy('country_code').csv(path)
The output will be multiple folders with corresponding country's data
path/country_code=india/part-0000.csv
path/country_code=australia/part-0000.csv
If you want one file inside each folder you can repartition your data as
df.repartition('country_code').write.partitionBy('country_code').csv(path)
Use partitionBy at the time of writing so that every partition is based on the column you specify (country_code in your case).
Here's more on this.
I have to work with 50+ .txt files each containing 2 columns and 631 rows where I have to do different operations to each (sometimes with each other) before doing data analysis. I was hoping there was a way to import each text file under a different dataframe in pandas instead of doing it individually. The code I've been using individually has been
df = pd.read_table(file_name, skiprows=1, index_col=0)
print(B)
I use index_col=0 because the first row is the x-value. I use skiprows=1 because I have to drop the title which is the first row (and file name in folder) of each .txt file. I was thinking maybe I could use glob package and importing all as a single data frame from the folder and then splitting it into different dataframes while keeping the first column as the name of each variable? Is there a feasible way to import all of these files at once under different dataframes from a folder and storing them under the first column name? All .txt files would be data frames of 2 col x 631 rows not including the first title row. All values in the columns are integers.
Thank you
Yes. If you store your file in a list named filelist (maybe using glob) you can use the following commands to read all files and store them on a dict.
dfdict = {f: pd.read_table(f,...) for f in filelist}
Then you can use each data frame with dfdict["filename.txt"].
I've created a csv file with the column names and saved it using pandas library. This file will be used to create a historic record where the rows will be charged one by one in different moments... what I'm doing to add rows to this csv previously created is transform the record to a DataFrame and then using to_csv() I choose mode = 'a' as a parameter in order to append this record to the existing file. The problem here is that I would like to see and index automatically generated in the file everytime I add a new row. I already know when I import this file as a DF, an index is generated automatically, but this is within the idle interface...when I open the csv with Excel for example...the file doesn't have an index.
While writing your files to csv, you can use set index = True in the to_csv method. This ensures that the index of your dataframe is written explicitly to the csv file