I want to create a routine in Python that reads an excel file in a given folder, modifies it, and saves it. What I want it to do with the excel file is to modify a column of dates in a mm/yyyy format into two columns with the same dates in a mm yyyy format.
This is what the initial spreadsheet looks like
This is what I would like to change it to:
Related
So I have an excel sheet that contains in this order:
Sample_name | column data | column data2 | column data ... n
I also have a .txt file that contains
Sample_name
What I want to do is filter the excel file for only the sample names contained in the .txt file. My current idea is to go through each column (excel sheet) and see if it matches any name in the .txt file, if it does, then grab the whole column. However, this seems like a nonefficient way to do it. I also need to do this using python. I was hoping someone could give me an idea on how to approach this better. Thank you very much.
Excel PowerQuery should do the trick:
Load .txt file as a table (list)
Load sheet with the data columns as another table
Merge (e.g. Left join) first table with second table
Optional: adjust/select the columns to be included or excluded in the resulting table
In Python with Pandas’ data frames the same can be accomplished (joining 2 data frames)
P.S. Pandas supports loading CSV files and txt files (as a variant of CSV) into a data frame
So basically I have multiple excel files (different names) in a folder and I want to copy the same cell (for example B3) from all files and create a column in New excel file and put all the value there.
The file above is what I want to import (multiple files like that). I want to copy the names and emails and save it to the new file like the one below.
So you want to read multiple files, get a specific cell and then create a new data frame and save it as a new Excel file:
cells = []
for f in glob.glob("*.xlsx"):
data = pd.read_excel(f, 'Sheet1')
cells.append(data.iloc[3,5])
pd.Series(cells).to_excel('file.xlsx')
In my particular example I took cell F4 (row=3, col=5) - you can obviously take any other cell that you like or even more than one cell and then save it to a different list, combining the two lists in the end. You could also have more complex logic where you could check one cell to decide which other cell to look at next.
The key point is that you want to iterate through a bunch of files and for each of them:
read the file
extract whatever data you are interested in
set this data aside somewhere
Once you've gone through all the files combine all the data in any way that you like and then save it to disk in a format of your choice.
I have a dataframe below and want to write that contents to a .json file.
And while creating output files , I do not want success part log files, so I tried to collect () the values from dataframe and used json_dumps() to create the file. But i am losing the column names and formats as opposed to the expected format in picture
Please help!
I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)
How could I write a list of items [1,2,3,4,5] to an excel file in a specific tab starting at a specific row and column location using the Pandas module? Does it involve the pandas.DataFrame.to_excel function, and do I need to convert my list into a dataframe before writing it to the excel file?
Would I make the list into a series first, and then convert the series into a dataframe, and then write the dataframe to the excel file?
Yes you will need to use 'to_excel'. The one line code below creates a list, converts it to a series, then converts it to a dataframe and creates and excel file
pd.Series([ i for i in range(10)]).to_frame().to_excel('test.xlsx')