How to read excel file with a grouped data

How to read excel file with a grouped data - python

I have a .xlsx file that looks like this:
I want to transform it into this one:
I'm not sure how to do it using python, because pandas cannot properly read the original file.

Related

Reading in a multiindex .csv file as returned from pandas using the ftable type in R

I have a multi-index (multi-column to be exact) pandas data frame in Python that I saved using the .to_csv() method. Now I would like to continue my analysis in R. For that I need to read in the .csv file. I know that R does not really support multi-index data frames like pandas does but it can handle ftables using the stats package. I tried to use read.ftable() but I can't figure out how to set the arguments right to correctly import the .csv file.
Here's some code to create a .csv file that has the same structure as my original data:
require(stats)
# create example .csv file with a multiindex as it would be saved when using pandas
fileConn<-file('test.csv')
long_string = paste("col_level_1,a,b,c,d\ncol_level_2,cat,dog,tiger,lion\ncol_level_3,foo,foo,foo,foo\nrow_level_1,,,,\n1,",
"\"0,525640810622065\",\"0,293400380474675\",\"0,591895790442417\",\"0,675403394728461\"\n2,\"0,253176104907883\",",
"\"0,107715459748816\",\"0,211636325794272\",\"0,618270276545688\"\n3,\"0,781049927692169\",\"0,72968971635063\",",
"\"0,913378426593516\",\"0,739497259262532\"\n4,\"0,498966730971063\",\"0,395825713762063\",\"0,252543611974303\",",
"\"0,240732390893718\"\n5,\"0,204075522469035\",\"0,227454178487449\",\"0,476571725142606\",\"0,804041968683541\"\n6,",
"\"0,281453400066927\",\"0,010059089264751\",\"0,873336799707968\",\"0,730105129502755\"\n7,\"0,834572206714808\",",
"\"0,668889079581709\",\"0,516135581764696\",\"0,999861473609101\"\n8,\"0,301692961056344\",\"0,702428450077691\",",
"\"0,211660363912457\",\"0,626178589354395\"\n9,\"0,22051883447221\",\"0,934567760412661\",\"0,757627523007149\",",
"\"0,721590060307143\"",sep="")
writeLines(long_string, fileConn)
close(fileConn)
When opening the .csv file in a reader of your choice, it should look like this:
How can I read this in using R?

I found one solution without using read.ftable() based on this post. Not that this won't give you the data in the ftable format:
headers <- read.csv(file='./test.csv',header=F,nrows=3,as.is=T,row.names=1)
dat <- read.table('./test.csv',skip=4,header=F,sep=',',row.names=1)
headers_collapsed <- apply(headers,2,paste,collapse='.')
colnames(dat) <- headers_collapsed

Trouble with to_csv saving groupby dataframe

I want to save groupby dataframe in csv but i am trying to save in csv it is not saving groupby dataframe.
this is data :
dataframe image
i run this code df.groupby(['Date','Name']).sum() after that i got
output image of groupby dataframe
but i am trying to save in csv file it save like this
I run this code df.to_csv("abcd.csv")
csv file image
But I want to save in csv file like
saving excel file output which i want
please tell me the solution
thank you

CSV files are plain text, The agreed format is each row is separated by newline and each character is separated by , in general.
To achieve the formatting you want, you can convert the df into an excel file instead of csv
gdf = df.groupby(['Date','Name']).sum()
gdf.to_excel("<path_to_file>")
You will need to explicitly install xlwt to achieve working with excel files
pip install xlwt

What is a .jl.gz file? How do I read it into a dataframe in python?

I have a file with extension .jl.gz (ds_dump_00000.jl.gz).
I'm a python user, and I've never used a file of this format. I'm familiar with the pandas data frame.
How do I read this file into a pandas data frame structure & carry on with analysis?

Converting spark dataframe to flatfile .csv

I have a spark dataframe (hereafter spark_df) and I'd like to convert that to .csv format. I tried two following methods:
spark_df_cut.write.csv('/my_location/my_file.csv')
spark_df_cut.repartition(1).write.csv("/my_location/my_file.csv", sep=',')
where I get no error message for any of them and both get completed [it seems], but I cannot find any output .csv file in the target location! Any suggestion?
I'm on a cloud-based Jupyternotebook using spark '2.3.1'.

spark_df_cut.write.csv('/my_location/my_file.csv')
//will create directory named my_file.csv in your specified path and writes data in CSV format into part-* files.
We are not able to control the names of files while writing the dataframe, look for directory named my_file.csv in your location (/my_location/my_file.csv).
In case if you want filename ending with *.csv then you need to rename using fs.rename method.

spark_df_cut.write.csv save the files as part files. there is no direct solution available in spark to save as .csv file that can be opened directly with xls or some other. but there are multiple workarounds available one such work around is to convert spark Dataframe to panda Dataframe and use to_csv method like below
df = spark.read.csv(path='game.csv', sep=',')
pdf = df.toPandas()
pdf.to_csv(path_or_buf='<path>/real.csv')
this will save the data as .csv file
and another approach is using open the file using hdfs command and cat that to a file.
please post if you need more help

Pyspark: write df to file with specific name, plot df

I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.

I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read excel file with a grouped data - python

I have a .xlsx file that looks like this: I want to transform it into this one: I'm not sure how to do it using python, because pandas cannot properly read the original file.

Related

Reading in a multiindex .csv file as returned from pandas using the ftable type in R

Trouble with to_csv saving groupby dataframe

What is a .jl.gz file? How do I read it into a dataframe in python?

Converting spark dataframe to flatfile .csv

Pyspark: write df to file with specific name, plot df

Categories

Resources