Saving my dataset as a csv file in python - python

I currently have a json file i downloaded from github of a dataset that I edited by adding columns of values. How would I export my newly edited dataset as a csv file that I could upload back to github?
Currently my data is saved as:
import pandas as pd
url = 'https://raw.githubusercontent.com/xxx.json' #example of raw url taken from github
df = pd.read_json(url) #dataset from json file
df['H_values'] = output #new column added of values
Since I updated the original dataset (df) with a column called "H_values" I would like to export this version of the dataset as a csv file (the last line of the code is the updated data). Thanks!

Simple:
df.to_csv("output.csv")
Check the Doc on pandas librairy for more information, on the fonction and its parameter

The documentation for dataframes suggest an option to convert one into a *.csv like string: Link
Now you just need to save this in the file:
with open("Output.csv", "w") as text_file:
text_file.write(df.to_csv(index=False))
for more options I refere to the doc-link above.

Related

Getting "ParserError" when I try to read a .txt file using pd.read_csv()

I am trying to convert this dataset: COCOMO81 to arff.
Before converting to .arff, I am trying to convert it to .csv
I am following this LINK to do this.
I got that dataset from promise site. I copied the entire page to notepad as cocomo81.txt and now I am trying to convert that cocomo81.txt file to .csv using python.
(I intend to convert the .csv file to .arff later using weka)
However, when I run
import pandas as pd
read_file = pd.read_csv(r"cocomo81.txt")
I get THIS ParserError.
To fix this, I followed this solution and modified my command to
read_file = pd.read_csv(r"cocomo81.txt",on_bad_lines='warn')
I got a bunch of warnings - you can see what it looks like here
and then I ran
read_file.to_csv(r'.\cocomo81csv.csv',index=None)
But it seems that the fix for ParserError didn't work in my case because my cocomo81csv.csv file looks like THIS in Excel.
Can someone please help me understand where I am going wrong and how can I use datasets from the promise repository in .arff format?
Looks like it's a csv file with comments as the first lines. The comment lines are indicated by % characters, but also #(?), and the actual csv data starts at line 230.
You should skip the first rows and manually set the column names, try something like this:
# set column names manually
col_names = ["rely", "data", "cplx", "time", "stor", "virt", "turn", "acap", "aexp", "pcap", "vexp", "lexp", "modp", "tool", "sced", "loc", "actual" ]
filename = "cocomo81.arff.txt"
# read csv data
df = pd.read_csv(filename, skiprows=229, sep=',', decimal='.', header=None, names=col_names)
print(df)
You first need to parse the txt file.
Column names can be taken after #attribute
#attribute rely numeric
#attribute data numeric
#attribute cplx numeric
#attribute time numeric
..............................
And in the csv file, load only the data after #data which is at the end of the file. You can just copy/paste.
0.88,1.16,0.7,1,1.06,1.15,1.07,1.19,1.13,1.17,1.1,1,1.24,1.1,1.04,113,2040
0.88,1.16,0.85,1,1.06,1,1.07,1,0.91,1,0.9,0.95,1.1,1,1,293,1600
1,1.16,0.85,1,1,0.87,0.94,0.86,0.82,0.86,0.9,0.95,0.91,0.91,1,132,243
0.75,1.16,0.7,1,1,0.87,1,1.19,0.91,1.42,1,0.95,1.24,1,1.04,60,240
...................................................................
And then read the resulting csv file
pd.read_csv(file, names=["rely", "data", "cplx", ...])

Multiple txt files as separate rows in a csv file without breaking into lines (in pandas dataframe)

I have many txt files (which have been converted from pdf) in a folder. I want to create a csv/excel dataset where each text file will become a row. Right now I am opening the files in pandas dataframe and then trying to save it to a csv file. When I print the dataframe, I get one row per txt file. However, when saving to csv file, the texts get broken and create multiple rows/lines for each txt file rather than just one row. Do you know how I can solve this problem? Any help would be highly appreciated. Thank you.
Following is the code I am using now.
import glob
import os
import pandas as pd
file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="latin-1") as f_input:
corpus.append(f_input.read())
df = pd.DataFrame({'col':corpus})
print (df)
df.to_csv('K:\\out.csv')
Update
If this solution is not possible it would be also helpful to transform the data a bit in pandas dataframe. I want to create a column with the name of txt files, that is, the name of each txt file in the folder will become the identifier of the respective text file. I will then save it to tsv format so that the lines do not get separated because of comma, as suggested by someone here.
I need something like following.
identifier col
txt1 example text in this file
txt2 second example text in this file
...
txtn final example text in this file
Use
import csv
df.to_csv('K:\\out.csv', quoting=csv.QUOTE_ALL)

Python: How to create a new dataframe with first row when a specific value

I am reading csv files into python using:
df = pd.read_csv(r"C:\csvfile.csv")
But the file has some summary data, and the raw data start if a value "valx" is found. If "valx" is not found then the file is useless. I would like to create news dataframes that start when "valx" is found. I have been trying for a while with no success. Any help on how to achieve this is greatly appreciated.
Unfortunately, pandas only accepts skiprows for rows to skip in the beginning. You might want to parse the file before creating the dataframe.
As an example:
import csv
with open(r"C:\csvfile.csv","r") as f:
lines = csv.reader(f, newline = '')
if any('valx' in i for i in lines):
data = lines
Using the Standard Libary csv module, you can read file and check if valx is in the file, if it is found, the content will be returned in the data variable.
From there you can use the data variable to create your dataframe.

How to create a hierarchical csv file?

I have following N number of invoice data in Excel and I want to create CSV of that file so that it can be imported whenever needed...so how can I archive this?
Here is a screenshot:
Assuming you have a Folder "excel" full of Excel Files within your Project-Directory and you also have another folder "csv" where you intend to put your generated CSV Files, you could pretty much easily batch-convert all the Excel Files in the "excel" Directory into "csv" using Pandas.
It will be assumed that you already have Pandas installed on your System. Otherwise, you could do that via: pip install pandas. The fairly commented Snippet below illustrates the Process:
# IMPORT DATAFRAME FROM PANDAS AS WELL AS PANDAS ITSELF
from pandas import DataFrame
import pandas as pd
import os
# OUR GOAL IS:::
# LOOP THROUGH THE FOLDER: excelDir.....
# AT EACH ITERATION IN THE LOOP, CHECK IF THE CURRENT FILE IS AN EXCEL FILE,
# IF IT IS, SIMPLY CONVERT IT TO CSV AND SAVE IT:
for fileName in os.listdir(excelDir):
#DO WE HAVE AN EXCEL FILE?
if fileName.endswith(".xls") or fileName.endswith(".xlsx"):
#IF WE DO; THEN WE DO THE CONVERSION USING PANDAS...
targetXLFile = os.path.join(excelDir, fileName)
targetCSVFile = os.path.join(csvDir, fileName) + ".csv"
# NOW, WE READ "IN" THE EXCEL FILE
dFrame = pd.read_excel(targetXLFile)
# ONCE WE DONE READING, WE CAN SIMPLY SAVE THE DATA TO CSV
pd.DataFrame.to_csv(dFrame, path_or_buf=targetCSVFile)
Hope this does the Trick for you.....
Cheers and Good-Luck.
Instead of putting total output into one csv, you could go with following steps.
Convert your excel content to csv files or csv-objects.
Each object will be tagged with invoice id and save into dictionary.
your dictionary data structure could be like {'invoice-id':
csv-object, 'invoice-id2': csv-object2, ...}
write custom function which can reads your csv-object, and gives you
name,product-id, qty, etc...
Hope this helps.

Pyspark: write df to file with specific name, plot df

I'm working with lastest version of Spark(2.1.1). I read multiple csv files to dataframe by spark.read.csv.
After processing with this dataframe, How can I save it to output csv file with specific name.
For example, there are 100 input files (in1.csv,in2.csv,in3.csv,...in100.csv).
The rows that belong to in1.csv should be saved as in1-result.csv. The rows that belong to in2.csv should be saved as in2-result.csv and so on. (The default file name will be like part-xxxx-xxxxx which is not readable)
I have seen partitionBy(col) but look like it can just partition by column.
Another question is I want to plot my dataframe. Spark has no built-in plot library. Many people use df.toPandas() to convert to pandas and plot it. Is there any better solution? Since my data is very big and toPandas() will cause memory error. I'm working on the server and want to save the plot as image instead of showing.
I suggest below solution for writing DataFrame in specific directories related to input file:
in loop for each file:
read csv file
add new column with information about input file using withColumn tranformation
union all DataFrames using union transformation
do required preprocessing
save result using partitionBy by providing column with input file information, so that rows related to the same input file will be saved in the same output directory
Code could look like:
all_df = None
for file in files: # where files is list of input CSV files that you want to read
df = spark.read.csv(file)
df.withColumn("input_file", file)
if all_df is None:
all_df = df
else:
all_df = all_df.union(df)
# do preprocessing
result.write.partitionBy(result.input_file).csv(outdir)

Categories

Resources